[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (32 issues) Subscriber: pigdaily Key Summary PIG-4818Single quote inside comment in GENERATE is not being ignored https://issues.apache.org/jira/browse/PIG-4818 PIG-4816Read a null scalar causing a Tez failure https://issues.apache.org/jira/browse/PIG-4816 PIG-4796Authenticate with Kerberos using a keytab file https://issues.apache.org/jira/browse/PIG-4796 PIG-4788the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine https://issues.apache.org/jira/browse/PIG-4788 PIG-4745DataBag should protect content of passed list of tuples https://issues.apache.org/jira/browse/PIG-4745 PIG-4734TOMAP schema inferring breaks some scripts in type checking for bincond https://issues.apache.org/jira/browse/PIG-4734 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4641Print the instance of Object without using toString() https://issues.apache.org/jira/browse/PIG-4641 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4581thread safe issue in NodeIdGenerator https://issues.apache.org/jira/browse/PIG-4581 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4526Make setting up the build environment easier https://issues.apache.org/jira/browse/PIG-4526 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in MRPrinter https://issues.apache.org/jira/browse/PIG-4455 PIG-4341Add CMX support to pig.tmpfilecompression.codec https://issues.apache.org/jira/browse/PIG-4341 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4111Make Pig compiles with avro-1.7.7 https://issues.apache.org/jira/browse/PIG-4111 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3906ant site errors out https://issues.apache.org/jira/browse/PIG-3906 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3866Create ThreadLocal classloader per PigContext https://issues.apache.org/jira/browse/PIG-3866 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3851Upgrade jline to 2.11 https://issues.apache.org/jira/browse/PIG-3851 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384
Re: Review Request 43571: PIG-4788:the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/43571/ --- (Updated March 1, 2016, 4:51 a.m.) Review request for pig, Xianda Ke, Mohit Sabharwal, Pallavi Rao, and Xuefu Zhang. Changes --- The patch modifies the PigSplit. PigSplit extends FileSplit do not extend InputSplit. I explain the detail reason in the jira page(https://issues.apache.org/jira/browse/PIG-4788). Hope more pig developers join in the discussion. Bugs: PIG-4788 https://issues.apache.org/jira/browse/PIG-4788 Repository: pig-git Description (updated) --- I explained more detailed about the modification on jira page. In PIG-4788.patch: changes are 1. PigSplit extends FileSplit not InputSplit 2. add try catch to PigSplit#getLocations(), PigSplit#getLength() 3.add PigSplit#getPath(). PigSplit#getPath() will be called in NewTrackingRecordReader. Diffs - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java 92c6bd6 src/org/apache/pig/builtin/mock/Storage.java afc1d29 Diff: https://reviews.apache.org/r/43571/diff/ Testing --- After test, no new unit tests are imported and unit test failures about TestOrcStoragePushdown will be fixed. Thanks, kelly zhang
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173203#comment-15173203 ] liyunzhang_intel commented on PIG-4788: --- [~mohitsabharwal]: it says that: {quote} A LoadFunc loads data into Pig. It can read from an HDFS file or other source. {quote} I do not know the other source is a file source,maybe we need ask others like [~rohini]. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173194#comment-15173194 ] Mohit Sabharwal commented on PIG-4788: -- Custom LoadFunc for example can be written to return any inputFormat , no ? https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/LoadFunc.html > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173191#comment-15173191 ] liyunzhang_intel commented on PIG-4788: --- [~mohitsabharwal]: most of pig input are file format. can you point out the non-file input? No unit test failure are imported after PigSplit extends FileSplit(I can not guarantee unit test can cover all ). I also hope this change not influence other features. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173189#comment-15173189 ] Mohit Sabharwal commented on PIG-4788: -- Ah, of course, sorry - FileSplit can't be replaced by PigSplit. My other concern was whether changing PigSplit to extend FileSplit will break PigSplit for inputformats that use non-File splits. Makes sense ? > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173175#comment-15173175 ] liyunzhang_intel commented on PIG-4788: --- [~mohitsabharwal]: PigSplit is an importance class which is related with many other class. So i guess we can not use FileSplit to replace PigSplit in PigInputFormatSpark although i have not done that experiment. Before Pallavi suggested to copy PigSplit to PigSplitSpark which extends FileSplit and use PigSplitSpark in PigInpitFormatSpark, i found it was difficult to do that when i did the experiment because i need copy a lot of code. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173164#comment-15173164 ] Mohit Sabharwal commented on PIG-4788: -- [~kellyzly], if you change {{PigSplit}} to extend {{FileSplit}}, will {{PigInputFormat}} still work with non-file splits like CombineFileSplit, etc. ? Can we instead use {{FileSplit}} when we create the record reader in {{PigInputFormatSpark}}, instead of {{PigSplit}} ? That way we could isolate the change in Spark specific code. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4796) Authenticate with Kerberos using a keytab file
[ https://issues.apache.org/jira/browse/PIG-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172798#comment-15172798 ] Daniel Dai commented on PIG-4796: - Looks good to me and thanks for the docs. [~rohini], do you have any comments since you are more familiar with this part. > Authenticate with Kerberos using a keytab file > -- > > Key: PIG-4796 > URL: https://issues.apache.org/jira/browse/PIG-4796 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.15.0 >Reporter: Niels Basjes >Assignee: Niels Basjes > Labels: feature, kerberos, security > Attachments: 2016-02-18-1510-PIG-4796.patch, > 2016-02-18-PIG-4796-rough-proof-of-concept.patch, PIG-4796-2016-02-23.patch > > > When running in a Kerberos secured environment users are faced with the > limitation that their jobs cannot run longer than the (remaining) ticket > lifetime of their Kerberos tickets. The environment I work in these tickets > expire after 10 hours, thus limiting the maximum job duration to at most 10 > hours (which is a problem). > In the Hadoop tooling there is a feature where you can authenticate using a > Kerberos keytab file (essentially a file that contains the encrypted form of > the kerberos principal and password). Using this the running application can > request new tickets from the Kerberos server when the initial tickets expire. > In my Java/Hadoop applications I commonly include these two lines: > {code} > System.setProperty("java.security.krb5.conf", "/etc/krb5.conf"); > UserGroupInformation.loginUserFromKeytab("nbas...@xx.net", > "/home/nbasjes/.krb/nbasjes.keytab"); > {code} > This way I have run an Apache Flink based application for more than 170 hours > (about a week) on the kerberos secured Yarn cluster. > What I propose is to have a feature that I can set the relevant kerberos > values in my pig script and from there be able to run a pig job for many days > on the secured cluster. > Proposal how this can look in a pig script: > {code} > SET java.security.krb5.conf '/etc/krb5.conf' > SET job.security.krb5.principal 'nbas...@xx.net' > SET job.security.krb5.keytab '/home/nbasjes/.krb/nbasjes.keytab' > {code} > So iff all of these are set (or at least the last two) then the > aforementioned UserGroupInformation.loginUserFromKeytab method is called > before submitting the job to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4
[ https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-4817: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 0.16.0 Status: Resolved (was: Patch Available) Patch committed to trunk. Thanks Niels! > Bump HTTP Logparser to version 2.4 > -- > > Key: PIG-4817 > URL: https://issues.apache.org/jira/browse/PIG-4817 > Project: Pig > Issue Type: Improvement >Reporter: Niels Basjes >Assignee: Niels Basjes > Fix For: 0.16.0 > > Attachments: PIG-4817-20160229.patch > > > Main reason for the update is this fix: > Now support parsing the first line even if it is chopped by Apache httpd > because of an URI longer than 8000 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4818) Single quote inside comment in GENERATE is not being ignored
[ https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172789#comment-15172789 ] Daniel Dai commented on PIG-4818: - +1. And this happens only when the comment is in GENERATE clause. > Single quote inside comment in GENERATE is not being ignored > > > Key: PIG-4818 > URL: https://issues.apache.org/jira/browse/PIG-4818 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.0 >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-4818-v01.patch > > > {code} > A = load '1.txt' as (a1:int, a2:int); > B = FOREACH A GENERATE a1, > -- testing ' here with single quote > a2; > dump B; > {code} > This fails with > {panel} > 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Lexical error at line 6, column 0. Encountered: > after : "" > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Request: Feedback on patches
Hi, A friendly request for your feedback on these patches I provided the last few days. Thanks. Authenticate with Kerberos using a keytab file https://issues.apache.org/jira/browse/PIG-4796 This is a feature to allow running jobs on secure clusters for a duration longer than the maximum lifetime of the kerberos tickets. Bump HTTP Logparser to version 2.4 https://issues.apache.org/jira/browse/PIG-4817 Simply update to the most recent version that allows parsing a few edge cases more. Specifically corrupted loglines as seen when HTTP 414 occurs (URI too long) which I have on a daily basis in our logfiles. Make setting up the build environment easier https://issues.apache.org/jira/browse/PIG-4526 A set of scripts to create a reproducable docker based build environment. This is intended to get a 'very reproducable' way of setting up the build environment to reproduce bugs and to run builds. ant site errors out https://issues.apache.org/jira/browse/PIG-3906 I submitted a rather brutal fix: Disable the problematic feature. Question: Is this fix a desirable one? -- Best regards / Met vriendelijke groeten, Niels Basjes
[jira] [Updated] (PIG-4818) Single quote inside comment in GENERATE is not being ignored
[ https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-4818: -- Attachment: pig-4818-v01.patch It seems like a minor change in PIG-2507 caused this unexpected behavior. Not understanding javacc much, but pasting some changes that seem to work. Appreciate if someone can review the change. > Single quote inside comment in GENERATE is not being ignored > > > Key: PIG-4818 > URL: https://issues.apache.org/jira/browse/PIG-4818 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.0 >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-4818-v01.patch > > > {code} > A = load '1.txt' as (a1:int, a2:int); > B = FOREACH A GENERATE a1, > -- testing ' here with single quote > a2; > dump B; > {code} > This fails with > {panel} > 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Lexical error at line 6, column 0. Encountered: > after : "" > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4818) Single quote inside comment in GENERATE is not being ignored
[ https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-4818: -- Status: Patch Available (was: Open) > Single quote inside comment in GENERATE is not being ignored > > > Key: PIG-4818 > URL: https://issues.apache.org/jira/browse/PIG-4818 > Project: Pig > Issue Type: Bug >Affects Versions: 0.15.0, 0.14.0, 0.13.0, 0.12.1, 0.12.0 >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-4818-v01.patch > > > {code} > A = load '1.txt' as (a1:int, a2:int); > B = FOREACH A GENERATE a1, > -- testing ' here with single quote > a2; > dump B; > {code} > This fails with > {panel} > 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Lexical error at line 6, column 0. Encountered: > after : "" > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4818) Single quote inside comment in GENERATE is not being ignored
Koji Noguchi created PIG-4818: - Summary: Single quote inside comment in GENERATE is not being ignored Key: PIG-4818 URL: https://issues.apache.org/jira/browse/PIG-4818 Project: Pig Issue Type: Bug Affects Versions: 0.15.0, 0.14.0, 0.13.0, 0.12.1, 0.12.0 Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Minor {code} A = load '1.txt' as (a1:int, a2:int); B = FOREACH A GENERATE a1, -- testing ' here with single quote a2; dump B; {code} This fails with {panel} 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 6, column 0. Encountered: after : "" {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4796) Authenticate with Kerberos using a keytab file
[ https://issues.apache.org/jira/browse/PIG-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171950#comment-15171950 ] Niels Basjes commented on PIG-4796: --- I just did a full size test run using the following script on 10 days worth of click data. Summary: Test passed on the Kerberos secured cluster I have here. My input were 90 distinct logfiles totaling a few hundred GiB of gzipped apache access logfiles. My kerberos account has been configured to have the tickets expire after 5 minutes and have a max renew of 10 minutes (for me this is the easiest way to test this feature). I ran this pig script with the following command line: {code}kdestroy ./bin/pig -P nbasjes.kerberos.properties -param_file LogFormats.properties ./useragent.pig{code} So I made sure I was logged out of Kerberos and then i ran the script against a Kerberos secured cluster. Even though the script lasted for over 27 minutes the while thing ran successfully. I verified the output of this script and this was correct. The script I ran (from the pig source directory): {code}REGISTER ./contrib/piggybank/java/piggybank.jar ; REGISTER ./lib/*.jar ; UserAgents = LOAD '$LOGFILE' USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader( '$LOGFORMAT', 'HTTP.USERAGENT:request.user-agent' ) AS ( useragent:chararray ); UserAgentsCount = FOREACH UserAgents GENERATE useragent AS useragent:chararray, 1LAS clicks:long; CountsPerUseragents = GROUP UserAgentsCount BY(useragent); SumsPerBrowser = FOREACH CountsPerUseragents GENERATE SUM(UserAgentsCount.clicks) AS clicks, group AS useragent; STORE SumsPerBrowser INTO 'TopUseragents' USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 'UNIX'); {code} [~daijy]: Is this the type of manual test you think is correct? > Authenticate with Kerberos using a keytab file > -- > > Key: PIG-4796 > URL: https://issues.apache.org/jira/browse/PIG-4796 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.15.0 >Reporter: Niels Basjes >Assignee: Niels Basjes > Labels: feature, kerberos, security > Attachments: 2016-02-18-1510-PIG-4796.patch, > 2016-02-18-PIG-4796-rough-proof-of-concept.patch, PIG-4796-2016-02-23.patch > > > When running in a Kerberos secured environment users are faced with the > limitation that their jobs cannot run longer than the (remaining) ticket > lifetime of their Kerberos tickets. The environment I work in these tickets > expire after 10 hours, thus limiting the maximum job duration to at most 10 > hours (which is a problem). > In the Hadoop tooling there is a feature where you can authenticate using a > Kerberos keytab file (essentially a file that contains the encrypted form of > the kerberos principal and password). Using this the running application can > request new tickets from the Kerberos server when the initial tickets expire. > In my Java/Hadoop applications I commonly include these two lines: > {code} > System.setProperty("java.security.krb5.conf", "/etc/krb5.conf"); > UserGroupInformation.loginUserFromKeytab("nbas...@xx.net", > "/home/nbasjes/.krb/nbasjes.keytab"); > {code} > This way I have run an Apache Flink based application for more than 170 hours > (about a week) on the kerberos secured Yarn cluster. > What I propose is to have a feature that I can set the relevant kerberos > values in my pig script and from there be able to run a pig job for many days > on the secured cluster. > Proposal how this can look in a pig script: > {code} > SET java.security.krb5.conf '/etc/krb5.conf' > SET job.security.krb5.principal 'nbas...@xx.net' > SET job.security.krb5.keytab '/home/nbasjes/.krb/nbasjes.keytab' > {code} > So iff all of these are set (or at least the last two) then the > aforementioned UserGroupInformation.loginUserFromKeytab method is called > before submitting the job to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4
[ https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated PIG-4817: -- Status: Patch Available (was: Open) I ran a few tests and this works as I expect it to run. > Bump HTTP Logparser to version 2.4 > -- > > Key: PIG-4817 > URL: https://issues.apache.org/jira/browse/PIG-4817 > Project: Pig > Issue Type: Improvement >Reporter: Niels Basjes >Assignee: Niels Basjes > Attachments: PIG-4817-20160229.patch > > > Main reason for the update is this fix: > Now support parsing the first line even if it is chopped by Apache httpd > because of an URI longer than 8000 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark
[ https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved PIG-4776. -- Resolution: Fixed The patch is committed to Spark branch. Thanks, Liyun! > Enable unit test "TestOrcStoragePushdown" for spark > --- > > Key: PIG-4776 > URL: https://issues.apache.org/jira/browse/PIG-4776 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4776.patch > > > In latest jenkins > report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it > shows that following unit tests fail: > org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark
[ https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171895#comment-15171895 ] Xuefu Zhang commented on PIG-4776: -- +1 > Enable unit test "TestOrcStoragePushdown" for spark > --- > > Key: PIG-4776 > URL: https://issues.apache.org/jira/browse/PIG-4776 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4776.patch > > > In latest jenkins > report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it > shows that following unit tests fail: > org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean > org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PIG-4243) Fix "TestStore" for Spark engine
[ https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved PIG-4243. -- Resolution: Fixed Committed to Spark branch. Thanks, Liyun! > Fix "TestStore" for Spark engine > > > Key: PIG-4243 > URL: https://issues.apache.org/jira/browse/PIG-4243 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4243.patch, PIG-4243_1.patch, > TEST-org.apache.pig.test.TestStore.txt > > > 1. Build spark and pig env according to PIG-4168 > 2. add TestStore to $PIG_HOME/test/spark-tests > cat $PIG_HOME/test/spark-tests > **/TestStore > 3. run unit test TestStore > ant test-spark > 4. the unit test fails > error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4
[ https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated PIG-4817: -- Attachment: PIG-4817-20160229.patch > Bump HTTP Logparser to version 2.4 > -- > > Key: PIG-4817 > URL: https://issues.apache.org/jira/browse/PIG-4817 > Project: Pig > Issue Type: Improvement >Reporter: Niels Basjes >Assignee: Niels Basjes > Attachments: PIG-4817-20160229.patch > > > Main reason for the update is this fix: > Now support parsing the first line even if it is chopped by Apache httpd > because of an URI longer than 8000 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PIG-4813) AvroStorage doesn't work for schema from external file for EMR
[ https://issues.apache.org/jira/browse/PIG-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagdish Kewat resolved PIG-4813. Resolution: Not A Bug Works on HDFS as well. The key is to use "*org.apache.pig.builtin.AvroStorage*" instead of "*org.apache.pig.piggybank.storage.avro.AvroStorage*" Resolving as not a bug. > AvroStorage doesn't work for schema from external file for EMR > -- > > Key: PIG-4813 > URL: https://issues.apache.org/jira/browse/PIG-4813 > Project: Pig > Issue Type: Bug >Reporter: Jagdish Kewat > > Hi Team, > I couldn't get the schema loading for AvroStorage as described in > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-etl-avro.html > working. > It works fine if I provide the raw schema string with option 'schema' as > described in https://cwiki.apache.org/confluence/display/PIG/AvroStorage. > On HDFS I don't even need to specify the schema with store command. > A quick insights regarding the versions. > * Hadoop : > {code} > Hadoop 2.6.0-amzn-2 > Subversion g...@aws157git.com:/pkg/Aws157BigTop -r > 41f4e6be3ac5d6676a3464f77de79a33e8fdd9f3 > Compiled by ec2-user on 2015-11-16T20:56Z > Compiled with protoc 2.5.0 > {code} > * Pig : > {code} > Apache Pig version 0.14.0-amzn-0 (r: unknown) > {code} > * piggybank jar version: > ** piggybank-0.14.0.jar > * avro jar version : > ** avro-1.7.7.jar > * avro-ipc jar version : > ** avro-ipc-1.7.7.jar > * json-simple jar version > ** json-simple-1.1.jar > I tried looking for any pibbybank version of jar for EMR however no luck. I > fear I am not using correct versions of jars since the feature should work as > it has been documented. > Please advise if I am missing anything. > Thanks, > Jagdish > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4817) Bump HTTP Logparser to version 2.4
Niels Basjes created PIG-4817: - Summary: Bump HTTP Logparser to version 2.4 Key: PIG-4817 URL: https://issues.apache.org/jira/browse/PIG-4817 Project: Pig Issue Type: Improvement Reporter: Niels Basjes Assignee: Niels Basjes Main reason for the update is this fix: Now support parsing the first line even if it is chopped by Apache httpd because of an URI longer than 8000 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4813) AvroStorage doesn't work for schema from external file for EMR
[ https://issues.apache.org/jira/browse/PIG-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171582#comment-15171582 ] Jagdish Kewat commented on PIG-4813: Thanks [~daijy] ! The org.apache.pig.builtin.AvroStorage worked. Need to check if this works on HDFS as well. Regards, Jagdish > AvroStorage doesn't work for schema from external file for EMR > -- > > Key: PIG-4813 > URL: https://issues.apache.org/jira/browse/PIG-4813 > Project: Pig > Issue Type: Bug >Reporter: Jagdish Kewat > > Hi Team, > I couldn't get the schema loading for AvroStorage as described in > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-etl-avro.html > working. > It works fine if I provide the raw schema string with option 'schema' as > described in https://cwiki.apache.org/confluence/display/PIG/AvroStorage. > On HDFS I don't even need to specify the schema with store command. > A quick insights regarding the versions. > * Hadoop : > {code} > Hadoop 2.6.0-amzn-2 > Subversion g...@aws157git.com:/pkg/Aws157BigTop -r > 41f4e6be3ac5d6676a3464f77de79a33e8fdd9f3 > Compiled by ec2-user on 2015-11-16T20:56Z > Compiled with protoc 2.5.0 > {code} > * Pig : > {code} > Apache Pig version 0.14.0-amzn-0 (r: unknown) > {code} > * piggybank jar version: > ** piggybank-0.14.0.jar > * avro jar version : > ** avro-1.7.7.jar > * avro-ipc jar version : > ** avro-ipc-1.7.7.jar > * json-simple jar version > ** json-simple-1.1.jar > I tried looking for any pibbybank version of jar for EMR however no luck. I > fear I am not using correct versions of jars since the feature should work as > it has been documented. > Please advise if I am missing anything. > Thanks, > Jagdish > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4243) Fix "TestStore" for Spark engine
[ https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171562#comment-15171562 ] Pallavi Rao commented on PIG-4243: -- +1 for the new patch. [~xuefuz], please commit. > Fix "TestStore" for Spark engine > > > Key: PIG-4243 > URL: https://issues.apache.org/jira/browse/PIG-4243 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4243.patch, PIG-4243_1.patch, > TEST-org.apache.pig.test.TestStore.txt > > > 1. Build spark and pig env according to PIG-4168 > 2. add TestStore to $PIG_HOME/test/spark-tests > cat $PIG_HOME/test/spark-tests > **/TestStore > 3. run unit test TestStore > ant test-spark > 4. the unit test fails > error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)