[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (30 issues) Subscriber: pigdaily Key Summary PIG-4699Print Job stats information in Tez like mapreduce https://issues.apache.org/jira/browse/PIG-4699 PIG-4693Class conflicts: Kryo bundled in spark vs kryo bundled with pig https://issues.apache.org/jira/browse/PIG-4693 PIG-4689CSV Writes incorrect header if two CSV files are created in one script https://issues.apache.org/jira/browse/PIG-4689 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4677Display failure information on stop on failure https://issues.apache.org/jira/browse/PIG-4677 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4641Print the instance of Object without using toString() https://issues.apache.org/jira/browse/PIG-4641 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4581thread safe issue in NodeIdGenerator https://issues.apache.org/jira/browse/PIG-4581 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0 https://issues.apache.org/jira/browse/PIG-4468 PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in MRPrinter https://issues.apache.org/jira/browse/PIG-4455 PIG-4417Pig's register command should support automatic fetching of jars from repo. https://issues.apache.org/jira/browse/PIG-4417 PIG-4373Implement PIG-3861 in Tez https://issues.apache.org/jira/browse/PIG-4373 PIG-4341Add CMX support to pig.tmpfilecompression.codec https://issues.apache.org/jira/browse/PIG-4341 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4111Make Pig compiles with avro-1.7.7 https://issues.apache.org/jira/browse/PIG-4111 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3866Create ThreadLocal classloader per PigContext https://issues.apache.org/jira/browse/PIG-3866 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3851Upgrade jline to 2.11 https://issues.apache.org/jira/browse/PIG-3851 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384
[jira] [Commented] (PIG-4695) Using 'replicated' left join results in different result from regular left join.
[ https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954381#comment-14954381 ] Daniel Dai commented on PIG-4695: - I've tried 0.15.0 and also cannot reproduce. Can you provide more details how to reproduce? > Using 'replicated' left join results in different result from regular left > join. > > > Key: PIG-4695 > URL: https://issues.apache.org/jira/browse/PIG-4695 > Project: Pig > Issue Type: Bug >Affects Versions: 0.15.0 >Reporter: Zbigniew Rzepka > > There seems to be a difference in results between regular LEFT JOIN and > replicated LEFT JOIN. This may be a case only with very small data sets, as > we're using piece of code shown below in production with correct results. > EDIT: > This issue only occurs when running PIG on Tez. (We're using Tez 7.0). > Example: > I have two data sets: > first_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > (108,17,all_users,all_users) > (138,11,all_users,all_users) > {code} > second_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > {code} > When I use regular LEFT JOIN on these two I get the correct output: > {code:sql} > joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value); > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (138,11,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users) > {code} > BUT, if I add {{USING 'replicated'}}, the result is completely different: > {code} > $joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) > USING 'replicated'; > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users) > (138,11,all_users,all_users) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4695) Using 'replicated' left join results in different result from regular left join.
[ https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954341#comment-14954341 ] Rohini Palaniswamy commented on PIG-4695: - With current trunk code, I get the right results. Haven't checked with 0.15 though. > Using 'replicated' left join results in different result from regular left > join. > > > Key: PIG-4695 > URL: https://issues.apache.org/jira/browse/PIG-4695 > Project: Pig > Issue Type: Bug >Affects Versions: 0.15.0 >Reporter: Zbigniew Rzepka > > There seems to be a difference in results between regular LEFT JOIN and > replicated LEFT JOIN. This may be a case only with very small data sets, as > we're using piece of code shown below in production with correct results. > EDIT: > This issue only occurs when running PIG on Tez. (We're using Tez 7.0). > Example: > I have two data sets: > first_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > (108,17,all_users,all_users) > (138,11,all_users,all_users) > {code} > second_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > {code} > When I use regular LEFT JOIN on these two I get the correct output: > {code:sql} > joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value); > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (138,11,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users) > {code} > BUT, if I add {{USING 'replicated'}}, the result is completely different: > {code} > $joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) > USING 'replicated'; > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users) > (138,11,all_users,all_users) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Dependency version on Kryo
Hi all, It was found in PIG-4693 (https://issues.apache.org/jira/browse/PIG-4693) that Pig is currently dependent on Kryo 2.22. However, Spark depends on 2.21. The two versions are not completely compatible. We tried several ways to solve the problem but unfortunately none worked. This is mainly because Spark doesn't give user an opportunity to provide their own kryo library (SPARK-10910). Please refer to the full discussions in PIG-4693. It seems that Pig brought in kryo dependency for ORC. I'm wondering if there is any specific reasons for kryo 2.22 and if not, whether we can downgrade the dependency to 2.21 instead. Our initial test shows that kryo 2.21 works just fine for ORC. This obviously solve our problem as well. Your input to this is greatly appreciated. Thanks, Xuefu
[jira] [Commented] (PIG-4693) Class conflicts: Kryo bundled in spark vs kryo bundled with pig
[ https://issues.apache.org/jira/browse/PIG-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954315#comment-14954315 ] Xuefu Zhang commented on PIG-4693: -- +1 on the latest patch. > Class conflicts: Kryo bundled in spark vs kryo bundled with pig > --- > > Key: PIG-4693 > URL: https://issues.apache.org/jira/browse/PIG-4693 > Project: Pig > Issue Type: Sub-task > Components: spark >Affects Versions: spark-branch >Reporter: Srikanth Sundarrajan >Assignee: Srikanth Sundarrajan > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4693.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4208) Make merge-sparse join work with Spark
[ https://issues.apache.org/jira/browse/PIG-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-4208: - Resolution: Fixed Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks, Abhishek! > Make merge-sparse join work with Spark > -- > > Key: PIG-4208 > URL: https://issues.apache.org/jira/browse/PIG-4208 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Praveen Rachabattuni >Assignee: Abhishek Agarwal > Fix For: spark-branch > > Attachments: PIG-4208.patch > > > Related e2e tests: MergeSparseJoin_[1-6] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4670) Embedded Python scripts still parse line by line
[ https://issues.apache.org/jira/browse/PIG-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954308#comment-14954308 ] Rohini Palaniswamy commented on PIG-4670: - Committed https://issues.apache.org/jira/secure/attachment/12765711/PIG-4670-fix-e2e-failures.patch. Thanks for the review Daniel. > Embedded Python scripts still parse line by line > > > Key: PIG-4670 > URL: https://issues.apache.org/jira/browse/PIG-4670 > Project: Pig > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.16.0 > > Attachments: PIG-4670-1.patch, PIG-4670-2.patch, > PIG-4670-fix-e2e-failures-nowhitespacechange.patch, > PIG-4670-fix-e2e-failures.patch > > > PIG-3204 fixed pig script parsing to parse in batches instead of line by > line. But the fix in BoundScript is not right and it is still parsing line by > line. That makes parsing take long time for very large pig scripts using > PigStorage when there is no schema file stored and without -noschema as it > tries to find the schema file lots of times. > It should be grunt.parseStopOnError(false); instead of > grunt.parseStopOnError(true); to make it parse statements in batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4699) Print Job stats information in Tez like mapreduce
[ https://issues.apache.org/jira/browse/PIG-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954124#comment-14954124 ] Daniel Dai commented on PIG-4699: - +1 > Print Job stats information in Tez like mapreduce > - > > Key: PIG-4699 > URL: https://issues.apache.org/jira/browse/PIG-4699 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.16.0 > > Attachments: PIG-4699-1.patch, sample-output.txt > > >Job stats information in mapreduce is extremely useful while debugging or > looking at performance bottlenecks on which of the mapreduce jobs is taking > time. It is hard to figure out the same and what aliases are being processed > in vertices of Tez without that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4670) Embedded Python scripts still parse line by line
[ https://issues.apache.org/jira/browse/PIG-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954095#comment-14954095 ] Daniel Dai commented on PIG-4670: - +1 > Embedded Python scripts still parse line by line > > > Key: PIG-4670 > URL: https://issues.apache.org/jira/browse/PIG-4670 > Project: Pig > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.16.0 > > Attachments: PIG-4670-1.patch, PIG-4670-2.patch, > PIG-4670-fix-e2e-failures-nowhitespacechange.patch, > PIG-4670-fix-e2e-failures.patch > > > PIG-3204 fixed pig script parsing to parse in batches instead of line by > line. But the fix in BoundScript is not right and it is still parsing line by > line. That makes parsing take long time for very large pig scripts using > PigStorage when there is no schema file stored and without -noschema as it > tries to find the schema file lots of times. > It should be grunt.parseStopOnError(false); instead of > grunt.parseStopOnError(true); to make it parse statements in batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4680) Enable pig job graphs to resume from last successful state
[ https://issues.apache.org/jira/browse/PIG-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953014#comment-14953014 ] Abhishek Agarwal commented on PIG-4680: --- Posted the review request here - https://reviews.apache.org/r/39226/ > Enable pig job graphs to resume from last successful state > -- > > Key: PIG-4680 > URL: https://issues.apache.org/jira/browse/PIG-4680 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Abhishek Agarwal >Assignee: Abhishek Agarwal > Attachments: PIG-4680.patch > > > Pig scripts can have multiple ETL jobs in the DAG which may take hours to > finish. In case of transient errors, the job fails. When the job is rerun, > all the nodes in Job graph will rerun. Some of these nodes may have already > run successfully. Redundant runs lead to wastage of cluster capacity and > pipeline delays. > In case of failure, we can persist the graph state. In next run, only the > failed nodes and their successors will rerun. This is of course subject to > preconditions such as > - Pig script has not changed > - Input locations have not changed > - Output data from previous run is intact > - Configuration has not changed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39226/ --- (Updated Oct. 12, 2015, 11:30 a.m.) Review request for pig and Rohini Palaniswamy. Repository: pig-git Description (updated) --- Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays. In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as > Pig script has not changed > Input locations have not changed > Output data from previous run is intact > Configuration has not changed Diffs - src/org/apache/pig/PigConfiguration.java 03b36a5 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 src/org/apache/pig/tools/grunt/GruntParser.java 439d087 src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 Diff: https://reviews.apache.org/r/39226/diff/ Testing --- Thanks, Abhishek Agarwal
Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39226/ --- Review request for pig and Rohini Palaniswamy. Repository: pig-git Description --- Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays. In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as Pig script has not changed Input locations have not changed Output data from previous run is intact Configuration has not changed Diffs - src/org/apache/pig/PigConfiguration.java 03b36a5 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 src/org/apache/pig/tools/grunt/GruntParser.java 439d087 src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 Diff: https://reviews.apache.org/r/39226/diff/ Testing --- Thanks, Abhishek Agarwal
[jira] [Commented] (PIG-4680) Enable pig job graphs to resume from last successful state
[ https://issues.apache.org/jira/browse/PIG-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952799#comment-14952799 ] Abhishek Agarwal commented on PIG-4680: --- [~rohini] I am trying to upload the patch, I have generated using "git diff --cached" (cached option because there are staged changes). However I am getting this error while uploading the diff Line 2: No valid separator after the filename was found in the diff header. I see that similar sort of patch is accepted by oozie reviewboard. > Enable pig job graphs to resume from last successful state > -- > > Key: PIG-4680 > URL: https://issues.apache.org/jira/browse/PIG-4680 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Abhishek Agarwal >Assignee: Abhishek Agarwal > Attachments: PIG-4680.patch > > > Pig scripts can have multiple ETL jobs in the DAG which may take hours to > finish. In case of transient errors, the job fails. When the job is rerun, > all the nodes in Job graph will rerun. Some of these nodes may have already > run successfully. Redundant runs lead to wastage of cluster capacity and > pipeline delays. > In case of failure, we can persist the graph state. In next run, only the > failed nodes and their successors will rerun. This is of course subject to > preconditions such as > - Pig script has not changed > - Input locations have not changed > - Output data from previous run is intact > - Configuration has not changed -- This message was sent by Atlassian JIRA (v6.3.4#6332)