[jira] Subscription: PIG patch available

2015-10-12 Thread jira
Issue Subscription
Filter: PIG patch available (30 issues)

Subscriber: pigdaily

Key Summary
PIG-4699Print Job stats information in Tez like mapreduce
https://issues.apache.org/jira/browse/PIG-4699
PIG-4693Class conflicts: Kryo bundled in spark vs kryo bundled with pig
https://issues.apache.org/jira/browse/PIG-4693
PIG-4689CSV Writes incorrect header if two CSV files are created in one 
script
https://issues.apache.org/jira/browse/PIG-4689
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4677Display failure information on stop on failure
https://issues.apache.org/jira/browse/PIG-4677
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4641Print the instance of Object without using toString()
https://issues.apache.org/jira/browse/PIG-4641
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] [Commented] (PIG-4695) Using 'replicated' left join results in different result from regular left join.

2015-10-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954381#comment-14954381
 ] 

Daniel Dai commented on PIG-4695:
-

I've tried 0.15.0 and also cannot reproduce. Can you provide more details how 
to reproduce?

> Using 'replicated' left join results in different result from regular left 
> join.
> 
>
> Key: PIG-4695
> URL: https://issues.apache.org/jira/browse/PIG-4695
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.15.0
>Reporter: Zbigniew Rzepka
>
> There seems to be a difference in results between regular LEFT JOIN and 
> replicated LEFT JOIN. This may be a case only with very small data sets, as 
> we're using piece of code shown below in production with correct results.
> EDIT:
> This issue only occurs when running PIG on Tez. (We're using Tez 7.0).
> Example:
> I have two data sets:
> first_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}
> second_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> {code}
> When I use regular LEFT JOIN on these two I get the correct output:
> {code:sql}
> joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (138,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users)
> {code}
> BUT, if I add {{USING 'replicated'}}, the result is completely different:
> {code}
> $joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
> USING 'replicated';
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4695) Using 'replicated' left join results in different result from regular left join.

2015-10-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954341#comment-14954341
 ] 

Rohini Palaniswamy commented on PIG-4695:
-

With current trunk code, I get the right results. Haven't checked with 0.15 
though.

> Using 'replicated' left join results in different result from regular left 
> join.
> 
>
> Key: PIG-4695
> URL: https://issues.apache.org/jira/browse/PIG-4695
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.15.0
>Reporter: Zbigniew Rzepka
>
> There seems to be a difference in results between regular LEFT JOIN and 
> replicated LEFT JOIN. This may be a case only with very small data sets, as 
> we're using piece of code shown below in production with correct results.
> EDIT:
> This issue only occurs when running PIG on Tez. (We're using Tez 7.0).
> Example:
> I have two data sets:
> first_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}
> second_period_users:
> {code}
> (108,11,all_users,all_users)
> (108,13,all_users,all_users)
> {code}
> When I use regular LEFT JOIN on these two I get the correct output:
> {code:sql}
> joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (138,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users)
> {code}
> BUT, if I add {{USING 'replicated'}}, the result is completely different:
> {code}
> $joined_periods_users = JOIN 
> $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
> $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
> USING 'replicated';
> {code}
> output:
> {code}
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,11,all_users,all_users,108,11,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,13,all_users,all_users,108,13,all_users,all_users)
> (108,17,all_users,all_users)
> (138,11,all_users,all_users)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Dependency version on Kryo

2015-10-12 Thread Xuefu Zhang
Hi all,

It was found in PIG-4693 (https://issues.apache.org/jira/browse/PIG-4693)
that Pig is currently dependent on Kryo 2.22. However, Spark depends on
2.21. The two versions are not completely compatible. We tried several ways
to solve the problem but unfortunately none worked. This is mainly because
Spark doesn't give user an opportunity to provide their own kryo library
(SPARK-10910). Please refer to the full discussions in PIG-4693.

It seems that Pig brought in kryo dependency for ORC. I'm wondering if
there is any specific reasons for kryo 2.22 and if not, whether we can
downgrade the dependency to 2.21 instead. Our initial test shows that kryo
2.21 works just fine for ORC. This obviously solve our problem as well.

Your input to this is greatly appreciated.

Thanks,
Xuefu


[jira] [Commented] (PIG-4693) Class conflicts: Kryo bundled in spark vs kryo bundled with pig

2015-10-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954315#comment-14954315
 ] 

Xuefu Zhang commented on PIG-4693:
--

+1 on the latest patch.

> Class conflicts: Kryo bundled in spark vs kryo bundled with pig
> ---
>
> Key: PIG-4693
> URL: https://issues.apache.org/jira/browse/PIG-4693
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4693.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4208) Make merge-sparse join work with Spark

2015-10-12 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4208:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Abhishek!

> Make merge-sparse join work with Spark
> --
>
> Key: PIG-4208
> URL: https://issues.apache.org/jira/browse/PIG-4208
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Praveen Rachabattuni
>Assignee: Abhishek Agarwal
> Fix For: spark-branch
>
> Attachments: PIG-4208.patch
>
>
> Related e2e tests: MergeSparseJoin_[1-6]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4670) Embedded Python scripts still parse line by line

2015-10-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954308#comment-14954308
 ] 

Rohini Palaniswamy commented on PIG-4670:
-

Committed 
https://issues.apache.org/jira/secure/attachment/12765711/PIG-4670-fix-e2e-failures.patch.
 Thanks for the review Daniel.

> Embedded Python scripts still parse line by line
> 
>
> Key: PIG-4670
> URL: https://issues.apache.org/jira/browse/PIG-4670
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4670-1.patch, PIG-4670-2.patch, 
> PIG-4670-fix-e2e-failures-nowhitespacechange.patch, 
> PIG-4670-fix-e2e-failures.patch
>
>
> PIG-3204 fixed pig script parsing to parse in batches instead of line by 
> line. But the fix in BoundScript is not right and it is still parsing line by 
> line. That makes parsing take long time for very large pig scripts using 
> PigStorage when there is no schema file stored and without -noschema as it 
> tries to find the schema file lots of times.
> It should be grunt.parseStopOnError(false); instead of 
> grunt.parseStopOnError(true); to make it parse statements in batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4699) Print Job stats information in Tez like mapreduce

2015-10-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954124#comment-14954124
 ] 

Daniel Dai commented on PIG-4699:
-

+1

> Print Job stats information in Tez like mapreduce
> -
>
> Key: PIG-4699
> URL: https://issues.apache.org/jira/browse/PIG-4699
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4699-1.patch, sample-output.txt
>
>
>Job stats information in mapreduce is extremely useful while debugging or 
> looking at performance bottlenecks on which of the mapreduce jobs is taking 
> time. It is hard to figure out the same and what aliases are being processed 
> in vertices of Tez without that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4670) Embedded Python scripts still parse line by line

2015-10-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954095#comment-14954095
 ] 

Daniel Dai commented on PIG-4670:
-

+1

> Embedded Python scripts still parse line by line
> 
>
> Key: PIG-4670
> URL: https://issues.apache.org/jira/browse/PIG-4670
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4670-1.patch, PIG-4670-2.patch, 
> PIG-4670-fix-e2e-failures-nowhitespacechange.patch, 
> PIG-4670-fix-e2e-failures.patch
>
>
> PIG-3204 fixed pig script parsing to parse in batches instead of line by 
> line. But the fix in BoundScript is not right and it is still parsing line by 
> line. That makes parsing take long time for very large pig scripts using 
> PigStorage when there is no schema file stored and without -noschema as it 
> tries to find the schema file lots of times.
> It should be grunt.parseStopOnError(false); instead of 
> grunt.parseStopOnError(true); to make it parse statements in batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4680) Enable pig job graphs to resume from last successful state

2015-10-12 Thread Abhishek Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953014#comment-14953014
 ] 

Abhishek Agarwal commented on PIG-4680:
---

Posted the review request here - https://reviews.apache.org/r/39226/

> Enable pig job graphs to resume from last successful state
> --
>
> Key: PIG-4680
> URL: https://issues.apache.org/jira/browse/PIG-4680
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Abhishek Agarwal
>Assignee: Abhishek Agarwal
> Attachments: PIG-4680.patch
>
>
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays. 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as 
>  - Pig script has not changed
>  - Input locations have not changed
>  - Output data from previous run is intact
>  - Configuration has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

2015-10-12 Thread Abhishek Agarwal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/
---

(Updated Oct. 12, 2015, 11:30 a.m.)


Review request for pig and Rohini Palaniswamy.


Repository: pig-git


Description (updated)
---

Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
finish. In case of transient errors, the job fails. When the job is rerun, all 
the nodes in Job graph will rerun. Some of these nodes may have already run 
successfully. Redundant runs lead to wastage of cluster capacity and pipeline 
delays.

In case of failure, we can persist the graph state. In next run, only the 
failed nodes and their successors will rerun. This is of course subject to 
preconditions such as
 > Pig script has not changed
 > Input locations have not changed
 > Output data from previous run is intact
 > Configuration has not changed


Diffs
-

  src/org/apache/pig/PigConfiguration.java 03b36a5 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 595e68c 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java
 4b62112 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java
 PRE-CREATION 
  src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
  src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
  src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 

Diff: https://reviews.apache.org/r/39226/diff/


Testing
---


Thanks,

Abhishek Agarwal



Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

2015-10-12 Thread Abhishek Agarwal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/
---

Review request for pig and Rohini Palaniswamy.


Repository: pig-git


Description
---

Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
finish. In case of transient errors, the job fails. When the job is rerun, all 
the nodes in Job graph will rerun. Some of these nodes may have already run 
successfully. Redundant runs lead to wastage of cluster capacity and pipeline 
delays.

In case of failure, we can persist the graph state. In next run, only the 
failed nodes and their successors will rerun. This is of course subject to 
preconditions such as
  Pig script has not changed
  Input locations have not changed
  Output data from previous run is intact
  Configuration has not changed


Diffs
-

  src/org/apache/pig/PigConfiguration.java 03b36a5 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 595e68c 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java
 4b62112 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java
 PRE-CREATION 
  src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
  src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
  src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 

Diff: https://reviews.apache.org/r/39226/diff/


Testing
---


Thanks,

Abhishek Agarwal



[jira] [Commented] (PIG-4680) Enable pig job graphs to resume from last successful state

2015-10-12 Thread Abhishek Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952799#comment-14952799
 ] 

Abhishek Agarwal commented on PIG-4680:
---

[~rohini] I am trying to upload the patch, I have generated using
"git diff --cached" (cached option because there are staged changes).  However 
I am getting this error while uploading the diff
Line 2: No valid separator after the filename was found in the diff header. I 
see that similar sort of patch is accepted by oozie reviewboard. 

> Enable pig job graphs to resume from last successful state
> --
>
> Key: PIG-4680
> URL: https://issues.apache.org/jira/browse/PIG-4680
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Abhishek Agarwal
>Assignee: Abhishek Agarwal
> Attachments: PIG-4680.patch
>
>
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays. 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as 
>  - Pig script has not changed
>  - Input locations have not changed
>  - Output data from previous run is intact
>  - Configuration has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)