[jira] Subscription: PIG patch available

2015-10-09 Thread jira
Issue Subscription
Filter: PIG patch available (29 issues)

Subscriber: pigdaily

Key Summary
PIG-4689CSV Writes incorrect header if two CSV files are created in one 
script
https://issues.apache.org/jira/browse/PIG-4689
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4677Display failure information on stop on failure
https://issues.apache.org/jira/browse/PIG-4677
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4641Print the instance of Object without using toString()
https://issues.apache.org/jira/browse/PIG-4641
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4208Make merge-sparse join work with Spark
https://issues.apache.org/jira/browse/PIG-4208
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


Jenkins build became unstable: Pig-trunk-commit #2246

2015-10-09 Thread Apache Jenkins Server
See 



[jira] [Commented] (PIG-4208) Make merge-sparse join work with Spark

2015-10-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950656#comment-14950656
 ] 

Xuefu Zhang commented on PIG-4208:
--

Yeah, I saw similar changes in other places. I was just trying to understand 
where PH_PIGGYBANK_JAR might be defined and how it's different from 
$ENV{PH_PIG}/contrib/piggybank/java.

+1 to the patch.

> Make merge-sparse join work with Spark
> --
>
> Key: PIG-4208
> URL: https://issues.apache.org/jira/browse/PIG-4208
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Praveen Rachabattuni
>Assignee: Abhishek Agarwal
> Fix For: spark-branch
>
> Attachments: PIG-4208.patch
>
>
> Related e2e tests: MergeSparseJoin_[1-6]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4680) Enable pig job graphs to resume from last successful state

2015-10-09 Thread Abhishek Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Agarwal updated PIG-4680:
--
Attachment: PIG-4680.patch

Here is a first cut implementation. I am in the process of adding tests. 

> Enable pig job graphs to resume from last successful state
> --
>
> Key: PIG-4680
> URL: https://issues.apache.org/jira/browse/PIG-4680
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Abhishek Agarwal
>Assignee: Abhishek Agarwal
> Attachments: PIG-4680.patch
>
>
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays. 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as 
>  - Pig script has not changed
>  - Input locations have not changed
>  - Output data from previous run is intact
>  - Configuration has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4208) Make merge-sparse join work with Spark

2015-10-09 Thread Abhishek Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950358#comment-14950358
 ] 

Abhishek Agarwal commented on PIG-4208:
---

It is a fix for the MergeSparse E2E tests. Spark was not able to load the 
piggybank classes from the directory. This change solves that. 

> Make merge-sparse join work with Spark
> --
>
> Key: PIG-4208
> URL: https://issues.apache.org/jira/browse/PIG-4208
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Praveen Rachabattuni
>Assignee: Abhishek Agarwal
> Fix For: spark-branch
>
> Attachments: PIG-4208.patch
>
>
> Related e2e tests: MergeSparseJoin_[1-6]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4554) Compress pig.script before encoding

2015-10-09 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4554:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks [~sandyridgeracer] for the patch and thanks [~daijy] 
for the review.

> Compress pig.script before encoding
> ---
>
> Key: PIG-4554
> URL: https://issues.apache.org/jira/browse/PIG-4554
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Rohini Palaniswamy
>Assignee: Sandeep Samdaria
>  Labels: newbie
> Fix For: 0.16.0
>
> Attachments: PIG-4554-2.patch, PIG-4554-3.patch, PIG-4554-4.patch, 
> PIG-4554.patch
>
>
>   Currently we truncate the pig script (maxScriptSize = 10240) and base64 
> encode it and store in config. We should remove the truncation and store the 
> full script by compressing and then doing base64 encoding. We already do that 
> for udfcontext, etc. It will save space as it will compress really well and 
> will also give the full pig script while debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-10-09 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950250#comment-14950250
 ] 

Rohini Palaniswamy commented on PIG-4673:
-

Yes. Committed 
https://issues.apache.org/jira/secure/attachment/12765682/PIG-4673-fix-test-failure.patch
 after 

svn mv 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestBuiltinReplaceMulti.java
 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestReplaceMulti.java

Thanks for the review Daniel.

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: 0.16.0
>
> Attachments: PIG-4673-1.patch, PIG-4673-fix-test-failure.patch, 
> replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4699) Print Job stats information in Tez like mapreduce

2015-10-09 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-4699:
---

 Summary: Print Job stats information in Tez like mapreduce
 Key: PIG-4699
 URL: https://issues.apache.org/jira/browse/PIG-4699
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


   Job stats information in mapreduce is extremely useful while debugging or 
looking at performance bottlenecks on which of the mapreduce jobs is taking 
time. It is hard to figure out the same and what aliases are being processed in 
vertices of Tez without that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4670) Embedded Python scripts still parse line by line

2015-10-09 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950218#comment-14950218
 ] 

Rohini Palaniswamy commented on PIG-4670:
-

I ended up changing code for describe and explain as well earlier which caused 
the query to execute. So reverted changes in that case back to what it was. 
Basically it turns on the batch, but called parseStopOnError(true) so that 
executeBatch() is not done in parseStopOnError. 

{code}
if (!sameBatch) {
executeBatch();
}
{code}

Additionally added pigServer.setSkipParseInRegisterForBatch(true); to skip 
parsing while registering query as dumpSchema() was calling parseQuery again.

> Embedded Python scripts still parse line by line
> 
>
> Key: PIG-4670
> URL: https://issues.apache.org/jira/browse/PIG-4670
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4670-1.patch, PIG-4670-2.patch, 
> PIG-4670-fix-e2e-failures-nowhitespacechange.patch, 
> PIG-4670-fix-e2e-failures.patch
>
>
> PIG-3204 fixed pig script parsing to parse in batches instead of line by 
> line. But the fix in BoundScript is not right and it is still parsing line by 
> line. That makes parsing take long time for very large pig scripts using 
> PigStorage when there is no schema file stored and without -noschema as it 
> tries to find the schema file lots of times.
> It should be grunt.parseStopOnError(false); instead of 
> grunt.parseStopOnError(true); to make it parse statements in batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)