[jira] Subscription: PIG patch available

2015-06-22 Thread jira
Issue Subscription
Filter: PIG patch available (25 issues)

Subscriber: pigdaily

Key Summary
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4574Eliminate identity vertex for order by and skewed join right after 
LOAD
https://issues.apache.org/jira/browse/PIG-4574
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4526Make setting up the build environment easier
https://issues.apache.org/jira/browse/PIG-4526
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement Optimize the use of DistributedCache(PIG-2672) and 
PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] [Commented] (PIG-4610) Enable "TestOrcStorage“ unit test in spark mode

2015-06-22 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597045#comment-14597045
 ] 

Mohit Sabharwal commented on PIG-4610:
--

+1 (non-binding)

> Enable "TestOrcStorage“ unit test in spark mode
> ---
>
> Key: PIG-4610
> URL: https://issues.apache.org/jira/browse/PIG-4610
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4610.patch
>
>
> In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows 
> following unit test failures about "TestOrcStorage":
> org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning
> org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType
> org.apache.pig.builtin.TestOrcStorage.testMultiStore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4610) Enable "TestOrcStorage“ unit test in spark mode

2015-06-22 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4610:
--
Attachment: PIG-4610.patch

[~mohitsabharwal],[~kexianda],[~praveenr019],[~xuefuz]: 
PIG-4610.patch fixes following unit test failures:
org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning
org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType
org.apache.pig.builtin.TestOrcStorage.testMultiStore


Let's make an example to explain why it fails before:
testOrcStorage.tmp.pig:
orc-file-11-format.orc  is found in 
$PIG_HOME/test/org/apache/pig/builtin/orc/orc-file-11-format.orc 
{code}
A = load './orc-file-11-format.orc' using OrcStorage();
B = foreach A generate int1,string1;
D = limit B 10;
store D into './testOrcStorage.tmp.out';
{code}

the result of spark:
{code}
false   1
false   1
false   1
false   1
false   1
false   1
false   1
false   1
false   1
false   1
{code}

the result of MR:
{code}
65536   hi
65536   bye
65536   hi
65536   bye
65536   hi
65536   bye
65536   hi
65536   bye
65536   hi
65536   bye
{code}
the data format from orc-file-11-format.orc is like: the requireColumns is the 
4th and 9th(this info is stored in orc-file-11-format.orc):
{code}
{true, 100, 2048, 65536, 9223372036854775807, 2.0, -5.0, , bye, {[{1, bye}, {2, 
sigh}]}, [{1, cat}, {-10, in}, {1234, hat}], {chani={5, chani}, 
mauddib={1, mauddib}}, 2000-03-12 15:00:01, 12345678.6547457}
{code}

the difference between spark and mr is because [{{OrcStorage#mRequiredColumns}} 
|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/OrcStorage.java#L298]
 is not 
initialized([{{UDFContext.getUDFContext().isFrontend()}}|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/OrcStorage.java#L296]
 is true). The reason {{UDFContext.getUDFContext().isFrontend()}} is true 
because 
[{{jconf.get(MRConfiguration.JOB_APPLICATION_ATTEMPT_ID)}}|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/UDFContext.java#L238]
 is null. PIG-4610.patch is set {{MRConfiguration.JOB_APPLICATION_ATTEMPT_ID}} 
in SparkUtil#newJobConf.


> Enable "TestOrcStorage“ unit test in spark mode
> ---
>
> Key: PIG-4610
> URL: https://issues.apache.org/jira/browse/PIG-4610
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4610.patch
>
>
> In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows 
> following unit test failures about "TestOrcStorage":
> org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning
> org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType
> org.apache.pig.builtin.TestOrcStorage.testMultiStore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596469#comment-14596469
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

The problem with the implicit add is that user typos could make it an add 
instead of update. For eg: If user specified

updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f7;

but actually meant to say 6 AS f6; , then the script will run fine and will 
require more debugging to find why the output is not as expected.  So would 
prefer having ... at the end to make any additions explicit. That way one can 
throw errors for update of columns that do not exist.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Kevin J. Price (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596462#comment-14596462
 ] 

Kevin J. Price commented on PIG-4608:
-

Several of us actually discussed this at some length, and didn't think it was 
worth differentiating between modified columns and appended columns in the 
command. Two ideas we had:
# A token, like you have, indicating that the remaining fields are being added. 
We were considering using an 'ADD' keyword. As in:
{code}
updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6 ADD f1+f2 AS new_sum;
{code}
# Separate statements for 'strict' versus 'non-strict' mode. e.g., for updating 
with appending you would use
{code}
updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f6;
{code}
and for updating with appending, you could use
{code}
updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6, f1+f2 AS new_sum;
{code}

However, our overall view from writing pig scripts is that chances are very few 
people would ever want to use the strict mode, nor did we see much value in 
having the extra token (ADD or ...) separating out appended columns. >From a 
programming viewpoint, it just makes more logical sense to us to view it as an 
implicit "update or add" construct.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596409#comment-14596409
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

Sounds good. Can we just add ... for the ones to be appended to make appending 
clear? i.e

updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as 
new_sum;

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596409#comment-14596409
 ] 

Rohini Palaniswamy edited comment on PIG-4608 at 6/22/15 6:42 PM:
--

Sounds good. Can we just add ... once at the end for the ones to be appended to 
make appending clear? i.e

updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as 
new_sum;


was (Author: rohini):
Sounds good. Can we just add ... for the ones to be appended to make appending 
clear? i.e

updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as 
new_sum;

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4443) Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596201#comment-14596201
 ] 

Rohini Palaniswamy commented on PIG-4443:
-

Just to be sure, are you getting this error with Pig on Tez or Mapreduce? And 
the error is while submitting the job or after it completes and fetching task 
reports?

> Write inputsplits in Tez to disk if the size is huge and option to compress 
> pig input splits
> 
>
> Key: PIG-4443
> URL: https://issues.apache.org/jira/browse/PIG-4443
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.15.0
>
> Attachments: PIG-4443-1.patch, PIG-4443-Fix-TEZ-2192-2.patch, 
> PIG-4443-Fix-TEZ-2192.patch
>
>
> Pig sets the input split information in user payload and when running against 
> a table with 10s of 1000s of partitions, DAG submission fails with
> java.io.IOException: Requested data length 305844060 is longer than maximum
> configured RPC length 67108864



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4443) Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits

2015-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PIG-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595756#comment-14595756
 ] 

Ángel Álvarez commented on PIG-4443:


I have a script in PIG that loads data from Hive using 
org.apache.hive.hcatalog.pig.HCatLoader. This script works fine in Pig 0.14, 
but in Pig 0.15 I'm getting this error:

Requested data length 160452289 is longer than maximum configured RPC length 
67108864

In Pig 0.14 I had to deal with this issue too, but I could always make it work 
by reducing the number of splits in the Hive tables created by Sqoop (using no 
more than 60 splits). Is there any special configuration needed?

> Write inputsplits in Tez to disk if the size is huge and option to compress 
> pig input splits
> 
>
> Key: PIG-4443
> URL: https://issues.apache.org/jira/browse/PIG-4443
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.15.0
>
> Attachments: PIG-4443-1.patch, PIG-4443-Fix-TEZ-2192-2.patch, 
> PIG-4443-Fix-TEZ-2192.patch
>
>
> Pig sets the input split information in user payload and when running against 
> a table with 10s of 1000s of partitions, DAG submission fails with
> java.io.IOException: Requested data length 305844060 is longer than maximum
> configured RPC length 67108864



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


How I can get the last tuple of a bag

2015-06-22 Thread 李运田
I have data like :
(lucy,{(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45)})
(lili,{(12,lili,23),(12,lili,23),(12,lili,23),(12,lili,34),(12,lili,23),(12,lili,89),(12,lili,23),(12,lili,23),(12,lili,23),(12,lili,34),(12,lili,23),(12,lili,89),(12,lili,23),(12,lili,23)})
(limaomao,{(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56)})
its metadata is: t2: {group: chararray,t1: {(a: int,b: chararray,c: int)}}
I can get first tuple of t1 using limit or FirstTupleFromBag.html .But, How I 
can get every last tuple of  t1.
thanks