[jira] [Updated] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)

2013-08-22 Thread Mark Wagner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Wagner updated PIG-3434:
-

Attachment: PIG-3434.1.patch

Fixed null handling in POUserFunc. It seems like POStatus.STATUS_NULL isn't 
being set everywhere e.g. NULL constants have STATUS_OK. That's the work of 
another JIRA though, I think.

> Null subexpression in bincond nullifies outer tuple (or bag)
> 
>
> Key: PIG-3434
> URL: https://issues.apache.org/jira/browse/PIG-3434
> Project: Pig
>  Issue Type: Bug
>Reporter: Pavel Fedyakov
>Assignee: Mark Wagner
> Attachments: PIG-3434.1.patch
>
>
> According to docs, for bincond operator "If a Boolean subexpression results 
> in null value, the resulting expression is null" 
> (http://pig.apache.org/docs/r0.11.0/basic.html#nulls).
> It works as described in plain foreach..generate expression:
> {noformat}
> in = load 'in';
> out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);
> dump out;
> {noformat}
> in (3 lines, 2nd is empty):
> {noformat}
> 0
> 1
> {noformat}
> out:
> {noformat}
> (1,3)
> (1,)
> (1,2)
> {noformat}
> But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
> line in output:
> {noformat}
> out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));
> {noformat}
> out:
> {noformat}
> ((1,3))
> ()
> ((1,2))
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-22 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748184#comment-13748184
 ] 

Daniel Dai commented on PIG-3379:
-

Missing LODistinct in the posted logical plan. Should be:
{code}
|---EventsPerMinute: (Name: LOForEach Schema: 
timeStamp#56:long,nbDevices#57:long,nbDevicesWatching#58:long)
|   |
|   (Name: LOGenerate[false,false,false] Schema: 
timeStamp#56:long,nbDevices#57:long,nbDevicesWatching#58:long)ColumnPrune:InputUids=[50,
 49]ColumnPrune:OutputUids=[58, 57, 56]
|   |   |
|   |   (Name: Multiply Type: long Uid: 56)
|   |   |
|   |   |---group:(Name: Project Type: long Uid: 49 Input: 0 Column: 
(*))
|   |   |
|   |   |---(Name: Cast Type: long Uid: 54)
|   |   |
|   |   |---(Name: Constant Type: int Uid: 54)
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
57)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 1 
Column: (*))
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
58)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 50 Input: 2 
Column: (*))
|   |
|   |---(Name: LOInnerLoad[0] Schema: group#49:long)
|   |
|   |---DistinctDevices: (Name: LODistinct Schema: 
deviceId#22:chararray)
|   |   |
|   |   |---1-3: (Name: LOForEach Schema: deviceId#22:chararray)
|   |   |   |
|   |   |   (Name: LOGenerate[false] Schema: deviceId#22:chararray)
|   |   |   |   |
|   |   |   |   deviceId:(Name: Project Type: chararray Uid: 22 
Input: 0 Column: (*))
|   |   |   |
|   |   |   |---(Name: LOInnerLoad[1] Schema: deviceId#22:chararray)
|   |   |
|   |   |---Events: (Name: LOInnerLoad[1] Schema: 
eventTime#21:long,deviceId#22:chararray,eventName#23:chararray)
{code}

The plan looks right.

Talked with [~xuefuz], the idea is to use projectedOperator instead of alias at 
the time we convert alias to position. The newly introduced projectedOperator 
is only used in alias translation. After that, input# and col# will be use as 
the coordinates of ProjectExpression. Patch looks good. I will commit it once 
tests pass.

> Alias reuse in nested foreach causes PIG script to fail
> ---
>
> Key: PIG-3379
> URL: https://issues.apache.org/jira/browse/PIG-3379
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: PIG-3379-draft.patch, PIG-3379.patch
>
>
> The following script fails:
> {code:title=temp.pig}
> Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
> eventName:chararray);
> Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
> EventsPerMinute = GROUP Events BY (eventTime / 6);
> EventsPerMinute = FOREACH EventsPerMinute {
>   DistinctDevices = DISTINCT Events.deviceId;
>   nbDevices = SIZE(DistinctDevices);
>   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
>   nbDevicesWatching = SIZE(DistinctDevices);
>   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
> as nbDevicesWatching;
> }
> EventsPerMinute = FILTER EventsPerMinute BY timeStamp >= 0  AND timeStamp < 
> 10;
> A = FOREACH EventsPerMinute GENERATE timeStamp;
> describe A;
> {code}
> With the error:
> {code}
> 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1025: 
>  Invalid field 
> projection. Projected field [timeStamp] does not exist in schema: 
> deviceId:chararray.
> {code}
> Using distinct alias name for the 2nd "DistinctDevices" fixes the problem. As 
> an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748183#comment-13748183
 ] 

Julien Le Dem commented on PIG-3419:


+1 LGTM
If test-commit passes I think we can commit to TRUNK

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-08-22 Thread jira
Issue Subscription
Filter: PIG patch available (18 issues)

Subscriber: pigdaily

Key Summary
PIG-3431Return more information for parsing related exceptions.
https://issues.apache.org/jira/browse/PIG-3431
PIG-3430Add xml format for explaining MapReduce Plan.
https://issues.apache.org/jira/browse/PIG-3430
PIG-3426Add support for removing s3 files
https://issues.apache.org/jira/browse/PIG-3426
PIG-3419Pluggable Execution Engine 
https://issues.apache.org/jira/browse/PIG-3419
PIG-3379Alias reuse in nested foreach causes PIG script to fail
https://issues.apache.org/jira/browse/PIG-3379
PIG-3374CASE and IN fail when expression includes dereferencing operator
https://issues.apache.org/jira/browse/PIG-3374
PIG-3349Document ToString(Datetime, String) UDF
https://issues.apache.org/jira/browse/PIG-3349
PIG-3346New property that controls the number of combined splits
https://issues.apache.org/jira/browse/PIG-3346
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3117A debug mode in which pig does not delete temporary files
https://issues.apache.org/jira/browse/PIG-3117
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3048Add mapreduce workflow information to job configuration
https://issues.apache.org/jira/browse/PIG-3048
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748137#comment-13748137
 ] 

Cheolsoo Park commented on PIG-3419:


I will kick off the unit tests with the new patch now. :-)
{quote}
if you can tell me what exactly I should be doing for indentation/in what files 
that'd be great.
{quote}
This is not a big deal. Basically, you can run the following command to replace 
tab chars in your patch:
{code}
sed -i .orig '/^+/,/$/ s//<4 whitespaces>/g' 
updated-8-22-2013-exec-engine.patch
{code}
Then, the modified patch can be applied with "-l" option 
(\-\-ignore-whitespace): 
{code}
patch -l < updated-8-22-2013-exec-engine.patch
{code}

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: updated-8-22-2013-exec-engine.patch

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: (was: updated-8-22-2013-exec-engine.patch)

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2013-08-22 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748057#comment-13748057
 ] 

Russell Jurney commented on PIG-1420:
-

This JIRA is not fixed. I don't know how to re-open it, however.

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>  Labels: five, high
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: (was: updated-8-22-2013-exec-engine.patch)

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: updated-8-22-2013-exec-engine.patch

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-22 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3435:
--

Attachment: pig-3435-v01.patch

Looking at the multi-query optimization code and documents.  I chickened out. 

Taking the same approach as PIG-1108 and simply skipping the MR jobs with 
custom partitioner.

Attaching the test case soon.

> Custom Partitioner not working with MultiQueryOptimizer
> ---
>
> Key: PIG-3435
> URL: https://issues.apache.org/jira/browse/PIG-3435
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3435-v01.patch
>
>
> When looking at PIG-3385, noticed some issues in handling of custom 
> partitioner with multi-query optimization.
> {noformat}
> C1 = group B1 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> {noformat}
> This seems to be merged to one mapreduce job correctly but custom partitioner 
> information was lost.
> {noformat}
> C1 = group B1 by col1 PARTITION BY 
> org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 parallel 2;
> {noformat}
> This seems to be merged even though they should run on two different 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747968#comment-13747968
 ] 

Achal Soni commented on PIG-3419:
-

Here is the ReviewBoard for the new patch : 

https://reviews.apache.org/r/13752

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747966#comment-13747966
 ] 

Achal Soni commented on PIG-3419:
-

Hi all, 

I've reuploaded the patch with all of the changes that Julien suggested as well 
as accounting for the comments from Mark. The test cases should be ok now 
(hopefully!) as we changed build.xml to package the META-INF folder and I 
changed the compilePp() issue. 

[~cheolsoo] Can you give this a look when you have time, and run the test 
suite? I think everything should be fine now. Also if you can tell me what 
exactly I should be doing for indentation/in what files that'd be great. I seem 
to have some problems with the whitespace/indentation aspect so some pointers 
here would be awesome. 

Let me know if anything else seems wrong. 

Achal

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: (was: finalpatch.patch)

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: updated-8-22-2013-exec-engine.patch

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: finalpatch.patch

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Achal Soni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Achal Soni updated PIG-3419:


Attachment: (was: finalpatch.patch)

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-22 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks Rohini. Committed to trunk.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168-3.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)

2013-08-22 Thread Mark Wagner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747789#comment-13747789
 ] 

Mark Wagner commented on PIG-3434:
--

I was able to reproduce this. It looks like POUserFunc isn't handling 
STATUS_NULL correctly. I'll make a patch.

> Null subexpression in bincond nullifies outer tuple (or bag)
> 
>
> Key: PIG-3434
> URL: https://issues.apache.org/jira/browse/PIG-3434
> Project: Pig
>  Issue Type: Bug
>Reporter: Pavel Fedyakov
>
> According to docs, for bincond operator "If a Boolean subexpression results 
> in null value, the resulting expression is null" 
> (http://pig.apache.org/docs/r0.11.0/basic.html#nulls).
> It works as described in plain foreach..generate expression:
> {noformat}
> in = load 'in';
> out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);
> dump out;
> {noformat}
> in (3 lines, 2nd is empty):
> {noformat}
> 0
> 1
> {noformat}
> out:
> {noformat}
> (1,3)
> (1,)
> (1,2)
> {noformat}
> But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
> line in output:
> {noformat}
> out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));
> {noformat}
> out:
> {noformat}
> ((1,3))
> ()
> ((1,2))
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)

2013-08-22 Thread Mark Wagner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Wagner reassigned PIG-3434:


Assignee: Mark Wagner

> Null subexpression in bincond nullifies outer tuple (or bag)
> 
>
> Key: PIG-3434
> URL: https://issues.apache.org/jira/browse/PIG-3434
> Project: Pig
>  Issue Type: Bug
>Reporter: Pavel Fedyakov
>Assignee: Mark Wagner
>
> According to docs, for bincond operator "If a Boolean subexpression results 
> in null value, the resulting expression is null" 
> (http://pig.apache.org/docs/r0.11.0/basic.html#nulls).
> It works as described in plain foreach..generate expression:
> {noformat}
> in = load 'in';
> out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);
> dump out;
> {noformat}
> in (3 lines, 2nd is empty):
> {noformat}
> 0
> 1
> {noformat}
> out:
> {noformat}
> (1,3)
> (1,)
> (1,2)
> {noformat}
> But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
> line in output:
> {noformat}
> out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));
> {noformat}
> out:
> {noformat}
> ((1,3))
> ()
> ((1,2))
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747745#comment-13747745
 ] 

Rohini Palaniswamy commented on PIG-3168:
-

+1

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168-3.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-22 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Attachment: PIG-3168-3.patch

I noticed that TestShortcuts is broken with my changes to explain in batch 
mode. I fixed the test in a new patch.

In fact, I didn't commit PIG-3168-2.patch at all. Since additional changes in 
the new patch are very minor, I will go ahead commit it.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168-3.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Slow Group By operator

2013-08-22 Thread Alan Gates
When data comes out of a map task, Hadoop serializes it so that it can know its 
exact size as it writes it into the output buffer.  To run it through the 
combiner it needs to deserialize it again, and then re-serialize it when it 
comes out.  So each pass through the combiner costs a serialize/deserialization 
pass, which is expensive and not worth it unless the data reduction is 
significant.  

In other words, the combiner can be slow because Java lacks a sizeof operator.

Alan.

On Aug 22, 2013, at 4:01 AM, Benjamin Jakobus wrote:

> Hi Cheolsoo,
> 
> Thanks - I will try this now and get back to you.
> 
> Out of interest; could you explain (or point me towards resources that
> would) why the combiner would be a problem?
> 
> Also, could the fact that Pig builds an intermediary data structure (?)
> whilst Hive just performs a sort then the arithmetic operation explain the
> slowdown?
> 
> (Apologies, I'm quite new to Pig/Hive - just my guesses).
> 
> Regards,
> Benjamin
> 
> 
> On 22 August 2013 01:07, Cheolsoo Park  wrote:
> 
>> Hi Benjamin,
>> 
>> Thank you very much for sharing detailed information!
>> 
>> 1) From the runtime numbers that you provided, the mappers are very slow.
>> 
>> CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
>> 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
>> 
>> 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
>> 
>> I am wondering whether disabling combiner will help here. I have seen a lot
>> of cases where combiner actually hurt performance significantly if it
>> doesn't combine mapper outputs significantly. Briefly looking at
>> generate_data.pl in PIG-200, it looks like a lot of random keys are
>> generated. So I guess you will end up with a large number of small bags
>> rather than a small number of large bags. If that's the case, combiner will
>> only add overhead to mappers.
>> 
>> Can you try to include this "set pig.exec.nocombiner true;" and see whether
>> it helps?
>> 
>> Thanks,
>> Cheolsoo
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus >> wrote:
>> 
>>> Hi Cheolsoo,
>>> 
> What's your query like? Can you share it? Do you call any algebraic UDF
> after group by? I am wondering whether combiner matters in your test.
>>> I have been running 3 different types of queries.
>>> 
>>> The first was performed on datasets of 6 different sizes:
>>> 
>>> 
>>>   - Dataset size 1: 30,000 records (772KB)
>>>   - Dataset size 2: 300,000 records (6.4MB)
>>>   - Dataset size 3: 3,000,000 records (63MB)
>>>   - Dataset size 4: 30 million records (628MB)
>>>   - Dataset size 5: 300 million records (6.2GB)
>>>   - Dataset size 6: 3 billion records (62GB)
>>> 
>>> The datasets scale linearly, whereby the size equates to 3000 * 10n .
>>> A seventh dataset consisting of 1,000 records (23KB) was produced to
>>> perform join
>>> operations on. Its schema is as follows:
>>> name - string
>>> marks - integer
>>> gpa - float
>>> The data was generated using the generate data.pl perl script available
>>> for
>>> download
>>> from https://issues.apache.org/jira/browse/PIG-200 to produce the
>>> datasets. The results are as follows:
>>> 
>>> 
>>> *  * *  * *  * *Set 1  * *Set 2**  * *Set 3**  *
>>> *Set
>>> 4**  * *Set 5**  * *Set 6*
>>> *Arithmetic**  * 32.82*  * 36.21*  * 49.49*  * 83.25*
>>> *
>>> 423.63*  * 3900.78
>>> *Filter 10%**  * 32.94*  * 34.32*  * 44.56*  * 66.68*
>>> *
>>> 295.59*  * 2640.52
>>> *Filter 90%**  * 33.93*  * 32.55*  * 37.86*  * 53.22*
>>> *
>>> 197.36*  * 1657.37
>>> *Group**  * *  *49.43*  * 53.34*  * 69.84*  * 105.12*
>>>   *497.61*  * 4394.21
>>> *Join**  * *  *   49.89*  * 50.08*  * 78.55*  *
>> 150.39*
>>>   *1045.34* *10258.19
>>> *Averaged performance of arithmetic, join, group, order, distinct select
>>> and filter operations on six datasets using Pig. Scripts were configured
>> as
>>> to use 8 reduce and 11 map tasks.*
>>> 
>>> 
>>> 
>>> *  * *  Set 1**  * *Set 2**  * *Set 3**  *
>>> *Set
>>> 4**  * *Set 5**  * *Set 6*
>>> *Arithmetic**  *  32.84*  * 37.33*  * 72.55*  * 300.08
>>> 2633.7227821.19
>>> *Filter 10%  *   32.36*  * 53.28*  * 59.22*  * 209.5*
>> *
>>> 1672.3* *18222.19
>>> *Filter 90%  *  31.23*  * 32.68*  *  36.8*  *  69.55*
>>> *
>>> 331.88* *3320.59
>>> *Group  * *  * 48.27*  * 47.68*  * 46.87*  * 53.66*
>>> *141.36* *1233.4
>>> *Join  * *  * *   *48.54*  *56.86*  * 104.6*  *
>> 517.5*
>>>   * 4388.34*  * -
>>> *Distinct**  * * *48.73*  *53.28*  * 72.54*  *
>> 109.77*
>>>   * - *  * *  *  -
>>> *Averaged performance of arithmetic, join, group, distinct select and
>>> filter operations on six datasets using Hive. Scripts were configured as
>> to
>>>

[jira] [Commented] (PIG-3424) Package import list should consider class name as is first even if -Dudf.import.list is passed

2013-08-22 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747603#comment-13747603
 ] 

Cheolsoo Park commented on PIG-3424:


Thanks for fixing it. +1.

> Package import list should consider class name as is first even if 
> -Dudf.import.list is passed
> --
>
> Key: PIG-3424
> URL: https://issues.apache.org/jira/browse/PIG-3424
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3424-1.patch, PIG-3424-fixtest.patch
>
>
> Currently if -Dudf.import.list is passed it adds them to the beginning of the 
> list and "" (class name as is) which is defined by default always is pushed 
> to end of list. In cases where the pig deployment itself contains predefined 
> -Dudf.import.list, class resolution tries all of that before trying the fully 
> qualified class name defined in LOAD as is. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3424) Package import list should consider class name as is first even if -Dudf.import.list is passed

2013-08-22 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3424:


Attachment: PIG-3424-fixtest.patch

When running the full test suite, encountered a test failure. Attaching patch 
that fixes the testcase. 

> Package import list should consider class name as is first even if 
> -Dudf.import.list is passed
> --
>
> Key: PIG-3424
> URL: https://issues.apache.org/jira/browse/PIG-3424
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3424-1.patch, PIG-3424-fixtest.patch
>
>
> Currently if -Dudf.import.list is passed it adds them to the beginning of the 
> list and "" (class name as is) which is defined by default always is pushed 
> to end of list. In cases where the pig deployment itself contains predefined 
> -Dudf.import.list, class resolution tries all of that before trying the fully 
> qualified class name defined in LOAD as is. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Slow Group By operator

2013-08-22 Thread Cheolsoo Park
Hi Benjamin,

To answer your question, how the Hadoop combiner works is that 1) mappers
write outputs to disk and 2) combiners read them, combine and write them
again. So you're paying extra disk I/O as well as
serialization/deserialization.

This will pay off if combiners significantly reduce the intermediate
outputs that reducers need to fetch from mappers. But if reduction is not
significant, it will only slow down mappers. You can identify whether this
is really a problem by comparing the time spent by map and combine
functions in the task logs.

What I usually do are:
1) If there are many small bags, disable combiners.
2) If there are many large bags, enable combiners. Furthermore, turning on
"pig.exec.mapPartAgg" helps. (see the Pig
blogfor
details.
)

Thanks,
Cheolsoo


On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus wrote:

> Hi Cheolsoo,
>
> Thanks - I will try this now and get back to you.
>
> Out of interest; could you explain (or point me towards resources that
> would) why the combiner would be a problem?
>
> Also, could the fact that Pig builds an intermediary data structure (?)
> whilst Hive just performs a sort then the arithmetic operation explain the
> slowdown?
>
> (Apologies, I'm quite new to Pig/Hive - just my guesses).
>
> Regards,
> Benjamin
>
>
> On 22 August 2013 01:07, Cheolsoo Park  wrote:
>
> > Hi Benjamin,
> >
> > Thank you very much for sharing detailed information!
> >
> > 1) From the runtime numbers that you provided, the mappers are very slow.
> >
> > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
> > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
> >
> > 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
> >
> > I am wondering whether disabling combiner will help here. I have seen a
> lot
> > of cases where combiner actually hurt performance significantly if it
> > doesn't combine mapper outputs significantly. Briefly looking at
> > generate_data.pl in PIG-200, it looks like a lot of random keys are
> > generated. So I guess you will end up with a large number of small bags
> > rather than a small number of large bags. If that's the case, combiner
> will
> > only add overhead to mappers.
> >
> > Can you try to include this "set pig.exec.nocombiner true;" and see
> whether
> > it helps?
> >
> > Thanks,
> > Cheolsoo
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <
> jakobusbe...@gmail.com
> > >wrote:
> >
> > > Hi Cheolsoo,
> > >
> > > >>What's your query like? Can you share it? Do you call any algebraic
> UDF
> > > >> after group by? I am wondering whether combiner matters in your
> test.
> > > I have been running 3 different types of queries.
> > >
> > > The first was performed on datasets of 6 different sizes:
> > >
> > >
> > >- Dataset size 1: 30,000 records (772KB)
> > >- Dataset size 2: 300,000 records (6.4MB)
> > >- Dataset size 3: 3,000,000 records (63MB)
> > >- Dataset size 4: 30 million records (628MB)
> > >- Dataset size 5: 300 million records (6.2GB)
> > >- Dataset size 6: 3 billion records (62GB)
> > >
> > > The datasets scale linearly, whereby the size equates to 3000 * 10n .
> > > A seventh dataset consisting of 1,000 records (23KB) was produced to
> > > perform join
> > > operations on. Its schema is as follows:
> > > name - string
> > > marks - integer
> > > gpa - float
> > > The data was generated using the generate data.pl perl script
> available
> > > for
> > > download
> > >  from https://issues.apache.org/jira/browse/PIG-200 to produce the
> > > datasets. The results are as follows:
> > >
> > >
> > >  *  * *  * *  * *Set 1  * *Set 2**  * *Set 3**
>  *
> > > *Set
> > > 4**  * *Set 5**  * *Set 6*
> > > *Arithmetic**  * 32.82*  * 36.21*  * 49.49*  * 83.25*
> > >  *
> > >  423.63*  * 3900.78
> > > *Filter 10%**  * 32.94*  * 34.32*  * 44.56*  * 66.68*
> > >  *
> > >  295.59*  * 2640.52
> > > *Filter 90%**  * 33.93*  * 32.55*  * 37.86*  * 53.22*
> > >  *
> > >  197.36*  * 1657.37
> > > *Group**  * *  *49.43*  * 53.34*  * 69.84*  *
> 105.12*
> > >*497.61*  * 4394.21
> > > *Join**  * *  *   49.89*  * 50.08*  * 78.55*  *
> > 150.39*
> > >*1045.34* *10258.19
> > > *Averaged performance of arithmetic, join, group, order, distinct
> select
> > > and filter operations on six datasets using Pig. Scripts were
> configured
> > as
> > > to use 8 reduce and 11 map tasks.*
> > >
> > >
> > >
> > >  *  * *  Set 1**  * *Set 2**  * *Set 3**  *
> > > *Set
> > > 4**  * *Set 5**  * *Set 6*
> > > *Arithmetic**  *  32.84*  * 37.33*  * 72.55*  * 300.08
> > >  2633.7227821.19
> > > *Filter 10%  *   32.36*  * 53.28*  * 59.22*  * 209.5*
> >  *
> > > 1672.3* *18222.19
> > > *Filter 90%  *  31.23*  * 

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747564#comment-13747564
 ] 

Cheolsoo Park commented on PIG-3419:


Attached test_failures.txt.

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, finalpatch.patch, 
> mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, 
> test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-22 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3419:
---

Attachment: test_failures.txt

I see 295 failures with "ant clean test". The number looks scary, but all the 
failures seem to boil down to two reasons:
# As Mark already mentioned, META-INF is not executed. So many test cases fail 
with the following error:
{code}
error msg: Unknown exec type: local
{code}
# The removal of PigServer.compilePp() makes many test cases fail with the 
following error:
{code}
java.lang.NoSuchMethodException: org.apache.pig.PigServer.compilePp()
{code}

As for indentation, can you please use 4-spaces instead of tabs? Tabs make 
indentation look funny at several places. A couple of awk/sed commands should 
do the job. 

Otherwise, looks great to me too. Thank you Achal for the wonderful work!

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, finalpatch.patch, 
> mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, 
> test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: can't parse the values using XML loader

2013-08-22 Thread Muni mahesh
Hi,

Is there any REGEX UDF available for this sort of problem. Thanks in advance


On Wed, Aug 21, 2013 at 10:36 PM, Amit  wrote:

> Hello,
> Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the
> entire string to the input and not the parts of it. You may want to write
> your own REGEX UDF (If you are not going route suggested by Will) which
> uses Matcher.find() instead of Matcher.matches().
>
>
>
> Regards,
> Amit
>
>
>
> 
>  From: "william.dowl...@thomsonreuters.com" <
> william.dowl...@thomsonreuters.com>
> To: u...@pig.apache.org; dev@pig.apache.org
> Sent: Wednesday, August 21, 2013 12:19 PM
> Subject: RE: can't parse the values using XML loader
>
>
> Part of the problem might be that the regexp has
>
> (.*)
>
> but you need
> (.*)
>
> Using regexps to parse XML is awfully brittle. An alternative is to use a
> UDF that calls out to an XML parser. I use ElementTree from python UDFs.
>
> Will Dowling
>
> 
> From: Muni mahesh [mahesh87.had...@gmail.com]
> Sent: Wednesday, August 21, 2013 6:58 AM
> To: dev@pig.apache.org; u...@pig.apache.org
> Subject: can't parse the values using XML loader
>
> *Input file :*
>
> 
> 
> hadoop developer
> ajay
> india
> ITC
> 10.90
> 2013
> 
> 
>
> 
> *Pig Script:*
>
> register /usr/lib/pig/piggybank.jar;
>
> A = load '/home/sudeep/Desktop/CATALOG.xml' using
> org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x:
> chararray);
>
>
> B = foreach A GENERATE
>
> FLATTEN(REGEX_EXTRACT_ALL(x,'\\n*\\n(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*(.*)\\n\\s*\\n*'))
> as (id: int, name:chararray);
>
>
> *Output Expected :*
>
> (hadoop, ajay, india, ITC, 10.90, 2013)
>
> *Issue :
>
> *
>
> But the output i am getting is :*
>
> ()
>
> *
>
> *I hope it is not able to parse the values between the tags
> *
>


[jira] [Commented] (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2013-08-22 Thread Ido Hadanny (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747531#comment-13747531
 ] 

Ido Hadanny commented on PIG-1420:
--

So, is this fixed or not? I couldn't get it to work in 0.10.0, and I can't 
understand where can I find the new patch from 2013... 

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>  Labels: five, high
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Slow Group By operator

2013-08-22 Thread Benjamin Jakobus
Hi Cheolsoo,

Thanks - I will try this now and get back to you.

Out of interest; could you explain (or point me towards resources that
would) why the combiner would be a problem?

Also, could the fact that Pig builds an intermediary data structure (?)
whilst Hive just performs a sort then the arithmetic operation explain the
slowdown?

(Apologies, I'm quite new to Pig/Hive - just my guesses).

Regards,
Benjamin


On 22 August 2013 01:07, Cheolsoo Park  wrote:

> Hi Benjamin,
>
> Thank you very much for sharing detailed information!
>
> 1) From the runtime numbers that you provided, the mappers are very slow.
>
> CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
> 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
>
> 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
>
> I am wondering whether disabling combiner will help here. I have seen a lot
> of cases where combiner actually hurt performance significantly if it
> doesn't combine mapper outputs significantly. Briefly looking at
> generate_data.pl in PIG-200, it looks like a lot of random keys are
> generated. So I guess you will end up with a large number of small bags
> rather than a small number of large bags. If that's the case, combiner will
> only add overhead to mappers.
>
> Can you try to include this "set pig.exec.nocombiner true;" and see whether
> it helps?
>
> Thanks,
> Cheolsoo
>
>
>
>
>
>
> On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus  >wrote:
>
> > Hi Cheolsoo,
> >
> > >>What's your query like? Can you share it? Do you call any algebraic UDF
> > >> after group by? I am wondering whether combiner matters in your test.
> > I have been running 3 different types of queries.
> >
> > The first was performed on datasets of 6 different sizes:
> >
> >
> >- Dataset size 1: 30,000 records (772KB)
> >- Dataset size 2: 300,000 records (6.4MB)
> >- Dataset size 3: 3,000,000 records (63MB)
> >- Dataset size 4: 30 million records (628MB)
> >- Dataset size 5: 300 million records (6.2GB)
> >- Dataset size 6: 3 billion records (62GB)
> >
> > The datasets scale linearly, whereby the size equates to 3000 * 10n .
> > A seventh dataset consisting of 1,000 records (23KB) was produced to
> > perform join
> > operations on. Its schema is as follows:
> > name - string
> > marks - integer
> > gpa - float
> > The data was generated using the generate data.pl perl script available
> > for
> > download
> >  from https://issues.apache.org/jira/browse/PIG-200 to produce the
> > datasets. The results are as follows:
> >
> >
> >  *  * *  * *  * *Set 1  * *Set 2**  * *Set 3**  *
> > *Set
> > 4**  * *Set 5**  * *Set 6*
> > *Arithmetic**  * 32.82*  * 36.21*  * 49.49*  * 83.25*
> >  *
> >  423.63*  * 3900.78
> > *Filter 10%**  * 32.94*  * 34.32*  * 44.56*  * 66.68*
> >  *
> >  295.59*  * 2640.52
> > *Filter 90%**  * 33.93*  * 32.55*  * 37.86*  * 53.22*
> >  *
> >  197.36*  * 1657.37
> > *Group**  * *  *49.43*  * 53.34*  * 69.84*  * 105.12*
> >*497.61*  * 4394.21
> > *Join**  * *  *   49.89*  * 50.08*  * 78.55*  *
> 150.39*
> >*1045.34* *10258.19
> > *Averaged performance of arithmetic, join, group, order, distinct select
> > and filter operations on six datasets using Pig. Scripts were configured
> as
> > to use 8 reduce and 11 map tasks.*
> >
> >
> >
> >  *  * *  Set 1**  * *Set 2**  * *Set 3**  *
> > *Set
> > 4**  * *Set 5**  * *Set 6*
> > *Arithmetic**  *  32.84*  * 37.33*  * 72.55*  * 300.08
> >  2633.7227821.19
> > *Filter 10%  *   32.36*  * 53.28*  * 59.22*  * 209.5*
>  *
> > 1672.3* *18222.19
> > *Filter 90%  *  31.23*  * 32.68*  *  36.8*  *  69.55*
> >  *
> > 331.88* *3320.59
> > *Group  * *  * 48.27*  * 47.68*  * 46.87*  * 53.66*
> >  *141.36* *1233.4
> > *Join  * *  * *   *48.54*  *56.86*  * 104.6*  *
> 517.5*
> >* 4388.34*  * -
> > *Distinct**  * * *48.73*  *53.28*  * 72.54*  *
> 109.77*
> >* - *  * *  *  -
> > *Averaged performance of arithmetic, join, group, distinct select and
> > filter operations on six datasets using Hive. Scripts were configured as
> to
> > use 8 reduce and 11 map tasks.*
> >
> > (If you want to see the standard deviation, let me know).
> >
> > So, to summarize the results: Pig outperforms Hive, with the exception of
> > using *Group By*.
> >
> > The Pig scripts used for this benchmark are as follows:
> > *Arithmetic*
> > -- Generate with basic arithmetic
> > A = load '$input/dataset_3' using PigStorage('\t') as (name, age,
> > gpa) PARALLEL $reducers;
> > B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers;
> > store B into '$output/dataset_3_projection' using PigStorage()
> > PARALLEL $reducers;
> >
> > *
> > *

[jira] [Updated] (PIG-3434) Null subexpression in bincond nullifies outer tuple (or bag)

2013-08-22 Thread Pavel Fedyakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Fedyakov updated PIG-3434:


Description: 
According to docs, for bincond operator "If a Boolean subexpression results in 
null value, the resulting expression is null" 
(http://pig.apache.org/docs/r0.11.0/basic.html#nulls).

It works as described in plain foreach..generate expression:

{noformat}
in = load 'in';
out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);
dump out;
{noformat}

in (3 lines, 2nd is empty):
{noformat}
0

1
{noformat}

out:
{noformat}
(1,3)
(1,)
(1,2)
{noformat}

But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
line in output:

{noformat}
out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));
{noformat}

out:
{noformat}
((1,3))
()
((1,2))
{noformat}


  was:
According to docs, for bincond operator "If a Boolean subexpression results in 
null value, the resulting expression is null" 
(http://pig.apache.org/docs/r0.11.0/basic.html#nulls).

It works as described in plain foreach..generate expression:

{{in = load 'in';}}
{{out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);}}
{{dump out;}}


in (3 lines, 2nd is empty):
{{0}}

{{1}}

out:
{{(1,3)}}
{{(1,)}}
{{(1,2)}}

But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
line in output:

{{out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));}}

out:
{{((1,3))}}
{{()}}
{{((1,2))}}



> Null subexpression in bincond nullifies outer tuple (or bag)
> 
>
> Key: PIG-3434
> URL: https://issues.apache.org/jira/browse/PIG-3434
> Project: Pig
>  Issue Type: Bug
>Reporter: Pavel Fedyakov
>
> According to docs, for bincond operator "If a Boolean subexpression results 
> in null value, the resulting expression is null" 
> (http://pig.apache.org/docs/r0.11.0/basic.html#nulls).
> It works as described in plain foreach..generate expression:
> {noformat}
> in = load 'in';
> out = FOREACH in GENERATE 1, ($0 > 0 ? 2 : 3);
> dump out;
> {noformat}
> in (3 lines, 2nd is empty):
> {noformat}
> 0
> 1
> {noformat}
> out:
> {noformat}
> (1,3)
> (1,)
> (1,2)
> {noformat}
> But if we wrap generated variables in tuple (or bag), we lose the whole 2nd 
> line in output:
> {noformat}
> out = FOREACH in GENERATE (1, ($0 > 0 ? 2 : 3));
> {noformat}
> out:
> {noformat}
> ((1,3))
> ()
> ((1,2))
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira