[jira] Subscription: PIG patch available

2013-08-20 Thread jira
Issue Subscription
Filter: PIG patch available (18 issues)

Subscriber: pigdaily

Key Summary
PIG-3431Return more information for parsing related exceptions.
https://issues.apache.org/jira/browse/PIG-3431
PIG-3430Add xml format for explaining MapReduce Plan.
https://issues.apache.org/jira/browse/PIG-3430
PIG-3426Add support for removing s3 files
https://issues.apache.org/jira/browse/PIG-3426
PIG-3419Pluggable Execution Engine 
https://issues.apache.org/jira/browse/PIG-3419
PIG-3379Alias reuse in nested foreach causes PIG script to fail
https://issues.apache.org/jira/browse/PIG-3379
PIG-3374CASE and IN fail when expression includes dereferencing operator
https://issues.apache.org/jira/browse/PIG-3374
PIG-3349Document ToString(Datetime, String) UDF
https://issues.apache.org/jira/browse/PIG-3349
PIG-3346New property that controls the number of combined splits
https://issues.apache.org/jira/browse/PIG-3346
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3168TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails 
in trunk
https://issues.apache.org/jira/browse/PIG-3168
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3048Add mapreduce workflow information to job configuration
https://issues.apache.org/jira/browse/PIG-3048
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-20 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3385:
--

Attachment: pig-3385-v01.patch

Wondering if custom partitioner ever worked for distinct.  

Looks like partitioner info is passed through POGlobalRearrange but "distinct" 
doesn't use it. 

Uploading an initial patch that just passes that info through PODistinct. 

It's the first time for me to touch the backend code. Appreciate if someone can 
take a look.  I'll upload a testcase next.

> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Will Oberman
>Priority: Minor
> Attachments: pig-3385-v01.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745487#comment-13745487
 ] 

Rohini Palaniswamy commented on PIG-3168:
-

+1

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745448#comment-13745448
 ] 

Cheolsoo Park commented on PIG-3419:


[~julienledem],
{quote}
Do we really throw Exception?
{quote}
No, we don't throw Exception to the end user. But currently, PigServer catches 
them all in a single catch block and sort them out using instanceof calls (see 
below). Probably we should make ExecutionEngine throw FEE, EE, and IOE and 
replace instanceof calls with catch blocks in PigServer.
{code}
try {
stats = pigContext.getExecutionEngine().launchPig(lp, jobName, pigContext);
} catch (Exception e) {
// There are a lot of exceptions thrown by the launcher.  If this
// is an ExecException, just let it through.  Else wrap it.
if (e instanceof ExecException){
throw (ExecException)e;
} else if (e instanceof FrontendException) {
throw (FrontendException)e;
} else {
int errCode = 2043;
String msg = "Unexpected error during execution.";
throw new ExecException(msg, errCode, PigException.BUG, e);
}
}
{code}

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-20 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745398#comment-13745398
 ] 

Julien Le Dem commented on PIG-3419:


[~cheolsoo] 
1. Do we really throw Exception ? If yes, then let's just throw that. If not 
then let's instead have FrontEndException, ExecException, IOException. i.e. 
let's remove the exceptions that are already included by the highest exception 
level.
2. agreed with you. I would expect the execution engine to handle the 
Properties internally and the signature of this method to be:
{noformat}
public void setProperty(String property, String value);
{noformat}

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Status: Patch Available  (was: Reopened)

TestMultiQueryBasic passes.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Attachment: (was: PIG-3618-2.patch)

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Attachment: PIG-3168-2.patch

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168-2.patch, PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3168:
---

Attachment: PIG-3618-2.patch

So here is what it does now:
* In interactive mode, explain with no alias == explain on the last relation.
* In batch mode, explain with no alias == explain on the entire script.

Let me know whether this is not good.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168.patch, PIG-3618-2.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reopened PIG-3168:



Thanks Rohini. I will post a patch that reverts it to the old behavior in batch 
mode.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3432) typo in log message in SchemaTupleFrontend

2013-08-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-3432.


   Resolution: Fixed
Fix Version/s: 0.12
 Assignee: oleksii iepishkin

Committed to trunk. Thank you Oleksii!

> typo in log message in SchemaTupleFrontend
> --
>
> Key: PIG-3432
> URL: https://issues.apache.org/jira/browse/PIG-3432
> Project: Pig
>  Issue Type: Bug
>Reporter: oleksii iepishkin
>Assignee: oleksii iepishkin
> Fix For: 0.12
>
> Attachments: PIG-3432.patch
>
>
> https://github.com/apache/pig/pull/11.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Slow Group By operator

2013-08-20 Thread Cheolsoo Park
Hi Benjarmin,

Can you describe which step of group by is slow? Mapper side or reducer
side?

What's your query like? Can you share it? Do you call any algebraic UDF
after group by? I am wondering whether combiner matters in your test.

Thanks,
Cheolsoo




On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus wrote:

> Hi all,
>
> After benchmarking Hive and Pig, I found that the Group By operator in Pig
> is drastically slower that Hive's. I was wondering whether anybody has
> experienced the same? And whether people may have any tips for improving
> the performance of this operation? (Adding a DISTINCT as suggested by an
> earlier post on here doesn't help. I am currently re-running the benchmark
> with LZO compression enabled).
>
> Regards,
> Ben
>


[jira] [Commented] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk

2013-08-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745240#comment-13745240
 ] 

Rohini Palaniswamy commented on PIG-3168:
-

bq. PIG-2994 made explain with no alias be equivalent to explain on the 
previous alias. 
  Shouldn't we revert back the behavior of explain with no alias to older 
behavior of explaining the whole script instead of fixing the test? It is kind 
of breaking backward compatibility.

> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
> 
>
> Key: PIG-3168
> URL: https://issues.apache.org/jira/browse/PIG-3168
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3168.patch
>
>
> PIG-2994 made explain with no alias be equivalent to explain on the previous 
> alias. This breaks 
> TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the 
> previous alias is an auto-generated alias not a user-defined alias.
> The following fixes the test:
> {code}
>  "I = GROUP F2 BY (f7, f8);" +
>  "STORE I into 'foo4'  using BinStorage();" +
> -"explain;";
> +"explain I;";
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745227#comment-13745227
 ] 

Cheolsoo Park commented on PIG-3419:


I have two more comments in ExecutionEngine and MRExecutionEngine as follows:
# Can you simplify the checked exceptions in the ExecutionEngine interface? For 
example,
From:
{code}
public PigStats launchPig(LogicalPlan lp, String grpName, PigContext pc)
throws PlanException, VisitorException, IOException, ExecException,
JobCreationException, FrontendException, Exception;
{code}
To:
{code}
public PigStats launchPig(LogicalPlan lp, String grpName, PigContext pc) throws 
Exception;
{code}
Looks like there's no point of throwing them again in ExecutionEngine because 
they will be caught as Exception in PigServer anyway. If needed, we should take 
specific actions per exception in ExecutionEngine.
# As for the setProperty method in ExecutionEngine, do we need to pass a 
properties? Can we construct a properties with the given key/value pair and 
call recomputeProperties() internally?
{code}
public void setProperty(Properties properties, String property, String value);
{code}
Also, as for the setProperty method in MRExecutionEngine, looks like it's 
mostly duplicate of recomputeProperties(). Can you just reuse 
recomputeProperties()?

Julien said you're working on a new patch. It would be nice if you could 
incorporate these (of course if you agree with me). Thank you a lot!

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_suite.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3432) typo in log message in SchemaTupleFrontend

2013-08-20 Thread oleksii iepishkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

oleksii iepishkin updated PIG-3432:
---

Attachment: PIG-3432.patch

> typo in log message in SchemaTupleFrontend
> --
>
> Key: PIG-3432
> URL: https://issues.apache.org/jira/browse/PIG-3432
> Project: Pig
>  Issue Type: Bug
>Reporter: oleksii iepishkin
> Attachments: PIG-3432.patch
>
>
> https://github.com/apache/pig/pull/11.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3432) typo in log message in SchemaTupleFrontend

2013-08-20 Thread oleksii iepishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745183#comment-13745183
 ] 

oleksii iepishkin commented on PIG-3432:


I've attached the patch to this ticket.

Just in case here is how the patch has been created:
{code}
git clone g...@github.com:apache/pig.git
cd pig
git checkout trunk

# merge pull request
curl https://github.com/apache/pig/pull/11.patch | git am

#create patch file for apache (I wish it was easier for a simple typo fix)
git reset HEAD~
git diff --no-prefix > PIG-3432.patch
{code}

> typo in log message in SchemaTupleFrontend
> --
>
> Key: PIG-3432
> URL: https://issues.apache.org/jira/browse/PIG-3432
> Project: Pig
>  Issue Type: Bug
>Reporter: oleksii iepishkin
> Attachments: PIG-3432.patch
>
>
> https://github.com/apache/pig/pull/11.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3432) typo in log message in SchemaTupleFrontend

2013-08-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745112#comment-13745112
 ] 

Cheolsoo Park commented on PIG-3432:


[~epishkin], thank you for the contribution. Unfortunately, we cannot pull your 
request on github since it's just a read-only mirror. Do you mind uploading 
your patch to this jira?

Please see here:
https://cwiki.apache.org/confluence/display/PIG/HowToContribute




> typo in log message in SchemaTupleFrontend
> --
>
> Key: PIG-3432
> URL: https://issues.apache.org/jira/browse/PIG-3432
> Project: Pig
>  Issue Type: Bug
>Reporter: oleksii iepishkin
>
> https://github.com/apache/pig/pull/11.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Slow Group By operator

2013-08-20 Thread Benjamin Jakobus
Hi all,

After benchmarking Hive and Pig, I found that the Group By operator in Pig
is drastically slower that Hive's. I was wondering whether anybody has
experienced the same? And whether people may have any tips for improving
the performance of this operation? (Adding a DISTINCT as suggested by an
earlier post on here doesn't help. I am currently re-running the benchmark
with LZO compression enabled).

Regards,
Ben


pig pull request: fixed a typo in a log message

2013-08-20 Thread epishkin
GitHub user epishkin opened a pull request:

https://github.com/apache/pig/pull/11

fixed a typo in a log message



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/epishkin/pig patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/pig/pull/11.patch







[jira] [Commented] (PIG-3429) Reduce Pig memory footprint using specialized tuple classes (complementary to SchemaTuple)

2013-08-20 Thread Jonathan Packer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745007#comment-13745007
 ] 

Jonathan Packer commented on PIG-3429:
--

Hi, so the current patch now seems to pass every unit tests except ones which 
use tuple's append() method which breaks. I have an idea for fixing this, but 
wanted to wait for feedback to make sure I'm going in the right direction. I 
know this is changing some important classes, but I think the memory 
improvements could especially help make Pig local mode more viable for 
general-purpose use as memory is more of an issue on laptops then on clusters.

My idea for fixing append() is that for the specialized tuple impls, they have 
an extra field "Tuple promotedTuple". This is null by default, so it only adds 
8 bytes of overhead (still much cheaper than the ArrayList when it is unused). 
If someone needs to append to the specialized tuple, the existing fields are 
copied into a new default tuple in the "promotedTuple" field and that is just 
used by proxy. So there is a small overhead vs default when use append, but for 
most cases where append is not used you retain the memory savings of the 
specialized tuples. Does this seem like an workable idea?

> Reduce Pig memory footprint using specialized tuple classes (complementary to 
> SchemaTuple)
> --
>
> Key: PIG-3429
> URL: https://issues.apache.org/jira/browse/PIG-3429
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.12
>Reporter: Jonathan Packer
>Assignee: Jonathan Packer
> Attachments: PIG-3429-v1.diff, PIG-3429-v2.diff
>
>
> Pig's default tuple implementation is very memory inefficient for small 
> tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags 
> being spilled more often than they need to. SchemaTuple addresses this, but 
> is not fully integrated into the PhysicalPlan pipeline (and seems like it 
> would be difficult to do so). Furthermore, it is likely that almost all UDFs 
> do not use SchemaTuple.
> This patch therefore provides some basic optimizations to reduce memory 
> footprint of tuples by having BinSedesTupleFactory construct specialized 
> tuple implementations in certain circumstances. This way, anything using 
> BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a 
> different factory, it will not be interfered with.
> There is a long description below, because this patch might break stuff. I 
> tried to think through possible implementation hazards which I will list.
> The specialized tuple implementations are as follows:
> EmptyTuple  // no fields, just an object header = 8 bytes
> NullWrapperTuple// wraps a single null field, 8 bytes
> CountingTuple   // replaces (1L) as initial output of COUNT, 8 bytes
> IntegerWrapperTuple // these all wrap a single primitive field
> LongWrapperTuple// object header + rounded primitive size = 16 bytes
> FloatWrapperTuple
> DoubleWrapperTuple
> BinSedesTuple2  // these are pair/triples of fields with no ArrayList
> BinSedesTuple3  // 16/24 bytes of overhead as opposed to 80 from ArrayList
> The memory savings are greatest for the algebraic math functions COUNT, SUM, 
> etc. For example, the size of an intermediate tuple for COUNT should go from 
> 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go 
> from 112 bytes to 16 bytes.
> I haven't finished running the full unit-tests, but TestAlgebraicEval passes 
> so I'm hopeful it will be manageable to debug.
> The three concerns that I have are:
> 1) Since TupleFactory now sometimes outputs non-appendable tuples, the 
> isFixedSize() method had to be removed. A file search didn't show it being 
> used anywhere though. I think appending to tuples instead of finding out the 
> requisite size ahead of time is bad practice as well (I changed POForeach to 
> do the latter so it can take advantage of the special tuple impls).
> 2) Also since TupleFactory now has multiple tuple types, the tupleClass() 
> method gets tricky. I made a superclass GenericBinSedesTuple that all the 
> specialized classes inherit from, and it seems to work, but I'm not sure what 
> the implications of this are. It breaks the inheritance tree of AbstractTuple 
> <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits 
> directly from GenericBinSedesTuple and DefaultTuple is left unused. In the 
> patch, all the stuff for DefaultBinSedesTuple is just copied over from the 
> old DefaultTuple.
> 3) I tried to be careful not to break BinInterSedesTupleRawComparator, but 
> this will need verification.
> Finally,
> 4) For my personal use cases, I'd like to make custom tuple implementations