[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3419:
---

Attachment: updated-8-23-2013-exec-engine.patch

I am uploading a new patch that includes the following changes:
* Fixes most test cases (issues with JobStats and Explain).
* Removes 
"src/META-INF/services/org.apache.pig.backend.executionengine.ExecType" because 
it's duplicate. (Probably it was added by mistake.)
* Renames TestJobStats.java to TestMRJobStats.java since it tests MRJobStats.
* Fixes a bunch of Java warnings.

The diff from Achal's last patch can be viewed 
[here|https://github.com/piaozhexiu/apache-pig/commit/2a0b8bd00ae8685cd13d9b5ea08cb4672c71f450].

I just kicked off the unit tests again and will let you know how it goes. 
Thanks!

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-08-23 Thread Jeremy Karn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Karn updated PIG-2417:
-

Patch Info: Patch Available

> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -
>
> Key: PIG-2417
> URL: https://issues.apache.org/jira/browse/PIG-2417
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-2417-4.patch, PIG-2417-5.patch, streaming2.patch, 
> streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-08-23 Thread Jeremy Karn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-2417 started by Jeremy Karn.

> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -
>
> Key: PIG-2417
> URL: https://issues.apache.org/jira/browse/PIG-2417
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-2417-4.patch, PIG-2417-5.patch, streaming2.patch, 
> streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work stopped] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-08-23 Thread Jeremy Karn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-2417 stopped by Jeremy Karn.

> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -
>
> Key: PIG-2417
> URL: https://issues.apache.org/jira/browse/PIG-2417
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-2417-4.patch, PIG-2417-5.patch, streaming2.patch, 
> streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749226#comment-13749226
 ] 

Julien Le Dem commented on PIG-3419:


The advantage of having the Execution engine abstraction in trunk is it allows 
running experimental Pig execution engines implementations like Tez or Spark on 
an official release of Pig without having to build from a specific branch.
The execution engine implementations themselves are fairly independent of Pig 
and do not need to  be maintained in a Pig branch.
If the ExecutionEngine abstraction evolves over time that can be done in Trunk 
and can be merged independently of the Tez implementation itself.


> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-08-23 Thread jira
Issue Subscription
Filter: PIG patch available (18 issues)

Subscriber: pigdaily

Key Summary
PIG-3436Make pigmix run with Hadoop2
https://issues.apache.org/jira/browse/PIG-3436
PIG-3431Return more information for parsing related exceptions.
https://issues.apache.org/jira/browse/PIG-3431
PIG-3430Add xml format for explaining MapReduce Plan.
https://issues.apache.org/jira/browse/PIG-3430
PIG-3426Add support for removing s3 files
https://issues.apache.org/jira/browse/PIG-3426
PIG-3419Pluggable Execution Engine 
https://issues.apache.org/jira/browse/PIG-3419
PIG-3374CASE and IN fail when expression includes dereferencing operator
https://issues.apache.org/jira/browse/PIG-3374
PIG-3349Document ToString(Datetime, String) UDF
https://issues.apache.org/jira/browse/PIG-3349
PIG-3346New property that controls the number of combined splits
https://issues.apache.org/jira/browse/PIG-3346
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3117A debug mode in which pig does not delete temporary files
https://issues.apache.org/jira/browse/PIG-3117
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3048Add mapreduce workflow information to job configuration
https://issues.apache.org/jira/browse/PIG-3048
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Assigned] (PIG-2606) union is not accepting same alias as multiple inputs

2013-08-23 Thread Hari Sankar Sivarama Subramaniyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sankar Sivarama Subramaniyan reassigned PIG-2606:
--

Assignee: Hari Sankar Sivarama Subramaniyan

> union is not accepting same alias as multiple inputs
> 
>
> Key: PIG-2606
> URL: https://issues.apache.org/jira/browse/PIG-2606
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Thejas M Nair
>Assignee: Hari Sankar Sivarama Subramaniyan
>
> grunt> l = load 'x';   
> grunt> u = union l, l; 
> 2012-03-16 18:48:45,687 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. Union with Count(Operand) < 2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749190#comment-13749190
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


Olga, first commit to the spork branch is from *2012*.

https://github.com/dvryaboy/pig  (the default branch on my github is "spork").



> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Upgrading antlr to 3.5

2013-08-23 Thread Ashutosh Chauhan
Hive had lot of trouble while upgrading antlr last time. See
https://issues.apache.org/jira/browse/HIVE-2439 &
https://issues.apache.org/jira/browse/HIVE-4547
Not to say you will encounter these difficulties in next upgrade too but
given the fact that antlr is not very particular about backward & forward
compatibility and Hive uses antlr in pretty grueling way, I will be pretty
cautious with upgrade.

My two cents.
Ashutosh


On Fri, Aug 23, 2013 at 4:17 PM, Daniel Dai  wrote:

> If 3.5 can work without any code change, probably should be Ok. But we
> never tried that.
>
>
> On Fri, Aug 23, 2013 at 7:43 PM, Prashant Kommireddi  >wrote:
>
> > Hi Daniel,
> >
> > The reasons are more internal. Our app is having an issue with 3.4 and
> it's
> > easier for us to move forward to 3.5
> >
> >
> http://antlr.markmail.org/search/?q=%22void+%3D+null%3B%22#query:%22void%20%3D%20null%3B%22%20order%3Adate-backward+page:1+mid:7g3th2bg3onyoqhv+state:results
> >
> > Is it difficult to upgrade the version across the board (hive + pig)?
> >
> >
> > On Fri, Aug 23, 2013 at 2:02 PM, Daniel Dai 
> wrote:
> >
> > > Any reason why you want to upgrade to 3.5? We'd like Hive/Pig use the
> > same
> > > version of antrl, which ease the integration work of Hive/Pig/HCat.
> > >
> > > Thanks,
> > > Daniel
> > >
> > >
> > > On Fri, Aug 23, 2013 at 5:50 PM, Prashant Kommireddi <
> > prash1...@gmail.com
> > > >wrote:
> > >
> > > > Hey guys,
> > > >
> > > > Anyone aware of any issues with upgrading antlr to v3.5 for Pig? I am
> > > > planning to try it out, and wanted to make sure it's not already been
> > > > tried.
> > > >
> > > > Thanks,
> > > > Prashant
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>


Re: Upgrading antlr to 3.5

2013-08-23 Thread Daniel Dai
If 3.5 can work without any code change, probably should be Ok. But we
never tried that.


On Fri, Aug 23, 2013 at 7:43 PM, Prashant Kommireddi wrote:

> Hi Daniel,
>
> The reasons are more internal. Our app is having an issue with 3.4 and it's
> easier for us to move forward to 3.5
>
> http://antlr.markmail.org/search/?q=%22void+%3D+null%3B%22#query:%22void%20%3D%20null%3B%22%20order%3Adate-backward+page:1+mid:7g3th2bg3onyoqhv+state:results
>
> Is it difficult to upgrade the version across the board (hive + pig)?
>
>
> On Fri, Aug 23, 2013 at 2:02 PM, Daniel Dai  wrote:
>
> > Any reason why you want to upgrade to 3.5? We'd like Hive/Pig use the
> same
> > version of antrl, which ease the integration work of Hive/Pig/HCat.
> >
> > Thanks,
> > Daniel
> >
> >
> > On Fri, Aug 23, 2013 at 5:50 PM, Prashant Kommireddi <
> prash1...@gmail.com
> > >wrote:
> >
> > > Hey guys,
> > >
> > > Anyone aware of any issues with upgrading antlr to v3.5 for Pig? I am
> > > planning to try it out, and wanted to make sure it's not already been
> > > tried.
> > >
> > > Thanks,
> > > Prashant
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749106#comment-13749106
 ] 

Olga Natkovich commented on PIG-3419:
-

I think the reason we wanted it on the Tez branch is that it might evolve with 
Tez implementation and so we would merge the updated code back when Tez is 
ready. Since there are no plans for any additional backend, is there a need to 
apply this to trunk sooner rather than later?

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749014#comment-13749014
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


Rohini, I want to reiterate that this patch has NO tez dependencies (if it 
does, that's a bug). The intention is not to make Tez possible. It's to make 
pluggable execution engines possible; and I do not want that functionality to 
be tied to a tez branch that will be unstable and in heavy development for the 
foreseeable future. This work will be immediately useful for the Spork (pig on 
spark) branch, for example.

Also, it allows people to work with new runtimes *without modifying Pig*. So 
Tez-on-Pig doesn't even have to be done as a branch of this project, someone 
can go an experiment completely independently.

For these reasons, I would like it in trunk.

You make a great point about the danger of changing exceptions, public methods, 
etc. I believe that most of these are project-public, and annotated as such. Do 
you have specific methods you are concerned about? Ideally we would change as 
little as possible for the end user.

Dmitriy

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3379:


   Resolution: Fixed
Fix Version/s: 0.12
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Xuefu!

> Alias reuse in nested foreach causes PIG script to fail
> ---
>
> Key: PIG-3379
> URL: https://issues.apache.org/jira/browse/PIG-3379
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.12
>
> Attachments: PIG-3379-draft.patch, PIG-3379.patch
>
>
> The following script fails:
> {code:title=temp.pig}
> Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
> eventName:chararray);
> Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
> EventsPerMinute = GROUP Events BY (eventTime / 6);
> EventsPerMinute = FOREACH EventsPerMinute {
>   DistinctDevices = DISTINCT Events.deviceId;
>   nbDevices = SIZE(DistinctDevices);
>   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
>   nbDevicesWatching = SIZE(DistinctDevices);
>   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
> as nbDevicesWatching;
> }
> EventsPerMinute = FILTER EventsPerMinute BY timeStamp >= 0  AND timeStamp < 
> 10;
> A = FOREACH EventsPerMinute GENERATE timeStamp;
> describe A;
> {code}
> With the error:
> {code}
> 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1025: 
>  Invalid field 
> projection. Projected field [timeStamp] does not exist in schema: 
> deviceId:chararray.
> {code}
> Using distinct alias name for the 2nd "DistinctDevices" fixes the problem. As 
> an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-08-23 Thread Jeremy Karn (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748957#comment-13748957
 ] 

Jeremy Karn commented on PIG-2417:
--

Here's the review board: https://reviews.apache.org/r/13781/

> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -
>
> Key: PIG-2417
> URL: https://issues.apache.org/jira/browse/PIG-2417
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-2417-4.patch, PIG-2417-5.patch, streaming2.patch, 
> streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request 13781: Changes to add support for streaming_python udfs.

2013-08-23 Thread Jeremy Karn

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13781/
---

Review request for pig.


Repository: pig-git


Description
---

Changes for PIG-2417 (https://issues.apache.org/jira/browse/PIG-2417)


Diffs
-

  build.xml b20eb3d 
  src/org/apache/pig/PigToStream.java 7cc2950 
  src/org/apache/pig/StreamToPig.java ff24b27 
  src/org/apache/pig/builtin/PigStreaming.java 5467693 
  src/org/apache/pig/builtin/PigToStreamUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamUDFToPig.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingDelimiters.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDF.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFException.java PRE-CREATION 
  src/org/apache/pig/builtin/StreamingUDFOutputSchemaException.java 
PRE-CREATION 
  src/org/apache/pig/impl/streaming/DefaultInputHandler.java 301bea3 
  src/org/apache/pig/impl/streaming/DefaultOutputHandler.java 1b46e7d 
  src/org/apache/pig/impl/streaming/ExecutableManager.java cf79c83 
  src/org/apache/pig/impl/streaming/InputHandler.java 690d94e 
  src/org/apache/pig/impl/streaming/OutputHandler.java 6e9262a 
  src/org/apache/pig/impl/streaming/StreamingUDFOutputHandler.java PRE-CREATION 
  src/org/apache/pig/impl/streaming/StreamingUtil.java PRE-CREATION 
  src/org/apache/pig/impl/util/JarManager.java 5c4acb0 
  src/org/apache/pig/impl/util/StorageUtil.java dcb62ec 
  src/org/apache/pig/scripting/ScriptEngine.java 29a9e1f 
  src/org/apache/pig/scripting/ScriptingIllustrateOutputCapturer.java 
PRE-CREATION 
  src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java 
PRE-CREATION 
  src/python/streaming/controller.py PRE-CREATION 
  src/python/streaming/pig_util.py PRE-CREATION 
  test/org/apache/pig/builtin/TestPigToStreamUDF.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamUDFToPig.java PRE-CREATION 
  test/org/apache/pig/builtin/TestStreamingUDF.java PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestExecutableManager.java 6246019 
  test/org/apache/pig/impl/streaming/TestStreamingUDFOutputHandler.java 
PRE-CREATION 
  test/org/apache/pig/impl/streaming/TestStreamingUtil.java PRE-CREATION 
  test/org/apache/pig/test/TestPigStreaming.java PRE-CREATION 
  test/org/apache/pig/test/TestStreaming.java 1eac5d2 
  test/python/streaming/test_controller.py PRE-CREATION 
  test/unit-tests d52ad9d 

Diff: https://reviews.apache.org/r/13781/diff/


Testing
---


Thanks,

Jeremy Karn



[jira] [Updated] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-08-23 Thread Jeremy Karn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Karn updated PIG-2417:
-

Attachment: PIG-2417-5.patch

Here's an updated patch that I think should be ready for review (review board 
coming soon).

Aside from the streaming python udfs this patch also contains some logic for 
capturing output from the python process that doesn't do much.  However, I'm 
hoping to get a patch up soon with Mortar's illustrate changes and that will 
take advantage of the captured output.

One thing thats still outstanding is documentation changes.  Should I just add 
a section similar to http://pig.apache.org/docs/r0.11.1/udf.html#python-udfs 
for streaming python?



> Streaming UDFs -  allow users to easily write UDFs in scripting languages 
> with no JVM implementation.
> -
>
> Key: PIG-2417
> URL: https://issues.apache.org/jira/browse/PIG-2417
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-2417-4.patch, PIG-2417-5.patch, streaming2.patch, 
> streaming3.patch, streaming.patch
>
>
> The goal of Streaming UDFs is to allow users to easily write UDFs in 
> scripting languages with no JVM implementation or a limited JVM 
> implementation.  The initial proposal is outlined here: 
> https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
> In order to implement this we need new syntax to distinguish a streaming UDF 
> from an embedded JVM UDF.  I'd propose something like the following (although 
> I'm not sure 'language' is the best term to be using):
> {code}define my_streaming_udfs language('python') 
> ship('my_streaming_udfs.py'){code}
> We'll also need a language-specific controller script that gets shipped to 
> the cluster which is responsible for reading the input stream, deserializing 
> the input data, passing it to the user written script, serializing that 
> script output, and writing that to the output stream.
> Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
> class will likely share some of the existing code in POStream and 
> ExecutableManager (where it make sense to pull out shared code) to stream 
> data to/from the controller script.
> One alternative approach to creating the StreamingUDF EvalFunc is to use the 
> POStream operator directly.  This would involve inserting the POStream 
> operator instead of the POUserFunc operator whenever we encountered a 
> streaming UDF while building the physical plan.  This approach seemed 
> problematic because there would need to be a lot of changes in order to 
> support POStream in all of the places we want to be able use UDFs (For 
> example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Upgrading antlr to 3.5

2013-08-23 Thread Prashant Kommireddi
Hi Daniel,

The reasons are more internal. Our app is having an issue with 3.4 and it's
easier for us to move forward to 3.5
http://antlr.markmail.org/search/?q=%22void+%3D+null%3B%22#query:%22void%20%3D%20null%3B%22%20order%3Adate-backward+page:1+mid:7g3th2bg3onyoqhv+state:results

Is it difficult to upgrade the version across the board (hive + pig)?


On Fri, Aug 23, 2013 at 2:02 PM, Daniel Dai  wrote:

> Any reason why you want to upgrade to 3.5? We'd like Hive/Pig use the same
> version of antrl, which ease the integration work of Hive/Pig/HCat.
>
> Thanks,
> Daniel
>
>
> On Fri, Aug 23, 2013 at 5:50 PM, Prashant Kommireddi  >wrote:
>
> > Hey guys,
> >
> > Anyone aware of any issues with upgrading antlr to v3.5 for Pig? I am
> > planning to try it out, and wanted to make sure it's not already been
> > tried.
> >
> > Thanks,
> > Prashant
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748907#comment-13748907
 ] 

Rohini Palaniswamy commented on PIG-3419:
-

In the Pig-on-Tez meeting in Linkedin we decided to do Tez work on a branch and 
that Cheolsoo will initiate conversation thread on mailing list for it and take 
up the task of creating the branch. Tez is relatively new and unstable so it 
will be wise to not start with code directly on trunk. Hive is also doing their 
Tez work on a branch. 

  Cheolsoo had a question as to whether we should commit this to trunk and 
branch after that. I would prefer PIG-3419 to be also put in the branch and not 
checked into trunk. It makes lot of changes to the Exceptions thrown, removes 
public methods etc and that might cause backward incompatibility during runtime 
with code compiled with previous versions of pig. All that needs to be figured 
out and fixed. So might not be a good idea to get this patch directly into 
trunk. Thoughts?

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3437) Error while running e2e test: "Can't open ./resource/hadoop23.res, No such file or directory " coming from test_harness.pl line #179

2013-08-23 Thread Annie Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748906#comment-13748906
 ] 

Annie Lin commented on PIG-3437:


using pig trunk workspace and run e2e test using hadoop23 on CI, it failed due 
to hadoop23.res not found, can someone point me where I can get this file?

In test_harness.pl
my $harnessRes = "";
if (defined($ENV{'HARNESS_RESOURCE'})) {
$harnessRes = $ENV{'HARNESS_RESOURCE'};
} elsif($^O =~ /mswin/i) {
   $harnessRes = "$ROOT/resource/windows.res";
} elsif ($globalCfg->{'hadoopversion'} == '23') {
   $harnessRes = "$ROOT/resource/hadoop23.res";  <=  
} else {
   $harnessRes = "$ROOT/resource/default.res";
}


below is error in console log from jenkins:

 [exec] FATAL ERROR ./test_harness.pl at 179:  Can't open 
./resource/hadoop23.res, No such file or directory


thanks,
Annie

> Error while running e2e test:  "Can't open ./resource/hadoop23.res, No such 
> file or directory " coming from test_harness.pl line #179
> -
>
> Key: PIG-3437
> URL: https://issues.apache.org/jira/browse/PIG-3437
> Project: Pig
>  Issue Type: Bug
>  Components: e2e harness
>Reporter: Annie Lin
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3437) Error while running e2e test: "Can't open ./resource/hadoop23.res, No such file or directory " coming from test_harness.pl line #179

2013-08-23 Thread Annie Lin (JIRA)
Annie Lin created PIG-3437:
--

 Summary: Error while running e2e test:  "Can't open 
./resource/hadoop23.res, No such file or directory " coming from 
test_harness.pl line #179
 Key: PIG-3437
 URL: https://issues.apache.org/jira/browse/PIG-3437
 Project: Pig
  Issue Type: Bug
  Components: e2e harness
Reporter: Annie Lin




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Achal Soni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748873#comment-13748873
 ] 

Achal Soni commented on PIG-3419:
-

[~cheolsoo] Thanks a lot for running the test suite! It's good to see where the 
patch is failing. I definitely agree that all of these need to be investigated 
before the patch gets anywhere.

I have some ideas about a few of the test cases, looks to be some minor stuff 
with JobStats and the way Explain works now which I have to look into. The rest 
I can't really think of off hte top of my head but I'll give it a shot. 

I'll report back with some more findings as soon as possible.

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Upgrading antlr to 3.5

2013-08-23 Thread Daniel Dai
Any reason why you want to upgrade to 3.5? We'd like Hive/Pig use the same
version of antrl, which ease the integration work of Hive/Pig/HCat.

Thanks,
Daniel


On Fri, Aug 23, 2013 at 5:50 PM, Prashant Kommireddi wrote:

> Hey guys,
>
> Anyone aware of any issues with upgrading antlr to v3.5 for Pig? I am
> planning to try it out, and wanted to make sure it's not already been
> tried.
>
> Thanks,
> Prashant
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


[jira] [Updated] (PIG-3436) Make pigmix run with Hadoop2

2013-08-23 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3436:


Status: Patch Available  (was: Open)

> Make pigmix run with Hadoop2
> 
>
> Key: PIG-3436
> URL: https://issues.apache.org/jira/browse/PIG-3436
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3436-1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-23 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3435:
--

Attachment: pig-3435-v02_skipcustompatitioner_for_merge.patch

While looking at the testcase, found PIG-2627 where it fixed one of the issues 
with custom-partitioner and multiquery optimization (but not all).

Specific case mentioned on that ticket is handled on that jira and it works, 
but my patch here simply skips multiquery optimization for ALL custom 
partitioner jobs.

Since it's sort of a correctness issue, I want this fix to be back-ported to 
0.11.  And for that, I kept the change to be simple.

Can we create a separate jira for reviving custom-partitioner + multiquery 
optimization for later releases?


> Custom Partitioner not working with MultiQueryOptimizer
> ---
>
> Key: PIG-3435
> URL: https://issues.apache.org/jira/browse/PIG-3435
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3435-v01.patch, 
> pig-3435-v02_skipcustompatitioner_for_merge.patch
>
>
> When looking at PIG-3385, noticed some issues in handling of custom 
> partitioner with multi-query optimization.
> {noformat}
> C1 = group B1 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> {noformat}
> This seems to be merged to one mapreduce job correctly but custom partitioner 
> information was lost.
> {noformat}
> C1 = group B1 by col1 PARTITION BY 
> org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 parallel 2;
> {noformat}
> This seems to be merged even though they should run on two different 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Upgrading antlr to 3.5

2013-08-23 Thread Prashant Kommireddi
Hey guys,

Anyone aware of any issues with upgrading antlr to v3.5 for Pig? I am
planning to try it out, and wanted to make sure it's not already been tried.

Thanks,
Prashant


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748657#comment-13748657
 ] 

Cheolsoo Park commented on PIG-3419:


All, so here is the list of failing tests:
{code}
org.apache.pig.test.TestGrunt.testScriptMissingLastNewLine
org.apache.pig.test.TestGrunt.testCheckScriptSyntaxWithSemiColonUDFErr
org.apache.pig.test.TestGrunt.testExplainDot
org.apache.pig.test.TestGrunt.testExplainOut
org.apache.pig.test.TestGrunt.testExplainBrief
org.apache.pig.test.TestGrunt.testExplainEmpty
org.apache.pig.test.TestGrunt.testExplainScript
org.apache.pig.test.TestInputOutputMiniClusterFileValidator.testValidationNeg
org.apache.pig.test.TestJobStats.testOneTaskReport
org.apache.pig.test.TestJobStats.testGetOuputSizeUsingNonFileBasedStorage1
org.apache.pig.test.TestJobStats.testGetOuputSizeUsingNonFileBasedStorage2
org.apache.pig.test.TestJobStats.testGetOuputSizeUsingNonFileBasedStorage3
org.apache.pig.test.TestJobStats.testGetOuputSizeUsingNonFileBasedStorage4
org.apache.pig.test.TestJobStats.testMedianMapReduceTime
org.apache.pig.test.TestJobStats.testGetOuputSizeUsingFileBasedStorage
org.apache.pig.test.TestMRExecutionEngine.testJobConfGeneration
org.apache.pig.test.TestMRExecutionEngine.testJobConfGenerationWithUserConfigs
org.apache.pig.test.TestMacroExpansion.test20
org.apache.pig.test.TestMacroExpansion.test21
org.apache.pig.test.TestMacroExpansion.test22
org.apache.pig.test.TestMacroExpansion.test23
org.apache.pig.test.TestMacroExpansion.test32
org.apache.pig.test.TestMacroExpansion.test33
org.apache.pig.test.TestMacroExpansion.test34
org.apache.pig.test.TestMacroExpansion.test35
org.apache.pig.test.TestMacroExpansion.testCommentInMacro
org.apache.pig.test.TestMacroExpansion.testNegativeNumber
org.apache.pig.test.TestMacroExpansion.typecastTest
org.apache.pig.test.TestMacroExpansion.testFilter
org.apache.pig.test.TestMapSideCogroup.testFailure2
org.apache.pig.test.TestMergeJoinOuter.testFailure
org.apache.pig.test.TestPigRunner.testEmptyFile
org.apache.pig.test.TestScriptLanguage.testSysArguments
org.apache.pig.test.TestShortcuts.testExplainShortcutNoAlias
org.apache.pig.test.TestShortcuts.testExplainShortcutNoAliasDefined
{code}
I prefer fixing them beforehand to fixing them afterward. Although none of 
these failures is serious (I believe), can we have a couple of more days before 
committing Achal's patch? I will make sure it gets committed into trunk because 
I definitely need it for a Tez branch.

Thoughts?

> Pluggable Execution Engine 
> ---
>
> Key: PIG-3419
> URL: https://issues.apache.org/jira/browse/PIG-3419
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.12
>Reporter: Achal Soni
>Assignee: Achal Soni
>Priority: Minor
> Attachments: execengine.patch, mapreduce_execengine.patch, 
> stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
> updated-8-22-2013-exec-engine.patch
>
>
> In an effort to adapt Pig to work using Apache Tez 
> (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
> a cleaner ExecutionEngine abstraction than existed before. The changes are 
> not that major as Pig was already relatively abstracted out between the 
> frontend and backend. The changes in the attached commit are essentially the 
> barebones changes -- I tried to not change the structure of Pig's different 
> components too much. I think it will be interesting to see in the future how 
> we can refactor more areas of Pig to really honor this abstraction between 
> the frontend and backend. 
> Some of the changes was to reinstate an ExecutionEngine interface to tie 
> together the front end and backend, and making the changes in Pig to delegate 
> to the EE when necessary, and creating an MRExecutionEngine that implements 
> this interface. Other work included changing ExecType to cycle through the 
> ExecutionEngines on the classpath and select the appropriate one (this is 
> done using Java ServiceLoader, exactly how MapReduce does for choosing the 
> framework to use between local and distributed mode). Also I tried to make 
> ScriptState, JobStats, and PigStats as abstract as possible in its current 
> state. I think in the future some work will need to be done here to perhaps 
> re-evaluate the usage of ScriptState and the responsibilities of the 
> different statistics classes. I haven't touched the PPNL, but I think more 
> abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira