[jira] [Commented] (PIG-3430) Add xml format for explaining MapReduce Plan.

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753350#comment-13753350
 ] 

Daniel Dai commented on PIG-3430:
-

I can get xml mapreduce plan with the patch. Two questions:
1. Any reason we only do it in mapreduce plan?
2. Why we need to mark tmpLoader? Is it to support xml mapreduce plan? Or it is 
a separate thing?

 Add xml format for explaining MapReduce Plan.
 -

 Key: PIG-3430
 URL: https://issues.apache.org/jira/browse/PIG-3430
 Project: Pig
  Issue Type: New Feature
Reporter: Jeremy Karn
 Attachments: PIG-3430.patch


 At Mortar we needed an easy way to store/parse a script's map reduce plan.  
 We added an xml output format for the MapReduce plan to make this easier.  We 
 also added a flag to keep track of if each store or load was from the 
 original script (and associated with an alias) or if its a temporary 
 store/load generated by Pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2606) union/ join operations are not accepting same alias as multiple inputs

2013-08-29 Thread Hari Sankar Sivarama Subramaniyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sankar Sivarama Subramaniyan updated PIG-2606:
---

Attachment: PIG-2606.2.patch.txt

Adding unit tests.

 union/ join operations are not accepting same alias as multiple inputs
 --

 Key: PIG-2606
 URL: https://issues.apache.org/jira/browse/PIG-2606
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.10.0
Reporter: Thejas M Nair
Assignee: Hari Sankar Sivarama Subramaniyan
 Attachments: PIG-2606.2.patch.txt, PIG-2606.patch.txt


 grunt l = load 'x';   
 grunt u = union l, l; 
 2012-03-16 18:48:45,687 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. Union with Count(Operand)  2
 grunt a = load 'a0.txt' as (a0, a1);
 grunt b = join a by a0, a by a1;
 2013-08-27 13:36:21,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2225: Projection with nothing to reference!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3433) The import sdsu cannot be resolved

2013-08-29 Thread Ido Hadanny (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753523#comment-13753523
 ] 

Ido Hadanny commented on PIG-3433:
--

I didn't know that, as you said, I just did top level ant clean eclipse-files 
+ ant compile gen

 The import sdsu cannot be resolved
 --

 Key: PIG-3433
 URL: https://issues.apache.org/jira/browse/PIG-3433
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.11.1
 Environment: Eclipse indigo
Reporter: Ido Hadanny

 executed:
 ➜  trunk  svn update
 At revision 1516115.
 ant clean eclipse-files
 ant compile gen
 getting:
 https://issues.apache.org/jira/browse/PIG-3399
 AND after manually removing the wrong javacc-4.2 dependency, getting:
 The import sdsu cannot be resolved in DataGenerator.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753843#comment-13753843
 ] 

Bikas Saha commented on PIG-3419:
-

Folks, FYI, based on recent feedback we have changed the names used in some of 
the TEZ API's. It a simple refactoring on the Tez side and should be a simple 
refactoring fix on the Pig side too. Jira for reference. TEZ-410.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3441) Allow Pig to use default resources from Configuration objects

2013-08-29 Thread Bhooshan Mogal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753860#comment-13753860
 ] 

Bhooshan Mogal commented on PIG-3441:
-

Could you please tell me where the code snippet you mentioned is from? 

A lot of code-flows in pig seem to re-create configuration objects by loading 
from Properties as well in the ConfigurationUtil.toConfiguration() method. 

In this method, I saw that default resources are ignored as - 
{code}
public static Configuration toConfiguration(Properties properties) {
assert properties != null;
final Configuration config = new Configuration(false);
final EnumerationObject iter = properties.keys();
...
{code}

Due to this, Pig was unable to read from custom resources added statically. The 
patch addresses this by allowing users to create the Configuration object in 
this method with loadDefaults set to true, based on a pig property. 

 Allow Pig to use default resources from Configuration objects
 -

 Key: PIG-3441
 URL: https://issues.apache.org/jira/browse/PIG-3441
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Bhooshan Mogal
 Attachments: PIG-3441_1.patch, PIG-3441.patch


 Pig currently ignores parameters from configuration files added statically to 
 Configuration objects as Configuration.addDefaultResource(filename.xml).
 Consider the following scenario -
 In a hadoop FileSystem driver for a non-HDFS filesystem you load properties 
 specific to that FileSystem in a static initializer block in the class that 
 extends org.apache.hadoop.fs.Filesystem for your FileSystem like below - 
 {code}
 class MyFileSystem extends FileSystem {
 static {
   Configuration.addDefaultResource(myfs-default.xml);
   Configuration.addDefaultResource(myfs-site.xml);
   }
 }
 {code}
 Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration 
 parameters defined in these configuration files as long as they are on the 
 classpath.
 However, Pig cannot find parameters from these files, because it ignores 
 configuration files added statically.
 Pig should allow users to specify if they would like pig to read parameters 
 from resources loaded statically.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753869#comment-13753869
 ] 

Bikas Saha commented on PIG-3419:
-

Looks like this jira wasnt the appropriate one to comment on. Is there a 
different umbrella jira for Pig on Tez that I can track and post comments on?

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Achal Soni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753878#comment-13753878
 ] 

Achal Soni commented on PIG-3419:
-

[~bikassaha] Thanks for the heads-up Bikas! This JIRA is not concerned with the 
Tez integration for Pig and is simply the abstraction in Pig to allow for 
alternate ExecutionEngines in Pig. But will certainly change this on the Tez 
integration side of stuff.

Thanks a lot [~cheolsoo] for continuing this! I think everything looks good 
from my end. I can certainly see why we may want to keep this on a different 
branch until everything is finalized. Certain things may still need more work. 
For example, OutputStats is not completed abstracted out, as it still has 
references to POStore which is a MR implementation construct. 
ScriptState/PPNL/JobStats may still need more abstraction (especially PPNL) and 
reworking to incorporate a new ExecutionEngine abstraction. I think what we 
have done here is the minimum foundation for an abstraction though, and it 
would be appropriate to put into trunk, but these are not my decisions to make. 

With regard to public methods that were changed, I don't think most of them are 
a big deal, besides as Cheolsoo said, the PigServer throwing PigException. I 
never thought IOException was a good exception to throw, but I think reverting 
PigServer back to IOException as it is userfacing code is not a big deal. The 
rest of the method signature changes shouldn't be worrisome because most of 
them are internal to the project. 

However, the change from JobStats to MRJobStats, while necessary (as each 
ExecutionEngine would have it's own type of JobStats it would present to the 
end user), could be problematic because it is userfacing code and would 
probably break people who were previously using JobStats. That I think is the 
most important thing to keep in mind. The task of making the PPNL and JobStats 
clearly tied to the ExecutionEngine should be thought through also.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3441) Allow Pig to use default resources from Configuration objects

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753880#comment-13753880
 ] 

Daniel Dai commented on PIG-3441:
-

This is from the test case of PIG-3135. If the custom configuration get lost 
along the way, I wonder why PIG-3135 works. Seems they should share the same 
issue.

 Allow Pig to use default resources from Configuration objects
 -

 Key: PIG-3441
 URL: https://issues.apache.org/jira/browse/PIG-3441
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Bhooshan Mogal
 Attachments: PIG-3441_1.patch, PIG-3441.patch


 Pig currently ignores parameters from configuration files added statically to 
 Configuration objects as Configuration.addDefaultResource(filename.xml).
 Consider the following scenario -
 In a hadoop FileSystem driver for a non-HDFS filesystem you load properties 
 specific to that FileSystem in a static initializer block in the class that 
 extends org.apache.hadoop.fs.Filesystem for your FileSystem like below - 
 {code}
 class MyFileSystem extends FileSystem {
 static {
   Configuration.addDefaultResource(myfs-default.xml);
   Configuration.addDefaultResource(myfs-site.xml);
   }
 }
 {code}
 Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration 
 parameters defined in these configuration files as long as they are on the 
 classpath.
 However, Pig cannot find parameters from these files, because it ignores 
 configuration files added statically.
 Pig should allow users to specify if they would like pig to read parameters 
 from resources loaded statically.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3441) Allow Pig to use default resources from Configuration objects

2013-08-29 Thread Bhooshan Mogal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753899#comment-13753899
 ] 

Bhooshan Mogal commented on PIG-3441:
-

The test cases of [PIG-3135|https://issues.apache.org/jira/browse/PIG-3135] 
pass a false to the Configuration constructor. 

Also, [PIG-3135|https://issues.apache.org/jira/browse/PIG-3135] calls 
ConfigurationUtil.toConfiguration as well, which creates the configuration 
object with loadDefaults set to false like in my previous comment.

I tried using [PIG-3135|https://issues.apache.org/jira/browse/PIG-3135] and 
setting pig.use.overriden.hadoop.configs to true, however, pig did not read 
from the custom configuration files added statically. When I changed it to set 
loadDefaults to true, it worked fine.

 Allow Pig to use default resources from Configuration objects
 -

 Key: PIG-3441
 URL: https://issues.apache.org/jira/browse/PIG-3441
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Bhooshan Mogal
 Attachments: PIG-3441_1.patch, PIG-3441.patch


 Pig currently ignores parameters from configuration files added statically to 
 Configuration objects as Configuration.addDefaultResource(filename.xml).
 Consider the following scenario -
 In a hadoop FileSystem driver for a non-HDFS filesystem you load properties 
 specific to that FileSystem in a static initializer block in the class that 
 extends org.apache.hadoop.fs.Filesystem for your FileSystem like below - 
 {code}
 class MyFileSystem extends FileSystem {
 static {
   Configuration.addDefaultResource(myfs-default.xml);
   Configuration.addDefaultResource(myfs-site.xml);
   }
 }
 {code}
 Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration 
 parameters defined in these configuration files as long as they are on the 
 classpath.
 However, Pig cannot find parameters from these files, because it ignores 
 configuration files added statically.
 Pig should allow users to specify if they would like pig to read parameters 
 from resources loaded statically.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753912#comment-13753912
 ] 

Julien Le Dem commented on PIG-3419:


[~cheolsoo]: thanks a lot for looking into this.

Here are my thoughts:

1. let's change it back

2. 4. 5. 6. 7. are either internal to Pig or necessary to add the execution 
engine abstraction.

3.
JobStats still exists but the MR specific part is split into MRJobStats which 
extends JobStats
Same thing for PigStatsUtil and ScriptState. Those classes are not disappearing 
but the MR specific part is abstracted out.
HExecutionEngine could be renamed back to what it was but this is again what is 
becoming the new abstraction.
Unfortunately tools like Ambrose and Lipstick depend on the MR specific parts 
of Pig and look at the internals. This patch is a necessary change so that 
those tools can work independently of the execution engine in the future.
The changes to Ambrose and Lipstick should be minimal though with this patch. 
But yes they would suffer from some incompatibility, but again there is no way 
around it when a tool looks inside the execution engine internals.

I think we should revert 1. and commit the patch.



 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753942#comment-13753942
 ] 

Dmitriy V. Ryaboy commented on PIG-3048:


no objections. after all, usage of the config info is purely optional.

We've run into trouble before with information of this sort becoming very big 
and triggering JobConf too large errors. Might want to look at compression at 
some point.

 Add mapreduce workflow information to job configuration
 ---

 Key: PIG-3048
 URL: https://issues.apache.org/jira/browse/PIG-3048
 Project: Pig
  Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 0.11.2

 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch


 Adding workflow properties to the job configuration would enable logging and 
 analysis of workflows in addition to individual MapReduce jobs.  Suggested 
 properties include a workflow ID, workflow name, adjacency list connecting 
 nodes in the workflow, and the name of the current node in the workflow.
 mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
 the application name
 e.g. pig_pigScriptId
 mapreduce.workflow.name - a name for the workflow, to distinguish this 
 workflow from other workflows and to group different runs of the same workflow
 e.g. pig command line
 mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
 encoded as mapreduce.workflow.adjacency.source node = comma-separated list 
 of target nodes
 mapreduce.workflow.node.name - the name of the node corresponding to this 
 MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3419:
---

Attachment: updated-8-29-2013-exec-engine.patch

I am uploading a new patch that revert the PigServer constructor (#1). The diff 
can be viewed 
[here|https://github.com/piaozhexiu/apache-pig/commit/a1e46e23ef0842874db6c09769a630ec47f5d259].
 (There are two unrelated minor changes.)

The new patch is rebased to trunk. Please let me know if anyone has objections. 
If I don't hear back, I will commit it to trunk tomorrow. Thanks!

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753979#comment-13753979
 ] 

Daniel Dai commented on PIG-3255:
-

I personally does not realize anyone using StreamToPig, but need to check with 
[~alangates], since he marked it as public stable. Other part of the patch 
looks good. Avoiding 2 byte array copy and reuse Text object would save memory 
and enhance performance.

 Avoid extra byte array copy in streaming deserialize
 

 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3255-1.patch, PIG-3255-2.patch


 PigStreaming.java:
  public Tuple deserialize(byte[] bytes) throws IOException {
 Text val = new Text(bytes);  
 return StorageUtil.textToTuple(val, fieldDel);
 }
 Should remove new Text(bytes) copy and construct the tuple directly from the 
 bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Achal Soni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753985#comment-13753985
 ] 

Achal Soni commented on PIG-3419:
-

I agree with all that is said, but there is no need to rename HExecutionEngine 
back. It doesn't semantically make sense and I don't think that anybody was 
directly interacting it outside of the test cases?

Whatever changes to Ambrose and Lipstick should be communicated clearly also. I 
have noted some issues with PPNL before with regard to abstraction -- namely, 
Pig provides the MROperPlan to the listeners, which is not relevant in a 
differen execution engine. Julien suggested this should be fixed in a follow up 
patch. This will most certainly affect Ambrose and Lipstick so we should be 
cautious in that regard.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3048:


   Resolution: Fixed
Fix Version/s: (was: 0.11.2)
   0.12
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks guys!

 Add mapreduce workflow information to job configuration
 ---

 Key: PIG-3048
 URL: https://issues.apache.org/jira/browse/PIG-3048
 Project: Pig
  Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 0.12

 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch


 Adding workflow properties to the job configuration would enable logging and 
 analysis of workflows in addition to individual MapReduce jobs.  Suggested 
 properties include a workflow ID, workflow name, adjacency list connecting 
 nodes in the workflow, and the name of the current node in the workflow.
 mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
 the application name
 e.g. pig_pigScriptId
 mapreduce.workflow.name - a name for the workflow, to distinguish this 
 workflow from other workflows and to group different runs of the same workflow
 e.g. pig command line
 mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
 encoded as mapreduce.workflow.adjacency.source node = comma-separated list 
 of target nodes
 mapreduce.workflow.node.name - the name of the node corresponding to this 
 MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Bill Graham (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753997#comment-13753997
 ] 

Bill Graham commented on PIG-3048:
--

+1 to commit.

Just one style nit re spaces:

{noformat}
(getFileName() != null)?getFileName():default
{noformat}
should instead be:
{noformat}
(getFileName() != null) ? getFileName() : default
{noformat}


 Add mapreduce workflow information to job configuration
 ---

 Key: PIG-3048
 URL: https://issues.apache.org/jira/browse/PIG-3048
 Project: Pig
  Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 0.12

 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch


 Adding workflow properties to the job configuration would enable logging and 
 analysis of workflows in addition to individual MapReduce jobs.  Suggested 
 properties include a workflow ID, workflow name, adjacency list connecting 
 nodes in the workflow, and the name of the current node in the workflow.
 mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
 the application name
 e.g. pig_pigScriptId
 mapreduce.workflow.name - a name for the workflow, to distinguish this 
 workflow from other workflows and to group different runs of the same workflow
 e.g. pig command line
 mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
 encoded as mapreduce.workflow.adjacency.source node = comma-separated list 
 of target nodes
 mapreduce.workflow.node.name - the name of the node corresponding to this 
 MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Bill Graham (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753998#comment-13753998
 ] 

Bill Graham commented on PIG-3048:
--

Whoops, I was a minute too late. :)

 Add mapreduce workflow information to job configuration
 ---

 Key: PIG-3048
 URL: https://issues.apache.org/jira/browse/PIG-3048
 Project: Pig
  Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 0.12

 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch


 Adding workflow properties to the job configuration would enable logging and 
 analysis of workflows in addition to individual MapReduce jobs.  Suggested 
 properties include a workflow ID, workflow name, adjacency list connecting 
 nodes in the workflow, and the name of the current node in the workflow.
 mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
 the application name
 e.g. pig_pigScriptId
 mapreduce.workflow.name - a name for the workflow, to distinguish this 
 workflow from other workflows and to group different runs of the same workflow
 e.g. pig command line
 mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
 encoded as mapreduce.workflow.adjacency.source node = comma-separated list 
 of target nodes
 mapreduce.workflow.node.name - the name of the node corresponding to this 
 MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754001#comment-13754001
 ] 

Daniel Dai commented on PIG-3048:
-

No problem, I just committed the change you suggested. Thanks Bill!

 Add mapreduce workflow information to job configuration
 ---

 Key: PIG-3048
 URL: https://issues.apache.org/jira/browse/PIG-3048
 Project: Pig
  Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 0.12

 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch


 Adding workflow properties to the job configuration would enable logging and 
 analysis of workflows in addition to individual MapReduce jobs.  Suggested 
 properties include a workflow ID, workflow name, adjacency list connecting 
 nodes in the workflow, and the name of the current node in the workflow.
 mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
 the application name
 e.g. pig_pigScriptId
 mapreduce.workflow.name - a name for the workflow, to distinguish this 
 workflow from other workflows and to group different runs of the same workflow
 e.g. pig command line
 mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
 encoded as mapreduce.workflow.adjacency.source node = comma-separated list 
 of target nodes
 mapreduce.workflow.node.name - the name of the node corresponding to this 
 MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3441) Allow Pig to use default resources from Configuration objects

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754017#comment-13754017
 ] 

Daniel Dai commented on PIG-3441:
-

I am not doubting your case does not work, but curious to know why PIG-3135 
works. Seems both are trying to pass a custom configuration in. In PIG-3135, it 
pass some handcoded entries, and you want construct Configuration(true), both 
then pass the config object to Pig. If Pig does take the config object, then 
both case work, if not, both case fail. I do want to solve both issue in one 
consistent way if possible. 

 Allow Pig to use default resources from Configuration objects
 -

 Key: PIG-3441
 URL: https://issues.apache.org/jira/browse/PIG-3441
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Bhooshan Mogal
 Attachments: PIG-3441_1.patch, PIG-3441.patch


 Pig currently ignores parameters from configuration files added statically to 
 Configuration objects as Configuration.addDefaultResource(filename.xml).
 Consider the following scenario -
 In a hadoop FileSystem driver for a non-HDFS filesystem you load properties 
 specific to that FileSystem in a static initializer block in the class that 
 extends org.apache.hadoop.fs.Filesystem for your FileSystem like below - 
 {code}
 class MyFileSystem extends FileSystem {
 static {
   Configuration.addDefaultResource(myfs-default.xml);
   Configuration.addDefaultResource(myfs-site.xml);
   }
 }
 {code}
 Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration 
 parameters defined in these configuration files as long as they are on the 
 classpath.
 However, Pig cannot find parameters from these files, because it ignores 
 configuration files added statically.
 Pig should allow users to specify if they would like pig to read parameters 
 from resources loaded statically.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3433) The import sdsu cannot be resolved

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754061#comment-13754061
 ] 

Daniel Dai commented on PIG-3433:
-

The error happens when you run ant or inside your eclipse?

 The import sdsu cannot be resolved
 --

 Key: PIG-3433
 URL: https://issues.apache.org/jira/browse/PIG-3433
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.11.1
 Environment: Eclipse indigo
Reporter: Ido Hadanny

 executed:
 ➜  trunk  svn update
 At revision 1516115.
 ant clean eclipse-files
 ant compile gen
 getting:
 https://issues.apache.org/jira/browse/PIG-3399
 AND after manually removing the wrong javacc-4.2 dependency, getting:
 The import sdsu cannot be resolved in DataGenerator.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2606) union/ join operations are not accepting same alias as multiple inputs

2013-08-29 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2606:


   Resolution: Fixed
Fix Version/s: 0.12
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Hari!

 union/ join operations are not accepting same alias as multiple inputs
 --

 Key: PIG-2606
 URL: https://issues.apache.org/jira/browse/PIG-2606
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.10.0
Reporter: Thejas M Nair
Assignee: Hari Sankar Sivarama Subramaniyan
 Fix For: 0.12

 Attachments: PIG-2606.2.patch.txt, PIG-2606.patch.txt


 grunt l = load 'x';   
 grunt u = union l, l; 
 2012-03-16 18:48:45,687 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. Union with Count(Operand)  2
 grunt a = load 'a0.txt' as (a0, a1);
 grunt b = join a by a0, a by a1;
 2013-08-27 13:36:21,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2225: Projection with nothing to reference!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3441) Allow Pig to use default resources from Configuration objects

2013-08-29 Thread Bhooshan Mogal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754067#comment-13754067
 ] 

Bhooshan Mogal commented on PIG-3441:
-

I sort of see your point now. It seems like 
[PIG-3135|https://issues.apache.org/jira/browse/PIG-3135] would work only if 
resources are added to Configuration objects as confObject.addResource() and 
not Configuration.addDefaultResource(), since loadDefaults is set to false in 
ConfigurationUtil.toConfiguration()?

Unless the standard configuration files are added as configObject.addResource() 
somewhere in the code? 

 Allow Pig to use default resources from Configuration objects
 -

 Key: PIG-3441
 URL: https://issues.apache.org/jira/browse/PIG-3441
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Bhooshan Mogal
 Attachments: PIG-3441_1.patch, PIG-3441.patch


 Pig currently ignores parameters from configuration files added statically to 
 Configuration objects as Configuration.addDefaultResource(filename.xml).
 Consider the following scenario -
 In a hadoop FileSystem driver for a non-HDFS filesystem you load properties 
 specific to that FileSystem in a static initializer block in the class that 
 extends org.apache.hadoop.fs.Filesystem for your FileSystem like below - 
 {code}
 class MyFileSystem extends FileSystem {
 static {
   Configuration.addDefaultResource(myfs-default.xml);
   Configuration.addDefaultResource(myfs-site.xml);
   }
 }
 {code}
 Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration 
 parameters defined in these configuration files as long as they are on the 
 classpath.
 However, Pig cannot find parameters from these files, because it ignores 
 configuration files added statically.
 Pig should allow users to specify if they would like pig to read parameters 
 from resources loaded statically.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3349) Document ToString(Datetime, String) UDF

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754150#comment-13754150
 ] 

Daniel Dai commented on PIG-3349:
-

+1. We also need to complete type conversion table later.

 Document ToString(Datetime, String) UDF
 ---

 Key: PIG-3349
 URL: https://issues.apache.org/jira/browse/PIG-3349
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.11.1
Reporter: pat chan
Assignee: Cheolsoo Park
Priority: Minor
 Fix For: 0.12

 Attachments: PIG-3349.patch


 Currently you can't cast a datetimeobject into a chararray:
 grunt B = foreach A generate (chararray)a; dump B;
 2013-06-05 15:29:01,372 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1052: 
 line 8, column 24 Cannot cast datetime to chararray
 Details at logfile: /Users/patc/projects/pig-0.11.1/pig_1370471270879.log
 Was this an oversight? The documented casting matrix does not show the 
 datetime object so I'm not sure if the current behavior is correct or not.
 My recommendation would be to support casting to and from strings. Casting 
 from a string would behave exactly like loading a datetime. Casting to a 
 string would be exactly the format you get when you dump a datetime.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars

2013-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754182#comment-13754182
 ] 

Daniel Dai commented on PIG-3285:
-

That sounds good. We can switch to it for newer version of hbase once 
HBASE-9165 committed. 

 Jobs using HBaseStorage fail to ship dependency jars
 

 Key: PIG-3285
 URL: https://issues.apache.org/jira/browse/PIG-3285
 Project: Pig
  Issue Type: Bug
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.11.1

 Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 
 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig


 Launching a job consuming {{HBaseStorage}} fails out of the box. The user 
 must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. 
 Exceptions look something like this:
 {noformat}
 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.NoClassDefFoundError: com/google/protobuf/Message
   at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.clinit(HbaseObjectWritable.java:266)
   at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139)
   at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612)
   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84)
   at $Proxy7.getProtocolVersion(Unknown Source)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3445) Make Parquet format available out of the box in Pig

2013-08-29 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3445:
--

 Summary: Make Parquet format available out of the box in Pig
 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem


We would add the Parquet jar in the Pig packages to make it available out of 
the box to pig users.
On top of that we could add the parquet.pig package to the list of packages to 
search for UDFs. (alternatively, the parquet jar could contain classes name 
or.apache.pig.builtin.ParquetLoader and ParquetStorer)
This way users can use Parquet simply by typing:
A = LOAD 'foo' USING ParquetLoader();
STORE A INTO 'bar' USING ParquetStorer();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Parquet support built in Pig

2013-08-29 Thread Julien Le Dem
Hello fellow Pig developers
I have opened a JIRA to add Parquet as a buit-in format in Pig:
https://issues.apache.org/jira/browse/PIG-3445
Please let me know what you think.
Julien

Re: Parquet support built in Pig

2013-08-29 Thread Russell Jurney
I think this is awesome. Best thing since diet sliced bread (they cut the
slices thin).


On Thu, Aug 29, 2013 at 4:36 PM, Julien Le Dem jul...@ledem.net wrote:

 Hello fellow Pig developers
 I have opened a JIRA to add Parquet as a buit-in format in Pig:
 https://issues.apache.org/jira/browse/PIG-3445
 Please let me know what you think.
 Julien




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754235#comment-13754235
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


[~billgraham] looping you in for Ambrose.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars

2013-08-29 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754251#comment-13754251
 ] 

Nick Dimiduk commented on PIG-3285:
---

[~daijy] would you mind commenting positively on the HBase ticket as well? 
Thanks.

 Jobs using HBaseStorage fail to ship dependency jars
 

 Key: PIG-3285
 URL: https://issues.apache.org/jira/browse/PIG-3285
 Project: Pig
  Issue Type: Bug
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.11.1

 Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 
 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig


 Launching a job consuming {{HBaseStorage}} fails out of the box. The user 
 must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. 
 Exceptions look something like this:
 {noformat}
 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.NoClassDefFoundError: com/google/protobuf/Message
   at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.clinit(HbaseObjectWritable.java:266)
   at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139)
   at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612)
   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84)
   at $Proxy7.getProtocolVersion(Unknown Source)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-08-29 Thread jira
Issue Subscription
Filter: PIG patch available (19 issues)

Subscriber: pigdaily

Key Summary
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3434Null subexpression in bincond nullifies outer tuple (or bag)
https://issues.apache.org/jira/browse/PIG-3434
PIG-3431Return more information for parsing related exceptions.
https://issues.apache.org/jira/browse/PIG-3431
PIG-3430Add xml format for explaining MapReduce Plan.
https://issues.apache.org/jira/browse/PIG-3430
PIG-3426Add support for removing s3 files
https://issues.apache.org/jira/browse/PIG-3426
PIG-3419Pluggable Execution Engine 
https://issues.apache.org/jira/browse/PIG-3419
PIG-3374CASE and IN fail when expression includes dereferencing operator
https://issues.apache.org/jira/browse/PIG-3374
PIG-3349Document ToString(Datetime, String) UDF
https://issues.apache.org/jira/browse/PIG-3349
PIG-3346New property that controls the number of combined splits
https://issues.apache.org/jira/browse/PIG-3346
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3255Avoid extra byte array copy in streaming deserialize
https://issues.apache.org/jira/browse/PIG-3255
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3117A debug mode in which pig does not delete temporary files
https://issues.apache.org/jira/browse/PIG-3117
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384