[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Status: Open  (was: Patch Available)

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Attachment: PIG_1551.2.patch

Attaching patch that fixes the two errors Richard pointed out.


 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Status: Patch Available  (was: Open)

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1205:


Attachment: PIG_1205_8.patch

Several updates continue Dmitriy's work.

1. Add unit test to HBaseStorage
  - code refactoring of TestHBaseStorage
  - add unit test for parameters: gt lt gte let limit and 
HBaseBinaryConverter.

2.  Update hbase 0.20 to hbase 0.20.6 (Dimitry, I found HBaseStorage do not 
work on hbase 0.20, do you also manul test on hbase 0.20.6 rather than 0.20.0 ?)

3.  I think we need more document for HBaseStorage especially the LoadCaster, 
if user specify the wrong LoadCast, he will get confusing result.

  


 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901787#action_12901787
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


Jeff,
Thanks a lot for pitching in with the tests!

I was using 0.20.0 and the old tests passed. I've only tested the binary 
conversion stuff and other new features  on the Twitter machines, and they do 
run a later HBase version -- perhaps the incompatibility is in the filters or 
binary casters code?
Do you know which tests fail with 0.20.0?

I will definitely add a bunch of documentation.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901799#action_12901799
 ] 

Jeff Zhang commented on PIG-1205:
-

Dmitriy,

The testcase of testLoadWithParameters_1 and testLoadWithParameters_2 failed 
when using hbase 0.20  
I think TableInputFormat has some update (maybe bug fixing) from hbase 0.20. to 
hbase 0.20.6



The following is log:
10/08/24 17:28:00 ERROR mapReduceLayer.Launcher: Backend error message during 
job submission
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: hbase://pigtable_1
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:365)
at 
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:347)
at 
org.apache.hadoop.hbase.filter.CompareFilter.readFields(CompareFilter.java:132)
at 
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:418)
at 
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:347)
at 
org.apache.hadoop.hbase.filter.FilterList.readFields(FilterList.java:204)
at org.apache.hadoop.hbase.client.Scan.readFields(Scan.java:523)
at 
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertStringToScan(TableMapReduceUtil.java:94)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:79)
at 
org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableIFBuilder.build(HBaseTableInputFormat.java:77)
at 
org.apache.pig.backend.hadoop.hbase.HBaseStorage.getInputFormat(HBaseStorage.java:268)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:257)
... 7 more

10/08/24 17:28:00 ERROR pigstats.PigStats: ERROR 2118: Unable to create input 
splits for: hbase://pigtable_1
10/08/24 17:28:00 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed!
10/08/24 17:28:00 INFO pigstats.PigStats: Script Statistics: 

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-24 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Attachment: 1343.patch

This patch will generate an error, where a job has failed but MR does not 
return any exception.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-24 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Status: Patch Available  (was: Open)

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901924#action_12901924
 ] 

Jeff Zhang commented on PIG-1205:
-

Dmitriy, 

I found the problem. This is really a bug of hbase 0.20.0 about the 
serialization of filter (https://issues.apache.org/jira/browse/HBASE-1830)
I think we should update hbase to 0.20.6 in pig, and 0.20.6 is compatible with 
0.20.0



 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?

2010-08-24 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901946#action_12901946
 ] 

Thejas M Nair commented on PIG-506:
---

Unit test passed, and I committed the changes. But it fails with latest changes 
to switch to new logical plan. I have added the test cases to exclude list in 
build.xml . 
Keeping the jira open until this is fixed.


 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, 
 NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, 
 TestWordCount.jar


 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates
This seems fine.  Is the Pig engine at any point testing to see if the  
interface is implemented and if so calling toBytes, or is this totally  
for use inside the store functions themselves to serialize Pig data  
types?


Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:

The current HBase patch on PIG-1205 (patch 7) includes this  
refactoring.

Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com  
wrote:


I just noticed that even though Utf8StorageConverter implements the  
various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster  
interface
-- and therefore can't be relied on when using modular Casters,  
like I am

trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I  
propose
adding a ByteCaster interface that defines these methods, and  
extending

Utf8StorageConverter to implement them (without actually changing the
implementation at all).
That way StoreFuncs that need to convert to bytes can use pluggable
converters. Objections?

-D





Re: is Hudson awol?

2010-08-24 Thread Alan Gates
Yes, our friend Hudson is ill again.  Giri, Hudson's doctor, should  
get a chance to look at it in a few days.


Alan.

On Aug 23, 2010, at 3:31 PM, Dmitriy Ryaboy wrote:


Haven't heard anything from Hudson in a while...

-D




[jira] Updated: (PIG-1560) Build target 'checkstyle' fails

2010-08-24 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1560:


Attachment: pig-1560.patch

This patch fixes the checkstyle target build failure.

 Build target 'checkstyle' fails
 ---

 Key: PIG-1560
 URL: https://issues.apache.org/jira/browse/PIG-1560
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: pig-1560.patch


 Stack trace:
 {code}
 /trunk/build.xml:894: java.lang.NoClassDefFoundError: 
 org/apache/commons/logging/LogFactory
 at 
 org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
 at 
 com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
 at 
 com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
 at 
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
 at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
 at org.apache.tools.ant.Task.perform(Task.java:348)
 at org.apache.tools.ant.Target.execute(Target.java:390)
 at org.apache.tools.ant.Target.performTasks(Target.java:411)
 at 
 org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
 at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
 at 
 org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
 at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
 at org.apache.tools.ant.Main.runBuild(Main.java:801)
 at org.apache.tools.ant.Main.startAnt(Main.java:218)
 at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
 at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.commons.logging.LogFactory
 at 
 org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
 at 
 org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
 at 
 org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 ... 22 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability

2010-08-24 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1311:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked in.

 Pig interfaces should be clearly classified in terms of scope and stability
 ---

 Key: PIG-1311
 URL: https://issues.apache.org/jira/browse/PIG-1311
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.8.0

 Attachments: PIG-1311.patch


 Clearly marking Pig interfaces (Java interfaces but also things like config 
 files, CLIs, Pig Latin syntax and semantics, etc.) to show scope 
 (public/private) and stability (stable/evolving/unstable) will help users 
 understand how to interact with Pig and developers to understand what things 
 they can and cannot change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1503) Label interfaces for audience and stability in org.apache.pig.backend package

2010-08-24 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-1503.
-

Resolution: Duplicate

The remaining interfaces were labeled as part PIG-1311.

 Label interfaces for audience and stability in org.apache.pig.backend package
 -

 Key: PIG-1503
 URL: https://issues.apache.org/jira/browse/PIG-1503
 Project: Pig
  Issue Type: Sub-task
  Components: documentation
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0


 This includes the datastorage and executionengine packages under backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates
One other comment.  By making this part of an interface that extends  
LoadCaster you are assuming the implementing class is both a load and  
store function.  It makes more sense to have a separate StoreCaster  
interface rather than extending LoadCaster.


Alan.

On Aug 24, 2010, at 9:18 AM, Alan Gates wrote:


This seems fine.  Is the Pig engine at any point testing to see if the
interface is implemented and if so calling toBytes, or is this totally
for use inside the store functions themselves to serialize Pig data
types?

Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:


The current HBase patch on PIG-1205 (patch 7) includes this
refactoring.
Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com
wrote:


I just noticed that even though Utf8StorageConverter implements the
various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster
interface
-- and therefore can't be relied on when using modular Casters,  
like I am

trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I
propose
adding a ByteCaster interface that defines these methods, and
extending
Utf8StorageConverter to implement them (without actually changing  
the

implementation at all).
That way StoreFuncs that need to convert to bytes can use pluggable
converters. Objections?

-D







[jira] Resolved: (PIG-1558) build.xml for site directory does not work

2010-08-24 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-1558.
-

Resolution: Fixed

Patch checked in.

 build.xml for site directory does not work
 --

 Key: PIG-1558
 URL: https://issues.apache.org/jira/browse/PIG-1558
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1558.patch


 Going to the site directory and running ant produces:  
 {code}
 ant 
 Buildfile: build.xml
 clean:
[delete] Deleting directory /Users/gates/src/pig/apache/site/author/build
 update:
 BUILD FAILED
 /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: 
 java.io.IOException: Cannot run program forrest (in directory 
 /Users/gates/src/pig/apache/site/author): error=2, No such file or directory
 {code}
 Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-24 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1559:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked in.

 Several things stated in Pig philosophy page are out of date
 

 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1559.patch


 The Pig philosophy page says several things that are no longer true (such as 
 that Pig does not have an optimizer (it does now), that we someday hope to 
 support streaming (we already do), that we some day hope to control splits 
 (we don't, we just use what Hadoop gives us now)).  These need to be updated 
 to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven

2010-08-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1562:


Fix Version/s: 0.8.0

 Fix the version for the dependent packages for the maven 
 -

 Key: PIG-1562
 URL: https://issues.apache.org/jira/browse/PIG-1562
 Project: Pig
  Issue Type: Bug
Reporter: niraj rai
Assignee: niraj rai
 Fix For: 0.8.0


 We need to fix the set version so that, version is properly set for the 
 dependent packages in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1560) Build target 'checkstyle' fails

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901975#action_12901975
 ] 

Olga Natkovich commented on PIG-1560:
-

please, commit

 Build target 'checkstyle' fails
 ---

 Key: PIG-1560
 URL: https://issues.apache.org/jira/browse/PIG-1560
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: pig-1560.patch


 Stack trace:
 {code}
 /trunk/build.xml:894: java.lang.NoClassDefFoundError: 
 org/apache/commons/logging/LogFactory
 at 
 org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
 at 
 com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
 at 
 com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
 at 
 com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
 at 
 org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
 at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
 at org.apache.tools.ant.Task.perform(Task.java:348)
 at org.apache.tools.ant.Target.execute(Target.java:390)
 at org.apache.tools.ant.Target.performTasks(Target.java:411)
 at 
 org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
 at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
 at 
 org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
 at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
 at org.apache.tools.ant.Main.runBuild(Main.java:801)
 at org.apache.tools.ant.Main.startAnt(Main.java:218)
 at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
 at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.commons.logging.LogFactory
 at 
 org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
 at 
 org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
 at 
 org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 ... 22 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901979#action_12901979
 ] 

Olga Natkovich commented on PIG-1559:
-

Looks like limit issue I was seeing has been addressed in the latest trunk. 

I think we need to add unit tests to catch this things in the future.

 Several things stated in Pig philosophy page are out of date
 

 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1559.patch


 The Pig philosophy page says several things that are no longer true (such as 
 that Pig does not have an optimizer (it does now), that we someday hope to 
 support streaming (we already do), that we some day hope to control splits 
 (we don't, we just use what Hadoop gives us now)).  These need to be updated 
 to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901984#action_12901984
 ] 

Olga Natkovich commented on PIG-1559:
-

sorry, wrong JIRA

 Several things stated in Pig philosophy page are out of date
 

 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1559.patch


 The Pig philosophy page says several things that are no longer true (such as 
 that Pig does not have an optimizer (it does now), that we someday hope to 
 support streaming (we already do), that we some day hope to control splits 
 (we don't, we just use what Hadoop gives us now)).  These need to be updated 
 to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901985#action_12901985
 ] 

Olga Natkovich commented on PIG-1557:
-

Looks like limit issue I was seeing has been addressed in the latest trunk. 

I think we need to add unit tests to catch this things in the future.



 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901992#action_12901992
 ] 

Richard Ding commented on PIG-1551:
---


The typo is still there:

{code}
private static final Class? LONG_ARRAY_CLASS = new Long[0].getClass();
{code}

It seems what you want is 

{code}
private static final Class? LONG_ARRAY_CLASS = new long[0].getClass();
{code}

so it's consistent with other array classes.

This does raise a question about array parameters: the first form applies to 
methods like _amethod(Long[] nums)_, while the second supports methods like 
_amethod(long[] nums)_. And they are not exchangeable. 

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902008#action_12902008
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


Ok, let's upgrade to 20.6 then. We could work around by serializing the filters 
ourselves, and applying them to the scan when reading the UDFContext, but seems 
a bit overboard, and folks should be upgrading anyway. 

*Commiters*: this is ready for review.



 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Attachment: PIG_1551.3.patch

Ugh. Thank you for catching that -- fixed, and added a test to make sure it 
stays fixed.

The particular set of methods I needed this for used primitives, so that's what 
I did. It's a bit tricky to add support for Long, Double, etc arrays, as I 
would have to check all combinations of possible method signatures when seeing 
things like (int[], int[], int[]) -- it becomes fairly ugly code.. Do you think 
this is particularly compelling? I can't really think of methods that take 
arrays of Number classes; usually, if you start using Numbers, you are also 
using Collections, not plain arrays.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902030#action_12902030
 ] 

Richard Ding commented on PIG-1343:
---

The log file is created when running in batch mode, but not in interactive mode.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902042#action_12902042
 ] 

Richard Ding commented on PIG-1551:
---

+1.

I'm fine with arrays of primitive types. I can't think of a Java method that 
uses an array of object Long as a parameter.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Caster interface and byte conversion

2010-08-24 Thread Dmitriy Ryaboy
As far as the toBytes methods -- I am not sure what they were originally
for. They aren't actually called anywhere that I can find, except my new
HBase stuff.
You are right, I could make it two interfaces, but I consolidated them for
simplicity of use/implementation. Now that I think about it, I can put all
the methods into StoreCaster and just have a unioning interface for
simplicity:

@InterfaceAudience.Public
@InterfaceStability.Evolving
public interface LoadStoreCaster extends LoadCaster, StoreCaster {

}

Does that seem ok?

-D

On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com wrote:

 One other comment.  By making this part of an interface that extends
 LoadCaster you are assuming the implementing class is both a load and store
 function.  It makes more sense to have a separate StoreCaster interface
 rather than extending LoadCaster.

 Alan.


 On Aug 24, 2010, at 9:18 AM, Alan Gates wrote:

  This seems fine.  Is the Pig engine at any point testing to see if the
 interface is implemented and if so calling toBytes, or is this totally
 for use inside the store functions themselves to serialize Pig data
 types?

 Alan.

 On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:

  The current HBase patch on PIG-1205 (patch 7) includes this
 refactoring.
 Please take a look if you have concerns.

 Or just if you feel like reviewing the code... :)

 -D

 On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com
 wrote:

  I just noticed that even though Utf8StorageConverter implements the
 various
 byte[] toBytes(Obj o) methods, they are not part of the LoadCaster
 interface
 -- and therefore can't be relied on when using modular Casters, like I
 am
 trying to do for the HBaseLoader.

 Since we don't want to introduce backwards-incompatible changes, I
 propose
 adding a ByteCaster interface that defines these methods, and
 extending
 Utf8StorageConverter to implement them (without actually changing the
 implementation at all).
 That way StoreFuncs that need to convert to bytes can use pluggable
 converters. Objections?

 -D






[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-24 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902065#action_12902065
 ] 

Thejas M Nair commented on PIG-1501:


Comments on the patch -
TFileStorage.java 
- getSchema() code that determines schema from data is same across TFileStorage 
and InterStorage . The code in BinStorage is also same, except that it does 
uses some deprecated functions. That can be moved to a common util class.   
(Yes, I should have moved it to a util class when I created InterStorage)

TestTmpFileCompression.java
- both tests test if TFile is getting used. I think one test can be changed to 
check if InterStorage gets used when compression is not turned on, or a check 
can be added to any other existing test case that runs MR job, to see if 
InterStorage gets used there.
- log setup code is duplicated between setup and resetLog() . can be moved to 
common func

SampleOptimizer.java
- The following comment can be updated -
// check that it is using BinaryStorage.
to
// check that it is using the temp file storage format.


TFileRecordWriter.java ,
- the comment in following section does not seem to be valid anymore -
{code}
 public TFileRecordWriter(Path file, String codec, Configuration conf)
+throws IOException {
+// hardcoded to use gzip and 1M as block size: may wish to be made 
configurable
{code}




 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

  Status: Resolved  (was: Patch Available)
Release Note: 
The idea is simple: frequently, Pig users need to use a simple function that is 
already provided by standard Java libraries, but for which a UDF has not been 
written. Dynamic Invokers allow a Pig programmer to refer to Java functions 
without having to wrap them in custom Pig UDFs, at the cost of doing some Java 
reflection on every function call.

{code}
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
{code}

Currently, Dynamic Invokers can be used for any static function that accepts no 
arguments or some combination of Strings, ints, longs, doubles, floats, or 
arrays of same, and returns a String, an int, a long, a double, or a float. 
Primitives only for the numbers, no capital-letter numeric classes as 
arguments. Depending on the return type, a specific kind of Invoker must be 
used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or 
InvokeForFloat.

The DEFINE keyword is used to bind a keyword to a Java method, as above. The 
first argument to the InvokeFor* constructor is the full path to the desired 
method. The second argument is a space-delimited ordered list of the classes of 
the method arguments. This can be omitted or an empty string if the method 
takes no arguments. Valid class names are String, Long, Float, Double, and Int. 
Invokers can also work with array arguments, represented in Pig as DataBags of 
single-tuple elements. Simply refer to string[], for example. Class names are 
not case-sensitive.

The ability to use invokers on methods that take array arguments makes methods 
like those in org.apache.commons.math.stat.StatUtils available for processing 
the results of grouping your datasets, for example. This is very nice, but a 
word of caution: the resulting UDF will of course not be optimized for Hadoop, 
and the very significant benefits one gains from implementing the Algebraic and 
Accumulative interfaces are lost here. Be careful with this one.
  Resolution: Fixed

Commited.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1354:
---

Release Note: Please see PIG-1551 release notes.

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates


On Aug 24, 2010, at 1:22 PM, Dmitriy Ryaboy wrote:

As far as the toBytes methods -- I am not sure what they were  
originally
for. They aren't actually called anywhere that I can find, except my  
new

HBase stuff.
You are right, I could make it two interfaces, but I consolidated  
them for
simplicity of use/implementation. Now that I think about it, I can  
put all

the methods into StoreCaster and just have a unioning interface for
simplicity:

@InterfaceAudience.Public
@InterfaceStability.Evolving
public interface LoadStoreCaster extends LoadCaster, StoreCaster {

}

Does that seem ok?


Yeah, makes sense.

Alan.



-D

On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com  
wrote:



One other comment.  By making this part of an interface that extends
LoadCaster you are assuming the implementing class is both a load  
and store
function.  It makes more sense to have a separate StoreCaster  
interface

rather than extending LoadCaster.

Alan.


On Aug 24, 2010, at 9:18 AM, Alan Gates wrote:

This seems fine.  Is the Pig engine at any point testing to see if  
the
interface is implemented and if so calling toBytes, or is this  
totally

for use inside the store functions themselves to serialize Pig data
types?

Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:

The current HBase patch on PIG-1205 (patch 7) includes this

refactoring.
Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy  
dvrya...@gmail.com

wrote:

I just noticed that even though Utf8StorageConverter implements the

various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster
interface
-- and therefore can't be relied on when using modular Casters,  
like I

am
trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I
propose
adding a ByteCaster interface that defines these methods, and
extending
Utf8StorageConverter to implement them (without actually  
changing the

implementation at all).
That way StoreFuncs that need to convert to bytes can use  
pluggable

converters. Objections?

-D










[jira] Created: (PIG-1563) SUBSTRING function is broken

2010-08-24 Thread Olga Natkovich (JIRA)
SUBSTRING function is broken


 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


Script:

A = load 'studenttab10k' as (name, age, gpa);
C = foreach A generate SUBSTRING(name, 0,5);
E = limit C 10;
dump E;

Output is always empty:

()
()
()
()
()
()
()
()
()
()


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: PIG-1483_1.patch

New patch adding unit test.

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483_1.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Status: Patch Available  (was: Open)

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483_1.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Fwd: hudson patch test jobs : hadoop pig and zookeeper

2010-08-24 Thread Alan Gates



Begin forwarded message:


From: Giridharan  Kesavan gkesa...@yahoo-inc.com
Date: August 24, 2010 4:38:46 PM PDT
To: gene...@hadoop.apache.org gene...@hadoop.apache.org
Subject: hudson patch test jobs : hadoop pig and zookeeper
Reply-To: gene...@hadoop.apache.org gene...@hadoop.apache.org

Hi,

We have a new hudson master hudson.apache.org and  
hudson.zones.apache.org is retired.
This means that we need to port all our patch test admin jobs for  
hadoop(common,hdfs,mapred), pig and zookeeper to the new hudson  
master.


I'm working on configuring patch admin jobs with the new hudson  
master: hudson.apache.org. (this is exactly the reason for why the  
patch test builds are not running at the moment)


Thanks
Giri




[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer

2010-08-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1514:


Status: Patch Available  (was: Open)

 Migrate logical optimization rule: OpLimitOptimizer
 ---

 Key: PIG-1514
 URL: https://issues.apache.org/jira/browse/PIG-1514
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1514-0.patch, jira-1514-1.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer

2010-08-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1514:


Status: Open  (was: Patch Available)

 Migrate logical optimization rule: OpLimitOptimizer
 ---

 Key: PIG-1514
 URL: https://issues.apache.org/jira/browse/PIG-1514
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1514-0.patch, jira-1514-1.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer

2010-08-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1514:


Attachment: jira-1514-1.patch

Regenerate patch to fix unit test fail.

 Migrate logical optimization rule: OpLimitOptimizer
 ---

 Key: PIG-1514
 URL: https://issues.apache.org/jira/browse/PIG-1514
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1514-0.patch, jira-1514-1.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-08-24 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1321:
-

Status: Open  (was: Patch Available)

 Logical Optimizer: Merge cascading foreach
 --

 Key: PIG-1321
 URL: https://issues.apache.org/jira/browse/PIG-1321
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1321-2.patch, pig-1321.patch


 We can merge consecutive foreach statement.
 Eg:
 b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1;
 c = foreach b generate b0#'kk1', b0#'kk2', b1, a1;
 = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-08-24 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1321:
-

Attachment: jira-1321-2.patch

Regenerate the patch to fix some test failures as well as rebasing with trunk's 
latest code changes.

 Logical Optimizer: Merge cascading foreach
 --

 Key: PIG-1321
 URL: https://issues.apache.org/jira/browse/PIG-1321
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1321-2.patch, pig-1321.patch


 We can merge consecutive foreach statement.
 Eg:
 b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1;
 c = foreach b generate b0#'kk1', b0#'kk2', b1, a1;
 = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-08-24 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1321:
-

Status: Patch Available  (was: Open)

 Logical Optimizer: Merge cascading foreach
 --

 Key: PIG-1321
 URL: https://issues.apache.org/jira/browse/PIG-1321
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1321-2.patch, pig-1321.patch


 We can merge consecutive foreach statement.
 Eg:
 b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1;
 c = foreach b generate b0#'kk1', b0#'kk2', b1, a1;
 = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557_1.patch

New patch adds a unit test.

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

  Status: Patch Available  (was: Open)
Hadoop Flags: [Reviewed]

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902211#action_12902211
 ] 

Olga Natkovich commented on PIG-1563:
-

The same needs to be done (and we need unit tests) for the following string 
manipulation functions:

INDEXOF
LAST_INDEX_OF
REPLACE
SPLIT
TRIM

 SUBSTRING function is broken
 

 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 Script:
 A = load 'studenttab10k' as (name, age, gpa);
 C = foreach A generate SUBSTRING(name, 0,5);
 E = limit C 10;
 dump E;
 Output is always empty:
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1564) add support for multiple filesystems

2010-08-24 Thread Andrew Hitchcock (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Hitchcock updated PIG-1564:
--

Attachment: PIG-1564-1.patch

At the moment you can not say read from S3N and write to HDFS in the one job 
(or even read from 1 S3N bucket and write to another). 
 
The essence of this patch is a change to the way HDataStorage works. Previously 
it mapped to 1 Hadoop FileSystem object, which basically limited jobs to a 
single FileSystem. The change is now that it is a wrapper around all Hadoop 
FileSystems, returning the correct one based upon the prefix of the path being 
requested. 
 
Another small change was that previously Pig assumed the default home directory 
was '/user/usename' on the default file system. This directory does not 
necessarily always exist, so I made this configurable with a new property 
pig.initial.fs.name.

 add support for multiple filesystems
 

 Key: PIG-1564
 URL: https://issues.apache.org/jira/browse/PIG-1564
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1564-1.patch


 Currently you can't run Pig scripts that read data from one file system and 
 write it to another. Also, Grunt doesn't support CDing from one directory to 
 another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-24 Thread Andrew Hitchcock (JIRA)
additional piggybank datetime and string UDFs
-

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock


Pig is missing a variety of UDFs that might be helpful for users implementing 
Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-24 Thread Andrew Hitchcock (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Hitchcock updated PIG-1565:
--

Status: Patch Available  (was: Open)

 additional piggybank datetime and string UDFs
 -

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1565-1.patch


 Pig is missing a variety of UDFs that might be helpful for users implementing 
 Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-24 Thread Andrew Hitchcock (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Hitchcock updated PIG-1565:
--

Attachment: PIG-1565-1.patch

This patch provides a number of UDFs written by the Amazon Elastic MapReduce 
team that we feel are useful.

A few of these UDFs are duplicates of existing functionality. I am including 
them because they are consistent with the rest of the UDFs in this patch and 
because I'd like to start a discussion about the best way to include these 
UDFs. Here is a list of what I believe to be duplicate UDFs:

INDEX_OF
LAST_INDEX_OF
SPLIT_ON_REGEX

Here are descriptions of the provided UDFs.

datetime/
 These are based on JodaTime and provide a similar model for date handling.

DATE_TIME
 A function that returns a DateTime String, of the form 
-MM-dd'T'HH:mm:ss.SSSZZ.
DURATION
 A function that returns a Duration as a long. A duration is a length of time 
specified in milliseconds.
EXTRACT_DT
 Extracts the integer numeric value of a field of a LocalDate, LocalTime, 
DateTime, Period or Duration.
FORMAT_DT
 Formats a LocalDate, LocalTime or DateTime given a format string into a string.
LOCAL_DATE
 A function that returns a LocalDate String, of the form -MM-dd.
LOCAL_TIME
 A function that returns a LocalTime String, of the form HH:mm:ss.SSS.
OFFSET_DT
 Offsets a LocalDate, LocalTime or DateTime by a Period/Duration, returning an 
object of the same type.
PERIOD
 A function that returns a Period String. A Period is specified in terms of 
individual duration fields such as years and days.

string/
 String handling functions modeled after Apache Commons StringUtils.

CAPITALIZE
 Capitalizes a String changing the first letter to upper case.
CENTER
 Centers a String in a larger String
CONCAT_WITH
 Joins the arguments with String joiner.
EXTRACT
 Parses input String with a regular expression, and returns all matches groups.
FORMAT
 Formats a list of arguments into a single String
INDEX_OF
 Finds the first index within a String, from a optional start position, 
handling null
LAST_INDEX_OF
 Finds the last index within a String, from a optional start position, handling 
null
LEFT_PAD
 Left pads a string to one of size size.
REPEAT
 Repeat a String repeat times to form a new String.
REPLACE_ONCE
 Replaces a String with another String inside a larger String, once.
RIGHT_PAD
 Right pads a string to one of size size.
SPLIT_ON_REGEX
 Splits this string around matches of the given regular expression.
STRIP
 Strips any of a set of characters from the start and end of a String.
STRIP_END
 Strips any of a set of characters from the start of a String.
STRIP_START
 Strips any of a set of characters from the start of a String.
SWAP_CASE
 Swaps the case of a String changing upper and title case to lower case, and 
lower case to upper case.

 additional piggybank datetime and string UDFs
 -

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1565-1.patch


 Pig is missing a variety of UDFs that might be helpful for users implementing 
 Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Open  (was: Patch Available)

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Patch Available  (was: Open)

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Minor polish of a debugging code inside comments

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.