[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Open (was: Patch Available) Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Attachment: PIG_1551.2.patch Attaching patch that fixes the two errors Richard pointed out. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Patch Available (was: Open) Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1205: Attachment: PIG_1205_8.patch Several updates continue Dmitriy's work. 1. Add unit test to HBaseStorage - code refactoring of TestHBaseStorage - add unit test for parameters: gt lt gte let limit and HBaseBinaryConverter. 2. Update hbase 0.20 to hbase 0.20.6 (Dimitry, I found HBaseStorage do not work on hbase 0.20, do you also manul test on hbase 0.20.6 rather than 0.20.0 ?) 3. I think we need more document for HBaseStorage especially the LoadCaster, if user specify the wrong LoadCast, he will get confusing result. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901787#action_12901787 ] Dmitriy V. Ryaboy commented on PIG-1205: Jeff, Thanks a lot for pitching in with the tests! I was using 0.20.0 and the old tests passed. I've only tested the binary conversion stuff and other new features on the Twitter machines, and they do run a later HBase version -- perhaps the incompatibility is in the filters or binary casters code? Do you know which tests fail with 0.20.0? I will definitely add a bunch of documentation. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901799#action_12901799 ] Jeff Zhang commented on PIG-1205: - Dmitriy, The testcase of testLoadWithParameters_1 and testLoadWithParameters_2 failed when using hbase 0.20 I think TableInputFormat has some update (maybe bug fixing) from hbase 0.20. to hbase 0.20.6 The following is log: 10/08/24 17:28:00 ERROR mapReduceLayer.Launcher: Backend error message during job submission org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hbase://pigtable_1 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:365) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:347) at org.apache.hadoop.hbase.filter.CompareFilter.readFields(CompareFilter.java:132) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:418) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:347) at org.apache.hadoop.hbase.filter.FilterList.readFields(FilterList.java:204) at org.apache.hadoop.hbase.client.Scan.readFields(Scan.java:523) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertStringToScan(TableMapReduceUtil.java:94) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:79) at org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableIFBuilder.build(HBaseTableInputFormat.java:77) at org.apache.pig.backend.hadoop.hbase.HBaseStorage.getInputFormat(HBaseStorage.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:257) ... 7 more 10/08/24 17:28:00 ERROR pigstats.PigStats: ERROR 2118: Unable to create input splits for: hbase://pigtable_1 10/08/24 17:28:00 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed! 10/08/24 17:28:00 INFO pigstats.PigStats: Script Statistics: Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Attachment: 1343.patch This patch will generate an error, where a job has failed but MR does not return any exception. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Status: Patch Available (was: Open) pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901924#action_12901924 ] Jeff Zhang commented on PIG-1205: - Dmitriy, I found the problem. This is really a bug of hbase 0.20.0 about the serialization of filter (https://issues.apache.org/jira/browse/HBASE-1830) I think we should update hbase to 0.20.6 in pig, and 0.20.6 is compatible with 0.20.0 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901946#action_12901946 ] Thejas M Nair commented on PIG-506: --- Unit test passed, and I committed the changes. But it fails with latest changes to switch to new logical plan. I have added the test cases to exclude list in build.xml . Keeping the jira open until this is fixed. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Caster interface and byte conversion
This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
Re: is Hudson awol?
Yes, our friend Hudson is ill again. Giri, Hudson's doctor, should get a chance to look at it in a few days. Alan. On Aug 23, 2010, at 3:31 PM, Dmitriy Ryaboy wrote: Haven't heard anything from Hudson in a while... -D
[jira] Updated: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-1560: Attachment: pig-1560.patch This patch fixes the checkstyle target build failure. Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: pig-1560.patch Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability
[ https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1311: Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked in. Pig interfaces should be clearly classified in terms of scope and stability --- Key: PIG-1311 URL: https://issues.apache.org/jira/browse/PIG-1311 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.8.0 Attachments: PIG-1311.patch Clearly marking Pig interfaces (Java interfaces but also things like config files, CLIs, Pig Latin syntax and semantics, etc.) to show scope (public/private) and stability (stable/evolving/unstable) will help users understand how to interact with Pig and developers to understand what things they can and cannot change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1503) Label interfaces for audience and stability in org.apache.pig.backend package
[ https://issues.apache.org/jira/browse/PIG-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-1503. - Resolution: Duplicate The remaining interfaces were labeled as part PIG-1311. Label interfaces for audience and stability in org.apache.pig.backend package - Key: PIG-1503 URL: https://issues.apache.org/jira/browse/PIG-1503 Project: Pig Issue Type: Sub-task Components: documentation Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 This includes the datastorage and executionengine packages under backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Caster interface and byte conversion
One other comment. By making this part of an interface that extends LoadCaster you are assuming the implementing class is both a load and store function. It makes more sense to have a separate StoreCaster interface rather than extending LoadCaster. Alan. On Aug 24, 2010, at 9:18 AM, Alan Gates wrote: This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
[jira] Resolved: (PIG-1558) build.xml for site directory does not work
[ https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-1558. - Resolution: Fixed Patch checked in. build.xml for site directory does not work -- Key: PIG-1558 URL: https://issues.apache.org/jira/browse/PIG-1558 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1558.patch Going to the site directory and running ant produces: {code} ant Buildfile: build.xml clean: [delete] Deleting directory /Users/gates/src/pig/apache/site/author/build update: BUILD FAILED /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: java.io.IOException: Cannot run program forrest (in directory /Users/gates/src/pig/apache/site/author): error=2, No such file or directory {code} Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1559: Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked in. Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1562: Fix Version/s: 0.8.0 Fix the version for the dependent packages for the maven - Key: PIG-1562 URL: https://issues.apache.org/jira/browse/PIG-1562 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 We need to fix the set version so that, version is properly set for the dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901975#action_12901975 ] Olga Natkovich commented on PIG-1560: - please, commit Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: pig-1560.patch Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901979#action_12901979 ] Olga Natkovich commented on PIG-1559: - Looks like limit issue I was seeing has been addressed in the latest trunk. I think we need to add unit tests to catch this things in the future. Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901984#action_12901984 ] Olga Natkovich commented on PIG-1559: - sorry, wrong JIRA Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901985#action_12901985 ] Olga Natkovich commented on PIG-1557: - Looks like limit issue I was seeing has been addressed in the latest trunk. I think we need to add unit tests to catch this things in the future. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901992#action_12901992 ] Richard Ding commented on PIG-1551: --- The typo is still there: {code} private static final Class? LONG_ARRAY_CLASS = new Long[0].getClass(); {code} It seems what you want is {code} private static final Class? LONG_ARRAY_CLASS = new long[0].getClass(); {code} so it's consistent with other array classes. This does raise a question about array parameters: the first form applies to methods like _amethod(Long[] nums)_, while the second supports methods like _amethod(long[] nums)_. And they are not exchangeable. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902008#action_12902008 ] Dmitriy V. Ryaboy commented on PIG-1205: Ok, let's upgrade to 20.6 then. We could work around by serializing the filters ourselves, and applying them to the scan when reading the UDFContext, but seems a bit overboard, and folks should be upgrading anyway. *Commiters*: this is ready for review. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Attachment: PIG_1551.3.patch Ugh. Thank you for catching that -- fixed, and added a test to make sure it stays fixed. The particular set of methods I needed this for used primitives, so that's what I did. It's a bit tricky to add support for Long, Double, etc arrays, as I would have to check all combinations of possible method signatures when seeing things like (int[], int[], int[]) -- it becomes fairly ugly code.. Do you think this is particularly compelling? I can't really think of methods that take arrays of Number classes; usually, if you start using Numbers, you are also using Collections, not plain arrays. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902030#action_12902030 ] Richard Ding commented on PIG-1343: --- The log file is created when running in batch mode, but not in interactive mode. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902042#action_12902042 ] Richard Ding commented on PIG-1551: --- +1. I'm fine with arrays of primitive types. I can't think of a Java method that uses an array of object Long as a parameter. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Caster interface and byte conversion
As far as the toBytes methods -- I am not sure what they were originally for. They aren't actually called anywhere that I can find, except my new HBase stuff. You are right, I could make it two interfaces, but I consolidated them for simplicity of use/implementation. Now that I think about it, I can put all the methods into StoreCaster and just have a unioning interface for simplicity: @InterfaceAudience.Public @InterfaceStability.Evolving public interface LoadStoreCaster extends LoadCaster, StoreCaster { } Does that seem ok? -D On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com wrote: One other comment. By making this part of an interface that extends LoadCaster you are assuming the implementing class is both a load and store function. It makes more sense to have a separate StoreCaster interface rather than extending LoadCaster. Alan. On Aug 24, 2010, at 9:18 AM, Alan Gates wrote: This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902065#action_12902065 ] Thejas M Nair commented on PIG-1501: Comments on the patch - TFileStorage.java - getSchema() code that determines schema from data is same across TFileStorage and InterStorage . The code in BinStorage is also same, except that it does uses some deprecated functions. That can be moved to a common util class. (Yes, I should have moved it to a util class when I created InterStorage) TestTmpFileCompression.java - both tests test if TFile is getting used. I think one test can be changed to check if InterStorage gets used when compression is not turned on, or a check can be added to any other existing test case that runs MR job, to see if InterStorage gets used there. - log setup code is duplicated between setup and resetLog() . can be moved to common func SampleOptimizer.java - The following comment can be updated - // check that it is using BinaryStorage. to // check that it is using the temp file storage format. TFileRecordWriter.java , - the comment in following section does not seem to be valid anymore - {code} public TFileRecordWriter(Path file, String codec, Configuration conf) +throws IOException { +// hardcoded to use gzip and 1M as block size: may wish to be made configurable {code} need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Resolved (was: Patch Available) Release Note: The idea is simple: frequently, Pig users need to use a simple function that is already provided by standard Java libraries, but for which a UDF has not been written. Dynamic Invokers allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs, at the cost of doing some Java reflection on every function call. {code} DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String'); encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray); decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8'); {code} Currently, Dynamic Invokers can be used for any static function that accepts no arguments or some combination of Strings, ints, longs, doubles, floats, or arrays of same, and returns a String, an int, a long, a double, or a float. Primitives only for the numbers, no capital-letter numeric classes as arguments. Depending on the return type, a specific kind of Invoker must be used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or InvokeForFloat. The DEFINE keyword is used to bind a keyword to a Java method, as above. The first argument to the InvokeFor* constructor is the full path to the desired method. The second argument is a space-delimited ordered list of the classes of the method arguments. This can be omitted or an empty string if the method takes no arguments. Valid class names are String, Long, Float, Double, and Int. Invokers can also work with array arguments, represented in Pig as DataBags of single-tuple elements. Simply refer to string[], for example. Class names are not case-sensitive. The ability to use invokers on methods that take array arguments makes methods like those in org.apache.commons.math.stat.StatUtils available for processing the results of grouping your datasets, for example. This is very nice, but a word of caution: the resulting UDF will of course not be optimized for Hadoop, and the very significant benefits one gains from implementing the Algebraic and Accumulative interfaces are lost here. Be careful with this one. Resolution: Fixed Commited. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1354: --- Release Note: Please see PIG-1551 release notes. UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Caster interface and byte conversion
On Aug 24, 2010, at 1:22 PM, Dmitriy Ryaboy wrote: As far as the toBytes methods -- I am not sure what they were originally for. They aren't actually called anywhere that I can find, except my new HBase stuff. You are right, I could make it two interfaces, but I consolidated them for simplicity of use/implementation. Now that I think about it, I can put all the methods into StoreCaster and just have a unioning interface for simplicity: @InterfaceAudience.Public @InterfaceStability.Evolving public interface LoadStoreCaster extends LoadCaster, StoreCaster { } Does that seem ok? Yeah, makes sense. Alan. -D On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com wrote: One other comment. By making this part of an interface that extends LoadCaster you are assuming the implementing class is both a load and store function. It makes more sense to have a separate StoreCaster interface rather than extending LoadCaster. Alan. On Aug 24, 2010, at 9:18 AM, Alan Gates wrote: This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
[jira] Created: (PIG-1563) SUBSTRING function is broken
SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: PIG-1483_1.patch New patch adding unit test. [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483_1.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Status: Patch Available (was: Open) [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483_1.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Fwd: hudson patch test jobs : hadoop pig and zookeeper
Begin forwarded message: From: Giridharan Kesavan gkesa...@yahoo-inc.com Date: August 24, 2010 4:38:46 PM PDT To: gene...@hadoop.apache.org gene...@hadoop.apache.org Subject: hudson patch test jobs : hadoop pig and zookeeper Reply-To: gene...@hadoop.apache.org gene...@hadoop.apache.org Hi, We have a new hudson master hudson.apache.org and hudson.zones.apache.org is retired. This means that we need to port all our patch test admin jobs for hadoop(common,hdfs,mapred), pig and zookeeper to the new hudson master. I'm working on configuring patch admin jobs with the new hudson master: hudson.apache.org. (this is exactly the reason for why the patch test builds are not running at the moment) Thanks Giri
[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1514: Status: Patch Available (was: Open) Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch, jira-1514-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1514: Status: Open (was: Patch Available) Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch, jira-1514-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1514: Attachment: jira-1514-1.patch Regenerate patch to fix unit test fail. Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch, jira-1514-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1321: - Status: Open (was: Patch Available) Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1321: - Attachment: jira-1321-2.patch Regenerate the patch to fix some test failures as well as rebasing with trunk's latest code changes. Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1321: - Status: Patch Available (was: Open) Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557_1.patch New patch adds a unit test. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Patch Available (was: Open) Hadoop Flags: [Reviewed] couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Resolved (was: Patch Available) Resolution: Fixed couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902211#action_12902211 ] Olga Natkovich commented on PIG-1563: - The same needs to be done (and we need unit tests) for the following string manipulation functions: INDEXOF LAST_INDEX_OF REPLACE SPLIT TRIM SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Hitchcock updated PIG-1564: -- Attachment: PIG-1564-1.patch At the moment you can not say read from S3N and write to HDFS in the one job (or even read from 1 S3N bucket and write to another). The essence of this patch is a change to the way HDataStorage works. Previously it mapped to 1 Hadoop FileSystem object, which basically limited jobs to a single FileSystem. The change is now that it is a wrapper around all Hadoop FileSystems, returning the correct one based upon the prefix of the path being requested. Another small change was that previously Pig assumed the default home directory was '/user/usename' on the default file system. This directory does not necessarily always exist, so I made this configurable with a new property pig.initial.fs.name. add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1565) additional piggybank datetime and string UDFs
additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1565) additional piggybank datetime and string UDFs
[ https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Hitchcock updated PIG-1565: -- Status: Patch Available (was: Open) additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1565-1.patch Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1565) additional piggybank datetime and string UDFs
[ https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Hitchcock updated PIG-1565: -- Attachment: PIG-1565-1.patch This patch provides a number of UDFs written by the Amazon Elastic MapReduce team that we feel are useful. A few of these UDFs are duplicates of existing functionality. I am including them because they are consistent with the rest of the UDFs in this patch and because I'd like to start a discussion about the best way to include these UDFs. Here is a list of what I believe to be duplicate UDFs: INDEX_OF LAST_INDEX_OF SPLIT_ON_REGEX Here are descriptions of the provided UDFs. datetime/ These are based on JodaTime and provide a similar model for date handling. DATE_TIME A function that returns a DateTime String, of the form -MM-dd'T'HH:mm:ss.SSSZZ. DURATION A function that returns a Duration as a long. A duration is a length of time specified in milliseconds. EXTRACT_DT Extracts the integer numeric value of a field of a LocalDate, LocalTime, DateTime, Period or Duration. FORMAT_DT Formats a LocalDate, LocalTime or DateTime given a format string into a string. LOCAL_DATE A function that returns a LocalDate String, of the form -MM-dd. LOCAL_TIME A function that returns a LocalTime String, of the form HH:mm:ss.SSS. OFFSET_DT Offsets a LocalDate, LocalTime or DateTime by a Period/Duration, returning an object of the same type. PERIOD A function that returns a Period String. A Period is specified in terms of individual duration fields such as years and days. string/ String handling functions modeled after Apache Commons StringUtils. CAPITALIZE Capitalizes a String changing the first letter to upper case. CENTER Centers a String in a larger String CONCAT_WITH Joins the arguments with String joiner. EXTRACT Parses input String with a regular expression, and returns all matches groups. FORMAT Formats a list of arguments into a single String INDEX_OF Finds the first index within a String, from a optional start position, handling null LAST_INDEX_OF Finds the last index within a String, from a optional start position, handling null LEFT_PAD Left pads a string to one of size size. REPEAT Repeat a String repeat times to form a new String. REPLACE_ONCE Replaces a String with another String inside a larger String, once. RIGHT_PAD Right pads a string to one of size size. SPLIT_ON_REGEX Splits this string around matches of the given regular expression. STRIP Strips any of a set of characters from the start and end of a String. STRIP_END Strips any of a set of characters from the start of a String. STRIP_START Strips any of a set of characters from the start of a String. SWAP_CASE Swaps the case of a String changing upper and title case to lower case, and lower case to upper case. additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1565-1.patch Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Open (was: Patch Available) multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Patch Available (was: Open) multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Minor polish of a debugging code inside comments multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.