[jira] Updated: (PIG-569) Inconsistency with Hadoop in Pig load statements involving globs with subdirectories
[ https://issues.apache.org/jira/browse/PIG-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Weil updated PIG-569: --- Environment: FC Linux x86/64, Pig revision 724576 (was: FC Linux x86/64) Inconsistency with Hadoop in Pig load statements involving globs with subdirectories Key: PIG-569 URL: https://issues.apache.org/jira/browse/PIG-569 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Environment: FC Linux x86/64, Pig revision 724576 Reporter: Kevin Weil Fix For: types_branch Pig cannot handle LOAD statements with Hadoop globs where the globs have subdirectories. For example, A = LOAD 'dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}' USING ... A similar statement in Hadoop, hadoop dfs -ls dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}, does work correctly. The output of running the above load statement in pig, built from svn revision 724576, is: 2008-12-17 12:02:28,480 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2008-12-17 12:02:28,480 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2008-12-17 12:02:28,480 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - java.io.IOException: Unable to get collect for pattern dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}} [Failed to obtain glob for dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}] at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:231) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:40) at org.apache.pig.impl.io.FileLocalizer.globMatchesFiles(FileLocalizer.java:486) at org.apache.pig.impl.io.FileLocalizer.fileExists(FileLocalizer.java:455) at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:108) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:200) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.pig.backend.datastorage.DataStorageException: Failed to obtain glob for dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3} ... 13 more Caused by: java.io.IOException: Illegal file pattern: Expecting set closure character or end of range, or } for glob {dir1 at 5 at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1069) at org.apache.hadoop.fs.FileSystem$GlobFilter.init(FileSystem.java:987) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:953) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:902) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:862) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:215) ... 12 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query
[ https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-563: --- Attachment: PIG-563.patch PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query -- Key: PIG-563 URL: https://issues.apache.org/jira/browse/PIG-563 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-563.patch Currently Pig's use of the combiner assumes the combiner is called exactly once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. This issue is to track changes needed in the CombinerOptimizer visitor and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this new model. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query
iirc, the last time support for combiners were added, Utkarsh unearthed a bunch of bugs (and so the restricted use of combiners in pig) ... cant access the testcases in the patch, but hopefully they are also covered ! Regards, Mridul Pradeep Kamath (JIRA) wrote: [ https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-563: --- Status: Patch Available (was: Open) Changes are in two main places: 1) CombinerOptimizer which decides whether to use the combiner and also modifies the Map/combine/reduce plans to use the combiner 2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and COUNT The CombinerOptimizer is changed as follows: The combiner is used only in the case of a group by followed by foreach generate simple project*, algebraic udf* where simple project is the projection of the group by key (not a nested project like group.$0). Two new foreachs are inserted - one in the combine and one in the map plan which will be based on the reduce foreach. The map foreach will have one inner plan for each inner plan in the foreach we're duplicating. For projections, the plan will be the same. For algebraic udfs, the plan will have the initial version of the function. The combine foreach will have one inner plan for each inner plan in the foreach we're duplicating. For projections, the project operators will be changed to project the same column as its position in the foreach. For algebraic udfs, the plan will have the intermediate version of the function. In the inner plans of the reduce foreach for projections, the project operators will be changed to project the same column as its position in the foreach. For algebraic udfs, the plan will have the final version of the function. The input to the udf will be a POProject which will project the column corresponding to the position of the udf in the foreach. The map plan is changed by replacing the existing Local rearrange with a special operator POPreCombinerLocalRearrange which behaves like the regular local rearrange in the getNext() as far as getting its input and constructing the key out of the input. It then returns a tuple with two fields - the key in the first position and the value inside a bag in the second position. This output format resembles the format out of a Package. This output will feed to the map foreach which expects this format. Then a normal local rearrange will be attached as the leaf of the map plan with a project as its input which projects the key from the map foreach. The combine plan will have the POCombiner package (formerly POPOstCombinerPackage), the combiner foreach and a local rearrange. The reduce plan will have a POCombiner package and the modified foreach at its root. The UDFs are changed to have correct implementations for Initial, Intermediate and Final. TestBuiltin has also been changed to test this new setup. PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query -- Key: PIG-563 URL: https://issues.apache.org/jira/browse/PIG-563 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Currently Pig's use of the combiner assumes the combiner is called exactly once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. This issue is to track changes needed in the CombinerOptimizer visitor and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this new model.