[jira] Updated: (PIG-569) Inconsistency with Hadoop in Pig load statements involving globs with subdirectories

2008-12-17 Thread Kevin Weil (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Weil updated PIG-569:
---

Environment: FC Linux x86/64, Pig revision 724576  (was: FC Linux x86/64)

 Inconsistency with Hadoop in Pig load statements involving globs with 
 subdirectories
 

 Key: PIG-569
 URL: https://issues.apache.org/jira/browse/PIG-569
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
 Environment: FC Linux x86/64, Pig revision 724576
Reporter: Kevin Weil
 Fix For: types_branch


 Pig cannot handle LOAD statements with Hadoop globs where the globs have 
 subdirectories.  For example, 
 A = LOAD 'dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}' USING ...
 A similar statement in Hadoop, hadoop dfs -ls 
 dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}, does work correctly.
 The output of running the above load statement in pig, built from svn 
 revision 724576, is:
 2008-12-17 12:02:28,480 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2008-12-17 12:02:28,480 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2008-12-17 12:02:28,480 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - java.io.IOException: Unable to get collect for pattern 
 dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}} [Failed to obtain glob for 
 dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}]
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:231)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:40)
   at 
 org.apache.pig.impl.io.FileLocalizer.globMatchesFiles(FileLocalizer.java:486)
   at 
 org.apache.pig.impl.io.FileLocalizer.fileExists(FileLocalizer.java:455)
   at 
 org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:108)
   at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
   at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:200)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742)
   at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
   at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
   at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
   at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.pig.backend.datastorage.DataStorageException: Failed to 
 obtain glob for dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}
   ... 13 more
 Caused by: java.io.IOException: Illegal file pattern: Expecting set closure 
 character or end of range, or } for glob {dir1 at 5
   at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1084)
   at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1069)
   at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.init(FileSystem.java:987)
   at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:953)
   at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
   at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
   at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:902)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:862)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:215)
   ... 12 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query

2008-12-17 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
---

Attachment: PIG-563.patch

 PERFORMANCE: enable combiner to be called 0 or more times whenver the 
 combiner is used for a pig query
 --

 Key: PIG-563
 URL: https://issues.apache.org/jira/browse/PIG-563
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-563.patch


 Currently Pig's use of the combiner assumes the combiner is called exactly 
 once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more 
 times. This issue is to track changes needed in the CombinerOptimizer visitor 
 and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work 
 in this new model.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query

2008-12-17 Thread Mridul


iirc, the last time support for combiners were added, Utkarsh unearthed 
a bunch of bugs (and so the restricted use of combiners in pig) ... cant 
access the testcases in the patch, but hopefully they are also covered !


Regards,
Mridul

Pradeep Kamath (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
---

Status: Patch Available  (was: Open)

Changes are in two main places:
1) CombinerOptimizer which decides whether to use the combiner and also 
modifies the Map/combine/reduce plans to use the combiner
2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and 
COUNT

The CombinerOptimizer is changed as follows:
The combiner is used only in the case of a group by followed by foreach generate simple 
project*, algebraic udf* where simple project is the projection of the group 
by key (not a nested project like group.$0). Two new foreachs are inserted - one  in the combine 
and one in the map plan which will be based on the reduce foreach.  The map foreach will have one 
inner plan for each  inner plan in the foreach we're duplicating.  For projections, the plan will 
be the same.  For algebraic udfs, the plan will have the initial version of the function.  The 
combine foreach will have one inner plan for each inner plan in the foreach we're duplicating.  
For projections, the project operators will be changed to project the same column as its position 
in the foreach. For algebraic udfs, the plan will have the intermediate version of the function. 
In the inner plans of the reduce foreach for projections, the project operators will be changed 
to project the same column as its position in the foreach. For algebraic udfs, the plan will have 
the final version of the function. The input to the udf will be a POProject which will project 
the column corresponding to the position of the udf in the foreach.
The map plan is changed by replacing the existing Local rearrange with a special operator 
POPreCombinerLocalRearrange which behaves like the regular local rearrange in the getNext() as far 
as getting its input and constructing the key out of the input. It then returns a tuple 
with two fields - the key in the first position and the value inside a bag in the 
second position. This output format resembles the format out of a Package. This output will feed to 
the map foreach which expects this format. Then a normal local rearrange will be attached as the 
leaf of the map plan with a project as its input which projects the key from the map foreach. The 
combine plan will have the POCombiner package (formerly POPOstCombinerPackage), the combiner 
foreach and a local rearrange. The reduce plan will have a POCombiner package and the modified 
foreach at its root.

The UDFs are changed to have correct implementations for Initial, Intermediate 
and Final. TestBuiltin has also been changed to test this new setup.


  

PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner 
is used for a pig query
--

Key: PIG-563
URL: https://issues.apache.org/jira/browse/PIG-563
Project: Pig
 Issue Type: Improvement
   Affects Versions: types_branch
   Reporter: Pradeep Kamath
   Assignee: Pradeep Kamath
Fix For: types_branch


Currently Pig's use of the combiner assumes the combiner is called exactly once 
in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. 
This issue is to track changes needed in the CombinerOptimizer visitor and the 
builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this 
new model.