[jira] [Commented] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused

Josh Wills (JIRA) Mon, 27 Aug 2012 05:28:12 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442375#comment-13442375
 ]


Josh Wills commented on CRUNCH-50:
----------------------------------

Hey Ryan-- I think that CRUNCH-34 fixed this as a side effect of the planner 
refactoring-- if you would be up for applying that patch to master and trying 
it out, I would be much obliged.

As far as testing it goes, it seems like we should be able to take advantage of 
the PipelineResult object, which will indicate how many MR jobs Crunch had to 
run and can use Counters as a proxy for determining which pipeline tasks were 
assigned to which MR jobs.
                
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>
>                 Key: CRUNCH-50
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-50
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>
> I'm not sure if this is a bug or just an optimization yet to be implemented, 
> but sibling parallelDo operations that have separate outputs are not being 
> fused into a single Map operation.  (The original FlumeJava paper suggests 
> they should be, and it would be a nice optimization to have here as well.)  
> Instead, each parallelDo results in a separate Map-only job that 
> (redundantly) scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test.  Notice 
> the logs below from running one of those tests scans the same input in 
> multiple jobs.
> 8414 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running 
> job "org.apache.crunch.MultipleOutputIT: 
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
> 8415 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
> available at: http://localhost:8080/
> 8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use 
> GenericOptionsParser for parsing the arguments. Applications should implement 
> Tool for the same.
> 8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file 
> set.  User classes may not be found. See JobConf(Class) or 
> JobConf#setJar(String).
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running 
> job "org.apache.crunch.MultipleOutputIT: 
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
> available at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major 
> refactoring in this space is on deck as part of CRUNCH-34...so it might be 
> best to address this after CRUNCH-34 lands.
> As an aside, it wasn't clear how to write a good integration test to expose 
> this functionality.  Would simply counting the stage results and ensuring we 
> have the expected number for a simple job be the best way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused

Reply via email to