Ryan Brush created CRUNCH-50:
--------------------------------

             Summary: Map-only parallelDos with multiple outputs should be fused
                 Key: CRUNCH-50
                 URL: https://issues.apache.org/jira/browse/CRUNCH-50
             Project: Crunch
          Issue Type: Improvement
            Reporter: Ryan Brush


I'm not sure if this is a bug or just an optimization yet to be implemented, 
but sibling parallelDo operations that have separate outputs are not being 
fused into a single Map operation.  (The original FlumeJava paper suggest they 
should be, and it would be a nice optimization to have here as well.)  Instead, 
each parallelDo results in a separate Map-only job that (redundantly) scans the 
input source of data.

This can be seen in the current MultipleOutputIT integration test.  Notice the 
logs below from running one of those tests scans the same input in multiple 
jobs.

8414 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job 
"org.apache.crunch.MultipleOutputIT: 
Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
8415 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
available at: http://localhost:8080/
8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use 
GenericOptionsParser for parsing the arguments. Applications should implement 
Tool for the same.
8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file 
set.  User classes may not be found. See JobConf(Class) or 
JobConf#setJar(String).
8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job 
"org.apache.crunch.MultipleOutputIT: 
Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
available at: http://localhost:8080/

I was going to take a stab at a patch for this, but noticed some major 
refactoring in this space is on deck as part of CRUNCH-34...so it might be best 
to address this after CRUNCH-34 lands.

As an aside, it wasn't clear how to write a good integration test to expose 
this functionality.  Would simply counting the stage results and ensuring we 
have the expected number for a simple job be the best way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to