[
https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-50:
-----------------------------
Affects Version/s: 0.3.0
Fix Version/s: 0.4.0
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>
> Key: CRUNCH-50
> URL: https://issues.apache.org/jira/browse/CRUNCH-50
> Project: Crunch
> Issue Type: Improvement
> Affects Versions: 0.3.0
> Reporter: Ryan Brush
> Fix For: 0.4.0
>
> Attachments: MULTIOUTPUT_FUSE_TEST.patch
>
>
> I'm not sure if this is a bug or just an optimization yet to be implemented,
> but sibling parallelDo operations that have separate outputs are not being
> fused into a single Map operation. (The original FlumeJava paper suggests
> they should be, and it would be a nice optimization to have here as well.)
> Instead, each parallelDo results in a separate Map-only job that
> (redundantly) scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test. Notice
> the logs below from running one of those tests scans the same input in
> multiple jobs.
> 8414 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running
> job "org.apache.crunch.MultipleOutputIT:
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
> 8415 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status
> available at: http://localhost:8080/
> 8417 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - Use
> GenericOptionsParser for parsing the arguments. Applications should implement
> Tool for the same.
> 8497 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - No job jar file
> set. User classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running
> job "org.apache.crunch.MultipleOutputIT:
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
> 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status
> available at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major
> refactoring in this space is on deck as part of CRUNCH-34...so it might be
> best to address this after CRUNCH-34 lands.
> As an aside, it wasn't clear how to write a good integration test to expose
> this functionality. Would simply counting the stage results and ensuring we
> have the expected number for a simple job be the best way?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira