Dave Beech created CRUNCH-144:
---------------------------------

             Summary: Ability to re-use PCollections after a write without 
having to recompute them
                 Key: CRUNCH-144
                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
             Project: Crunch
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.4.0
            Reporter: Dave Beech
            Assignee: Josh Wills


I have a pipeline that consists of several stages to process and filter a 
dataset. I would like to persist this dataset to HDFS and then perform further 
computation on it. 

Example:
1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D

5. ) Write D to HDFS

6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase

Running this pipeline code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd

If a "pipeline.run()" call is included after step 5, the same two jobs are run 
but sequentially. 

What I would like is to be able to hold on to the PCollection reference to "D", 
so that steps 6* can be run without going back to the start and re-doing all 
the work needed to generate it.

-- 

Ref to original discussion on crunch-user: 
http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to