[ 
https://issues.apache.org/jira/browse/CRUNCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-144:
------------------------------

    Attachment: CRUNCH-144b.patch

The aforementioned ugly, but functional, patch.
                
> Ability to re-use PCollections after a write without having to recompute them
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-144
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-144b.patch, CRUNCH-144.patch
>
>
> I have a pipeline that consists of several stages to process and filter a 
> dataset. I would like to persist this dataset to HDFS and then perform 
> further computation on it. 
> Example:
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
> 5. ) Write D to HDFS
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
> Running this pipeline code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
> If a "pipeline.run()" call is included after step 5, the same two jobs are 
> run but sequentially. 
> What I would like is to be able to hold on to the PCollection reference to 
> "D", so that steps 6* can be run without going back to the start and re-doing 
> all the work needed to generate it.
> -- 
> Ref to original discussion on crunch-user: 
> http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to