Yes. If you were feeling adventurous, you could add one processing path to the first stage beyond just the cache() and run() call, so like:
PCollection<String> rawInput = pipeline.read(From.textFile(s3Path); rawInput.cache() PCollection<Foo> fooOutput = rawInput.parallelDo(...) fooOutput.write(somewhere) pipeline.run(); On Fri, Nov 13, 2015 at 9:46 AM, David Ortiz <[email protected]> wrote: > So I would want something like⦠> > > > PCollection<String> rawInput = pipeline.read(From.textFile(s3Path); > > rawInput.cache() > > pipeline.run() > > > > //Do other processing followed by another pipeline.run() > > ? > > > > Thanks, > > Dave > > > > *From:* Josh Wills [mailto:[email protected]] > *Sent:* Friday, November 13, 2015 12:44 PM > *To:* [email protected] > *Subject:* Re: MRPipeline.cache() > > > > To absolutely guarantee it only runs once, you should make reading/copying > the data from S3 into HDFS its own job by inserting a Pipeline.run() after > the call to cache() and before any subsequent processing on the data. > cache() will write data locally, but if you have N processes that want to > do something to the data, it won't necessarily guarantee that the caching > happens before the rest of the processes start trying to read the data w/o > a blocking call to run(). > > > > J > > > > On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <[email protected]> wrote: > > Hey, > > > > If I have a super expensive to read input data set (think hundreds of > GB of data on s3 for example), would I be able to use cache to make sure I > only do the read once, then hand it out to the jobs that need it, as > opposed to what crunch does by default, which is read it once for each > parallel thread that needs the data? > > > > Thanks, > > Dave > > > *This email is intended only for the use of the individual(s) to whom it > is addressed. If you have received this communication in error, please > immediately notify the sender and delete the original email.* >
