Re: MRPipeline.cache()

Josh Wills Fri, 13 Nov 2015 09:54:25 -0800

Yes. If you were feeling adventurous, you could add one processing path to
the first stage beyond just the cache() and run() call, so like:


PCollection<String> rawInput = pipeline.read(From.textFile(s3Path);

rawInput.cache()

PCollection<Foo> fooOutput = rawInput.parallelDo(...)

fooOutput.write(somewhere)

pipeline.run();


On Fri, Nov 13, 2015 at 9:46 AM, David Ortiz <[email protected]>
wrote:

> So I would want something like…
>
>
>
> PCollection<String> rawInput = pipeline.read(From.textFile(s3Path);
>
> rawInput.cache()
>
> pipeline.run()
>
>
>
> //Do other processing followed by another pipeline.run()
>
> ?
>
>
>
> Thanks,
>
>      Dave
>
>
>
> *From:* Josh Wills [mailto:[email protected]]
> *Sent:* Friday, November 13, 2015 12:44 PM
> *To:* [email protected]
> *Subject:* Re: MRPipeline.cache()
>
>
>
> To absolutely guarantee it only runs once, you should make reading/copying
> the data from S3 into HDFS its own job by inserting a Pipeline.run() after
> the call to cache() and before any subsequent processing on the data.
> cache() will write data locally, but if you have N processes that want to
> do something to the data, it won't necessarily guarantee that the caching
> happens before the rest of the processes start trying to read the data w/o
> a blocking call to run().
>
>
>
> J
>
>
>
> On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <[email protected]> wrote:
>
> Hey,
>
>
>
>      If I have a super expensive to read input data set (think hundreds of
> GB of data on s3 for example), would I be able to use cache to make sure I
> only do the read once, then hand it out to the jobs that need it, as
> opposed to what crunch does by default, which is read it once for each
> parallel thread that needs the data?
>
>
>
> Thanks,
>
>      Dave
>
>
> *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

Re: MRPipeline.cache()

Reply via email to