Thanks Josh - that works. At least I was only 2 characters away from the right answer! ;)
On 16 January 2013 15:40, Josh Wills <[email protected]> wrote: > Hey Dave, > > I forgot to tell you something important: your intermediate job should use > At.avroFile(...) instead of To.avroFile(...) since you're planning on > consuming additional data from it. If you do that, I believe it will work > as expected (two sequential jobs with the second one picking up where the > first one left off). In any case, we should make that transparent to users, > so I'm writing a small patch to do the underlying Target -> SourceTarget > conversion automatically when we can. > > Josh > > > On Wed, Jan 16, 2013 at 2:34 AM, Dave Beech <[email protected]> wrote: > >> Hi Josh. A follow up just to check I've got this straight. >> >> I've amended my pipeline and added a "pipeline.run()" call after the >> write to HDFS. Now I do get two mapreduce jobs, but instead of the second >> carrying on where the first left off, it actually re-does all the steps >> needed to generate the PCollection that was written. I get the same jobs A >> and B I described in my original email, but running sequentially rather >> than in parallel. Is that what you'd expect? >> >> So I guess what I have to do following the write is re-read from the >> output path using pipeline.read(From.avroFile(...)). >> >> It'd be good if the pipeline could hold onto information about >> PCollections even after they're written, so that they can be used by >> follow-on steps. I'll file a JIRA to this effect so we can discuss it >> there. >> >> Thanks, >> Dave >> >> >> On 15 January 2013 21:00, Dave Beech <[email protected]> wrote: >> >>> Thanks Josh - that's great. I'll file a JIRA about the side-outputs >>> feature, but the pipeline.run() call will serve my purpose for now. >>> >>> Cheers, >>> Dave >>> >>> On 15 January 2013 18:03, Josh Wills <[email protected]> wrote: >>> >>>> Hey Dave, >>>> >>>> The way to force a sequential run would be to call pipeline.run() after >>>> you write D to HDFS and before you declare the operations in step 6. What >>>> we would really want here is a single MapReduce job that wrote side outputs >>>> on the map side to create the dataset in step D, but we don't have support >>>> for side-outputs in maps yet. Worth filing a JIRA, I think. >>>> >>>> Thanks! >>>> Josh >>>> >>> >>> >>> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
