I'm just back from vacation; I'll take a look at the planner error and see if I can come up with a workaround.
On Sat, Mar 29, 2014 at 1:53 PM, Jinal Shah <[email protected]> wrote: > Any updates on this? > On Mar 20, 2014 2:27 PM, "Jinal Shah" <[email protected]> wrote: > > > Sorry Micah U and V are totally different Types. Just wanted to clarify > it. > > > > > > On Thu, Mar 20, 2014 at 2:00 PM, Jinal Shah <[email protected] > >wrote: > > > >> Hey Micah, > >> > >> Yes you are right and this is what is going on in that // do something. > >> (Higher Level overview) > >> > >> Here U and V are same > >> > >> PCollection<V> collectionWhichCouldBeEmpty = null; > >> if(path.exists){ > >> collectionWhichCouldBeEmpty= pipeline.read(FromPath, PType.V); > >> } else{ > >> collectionWhichCouldBeEmpty = pipeline.emptyPCollection(); > >> } > >> > >> PCollection<U> collectionWhichHasData = DataComingFromDifferentSource(); > >> > >> PTable<K,V> VTable = collectionWhichCouldBeEmpty.by(PType<K>); > >> > >> PTable<K,U> UTable = collectionWhichHasData.by(PType<K>); > >> > >> UVTable = Join.join(UTable, VTable, Join.LEFT); > >> > >> pipeline.write(UVTable.values(), somePath, PType<U>); > >> > >> pipeline.run() // error is here > >> > >> > >> Hope this helps. > >> > >> > >> > >> > >> On Thu, Mar 20, 2014 at 12:50 PM, Micah Whitacre <[email protected] > >wrote: > >> > >>> Jinal can you elaborate on the "//do something" section of the code? I > >>> thought when I heard it described other PCollections were being joined > >>> with > >>> the emptyPCollection and it was the outcome of the joins and additional > >>> processing that was actually being persisted. > >>> > >>> > >>> On Thu, Mar 20, 2014 at 10:18 AM, Chao Shi <[email protected]> wrote: > >>> > >>> > Hi Josh and Jinal, > >>> > > >>> > This was introduced to help the following case: In one of our MR > >>> programs, > >>> > there is a command line option that one can optionally specify a path > >>> to > >>> > data to be joined on. Before introducing emptyCollection(), we have > to > >>> do > >>> > like this: > >>> > > >>> > Path path = ... > >>> > PCollection in1 = null; > >>> > if (path != null) { > >>> > in = pipeline.read(...); > >>> > } > >>> > PCollection in2 = pipeline.read(...); > >>> > if (in1 != null) { > >>> > in2 = in2.join(in1); > >>> > } > >>> > > >>> > You can see checks for null everywhere. With emptyPColleciton, we can > >>> do > >>> > this: > >>> > > >>> > if (path != null) { > >>> > in2 = pipeline.read(); > >>> > } else { > >>> > in2 = emptyPCollection(); > >>> > } > >>> > in1.join(in2) > >>> > > >>> > I think Jinal's case should be a bad case for our current > >>> implementation. > >>> > Perhaps we should change it to create an empty output directory > rather > >>> than > >>> > report an error, which doesn't start the MR and can save the job > >>> start-up > >>> > time. This is the benefit for knowing PCollection in plan-time. > >>> > > >>> > > >>> > 2014-03-16 23:34 GMT+08:00 Josh Wills <[email protected]>: > >>> > > >>> > > +chao > >>> > > > >>> > > Inlined. > >>> > > > >>> > > On Sat, Mar 15, 2014 at 12:34 PM, Jinal Shah < > >>> [email protected] > >>> > >wrote: > >>> > > > >>> > >> Hi, > >>> > >> > >>> > >> I actually came across a particular case I'm not sure whether the > >>> > behavior > >>> > >> is right or not. So here is what is happening I am getting No > Output > >>> > >> exception throwing while trying to run my Crunch job. On further > >>> > >> investigating I found that I was using Pipeline.emptyCollection(). > >>> So > >>> > here > >>> > >> is how my scenario looks like > >>> > >> > >>> > >> PCollection<V> collectionWhichCouldBeEmpty = null; > >>> > >> if(path.exists){ > >>> > >> collectionWhichCouldBeEmpty= pipeline.read(FromPath, PType.V); > >>> > >> } else{ > >>> > >> collectionWhichCouldBeEmpty = pipeline.emptyPCollection(); > >>> > >> } > >>> > >> > >>> > >> //do some operations > >>> > >> > >>> > >> pipeline.write(Target); > >>> > >> > >>> > >> pipeline.run()// this is where it is throwing the error > >>> > >> > >>> > >> > >>> > > >>> > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/MSCRPlanner.java#L287 > >>> > >> > >>> > >> On further debugging I found that the Vertex didn't have an input. > >>> > >> > >>> > >> > >>> > > >>> > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/mr/plan/MSCRPlanner.java#L275 > >>> > >> > >>> > >> > >>> > >> So If I use the pipeline.read and it creates an Empty PCollection > it > >>> > works > >>> > >> since it has the input source but If I create an Empty PCollection > >>> using > >>> > >> the pipeline.emptyPCollection which doesn't have an input source > >>> then it > >>> > >> fails > >>> > >> > >>> > >> Not sure if the case is missed or it has to be like this. > >>> > >> > >>> > > > >>> > > It's a good question, and I'm not sure of the answer. Added Chao to > >>> the > >>> > > To: line to ask him what the intention was in this case. > >>> > > > >>> > > > >>> > >> Thanks > >>> > >> Jinal > >>> > >> > >>> > > > >>> > > > >>> > > >>> > >> > >> > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
