Let me see if I can transform the thing I've got already into a more suitable shape
On 4 June 2014 17:10, Josh Wills <[email protected]> wrote: > Tracking this here: https://issues.apache.org/jira/browse/CRUNCH-412 > > David, do you have a patch that would be easy to adapt here? > > > On Wed, Jun 4, 2014 at 8:05 AM, Gabriel Reid <[email protected]> > wrote: > > > On Wed, Jun 4, 2014 at 4:53 PM, Josh Wills <[email protected]> wrote: > > > As long as we're on the subject, the fact that we don't > > > serialize/deserialize DoFns before we run MemPipelines has burned me a > > few > > > times when I would make a change and forget about the serialization > > > implications until I tried to run the job. One of the things I like > about > > > the local Spark implementation is that it does this serialization check > > for > > > you. So if we were to have some mode that could be enabled to allow us > to > > > simulate DoFn serialization and object re-use situations locally, even > at > > > the cost of extra runtime overhead, I would be happy. > > > > Yeah, that sounds like a good plan -- it could probably be done via an > > overload of MemPipeline.getInstance() that would take a boolean flag > > to replication MR or not. > > > > > > > > > > > > > On Wed, Jun 4, 2014 at 4:27 AM, Gabriel Reid <[email protected]> > > wrote: > > > > > >> Hi David, > > >> > > >> Yeah, the whole object reuse situation in MapReduce is quite a drag. > > >> By the way, this isn't specific to Avro -- it's just as big of an > > >> issue with Writables. > > >> > > >> I'm kind of torn on the idea of building this into MemPipeline. On the > > >> one hand, the closer the behavior of different pipeline > > >> implementations are, the better, and I think MemPipeline is currently > > >> indeed largely used for fast testing of MR workflows. > > >> > > >> On the other hand, the MemPipeline can also be seen as an optimized > > >> implementation for running a pipeline faster if everything fits in > > >> memory, so adding in extra serialization/deserialization just to be > > >> similar to the MRPipeline would be unfortunately. Additionally, > > >> there's also the SparkPipeline implementation to consider -- I assume > > >> that spark doesn't have this same object reuse situation, although I'm > > >> not actually sure. > > >> > > >> I think that my general feeling is that I'd rather not add object > > >> reuse into MemPipeline, but I do think it would be great if we could > > >> bring the behavior of the two implementations in line. Any other > > >> thoughts on how we could do this? > > >> > > >> - Gabriel > > >> > > >> > > >> On Wed, Jun 4, 2014 at 11:51 AM, David Whiting < > [email protected]> > > >> wrote: > > >> > We are facing some problems where errors are not found with > > MemPipeline > > >> > tests which do happen on the cluster due to the way Avro reuses > > objects > > >> for > > >> > reading; meaning that if you do a parallelDo and a PGroupedTable, > then > > >> the > > >> > Iterable of values you get is actually the same object returned to > you > > >> with > > >> > different contents. If you then try and store certain instances of > > this > > >> to > > >> > be emitted later, then you will get unexpected results without > taking > > a > > >> > copy. > > >> > > > >> > To try and catch these problems on the local side, we've written a > > >> wrapper > > >> > for Iterable<? extends SpecificRecord> which uses the > > SpecificDatumWriter > > >> > and SpecificDatumReader to write to and from ByteBuffers for each > > >> iteration > > >> > and offering the last record for reuse. This allows us to run the > > >> offending > > >> > MapFn or DoFn in isolation with a wrapped Iterable as input to > > identify > > >> and > > >> > fix the problem. > > >> > > > >> > This, however, is a little awkward and it would be much neater if > this > > >> was > > >> > driven from the Crunch side. At the point in the MapShuffler where > the > > >> > Iterable is wrapped in a SingleUseIterable, it could also be wrapped > > in > > >> one > > >> > of these AvroReadSimulatingIterable, meaning that people could find > > these > > >> > problems before even running them live for the first time. However, > > this > > >> is > > >> > a problem that is specific to Avro so it obviously makes no sense to > > hack > > >> > it right into HFunction. > > >> > > > >> > Anyone have any thoughts on how this could be integrated nicely, or > > >> indeed > > >> > if it should be integrated at all? > > >> > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
