Gianmarco, I think you're absolutely correct. Thanks for weighing in :)

2012/6/9 Gianmarco De Francisci Morales <g...@apache.org>

> So, to recap.
>
> InterSedes writes the R1/R2/R3 thing.
> I am quite sure it is done for splittability purposes.
> The RawComparators, as well as InterStorage, operate on binary data that
> does not see this thing.
>
> DISCLAIMER: Guesswork below!
>
> My wild guess is that InterRecordReader has nothing to do with the
> RawComparator.
> The former is used at the input of the map phase (and InterRecordWriter is
> used at the output of the reduce phase) while RawComparator is used at the
> boundary between map and reduce phase.
> If you look at PigGenericMapReduce you will see that the mapper/reducer
> always writes a PigNullableWritable as a key. I think this is the place
> where the RawComparator actually plays a role, which means it does not
> directly see the R1/R2/R3 because they are stripped out by the
> RecordReader.
>
>
> Cheers,
>
> --
> Gianmarco
>
>
>
> On Sun, May 27, 2012 at 10:28 AM, Jonathan Coveney <jcove...@gmail.com
> >wrote:
>
> > Ashutosh, that definitely does help. Thanks for lending your insight. I
> > think the thing I have little color on at the moment is the relationship
> > between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so
> on,
> > and then the various byte[] compare functions.
> >
> > 2012/5/27 Ashutosh Chauhan <hashut...@apache.org>
> >
> > > Hey Jon,
> > >
> > > You raised some interesting question. I don't have answer for all, but
> > have
> > > for few.
> > >
> > > * BinStorage is a legacy format which was used for intermediate
> > > serialization between MR jobs earlier. It is no longer used but is
> there
> > > because unfortunately folks have stored their end-data using
> BinStorage,
> > > even though it was considered internal format and subject to change.
> The
> > > reason folks chose to store data using it was BinStorage was schema
> > aware,
> > > so once u wrote end-data with it, you can reload it without specifying
> > > schema. This feature led to its (mis)use. See
> > > https://issues.apache.org/jira/browse/PIG-798 for some related bugs
> > around
> > > this.
> > >
> > > * I think you have a correct intuition that in addition to identify
> tuple
> > > boundaries, R1,R2,R3 is also used to identify block boundaries, that is
> > to
> > > make file splittable. Since, then you can arbitrarily split the files
> > among
> > > multiple mappers and they will know where does their first record
> starts.
> > >
> > > Hope it helps,
> > > Ashutosh
> > >
> > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <jcove...@gmail.com
> > > >wrote:
> > >
> > > > I appreciate it, Gianmarco :)
> > > >
> > > > 2012/5/26 Gianmarco De Francisci Morales <g...@apache.org>
> > > >
> > > > > I am not sure, but I will have a look at it (I implemented the raw
> > > > > comparator for secondary sort).
> > > > > I don't remember having to deal with this issue.
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Gianmarco
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <
> > jcove...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I'll just bump this once. The main thing I'm still unsure on is
> > just
> > > > the
> > > > > > relationship various raw comparators, Pig, and hadoop. If we're
> > > > > serializing
> > > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2,
> RECORD_3,
> > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2,
> > > > RECORD_3,
> > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come
> it
> > > > > appears
> > > > > > that the raw comparators aren't aware of it?
> > > > > >
> > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com>
> > > > > >
> > > > > > > And one more question to pile on:
> > > > > > >
> > > > > > > What defines the binary data that the raw tuple comparator will
> > be
> > > > run
> > > > > > on?
> > > > > > > It seems like that it comes from hadoop, and the format
> generally
> > > > makes
> > > > > > > sense (you get bytes and do with them what you will). The thing
> > > that
> > > > > > > confuses me is why don't you have to deal with the
> > > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with
> > all
> > > > of
> > > > > > that
> > > > > > > and reads a deserialized tuple...so at what point do you get
> > binary
> > > > > Tuple
> > > > > > > data that doesn't have all of the split stuff? I'll keep
> digging
> > > > > through
> > > > > > > but this is where my ignorance of the technicalities of the MR
> > > layer
> > > > > > comes
> > > > > > > in...
> > > > > > >
> > > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com>
> > > > > > >
> > > > > > >> Another question is clarifying what BinStorage does compared
> to
> > > > > > >> InterStorage. It looks like it might just be a legacy storage
> > > > format?
> > > > > > >>
> > > > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find
> the
> > > next
> > > > > > >> Tuple in the stream, but once you do that, can't you just
> read a
> > > > > tuple,
> > > > > > and
> > > > > > >> then read skip 12 bytes (3 ints), and keep reading?
> > > > > > >>
> > > > > > >>
> > > > > > >> 2012/5/23 Jonathan Coveney <jcove...@gmail.com>
> > > > > > >>
> > > > > > >>> I'm trying to understand how intermediate serialization in
> Pig
> > > > works
> > > > > at
> > > > > > >>> a deeper level (understanding the whole code path, not just
> > > > > > BinInterSedes
> > > > > > >>> in its own vaccuum). Right now I am looking at
> > > > > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the
> > > right
> > > > > > place
> > > > > > >>> to look for understanding how BinInterSedes is actually
> called?
> > > > > > >>>
> > > > > > >>> Further, I'm trying to better understanding the
> > > > > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to
> make
> > > the
> > > > > > file
> > > > > > >>> splittable? But I'm not really sure. I'd love any pointers
> > about
> > > > > where
> > > > > > to
> > > > > > >>> look for how BinInterSedes is used, and how intermediate
> > storage
> > > > > > happens.
> > > > > > >>>
> > > > > > >>> Thanks!
> > > > > > >>> Jon
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to