Re: Some questions on intermediate serialization in Pig

Jonathan Coveney Sun, 27 May 2012 01:28:34 -0700

Ashutosh, that definitely does help. Thanks for lending your insight. I
think the thing I have little color on at the moment is the relationship
between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so on,
and then the various byte[] compare functions.


2012/5/27 Ashutosh Chauhan <[email protected]>

> Hey Jon,
>
> You raised some interesting question. I don't have answer for all, but have
> for few.
>
> * BinStorage is a legacy format which was used for intermediate
> serialization between MR jobs earlier. It is no longer used but is there
> because unfortunately folks have stored their end-data using BinStorage,
> even though it was considered internal format and subject to change. The
> reason folks chose to store data using it was BinStorage was schema aware,
> so once u wrote end-data with it, you can reload it without specifying
> schema. This feature led to its (mis)use. See
> https://issues.apache.org/jira/browse/PIG-798 for some related bugs around
> this.
>
> * I think you have a correct intuition that in addition to identify tuple
> boundaries, R1,R2,R3 is also used to identify block boundaries, that is to
> make file splittable. Since, then you can arbitrarily split the files among
> multiple mappers and they will know where does their first record starts.
>
> Hope it helps,
> Ashutosh
>
> On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[email protected]
> >wrote:
>
> > I appreciate it, Gianmarco :)
> >
> > 2012/5/26 Gianmarco De Francisci Morales <[email protected]>
> >
> > > I am not sure, but I will have a look at it (I implemented the raw
> > > comparator for secondary sort).
> > > I don't remember having to deal with this issue.
> > >
> > > Cheers,
> > > --
> > > Gianmarco
> > >
> > >
> > >
> > >
> > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[email protected]
> > > >wrote:
> > >
> > > > I'll just bump this once. The main thing I'm still unsure on is just
> > the
> > > > relationship various raw comparators, Pig, and hadoop. If we're
> > > serializing
> > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2,
> > RECORD_3,
> > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it
> > > appears
> > > > that the raw comparators aren't aware of it?
> > > >
> > > > 2012/5/23 Jonathan Coveney <[email protected]>
> > > >
> > > > > And one more question to pile on:
> > > > >
> > > > > What defines the binary data that the raw tuple comparator will be
> > run
> > > > on?
> > > > > It seems like that it comes from hadoop, and the format generally
> > makes
> > > > > sense (you get bytes and do with them what you will). The thing
> that
> > > > > confuses me is why don't you have to deal with the
> > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all
> > of
> > > > that
> > > > > and reads a deserialized tuple...so at what point do you get binary
> > > Tuple
> > > > > data that doesn't have all of the split stuff? I'll keep digging
> > > through
> > > > > but this is where my ignorance of the technicalities of the MR
> layer
> > > > comes
> > > > > in...
> > > > >
> > > > > 2012/5/23 Jonathan Coveney <[email protected]>
> > > > >
> > > > >> Another question is clarifying what BinStorage does compared to
> > > > >> InterStorage. It looks like it might just be a legacy storage
> > format?
> > > > >>
> > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the
> next
> > > > >> Tuple in the stream, but once you do that, can't you just read a
> > > tuple,
> > > > and
> > > > >> then read skip 12 bytes (3 ints), and keep reading?
> > > > >>
> > > > >>
> > > > >> 2012/5/23 Jonathan Coveney <[email protected]>
> > > > >>
> > > > >>> I'm trying to understand how intermediate serialization in Pig
> > works
> > > at
> > > > >>> a deeper level (understanding the whole code path, not just
> > > > BinInterSedes
> > > > >>> in its own vaccuum). Right now I am looking at
> > > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the
> right
> > > > place
> > > > >>> to look for understanding how BinInterSedes is actually called?
> > > > >>>
> > > > >>> Further, I'm trying to better understanding the
> > > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make
> the
> > > > file
> > > > >>> splittable? But I'm not really sure. I'd love any pointers about
> > > where
> > > > to
> > > > >>> look for how BinInterSedes is used, and how intermediate storage
> > > > happens.
> > > > >>>
> > > > >>> Thanks!
> > > > >>> Jon
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Some questions on intermediate serialization in Pig

Reply via email to