Re: Some questions on intermediate serialization in Pig

Jonathan Coveney Sat, 26 May 2012 21:05:10 -0700

I appreciate it, Gianmarco :)

2012/5/26 Gianmarco De Francisci Morales <[email protected]>


> I am not sure, but I will have a look at it (I implemented the raw
> comparator for secondary sort).
> I don't remember having to deal with this issue.
>
> Cheers,
> --
> Gianmarco
>
>
>
>
> On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[email protected]
> >wrote:
>
> > I'll just bump this once. The main thing I'm still unsure on is just the
> > relationship various raw comparators, Pig, and hadoop. If we're
> serializing
> > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it
> appears
> > that the raw comparators aren't aware of it?
> >
> > 2012/5/23 Jonathan Coveney <[email protected]>
> >
> > > And one more question to pile on:
> > >
> > > What defines the binary data that the raw tuple comparator will be run
> > on?
> > > It seems like that it comes from hadoop, and the format generally makes
> > > sense (you get bytes and do with them what you will). The thing that
> > > confuses me is why don't you have to deal with the
> > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of
> > that
> > > and reads a deserialized tuple...so at what point do you get binary
> Tuple
> > > data that doesn't have all of the split stuff? I'll keep digging
> through
> > > but this is where my ignorance of the technicalities of the MR layer
> > comes
> > > in...
> > >
> > > 2012/5/23 Jonathan Coveney <[email protected]>
> > >
> > >> Another question is clarifying what BinStorage does compared to
> > >> InterStorage. It looks like it might just be a legacy storage format?
> > >>
> > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next
> > >> Tuple in the stream, but once you do that, can't you just read a
> tuple,
> > and
> > >> then read skip 12 bytes (3 ints), and keep reading?
> > >>
> > >>
> > >> 2012/5/23 Jonathan Coveney <[email protected]>
> > >>
> > >>> I'm trying to understand how intermediate serialization in Pig works
> at
> > >>> a deeper level (understanding the whole code path, not just
> > BinInterSedes
> > >>> in its own vaccuum). Right now I am looking at
> > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right
> > place
> > >>> to look for understanding how BinInterSedes is actually called?
> > >>>
> > >>> Further, I'm trying to better understanding the
> > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the
> > file
> > >>> splittable? But I'm not really sure. I'd love any pointers about
> where
> > to
> > >>> look for how BinInterSedes is used, and how intermediate storage
> > happens.
> > >>>
> > >>> Thanks!
> > >>> Jon
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Some questions on intermediate serialization in Pig

Reply via email to