I appreciate it, Gianmarco :) 2012/5/26 Gianmarco De Francisci Morales <[email protected]>
> I am not sure, but I will have a look at it (I implemented the raw > comparator for secondary sort). > I don't remember having to deal with this issue. > > Cheers, > -- > Gianmarco > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[email protected] > >wrote: > > > I'll just bump this once. The main thing I'm still unsure on is just the > > relationship various raw comparators, Pig, and hadoop. If we're > serializing > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > appears > > that the raw comparators aren't aware of it? > > > > 2012/5/23 Jonathan Coveney <[email protected]> > > > > > And one more question to pile on: > > > > > > What defines the binary data that the raw tuple comparator will be run > > on? > > > It seems like that it comes from hadoop, and the format generally makes > > > sense (you get bytes and do with them what you will). The thing that > > > confuses me is why don't you have to deal with the > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of > > that > > > and reads a deserialized tuple...so at what point do you get binary > Tuple > > > data that doesn't have all of the split stuff? I'll keep digging > through > > > but this is where my ignorance of the technicalities of the MR layer > > comes > > > in... > > > > > > 2012/5/23 Jonathan Coveney <[email protected]> > > > > > >> Another question is clarifying what BinStorage does compared to > > >> InterStorage. It looks like it might just be a legacy storage format? > > >> > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next > > >> Tuple in the stream, but once you do that, can't you just read a > tuple, > > and > > >> then read skip 12 bytes (3 ints), and keep reading? > > >> > > >> > > >> 2012/5/23 Jonathan Coveney <[email protected]> > > >> > > >>> I'm trying to understand how intermediate serialization in Pig works > at > > >>> a deeper level (understanding the whole code path, not just > > BinInterSedes > > >>> in its own vaccuum). Right now I am looking at > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right > > place > > >>> to look for understanding how BinInterSedes is actually called? > > >>> > > >>> Further, I'm trying to better understanding the > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the > > file > > >>> splittable? But I'm not really sure. I'd love any pointers about > where > > to > > >>> look for how BinInterSedes is used, and how intermediate storage > > happens. > > >>> > > >>> Thanks! > > >>> Jon > > >>> > > >> > > >> > > > > > >
