Ashutosh, that definitely does help. Thanks for lending your insight. I think the thing I have little color on at the moment is the relationship between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so on, and then the various byte[] compare functions.
2012/5/27 Ashutosh Chauhan <hashut...@apache.org> > Hey Jon, > > You raised some interesting question. I don't have answer for all, but have > for few. > > * BinStorage is a legacy format which was used for intermediate > serialization between MR jobs earlier. It is no longer used but is there > because unfortunately folks have stored their end-data using BinStorage, > even though it was considered internal format and subject to change. The > reason folks chose to store data using it was BinStorage was schema aware, > so once u wrote end-data with it, you can reload it without specifying > schema. This feature led to its (mis)use. See > https://issues.apache.org/jira/browse/PIG-798 for some related bugs around > this. > > * I think you have a correct intuition that in addition to identify tuple > boundaries, R1,R2,R3 is also used to identify block boundaries, that is to > make file splittable. Since, then you can arbitrarily split the files among > multiple mappers and they will know where does their first record starts. > > Hope it helps, > Ashutosh > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <jcove...@gmail.com > >wrote: > > > I appreciate it, Gianmarco :) > > > > 2012/5/26 Gianmarco De Francisci Morales <g...@apache.org> > > > > > I am not sure, but I will have a look at it (I implemented the raw > > > comparator for secondary sort). > > > I don't remember having to deal with this issue. > > > > > > Cheers, > > > -- > > > Gianmarco > > > > > > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <jcove...@gmail.com > > > >wrote: > > > > > > > I'll just bump this once. The main thing I'm still unsure on is just > > the > > > > relationship various raw comparators, Pig, and hadoop. If we're > > > serializing > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > > RECORD_3, > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > > > appears > > > > that the raw comparators aren't aware of it? > > > > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > > > > > > And one more question to pile on: > > > > > > > > > > What defines the binary data that the raw tuple comparator will be > > run > > > > on? > > > > > It seems like that it comes from hadoop, and the format generally > > makes > > > > > sense (you get bytes and do with them what you will). The thing > that > > > > > confuses me is why don't you have to deal with the > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all > > of > > > > that > > > > > and reads a deserialized tuple...so at what point do you get binary > > > Tuple > > > > > data that doesn't have all of the split stuff? I'll keep digging > > > through > > > > > but this is where my ignorance of the technicalities of the MR > layer > > > > comes > > > > > in... > > > > > > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > > > > > > >> Another question is clarifying what BinStorage does compared to > > > > >> InterStorage. It looks like it might just be a legacy storage > > format? > > > > >> > > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the > next > > > > >> Tuple in the stream, but once you do that, can't you just read a > > > tuple, > > > > and > > > > >> then read skip 12 bytes (3 ints), and keep reading? > > > > >> > > > > >> > > > > >> 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > >> > > > > >>> I'm trying to understand how intermediate serialization in Pig > > works > > > at > > > > >>> a deeper level (understanding the whole code path, not just > > > > BinInterSedes > > > > >>> in its own vaccuum). Right now I am looking at > > > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the > right > > > > place > > > > >>> to look for understanding how BinInterSedes is actually called? > > > > >>> > > > > >>> Further, I'm trying to better understanding the > > > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make > the > > > > file > > > > >>> splittable? But I'm not really sure. I'd love any pointers about > > > where > > > > to > > > > >>> look for how BinInterSedes is used, and how intermediate storage > > > > happens. > > > > >>> > > > > >>> Thanks! > > > > >>> Jon > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > >