Gianmarco, I think you're absolutely correct. Thanks for weighing in :) 2012/6/9 Gianmarco De Francisci Morales <g...@apache.org>
> So, to recap. > > InterSedes writes the R1/R2/R3 thing. > I am quite sure it is done for splittability purposes. > The RawComparators, as well as InterStorage, operate on binary data that > does not see this thing. > > DISCLAIMER: Guesswork below! > > My wild guess is that InterRecordReader has nothing to do with the > RawComparator. > The former is used at the input of the map phase (and InterRecordWriter is > used at the output of the reduce phase) while RawComparator is used at the > boundary between map and reduce phase. > If you look at PigGenericMapReduce you will see that the mapper/reducer > always writes a PigNullableWritable as a key. I think this is the place > where the RawComparator actually plays a role, which means it does not > directly see the R1/R2/R3 because they are stripped out by the > RecordReader. > > > Cheers, > > -- > Gianmarco > > > > On Sun, May 27, 2012 at 10:28 AM, Jonathan Coveney <jcove...@gmail.com > >wrote: > > > Ashutosh, that definitely does help. Thanks for lending your insight. I > > think the thing I have little color on at the moment is the relationship > > between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so > on, > > and then the various byte[] compare functions. > > > > 2012/5/27 Ashutosh Chauhan <hashut...@apache.org> > > > > > Hey Jon, > > > > > > You raised some interesting question. I don't have answer for all, but > > have > > > for few. > > > > > > * BinStorage is a legacy format which was used for intermediate > > > serialization between MR jobs earlier. It is no longer used but is > there > > > because unfortunately folks have stored their end-data using > BinStorage, > > > even though it was considered internal format and subject to change. > The > > > reason folks chose to store data using it was BinStorage was schema > > aware, > > > so once u wrote end-data with it, you can reload it without specifying > > > schema. This feature led to its (mis)use. See > > > https://issues.apache.org/jira/browse/PIG-798 for some related bugs > > around > > > this. > > > > > > * I think you have a correct intuition that in addition to identify > tuple > > > boundaries, R1,R2,R3 is also used to identify block boundaries, that is > > to > > > make file splittable. Since, then you can arbitrarily split the files > > among > > > multiple mappers and they will know where does their first record > starts. > > > > > > Hope it helps, > > > Ashutosh > > > > > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <jcove...@gmail.com > > > >wrote: > > > > > > > I appreciate it, Gianmarco :) > > > > > > > > 2012/5/26 Gianmarco De Francisci Morales <g...@apache.org> > > > > > > > > > I am not sure, but I will have a look at it (I implemented the raw > > > > > comparator for secondary sort). > > > > > I don't remember having to deal with this issue. > > > > > > > > > > Cheers, > > > > > -- > > > > > Gianmarco > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney < > > jcove...@gmail.com > > > > > >wrote: > > > > > > > > > > > I'll just bump this once. The main thing I'm still unsure on is > > just > > > > the > > > > > > relationship various raw comparators, Pig, and hadoop. If we're > > > > > serializing > > > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > RECORD_3, > > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > > > > RECORD_3, > > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come > it > > > > > appears > > > > > > that the raw comparators aren't aware of it? > > > > > > > > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > > > > > > > > > > And one more question to pile on: > > > > > > > > > > > > > > What defines the binary data that the raw tuple comparator will > > be > > > > run > > > > > > on? > > > > > > > It seems like that it comes from hadoop, and the format > generally > > > > makes > > > > > > > sense (you get bytes and do with them what you will). The thing > > > that > > > > > > > confuses me is why don't you have to deal with the > > > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with > > all > > > > of > > > > > > that > > > > > > > and reads a deserialized tuple...so at what point do you get > > binary > > > > > Tuple > > > > > > > data that doesn't have all of the split stuff? I'll keep > digging > > > > > through > > > > > > > but this is where my ignorance of the technicalities of the MR > > > layer > > > > > > comes > > > > > > > in... > > > > > > > > > > > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > > > > > > > > > > >> Another question is clarifying what BinStorage does compared > to > > > > > > >> InterStorage. It looks like it might just be a legacy storage > > > > format? > > > > > > >> > > > > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find > the > > > next > > > > > > >> Tuple in the stream, but once you do that, can't you just > read a > > > > > tuple, > > > > > > and > > > > > > >> then read skip 12 bytes (3 ints), and keep reading? > > > > > > >> > > > > > > >> > > > > > > >> 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > > > > >> > > > > > > >>> I'm trying to understand how intermediate serialization in > Pig > > > > works > > > > > at > > > > > > >>> a deeper level (understanding the whole code path, not just > > > > > > BinInterSedes > > > > > > >>> in its own vaccuum). Right now I am looking at > > > > > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the > > > right > > > > > > place > > > > > > >>> to look for understanding how BinInterSedes is actually > called? > > > > > > >>> > > > > > > >>> Further, I'm trying to better understanding the > > > > > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to > make > > > the > > > > > > file > > > > > > >>> splittable? But I'm not really sure. I'd love any pointers > > about > > > > > where > > > > > > to > > > > > > >>> look for how BinInterSedes is used, and how intermediate > > storage > > > > > > happens. > > > > > > >>> > > > > > > >>> Thanks! > > > > > > >>> Jon > > > > > > >>> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >