I'll just bump this once. The main thing I'm still unsure on is just the relationship various raw comparators, Pig, and hadoop. If we're serializing RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears that the raw comparators aren't aware of it?
2012/5/23 Jonathan Coveney <jcove...@gmail.com> > And one more question to pile on: > > What defines the binary data that the raw tuple comparator will be run on? > It seems like that it comes from hadoop, and the format generally makes > sense (you get bytes and do with them what you will). The thing that > confuses me is why don't you have to deal with the > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of that > and reads a deserialized tuple...so at what point do you get binary Tuple > data that doesn't have all of the split stuff? I'll keep digging through > but this is where my ignorance of the technicalities of the MR layer comes > in... > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > >> Another question is clarifying what BinStorage does compared to >> InterStorage. It looks like it might just be a legacy storage format? >> >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next >> Tuple in the stream, but once you do that, can't you just read a tuple, and >> then read skip 12 bytes (3 ints), and keep reading? >> >> >> 2012/5/23 Jonathan Coveney <jcove...@gmail.com> >> >>> I'm trying to understand how intermediate serialization in Pig works at >>> a deeper level (understanding the whole code path, not just BinInterSedes >>> in its own vaccuum). Right now I am looking at >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right place >>> to look for understanding how BinInterSedes is actually called? >>> >>> Further, I'm trying to better understanding the >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file >>> splittable? But I'm not really sure. I'd love any pointers about where to >>> look for how BinInterSedes is used, and how intermediate storage happens. >>> >>> Thanks! >>> Jon >>> >> >> >