I'll just bump this once. The main thing I'm still unsure on is just the
relationship various raw comparators, Pig, and hadoop. If we're serializing
RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears
that the raw comparators aren't aware of it?

2012/5/23 Jonathan Coveney <jcove...@gmail.com>

> And one more question to pile on:
>
> What defines the binary data that the raw tuple comparator will be run on?
> It seems like that it comes from hadoop, and the format generally makes
> sense (you get bytes and do with them what you will). The thing that
> confuses me is why don't you have to deal with the
> RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of that
> and reads a deserialized tuple...so at what point do you get binary Tuple
> data that doesn't have all of the split stuff? I'll keep digging through
> but this is where my ignorance of the technicalities of the MR layer comes
> in...
>
> 2012/5/23 Jonathan Coveney <jcove...@gmail.com>
>
>> Another question is clarifying what BinStorage does compared to
>> InterStorage. It looks like it might just be a legacy storage format?
>>
>> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next
>> Tuple in the stream, but once you do that, can't you just read a tuple, and
>> then read skip 12 bytes (3 ints), and keep reading?
>>
>>
>> 2012/5/23 Jonathan Coveney <jcove...@gmail.com>
>>
>>> I'm trying to understand how intermediate serialization in Pig works at
>>> a deeper level (understanding the whole code path, not just BinInterSedes
>>> in its own vaccuum). Right now I am looking at
>>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right place
>>> to look for understanding how BinInterSedes is actually called?
>>>
>>> Further, I'm trying to better understanding the
>>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file
>>> splittable? But I'm not really sure. I'd love any pointers about where to
>>> look for how BinInterSedes is used, and how intermediate storage happens.
>>>
>>> Thanks!
>>> Jon
>>>
>>
>>
>

Reply via email to