[jira] [Commented] (PIG-3655) BinStorage and InterStorage approach to record markers is broken

Adam Szita (JIRA) Tue, 11 Jul 2017 07:31:25 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082281#comment-16082281
 ]


Adam Szita commented on PIG-3655:
---------------------------------

I agree that we should follow Hadoop's approach and generate a longer and 
random record marker instead of 0x010203.
I propose we use this ticket for fixing InterStorage, doing the same approach 
for BinStorage would cause an incompatibility for files already written in the 
past (and I also saw some other problems in BinStorage e.g. writing 
bigdecimals..)

So for InterStorage we're not bound by any contract and we can change its 
format freely.
I uploaded a fix [^PIG-3655.0.patch]: InterRecordWriter will now generate a 
random record marker the same way [Hadoop does it for a 
SequenceFile|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java#L870]

InterRecordReader will read this 16 bytes long magic sequence at initialization 
time from the beginning of the file (+3 because I also kept the original marker 
0x010203). Later during reading records it will always compare the record 
marker seen with the one the reader was initialized with.
In theory this should result in much less collisions with data, the original 
sequence was 3+1 bytes (+1 for marker a Tuple type), but now we have 20 bytes 
in total (original 3 + 16 random + 1 tuple type)

> BinStorage and InterStorage approach to record markers is broken
> ----------------------------------------------------------------
>
>                 Key: PIG-3655
>                 URL: https://issues.apache.org/jira/browse/PIG-3655
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>            Reporter: Jeff Plaisance
>            Assignee: Mark Wagner
>         Attachments: PIG-3655.0.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PIG-3655) BinStorage and InterStorage approach to record markers is broken

Reply via email to