[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091634#comment-16091634
 ] 

Adam Szita commented on PIG-3655:
---------------------------------

[~rohini] I've attached [^PIG-3655.3.patch] with the sync interval approach. 
I'm currently running unit and e2e tests, will update once I have a result.

> BinStorage and InterStorage approach to record markers is broken
> ----------------------------------------------------------------
>
>                 Key: PIG-3655
>                 URL: https://issues.apache.org/jira/browse/PIG-3655
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>            Reporter: Jeff Plaisance
>            Assignee: Adam Szita
>         Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, 
> PIG-3655.3.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to