[ 
https://issues.apache.org/jira/browse/PIG-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732007#comment-16732007
 ] 

Nandor Kollar commented on PIG-5373:
------------------------------------

I have one observation to the patch: to be future-proof, instead 
CircularFifoBuffer from commons-collection I think we should use 
CircularFifoQueue from commons-collections4. On one hand CircularFifoBuffer was 
removed from the latest commons collections code, on the other hand 
CircularFifoQueue is generic, so we can eliminated iterating through Object 
items and casting to integer. Be aware of one thing: the semantic of isFull has 
changed, CircularFifoQueue is never full. The isFull call should be replaced to 
{{queue.size() == queue.maxSize()}}.

> InterRecordReader might skip records if certain sync markers are used
> ---------------------------------------------------------------------
>
>                 Key: PIG-5373
>                 URL: https://issues.apache.org/jira/browse/PIG-5373
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: PIG-5373.0.patch
>
>
> Due to bug in InterRecordReader#skipUntilMarkerOrSplitEndOrEOF(), it can 
> happen that sync markers are not identified while reading the interim binary 
> file used to hold data between jobs.
> In such files sync markers are placed upon writing, which later help during 
> reading the data. These are random generated and it seems like that in some 
> rare combinations of markers and data preceding it, they cannot be not found. 
> This can result in reading through all the bytes (looking for the marker) and 
> reaching split end or EOF, and extracting no records at all.
> This symptom is also observable from JobHistory stats, where if a job is 
> affected by this issue, will have tasks that have HDFS_BYTES_READ or 
> FILE_BYTES_READ about equal to the number bytes of the split, but at the same 
> time having MAP_INPUT_RECORDS=0
> One such (test) example is this:
> {code:java}
> marker: [-128, -128, 4] , data: [127, -1, 2, -128, -128, -128, 4, 1, 2, 
> 3]{code}
> Due to a bug, such markers whose prefix overlap with the last data chunk are 
> not seen by the reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to