starting character sequence in delimiter were found missing in certain cases in the Map Output

Meria Joseph (JIRA) Tue, 07 Aug 2012 21:49:13 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430867#comment-13430867
 ]


Meria Joseph commented on HADOOP-8655:
--------------------------------------

The issue occurs when the buffer that reads the input file content, at a 
particular instance, ends with a character or character sequence that matches 
the head of the record delimiter.

For example, in the above case, while reading the file, the buffer's end bytes 
at an instance might be as follows,

........</name></entity><entity><id>3</

causing it to skip the last two characters considering it as a part of the 
delimiter </entity>.

The default buffer size is 4096 bytes.Hence the input should be more than 4096 
bytes and the last bytes of the buffer should match the head of the 
delimiter...Please guide how to create test case for the patch..



 


                
> In TextInputFormat, while specifying textinputformat.record.delimiter the 
> character/character sequences in data file similar to starting 
> character/starting character sequence in delimiter were found missing in 
> certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8655
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8655
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, 
> textinputformat.record.delimiter
>         Attachments: MAPREDUCE-4519.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug 
> occurs at some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8655) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

Reply via email to