[jira] Commented: (MAPREDUCE-1487) io.DataInputBuffer.getLength() semantic wrong/confused

Sarthak (JIRA) Sat, 25 Dec 2010 13:19:09 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975073#action_12975073
 ]


Sarthak commented on MAPREDUCE-1487:
------------------------------------

Hi, 

I have a similar problem in Hadoop 0.20.2. 

I am trying to write a BSONWritable class that extends WritableComparable. It 
can be considered as an equivalent of IntWritable, FloatWritable,etc. However 
the length of the Key/Values produced by the Map step may/may not constant. 

I stepped through the Code and Specifically the ReduceContext.nextKeyValue() 
method line 115 does a buffer.reset().

    DataInputBuffer next = input.getKey();
    currentRawKey.set(next.getData(), next.getPosition(), 
                      next.getLength() - next.getPosition());
    buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
    key = keyDeserializer.deserialize(key);
    next = input.getValue();
    buffer.reset(next.getData(), next.getPosition(), next.getLength());
    value = valueDeserializer.deserialize(value);

In the debugger, I see that the value length field in the  next object is 103. 
However, the next.getLength() returns the length of the data itself. In the 
reset method, this causes the 'count' field to increase on every call. 

In the next step, the valueDeserializer fails as the data does not get reset 
properly due to the length field and hence the deserialization fails. The only 
way for me to get an accurate length of the Value in bytes is with the 
next.length field above. If the next.getLength() returns the correct value, 
then the deserialization step wrill work fine. 

I stepped through the same with another Map Reduce step that has an IntWritable 
output. So in the valueDeserializer.deserialize step, it calls the 
IntWritable.readFields() (which is pretty simple and reads only the next four 
bytes). 



> io.DataInputBuffer.getLength() semantic wrong/confused
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-1487
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1487
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>         Environment: linux
>            Reporter: Yang Yang
>
> I was trying Google Protocol Buffer as a value type on hadoop,
> then when I used it in reducer, the parser always failed.
> while it worked fine on a plain inputstream reader or mapper.
> the reason is that the reducer interface in Task.java gave a buffer larger 
> than an actual encoded record to the parser, and the parser does not stop 
> until it reaches
> the buffer end, so it parsed some  junk bytes.
> the root cause is due to hadoop.io.DataInputBuffer.java :
> in 0.20.1  DataInputBuffer.java  line 47:
>     public void reset(byte[] input, int start, int length) {
>       this.buf = input;
>       this.count = start+length;
>       this.mark = start;
>       this.pos = start;
>     }
>     public byte[] getData() { return buf; }
>     public int getPosition() { return pos; }
>     public int getLength() { return count; }
> we see that the above logic seems to assume that "getLength()" returns the 
> total ** capacity ***, not the actual content length, of the buffer, yet 
> latter code
> seems to assume the semantic that "length" is actual content length, i.e. end 
> - start :
>  /** Resets the data that the buffer reads. */
>   public void reset(byte[] input, int start, int length) {
>     buffer.reset(input, start, length);
>   }
> i.e. if u call reset( getPosition(), getLength() ) on the same buffer again 
> and again, the "length" would be infinitely increased.
> this confusion in semantic is reflected in  many places, at leat in 
> IFile.java, and Task.java, where it caused the original issue.
> around line 980 of Task.java, we see
>    valueIn.reset(nextValueBytes.getData(), nextValueBytes.getPosition(), 
> nextValueBytes.getLength())  
> if the position above is not empty, the above actually sets a buffer too 
> long, causing the reported issue.
> changing the Task.java as a hack , to 
>       valueIn.reset(nextValueBytes.getData(), nextValueBytes.getPosition(), 
> nextValueBytes.getLength() - nextValueBytes.getPosition());
> fixed the issue, but the semantic of DataInputBuffer should be fixed and 
> streamlined

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1487) io.DataInputBuffer.getLength() semantic wrong/confused

Reply via email to