Ruiqi Dong created AVRO-4268:
--------------------------------

             Summary: BytesWritableConverter serializes unused capacity bytes
                 Key: AVRO-4268
                 URL: https://issues.apache.org/jira/browse/AVRO-4268
             Project: Apache Avro
          Issue Type: Bug
          Components: java
            Reporter: Ruiqi Dong


*Summary*
`AvroDatumConverterFactory.BytesWritableConverter` converts Hadoop 
`BytesWritable` to Avro `bytes` using `ByteBuffer.wrap(input.getBytes())`. 
`BytesWritable.getBytes()` returns the backing array, whose length can be 
larger than the logical value length reported by `getLength()`. The converter 
therefore serializes stale or unused capacity bytes.
 
*Affected code*
File: 
`lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroDatumConverterFactory.java`
{code:java}
@Override
public ByteBuffer convert(BytesWritable input) {
  return ByteBuffer.wrap(input.getBytes());
} {code}
*Reproducer*
Add this test to 
`lang/java/mapred/src/test/java/org/apache/avro/hadoop/io/TestAvroDatumConverterFactory.java`
{code:java}
@Test
void convertBytesWritableRespectsLogicalLength() {
  AvroDatumConverter<BytesWritable, ByteBuffer> converter = 
mFactory.create(BytesWritable.class);
  BytesWritable writable = new BytesWritable(new byte[] { 1, 2, 3, 4, 5 });
  writable.setSize(3);

  ByteBuffer bytes = converter.convert(writable);

  assertEquals(3, bytes.remaining());
  assertEquals(1, bytes.get(0));
  assertEquals(2, bytes.get(1));
  assertEquals(3, bytes.get(2));
} {code}
Run:
{code:java}
MAVEN_SKIP_RC=true 
JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home
 \
PATH=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin
 \
/opt/homebrew/bin/mvn -q -t toolchains-local.xml -pl lang/java/mapred -am \
  
-Dtest=org.apache.avro.hadoop.io.TestAvroDatumConverterFactory#convertBytesWritableRespectsLogicalLength
 \
  -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false 
-Dinvoker.skip=true -Drat.skip=true test
 {code}
*Observed behavior*
The converted `ByteBuffer` has `remaining() == 5`.

*Expected behavior*
The converted `ByteBuffer` should have `remaining() == input.getLength()`, 
which is `3` in the reproducer.


Hadoop `BytesWritable` separates capacity from logical length. Avro `bytes` 
should encode only the logical value, not unused backing-array capacity. The 
likely fix is `ByteBuffer.wrap(input.copyBytes())` or 
`ByteBuffer.wrap(input.getBytes(), 0, input.getLength())`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to