Ruiqi Dong created AVRO-4267:
--------------------------------

             Summary: GenericData.compare(...) orders bytes using signed 
ByteBuffer.compareTo, unlike binary Avro order
                 Key: AVRO-4267
                 URL: https://issues.apache.org/jira/browse/AVRO-4267
             Project: Apache Avro
          Issue Type: Bug
          Components: java
            Reporter: Ruiqi Dong


*Summary*
`BinaryData.compare(...)` compares Avro `bytes` using unsigned byte order. 
`GenericData.compare(...)` falls through to `ByteBuffer.compareTo(...)` for 
`bytes`, which uses signed Java byte comparison. This makes in-memory datum 
comparison inconsistent with binary-encoded datum comparison.
 
*Affected code*
File: `lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java`
{code:java}
protected int compare(Object o1, Object o2, Schema s, boolean equals) {
  ...
  case STRING:
    CharSequence cs1 = o1 instanceof CharSequence ? (CharSequence) o1 : 
o1.toString();
    CharSequence cs2 = o2 instanceof CharSequence ? (CharSequence) o2 : 
o2.toString();
    return Utf8.compareSequences(cs1, cs2);
  default:
    return ((Comparable) o1).compareTo(o2);
  }
} {code}
File: `lang/java/avro/src/main/java/org/apache/avro/io/BinaryData.java`
{code:java}
public static int compareBytes(byte[] b1, int s1, int l1, byte[] b2, int s2, 
int l2) {
  return Arrays.compareUnsigned(b1, s1, s1 + l1, b2, s2, s2 + l2);
} {code}
*Reproducer*
Add this test to `lang/java/avro/src/test/java/org/apache/avro/TestCompare.java`
{code:java}
@Test
void genericBytesCompareMatchesBinaryUnsignedByteOrder() throws Exception {
  Schema schema = SchemaParser.parseSingle("\"bytes\"");
  ByteBuffer lower = ByteBuffer.wrap(new byte[] { 0x7F });
  ByteBuffer higher = ByteBuffer.wrap(new byte[] { (byte) 0xFF });

  assertTrue(BinaryData.compare(render(lower, schema, new 
GenericDatumWriter<>()), 0,
      render(higher, schema, new GenericDatumWriter<>()), 0, schema) < 0);
  assertTrue(GenericData.get().compare(lower, higher, schema) < 0);
} {code}
Run:
{code:java}
MAVEN_SKIP_RC=true 
JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home
 \
PATH=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin
 \
/opt/homebrew/bin/mvn -q -t toolchains-local.xml -pl lang/java/avro \
  
-Dtest=org.apache.avro.TestCompare#genericBytesCompareMatchesBinaryUnsignedByteOrder
 test {code}
*Observed behavior*
The test fails on the `GenericData.compare(...)` assertion. Binary comparison 
orders `0x7F < 0xFF`, but generic in-memory comparison orders the same buffers 
the other way because `(byte) 0xFF` is `-1` in Java.

*Expected behavior*
`GenericData.compare(...)` should use the same byte ordering as 
`BinaryData.compare(...)` for Avro `bytes`.


`BinaryData.compare(...)` is documented as consistent with 
`GenericData.compare(...)`. Different ordering for the same logical `bytes` 
values can affect sorting, comparisons inside records, and any code that 
expects binary and in-memory Avro datum ordering to agree.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to