Ruiqi Dong created AVRO-4267:
--------------------------------
Summary: GenericData.compare(...) orders bytes using signed
ByteBuffer.compareTo, unlike binary Avro order
Key: AVRO-4267
URL: https://issues.apache.org/jira/browse/AVRO-4267
Project: Apache Avro
Issue Type: Bug
Components: java
Reporter: Ruiqi Dong
*Summary*
`BinaryData.compare(...)` compares Avro `bytes` using unsigned byte order.
`GenericData.compare(...)` falls through to `ByteBuffer.compareTo(...)` for
`bytes`, which uses signed Java byte comparison. This makes in-memory datum
comparison inconsistent with binary-encoded datum comparison.
*Affected code*
File: `lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java`
{code:java}
protected int compare(Object o1, Object o2, Schema s, boolean equals) {
...
case STRING:
CharSequence cs1 = o1 instanceof CharSequence ? (CharSequence) o1 :
o1.toString();
CharSequence cs2 = o2 instanceof CharSequence ? (CharSequence) o2 :
o2.toString();
return Utf8.compareSequences(cs1, cs2);
default:
return ((Comparable) o1).compareTo(o2);
}
} {code}
File: `lang/java/avro/src/main/java/org/apache/avro/io/BinaryData.java`
{code:java}
public static int compareBytes(byte[] b1, int s1, int l1, byte[] b2, int s2,
int l2) {
return Arrays.compareUnsigned(b1, s1, s1 + l1, b2, s2, s2 + l2);
} {code}
*Reproducer*
Add this test to `lang/java/avro/src/test/java/org/apache/avro/TestCompare.java`
{code:java}
@Test
void genericBytesCompareMatchesBinaryUnsignedByteOrder() throws Exception {
Schema schema = SchemaParser.parseSingle("\"bytes\"");
ByteBuffer lower = ByteBuffer.wrap(new byte[] { 0x7F });
ByteBuffer higher = ByteBuffer.wrap(new byte[] { (byte) 0xFF });
assertTrue(BinaryData.compare(render(lower, schema, new
GenericDatumWriter<>()), 0,
render(higher, schema, new GenericDatumWriter<>()), 0, schema) < 0);
assertTrue(GenericData.get().compare(lower, higher, schema) < 0);
} {code}
Run:
{code:java}
MAVEN_SKIP_RC=true
JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home
\
PATH=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin
\
/opt/homebrew/bin/mvn -q -t toolchains-local.xml -pl lang/java/avro \
-Dtest=org.apache.avro.TestCompare#genericBytesCompareMatchesBinaryUnsignedByteOrder
test {code}
*Observed behavior*
The test fails on the `GenericData.compare(...)` assertion. Binary comparison
orders `0x7F < 0xFF`, but generic in-memory comparison orders the same buffers
the other way because `(byte) 0xFF` is `-1` in Java.
*Expected behavior*
`GenericData.compare(...)` should use the same byte ordering as
`BinaryData.compare(...)` for Avro `bytes`.
`BinaryData.compare(...)` is documented as consistent with
`GenericData.compare(...)`. Different ordering for the same logical `bytes`
values can affect sorting, comparisons inside records, and any code that
expects binary and in-memory Avro datum ordering to agree.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)