Mikael Brännström created DRILL-8535:
----------------------------------------
Summary: tdigest_merge cannot parse correct tdigest data
Key: DRILL-8535
URL: https://issues.apache.org/jira/browse/DRILL-8535
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.22.0
Reporter: Mikael Brännström
The tdigest_merge SQL function parses binary data via an UTF-8 string (bytes ->
UTF8 String -> bytes), which corrupts the data. Any byte value >= 0x80 will
likely be expanded to multiple bytes.
The effect is that the call to MergingDigest.fromBytes parse exceptions, such
as BufferUnderFlowException, due to corrupt data.
To reproduce, create a tdigest with e.g. the single integer value 1082 with the
default compression 100. The resulting data is:
{code:java}
[0, 0, 0, 2, 64, -112, -24, 0, 0, 0, 0, 0, 64, -112, -24, 0, 0, 0, 0, 0, 66,
-56, 0, 0, 0, -46, 4, 26, 0, 1, 63, -128, 0, 0, 68, -121, 64, 0]
{code}
After UTF-8 String corruption, the data becomes:
{code:java}
[0, 0, 0, 2, 64, -17, -65, -67, -17, -65, -67, 0, 0, 0, 0, 0, 64, -17, -65,
-67, -17, -65, -67, 0, 0, 0, 0, 0, 66, -17, -65, -67, 0, 0, 0, -17, -65, -67,
4, 26, 0, 1, 63, -17, -65, -67, 0, 0, 68, -17, -65, -67, 64, 0]{code}
The fix is trivial and relates to the class
{{{}TDigestFunctions.TDigestMergeFunction{}}}.
Line 1109 is incorrect:
{code:java}
byte[] buf =
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(in.start,
in.end, in.buffer).getBytes(java.nio.charset.StandardCharsets.UTF_8); {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)