[jira] [Created] (KAFKA-16289) Values.parseString on heterogeneous lists and maps sometimes corrupts data by inferring incorrect schema

Greg Harris (Jira) Tue, 20 Feb 2024 15:23:04 -0800

Greg Harris created KAFKA-16289:
-----------------------------------

             Summary: Values.parseString on heterogeneous lists and maps 
sometimes corrupts data by inferring incorrect schema
                 Key: KAFKA-16289
                 URL: https://issues.apache.org/jira/browse/KAFKA-16289
             Project: Kafka
          Issue Type: Bug
          Components: connect
            Reporter: Greg Harris
            Assignee: Greg Harris

The Values.parseString function makes a best-effort conversion of strings to
Connect-schema'd data. It supports reading arrays and maps as delimited by
`[,]` and `\{:,}` characters, and attempts to infer the common type of these
structures from the types of the elements. The algorithm it follows is:

1. Parse the elements of the list in one-pass. Infer the smallest/strictest
type which can contain each value individually.
2. Iterate over the schemas inferred for each element, and repeatedly merge two
schemas together to the smallest type which covers both element schemas.
3. Convert the parsed elements to the common element schema.

The implementation of step 2 here:
[https://github.com/apache/kafka/blob/ead2431c37ace9255df88ffe819bb905311af088/connect/api/src/main/java/org/apache/kafka/connect/data/Values.java#L805-L823]
has a flaw in it however. The `elementSchema` variable has `null` as a
sentinel both of the situations "no elements seen so far" and "no common schema
possible" among the seen elements.

When processing the first element of the list, `null` is used to adopt the
schema of the first element as the initial common schema. Later when an
incompatible element is found, the common schema is set to null to indicate
that there is no common element schema. However, a following iteration can
misinterpret the `null` as being at the start of the list again, and inject a
schema which works for some of the elements and not others.

When the values are converted in step 3, each element has one of the following
happen:
1. The value is left-as is (e.g. no common schema inferred)
2. The value is converted correctly to the destination type (e.g. int -> long)
3. An exception is thrown because the type could not be converted (e.g. string
-> struct)
4. The value is silently corrupted (e.g. long -> int, decimal -> long)

In normal circumstances either case (1) happens to all of the elements, or
case(2) does, depending on if a common schema was found. But when this bug is
triggered by having heterogeneous types, case (2) or case (3) can happen to
some of the elements in the array.

The effects depend on the order of elements in the array, as the sentinel value
bug is dependent on the iteration order of the elements. For example:

* `[1,2,"3"]` returns Byte, Byte, String
* `["1",2,3]` returns Byte, Byte, Byte (safely converts the data, case 2)
* `[1,2,{}]` returns Byte, Byte, Map
* `[{},2,3]` experiences an exception and returns String (exception, case 3)
* `[1, 2, 1000000000000000000000]` returns Byte, Byte, BigDecimal
* `[1000000000000000000000, 1, 2]` returns Byte, Byte, Byte (corruption, case 4)

Fixing this bug would entail changing all of these to return heterogeneous
lists without a common schema, and not convert the values at all. However, this
is a backwards-incompatible change because these are all situations in which we
return data without an exception, so downstream users could be relying on the
result.

However, this behavior is very opaque and unpredictable, and I think anyone
that observes this in the wild would need to work-around it or avoid it, rather
than rely on it happening. I think that fixing it to address the silent
corruption case is a bigger benefit to users than the harm done by changing the
other cases.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16289) Values.parseString on heterogeneous lists and maps sometimes corrupts data by inferring incorrect schema

Reply via email to