Greg Harris created KAFKA-16289:
-----------------------------------

             Summary: Values.parseString on heterogeneous lists and maps 
sometimes corrupts data by inferring incorrect schema
                 Key: KAFKA-16289
                 URL: https://issues.apache.org/jira/browse/KAFKA-16289
             Project: Kafka
          Issue Type: Bug
          Components: connect
            Reporter: Greg Harris
            Assignee: Greg Harris


The Values.parseString function makes a best-effort conversion of strings to 
Connect-schema'd data. It supports reading arrays and maps as delimited by 
`[,]` and `\{:,}` characters, and attempts to infer the common type of these 
structures from the types of the elements. The algorithm it follows is:

1. Parse the elements of the list in one-pass. Infer the smallest/strictest 
type which can contain each value individually.
2. Iterate over the schemas inferred for each element, and repeatedly merge two 
schemas together to the smallest type which covers both element schemas.
3. Convert the parsed elements to the common element schema.

The implementation of step 2 here: 
[https://github.com/apache/kafka/blob/ead2431c37ace9255df88ffe819bb905311af088/connect/api/src/main/java/org/apache/kafka/connect/data/Values.java#L805-L823]
 has a flaw in it however. The `elementSchema` variable has `null` as a 
sentinel both of the situations "no elements seen so far" and "no common schema 
possible" among the seen elements.

When processing the first element of the list, `null` is used to adopt the 
schema of the first element as the initial common schema. Later when an 
incompatible element is found, the common schema is set to null to indicate 
that there is no common element schema. However, a following iteration can 
misinterpret the `null` as being at the start of the list again, and inject a 
schema which works for some of the elements and not others.

When the values are converted in step 3, each element has one of the following 
happen:
1. The value is left-as is (e.g. no common schema inferred)
2. The value is converted correctly to the destination type (e.g. int -> long)
3. An exception is thrown because the type could not be converted (e.g. string 
-> struct)
4. The value is silently corrupted (e.g. long -> int, decimal -> long)

In normal circumstances either case (1) happens to all of the elements, or 
case(2) does, depending on if a common schema was found. But when this bug is 
triggered by having heterogeneous types, case (2) or case (3) can happen to 
some of the elements in the array.

The effects depend on the order of elements in the array, as the sentinel value 
bug is dependent on the iteration order of the elements. For example:

* `[1,2,"3"]` returns Byte, Byte, String
* `["1",2,3]` returns Byte, Byte, Byte (safely converts the data, case 2)
* `[1,2,{}]` returns Byte, Byte, Map
* `[{},2,3]` experiences an exception and returns String (exception, case 3)
* `[1, 2, 1000000000000000000000]` returns Byte, Byte, BigDecimal
* `[1000000000000000000000, 1, 2]` returns Byte, Byte, Byte (corruption, case 4)

Fixing this bug would entail changing all of these to return heterogeneous 
lists without a common schema, and not convert the values at all. However, this 
is a backwards-incompatible change because these are all situations in which we 
return data without an exception, so downstream users could be relying on the 
result.

However, this behavior is very opaque and unpredictable, and I think anyone 
that observes this in the wild would need to work-around it or avoid it, rather 
than rely on it happening. I think that fixing it to address the silent 
corruption case is a bigger benefit to users than the harm done by changing the 
other cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to