[ 
https://issues.apache.org/jira/browse/NIFI-16061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alaksiej Ščarbaty updated NIFI-16061:
-------------------------------------
    Description: 
h3. Description

When a JSON field carries a quoted string value in one record and a bare number 
in another, the output incorrectly promotes the quoted string to a bare number.
h3. Root cause

Schema inference sees _TextNode("42")_ as STRING and _IntNode(7)_ as INT, so 
_FieldTypeInference_ merges the field to {_}CHOICE(INT, STRING){_}. At write 
time, _DataTypeUtils.findMostSuitableType_ sorts candidates by 
_RecordFieldType_ enum ordinal (INT=3 before STRING=13) and returns the first 
type the string value is convertible to. Because _"42"_ is convertible to INT, 
the string is silently coerced to a number.

The same issue applies to any type narrower than STRING that appears in a 
CHOICE, including BOOLEAN: _"false"_ is promoted to bare {_}false{_}.
h3. Steps to reproduce

Use any flow with _JsonTreeReader_ + _JsonRecordSetWriter_ and inferred schema 
with records:
{code:java}
{"val":"42"}
{"val":7}{code}
 

Expected writer outpue: 
{code:java}
[{"val":"42"},{"val":7}]{code}
Actual output: 
{code:java}
[{"val":42},{"val":7}]{code}
h3. Open questions

Is implicit type narrowing desired by default? 

Shall we avoid type narrowing in these situations and adhere to the actual 
field type? Or at least to make this behavior configurable?

  was:
h3. Description

When a JSON field carries a quoted string value in one record and a bare number 
in another, the output incorrectly promotes the quoted string to a bare number.
h3. Root cause

Schema inference sees `TextNode("42")` as STRING and `IntNode(7)` as INT, so 
`FieldTypeInference` merges the field to `CHOICE(INT, STRING)`. At write time, 
`DataTypeUtils.findMostSuitableType` sorts candidates by `RecordFieldType` enum 
ordinal (INT=3 before STRING=13) and returns the first type the string value is 
convertible to. Because `"42"` is convertible to INT, the string is silently 
coerced to a number.

The same issue applies to any type narrower than STRING that appears in a 
CHOICE, including BOOLEAN: `"false"` is promoted to bare `false`.
h3. Steps to reproduce

Use any flow with JsonTreeReader + JsonRecordSetWriter and inferred schema with 
records:
{code:java}
{"val":"42"}
{"val":7}{code}
 

Expected output: 
{code:java}
[{"val":"42"},{"val":7}]{code}
Actual output: 

 

 
{code:java}
[{"val":42},{"val":7}]{code}
 
h3. Open questions

Is implicit type narrowing desired by default? 

Shall we avoid type narrowing in these situations and adhere to the actual 
field type? Or at least to make this behavior configurable?


> JsonRecordSetWriter promotes quoted JSON strings to numbers when field schema 
> is CHOICE(INT, STRING)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-16061
>                 URL: https://issues.apache.org/jira/browse/NIFI-16061
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>    Affects Versions: 2.10.0
>            Reporter: Alaksiej Ščarbaty
>            Priority: Major
>
> h3. Description
> When a JSON field carries a quoted string value in one record and a bare 
> number in another, the output incorrectly promotes the quoted string to a 
> bare number.
> h3. Root cause
> Schema inference sees _TextNode("42")_ as STRING and _IntNode(7)_ as INT, so 
> _FieldTypeInference_ merges the field to {_}CHOICE(INT, STRING){_}. At write 
> time, _DataTypeUtils.findMostSuitableType_ sorts candidates by 
> _RecordFieldType_ enum ordinal (INT=3 before STRING=13) and returns the first 
> type the string value is convertible to. Because _"42"_ is convertible to 
> INT, the string is silently coerced to a number.
> The same issue applies to any type narrower than STRING that appears in a 
> CHOICE, including BOOLEAN: _"false"_ is promoted to bare {_}false{_}.
> h3. Steps to reproduce
> Use any flow with _JsonTreeReader_ + _JsonRecordSetWriter_ and inferred 
> schema with records:
> {code:java}
> {"val":"42"}
> {"val":7}{code}
>  
> Expected writer outpue: 
> {code:java}
> [{"val":"42"},{"val":7}]{code}
> Actual output: 
> {code:java}
> [{"val":42},{"val":7}]{code}
> h3. Open questions
> Is implicit type narrowing desired by default? 
> Shall we avoid type narrowing in these situations and adhere to the actual 
> field type? Or at least to make this behavior configurable?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to