[
https://issues.apache.org/jira/browse/NIFI-16061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alaksiej Ščarbaty updated NIFI-16061:
-------------------------------------
Description:
h3. Description
When a JSON field carries a quoted string value in one record and a bare number
in another, the output incorrectly promotes the quoted string to a bare number.
h3. Root cause
Schema inference sees _TextNode("42")_ as STRING and _IntNode(7)_ as INT, so
_FieldTypeInference_ merges the field to {_}CHOICE(INT, STRING){_}. At write
time, _DataTypeUtils.findMostSuitableType_ sorts candidates by
_RecordFieldType_ enum ordinal (INT=3 before STRING=13) and returns the first
type the string value is convertible to. Because _"42"_ is convertible to INT,
the string is silently coerced to a number.
The same issue applies to any type narrower than STRING that appears in a
CHOICE, including BOOLEAN: _"false"_ is promoted to bare {_}false{_}.
h3. Steps to reproduce
Use any flow with _JsonTreeReader_ + _JsonRecordSetWriter_ and inferred schema
with records:
{code:java}
{"val":"42"}
{"val":7}{code}
Expected writer outpue:
{code:java}
[{"val":"42"},{"val":7}]{code}
Actual output:
{code:java}
[{"val":42},{"val":7}]{code}
h3. Open questions
Is implicit type narrowing desired by default?
Shall we avoid type narrowing in these situations and adhere to the actual
field type? Or at least to make this behavior configurable?
was:
h3. Description
When a JSON field carries a quoted string value in one record and a bare number
in another, the output incorrectly promotes the quoted string to a bare number.
h3. Root cause
Schema inference sees `TextNode("42")` as STRING and `IntNode(7)` as INT, so
`FieldTypeInference` merges the field to `CHOICE(INT, STRING)`. At write time,
`DataTypeUtils.findMostSuitableType` sorts candidates by `RecordFieldType` enum
ordinal (INT=3 before STRING=13) and returns the first type the string value is
convertible to. Because `"42"` is convertible to INT, the string is silently
coerced to a number.
The same issue applies to any type narrower than STRING that appears in a
CHOICE, including BOOLEAN: `"false"` is promoted to bare `false`.
h3. Steps to reproduce
Use any flow with JsonTreeReader + JsonRecordSetWriter and inferred schema with
records:
{code:java}
{"val":"42"}
{"val":7}{code}
Expected output:
{code:java}
[{"val":"42"},{"val":7}]{code}
Actual output:
{code:java}
[{"val":42},{"val":7}]{code}
h3. Open questions
Is implicit type narrowing desired by default?
Shall we avoid type narrowing in these situations and adhere to the actual
field type? Or at least to make this behavior configurable?
> JsonRecordSetWriter promotes quoted JSON strings to numbers when field schema
> is CHOICE(INT, STRING)
> ----------------------------------------------------------------------------------------------------
>
> Key: NIFI-16061
> URL: https://issues.apache.org/jira/browse/NIFI-16061
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Affects Versions: 2.10.0
> Reporter: Alaksiej Ščarbaty
> Priority: Major
>
> h3. Description
> When a JSON field carries a quoted string value in one record and a bare
> number in another, the output incorrectly promotes the quoted string to a
> bare number.
> h3. Root cause
> Schema inference sees _TextNode("42")_ as STRING and _IntNode(7)_ as INT, so
> _FieldTypeInference_ merges the field to {_}CHOICE(INT, STRING){_}. At write
> time, _DataTypeUtils.findMostSuitableType_ sorts candidates by
> _RecordFieldType_ enum ordinal (INT=3 before STRING=13) and returns the first
> type the string value is convertible to. Because _"42"_ is convertible to
> INT, the string is silently coerced to a number.
> The same issue applies to any type narrower than STRING that appears in a
> CHOICE, including BOOLEAN: _"false"_ is promoted to bare {_}false{_}.
> h3. Steps to reproduce
> Use any flow with _JsonTreeReader_ + _JsonRecordSetWriter_ and inferred
> schema with records:
> {code:java}
> {"val":"42"}
> {"val":7}{code}
>
> Expected writer outpue:
> {code:java}
> [{"val":"42"},{"val":7}]{code}
> Actual output:
> {code:java}
> [{"val":42},{"val":7}]{code}
> h3. Open questions
> Is implicit type narrowing desired by default?
> Shall we avoid type narrowing in these situations and adhere to the actual
> field type? Or at least to make this behavior configurable?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)