Dain Sundstrom created HIVE-26958:
-------------------------------------
Summary: JsonSerDe data corruption when scalar type is a json
object
Key: HIVE-26958
URL: https://issues.apache.org/jira/browse/HIVE-26958
Project: Hive
Issue Type: Bug
Components: File Formats
Reporter: Dain Sundstrom
JsonSerDe uses the Jackson {{JsonParser.getText}} for decoding scalar values
from json strings. The problem is this method in Jackson converts any token to
text including {{START_OBJECT}} '{{{}{{}}}'. This means when a scalar field is
actually a json object, JsonSerDe will process the open curly bracket for
{{{}BOOLEAN{}}}, {{{}DECIMAL{}}}, {{{}CHAR{}}}, {{{}VARCHAR{}}}, and
{{{}VARBINARY{}}}. Then it continues processing field inside of the json object
as if they are part of the outer json object. When the closing curly bracket is
encountered it pops a level, which can end parsing early. This bug will result
in corrupted data for the following JSON:
{code:java}
{ "boolean_field" : {}, "other_field" : 99 }
=> [boolean_field=false, other_field=null]
{ "boolean_field" : { "other_field" : 42 }, "other_field" : 99 } => (false,
null)
=> [boolean_field=false, other_field=42]{code}
BTW, when a json array is passed instead of an object, you get an error because
the array does not contain fields which the code checks for.
I think the behavior should result in an error like you get when a json array
is field value for a scalar. If so the fix is to make sure the value token a
scalar for non-complex types in {{{}extractCurrentField{}}}, so something like
this:
{code:java}
if (!hcatFieldSchema.isComplex() && !valueToken.isScalarValue()) {
throw new IOException(type + " value must be a scalar json value");
} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)