Dain Sundstrom created HIVE-26958: ------------------------------------- Summary: JsonSerDe data corruption when scalar type is a json object Key: HIVE-26958 URL: https://issues.apache.org/jira/browse/HIVE-26958 Project: Hive Issue Type: Bug Components: File Formats Reporter: Dain Sundstrom
JsonSerDe uses the Jackson {{JsonParser.getText}} for decoding scalar values from json strings. The problem is this method in Jackson converts any token to text including {{START_OBJECT}} '{{{}{{}}}'. This means when a scalar field is actually a json object, JsonSerDe will process the open curly bracket for {{{}BOOLEAN{}}}, {{{}DECIMAL{}}}, {{{}CHAR{}}}, {{{}VARCHAR{}}}, and {{{}VARBINARY{}}}. Then it continues processing field inside of the json object as if they are part of the outer json object. When the closing curly bracket is encountered it pops a level, which can end parsing early. This bug will result in corrupted data for the following JSON: {code:java} { "boolean_field" : {}, "other_field" : 99 } => [boolean_field=false, other_field=null] { "boolean_field" : { "other_field" : 42 }, "other_field" : 99 } => (false, null) => [boolean_field=false, other_field=42]{code} BTW, when a json array is passed instead of an object, you get an error because the array does not contain fields which the code checks for. I think the behavior should result in an error like you get when a json array is field value for a scalar. If so the fix is to make sure the value token a scalar for non-complex types in {{{}extractCurrentField{}}}, so something like this: {code:java} if (!hcatFieldSchema.isComplex() && !valueToken.isScalarValue()) { throw new IOException(type + " value must be a scalar json value"); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)