Dain Sundstrom created HIVE-26958:
-------------------------------------

             Summary: JsonSerDe data corruption when scalar type is a json 
object
                 Key: HIVE-26958
                 URL: https://issues.apache.org/jira/browse/HIVE-26958
             Project: Hive
          Issue Type: Bug
          Components: File Formats
            Reporter: Dain Sundstrom


 

JsonSerDe uses the Jackson {{JsonParser.getText}} for decoding scalar values 
from json strings.  The problem is this method in Jackson converts any token to 
text including {{START_OBJECT}} '{{{}{{}}}'.  This means when a scalar field is 
actually a json object, JsonSerDe will process the open curly bracket for 
{{{}BOOLEAN{}}}, {{{}DECIMAL{}}}, {{{}CHAR{}}}, {{{}VARCHAR{}}}, and 
{{{}VARBINARY{}}}. Then it continues processing field inside of the json object 
as if they are part of the outer json object. When the closing curly bracket is 
encountered it pops a level, which can end parsing early. This bug will result 
in corrupted data for the following JSON:

 
{code:java}
{ "boolean_field" : {}, "other_field" : 99 } 
  => [boolean_field=false, other_field=null]


{ "boolean_field" : { "other_field" : 42 }, "other_field" : 99 } => (false, 
null) 
 => [boolean_field=false, other_field=42]{code}
 

BTW, when a json array is passed instead of an object, you get an error because 
the array does not contain fields which the code checks for.

I think the behavior should result in an error like you get when a json array 
is field value for a scalar.  If so the fix is to make sure the value token a 
scalar for non-complex types in {{{}extractCurrentField{}}}, so something like 
this:
{code:java}
if (!hcatFieldSchema.isComplex() && !valueToken.isScalarValue()) {
    throw new IOException(type + " value must be a scalar json value");
} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to