[ 
https://issues.apache.org/jira/browse/HIVE-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ratandeep Ratti updated HIVE-17394:
-----------------------------------
    Description: 
The following methods in {{AvroDeserializer}} keep regenerating {{TypeInfo}} 
objects for every nullable  field in a row.

This is happening in the following methods.

{code}
private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema 
recordSchema) throws AvroSerdeException {
// elided
line 312:  return worker(datum, fileSchema, newRecordSchema,
            SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
}
..
private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema 
recordSchema)
// elided
line 357: return worker(datum, currentFileSchema, schema,
      SchemaToTypeInfo.generateTypeInfo(schema, null));
{code}

This is really bad in terms of performance. I'm not sure why didn't we use the 
TypeInfo we already have instead of generating again for each nullable field.  
If you look at the {{worker}} method which calls the method 
{{deserializeNullableUnion}} the typeInfo corresponding to the nullable field 
column is already determined. 
Moreover the cache in {{SchmaToTypeInfo}} class does not help in nullable Avro 
records case as checking if an Avro record schema object already exists in the 
cache requires traversing all the fields in the record schema.

I've attached profiling snapshot which shows maximum time is being spent in the 
cache.

One way of fixing this IMO might be to make use of the column TypeInfo which is 
already passed in the worker method.

  was:
The following methods in {{AvroDeserializer}} keep regenerating TypeInfo 
objects for every nullable  field in a row.

This is happening in the following methods.

{code}
private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema 
recordSchema) throws AvroSerdeException {
// elided
line 312:  return worker(datum, fileSchema, newRecordSchema,
            SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
}
..
private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema 
recordSchema)
// elided
line 357: return worker(datum, currentFileSchema, schema,
      SchemaToTypeInfo.generateTypeInfo(schema, null));
{code}

This is really bad in terms of performance. I'm not sure why didn't we use the 
TypeInfo we already have instead of generating again for each nullable field.  
If you look at the {{worker}} method which calls the method 
{{deserializeNullableUnion}} the typeInfo corresponding to the nullable field 
column is already determined. Not sure why we have to determine that 
information again.

Moreover the cache in SchmaToTypeInfo does not help in nullable Avro records 
case as checking if an Avro record schema object already exists in the cache 
requires traversing all the fields in the record schema.

I've attached profiling snapshot which shows maximum time is being spent in the 
cache.

One way of fixing this IMO is to make use of the column TypeInfo which is 
already passed in the worker method.


> AvroSerde is regenerating TypeInfo objects for each nullable Avro field for 
> every row
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-17394
>                 URL: https://issues.apache.org/jira/browse/HIVE-17394
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Ratandeep Ratti
>         Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png
>
>
> The following methods in {{AvroDeserializer}} keep regenerating {{TypeInfo}} 
> objects for every nullable  field in a row.
> This is happening in the following methods.
> {code}
> private Object deserializeNullableUnion(Object datum, Schema fileSchema, 
> Schema recordSchema) throws AvroSerdeException {
> // elided
> line 312:  return worker(datum, fileSchema, newRecordSchema,
>             SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
> }
> ..
> private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema 
> recordSchema)
> // elided
> line 357: return worker(datum, currentFileSchema, schema,
>       SchemaToTypeInfo.generateTypeInfo(schema, null));
> {code}
> This is really bad in terms of performance. I'm not sure why didn't we use 
> the TypeInfo we already have instead of generating again for each nullable 
> field.  If you look at the {{worker}} method which calls the method 
> {{deserializeNullableUnion}} the typeInfo corresponding to the nullable field 
> column is already determined. 
> Moreover the cache in {{SchmaToTypeInfo}} class does not help in nullable 
> Avro records case as checking if an Avro record schema object already exists 
> in the cache requires traversing all the fields in the record schema.
> I've attached profiling snapshot which shows maximum time is being spent in 
> the cache.
> One way of fixing this IMO might be to make use of the column TypeInfo which 
> is already passed in the worker method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to