[jira] [Commented] (HIVE-17394) AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row

Ratandeep Ratti (JIRA) Mon, 28 Aug 2017 10:28:33 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144090#comment-16144090
 ]


Ratandeep Ratti commented on HIVE-17394:
----------------------------------------

I've found this problem with Hive-1.1 . Didn't look too closely at Hive-2.x / 
trunk. But from a high level by looking at the code it seems the problem will 
also exist there.

> AvroSerde is regenerating TypeInfo objects for each nullable Avro field for 
> every row
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-17394
>                 URL: https://issues.apache.org/jira/browse/HIVE-17394
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Ratandeep Ratti
>         Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png
>
>
> The following methods in {{AvroDeserializer}} keep regenerating TypeInfo 
> objects for every nullable  field in a row.
> This is happening in the following methods.
> {code}
> private Object deserializeNullableUnion(Object datum, Schema fileSchema, 
> Schema recordSchema) throws AvroSerdeException {
> // elided
> line 312:  return worker(datum, fileSchema, newRecordSchema,
>             SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
> }
> ..
> private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema 
> recordSchema)
> // elided
> line 357: return worker(datum, currentFileSchema, schema,
>       SchemaToTypeInfo.generateTypeInfo(schema, null));
> {code}
> This is really bad in terms of performance. I'm not sure why didn't we use 
> the TypeInfo we already have instead of generating again for each nullable 
> field.  If you look at the {{worker}} method which calls the method 
> {{deserializeNullableUnion}} the typeInfo corresponding to the nullable field 
> column is already determined. Not sure why we have to determine that 
> information again.
> Moreover the cache in SchmaToTypeInfo does not help in nullable Avro records 
> case as checking if an Avro record schema object already exists in the cache 
> requires traversing all the fields in the record schema.
> I've attached profiling snapshot which shows maximum time is being spent in 
> the cache.
> One way of fixing this IMO is to make use of the column TypeInfo which is 
> already passed in the worker method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17394) AvroSerde is regenerating TypeInfo objects for each nullable Avro field for every row

Reply via email to