[ https://issues.apache.org/jira/browse/HIVE-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144090#comment-16144090 ]
Ratandeep Ratti commented on HIVE-17394: ---------------------------------------- I've found this problem with Hive-1.1 . Didn't look too closely at Hive-2.x / trunk. But from a high level by looking at the code it seems the problem will also exist there. > AvroSerde is regenerating TypeInfo objects for each nullable Avro field for > every row > ------------------------------------------------------------------------------------- > > Key: HIVE-17394 > URL: https://issues.apache.org/jira/browse/HIVE-17394 > Project: Hive > Issue Type: Bug > Affects Versions: 1.1.0 > Reporter: Ratandeep Ratti > Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png > > > The following methods in {{AvroDeserializer}} keep regenerating TypeInfo > objects for every nullable field in a row. > This is happening in the following methods. > {code} > private Object deserializeNullableUnion(Object datum, Schema fileSchema, > Schema recordSchema) throws AvroSerdeException { > // elided > line 312: return worker(datum, fileSchema, newRecordSchema, > SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null)); > } > .. > private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema > recordSchema) > // elided > line 357: return worker(datum, currentFileSchema, schema, > SchemaToTypeInfo.generateTypeInfo(schema, null)); > {code} > This is really bad in terms of performance. I'm not sure why didn't we use > the TypeInfo we already have instead of generating again for each nullable > field. If you look at the {{worker}} method which calls the method > {{deserializeNullableUnion}} the typeInfo corresponding to the nullable field > column is already determined. Not sure why we have to determine that > information again. > Moreover the cache in SchmaToTypeInfo does not help in nullable Avro records > case as checking if an Avro record schema object already exists in the cache > requires traversing all the fields in the record schema. > I've attached profiling snapshot which shows maximum time is being spent in > the cache. > One way of fixing this IMO is to make use of the column TypeInfo which is > already passed in the worker method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)