[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergio Peña updated HIVE-9333: ------------------------------ Status: Patch Available (was: Open) > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > ----------------------------------------------------------------------------------- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task > Reporter: Sergio Peña > Assignee: Sergio Peña > Attachments: HIVE-9333.1.patch, HIVE-9333.2.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)