[jira] [Updated] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds

JIRA Mon, 26 Jan 2015 17:31:09 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergio Peña updated HIVE-9333:
------------------------------
    Description: 
The serialize process on ParquetHiveSerDe parses a Hive object
to a Writable object by looping through all the Hive object children,
and creating new Writables objects per child. These final writables
objects are passed in to the Parquet writing function, and parsed again
on the DataWritableWriter class by looping through the ArrayWritable
object. These two loops (ParquetHiveSerDe.serialize() and 
DataWritableWriter.write()  may be reduced to use just one loop into the 
DataWritableWriter.write() method in order to increment the writing process 
speed for Hive parquet.

In order to achieve this, we can wrap the Hive object and object inspector
on ParquetHiveSerDe.serialize() method into an object that implements the 
Writable object and thus avoid the loop that serialize() does, and leave the 
loop parser to the DataWritableWriter.write() method. We can see how ORC does 
this with the OrcSerde.OrcSerdeRow class.

Writable objects are organized differently on any kind of storage formats, so I 
don't think it is necessary to create and keep the writable objects in the 
serialize() method as they won't be used until the writing process starts 
(DataWritableWriter.write()).

This performance issue was found using microbenchmark tests from HIVE-8121.

  was:
The serialize process on ParquetHiveSerDe parses a Hive object
to a Writable object by looping through all the Hive object children,
and creating new Writables objects per child. These final writables
objects are passed in to the Parquet writing function, and parsed again
on the DataWritableWriter class by looping through the ArrayWritable
object. These two loops (ParquetHiveSerDe.serialize() and 
DataWritableWriter.write()  may be reduced to use just one loop into the 
DataWritableWriter.write() method in order to increment the writing process 
speed for Hive parquet.

In order to achieve this, we can wrap the Hive object and object inspector
on ParquetHiveSerDe.serialize() method into an object that implements the 
Writable object and thus avoid the loop that serialize() does, and leave the 
loop parser to the DataWritableWriter.write() method. We can see how ORC does 
this with the OrcSerde.OrcSerdeRow class.

Writable objects are organized differently on any kind of storage formats, so I 
don't think it is necessary to create and keep the writable objects in the 
serialize() method as they won't be used until the writing process starts 
(DataWritableWriter.write()).

We might save 200% of extra time by doing such change.
This performance issue was found using microbenchmark tests from HIVE-8121.


> Move parquet serialize implementation to DataWritableWriter to improve write 
> speeds
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-9333
>                 URL: https://issues.apache.org/jira/browse/HIVE-9333
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-9333.1.patch
>
>
> The serialize process on ParquetHiveSerDe parses a Hive object
> to a Writable object by looping through all the Hive object children,
> and creating new Writables objects per child. These final writables
> objects are passed in to the Parquet writing function, and parsed again
> on the DataWritableWriter class by looping through the ArrayWritable
> object. These two loops (ParquetHiveSerDe.serialize() and 
> DataWritableWriter.write()  may be reduced to use just one loop into the 
> DataWritableWriter.write() method in order to increment the writing process 
> speed for Hive parquet.
> In order to achieve this, we can wrap the Hive object and object inspector
> on ParquetHiveSerDe.serialize() method into an object that implements the 
> Writable object and thus avoid the loop that serialize() does, and leave the 
> loop parser to the DataWritableWriter.write() method. We can see how ORC does 
> this with the OrcSerde.OrcSerdeRow class.
> Writable objects are organized differently on any kind of storage formats, so 
> I don't think it is necessary to create and keep the writable objects in the 
> serialize() method as they won't be used until the writing process starts 
> (DataWritableWriter.write()).
> This performance issue was found using microbenchmark tests from HIVE-8121.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds

Reply via email to