[jira] [Comment Edited] (TAJO-30) Parquet Integration

David Chen (JIRA) Sat, 22 Mar 2014 00:41:29 -0700

    [ 
https://issues.apache.org/jira/browse/TAJO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943833#comment-13943833
 ]


David Chen edited comment on TAJO-30 at 3/22/14 7:39 AM:
---------------------------------------------------------

Hi Hyunsik,

That's an interesting idea. Do you mean that Tajo will use Parquet as the 
default storage format or have all storage formats deserialize into a 
representation that follows the Dremel model? Parquet doesn't really have its 
own in-memory representation. Each of the Parquet packages basically 
deserialize into a given in-memory representation using the readers and 
writers. For example, parquet-avro deserializes into Avro GenericRecords (or 
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code 
deserializes into Tajo Tuples.

My changes are currently in the {{parquet}} branch in my fork on GitHub: 
https://github.com/davidzchen/tajo/tree/parquet

They are almost ready. During further testing, I found a few more issues, most 
of them I have now fixed. One thing I noticed was that when reading a 
projection, the resulting Tuple still has all the columns of the table schema 
but the non-projected fields are simply null. What is the motivation for 
retaining all the columns in the Tuple rather than having the Tuple only 
contain the projected columns?

There is one last test that is failing which is caused by the fact that I am 
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a 
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find 
much documentation on its use. I can always write this as a placeholder column 
or special-case it. Once I fix this, I will post a review request.

There are some follow-up work items that I plan to do, most likely as review 
changes:

 * Add TableStats to ParquetAppender.
 * Figure out of ParquetAppender.flush() is needed.
 * Additional end-to-end testing
 * Add some documentation

Thanks,
David

Edit: Update GitHub link.


was (Author: davidzchen):
Hi Hyunsik,

That's an interesting idea. Do you mean that Tajo will use Parquet as the 
default storage format or have all storage formats deserialize into a 
representation that follows the Dremel model? Parquet doesn't really have its 
own in-memory representation. Each of the Parquet packages basically 
deserialize into a given in-memory representation using the readers and 
writers. For example, parquet-avro deserializes into Avro GenericRecords (or 
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code 
deserializes into Tajo Tuples.

My changes are currently in the {{parquet}} branch in my fork on GitHub: 
https://github.com/davidzchen/incubator-tajo/tree/parquet

They are almost ready. During further testing, I found a few more issues, most 
of them I have now fixed. One thing I noticed was that when reading a 
projection, the resulting Tuple still has all the columns of the table schema 
but the non-projected fields are simply null. What is the motivation for 
retaining all the columns in the Tuple rather than having the Tuple only 
contain the projected columns?

There is one last test that is failing which is caused by the fact that I am 
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a 
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find 
much documentation on its use. I can always write this as a placeholder column 
or special-case it. Once I fix this, I will post a review request.

There are some follow-up work items that I plan to do, most likely as review 
changes:

 * Add TableStats to ParquetAppender.
 * Figure out of ParquetAppender.flush() is needed.
 * Additional end-to-end testing
 * Add some documentation

Thanks,
David

Edit: Update GitHub link.

> Parquet Integration
> -------------------
>
>                 Key: TAJO-30
>                 URL: https://issues.apache.org/jira/browse/TAJO-30
>             Project: Tajo
>          Issue Type: New Feature
>            Reporter: Hyunsik Choi
>            Assignee: David Chen
>              Labels: Parquet
>         Attachments: TAJO-30.patch
>
>
> Parquet is a columnar storage format developed by Twitter. Implement Parquet 
> (http://parquet.io/) support for Tajo.
> The implementation consists of the following:
>  * {{ParquetScanner}} and {{ParquetAppender}} - FileScanner and FileAppenders 
> for reading and writing Parquet.
>  * {{TajoParquetReader}} and {{TajoParquetWriter}} - Top-level reader and 
> writer for serializing/deserializing to Tajo Tuples.
>  * {{TajoReadSupport}} and {{TajoWriteSupport}} - Abstractions to perform 
> conversion between Parquet and Tajo records.
>  * {{TajoRecordMaterializer}} - Materializes Tajo Tuples from Parquet's 
> internal representation.
>  * {{TajoRecordConverter}} - Used by {{TajoRecordMateriailzer}} to 
> materialize a Tajo Tuple.
>  * {{TajoSchemaConverter}} - Converts between Tajo and Parquet schemas.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TAJO-30) Parquet Integration

Reply via email to