[
https://issues.apache.org/jira/browse/TAJO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943833#comment-13943833
]
David Chen edited comment on TAJO-30 at 3/22/14 7:39 AM:
---------------------------------------------------------
Hi Hyunsik,
That's an interesting idea. Do you mean that Tajo will use Parquet as the
default storage format or have all storage formats deserialize into a
representation that follows the Dremel model? Parquet doesn't really have its
own in-memory representation. Each of the Parquet packages basically
deserialize into a given in-memory representation using the readers and
writers. For example, parquet-avro deserializes into Avro GenericRecords (or
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code
deserializes into Tajo Tuples.
My changes are currently in the {{parquet}} branch in my fork on GitHub:
https://github.com/davidzchen/tajo/tree/parquet
They are almost ready. During further testing, I found a few more issues, most
of them I have now fixed. One thing I noticed was that when reading a
projection, the resulting Tuple still has all the columns of the table schema
but the non-projected fields are simply null. What is the motivation for
retaining all the columns in the Tuple rather than having the Tuple only
contain the projected columns?
There is one last test that is failing which is caused by the fact that I am
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find
much documentation on its use. I can always write this as a placeholder column
or special-case it. Once I fix this, I will post a review request.
There are some follow-up work items that I plan to do, most likely as review
changes:
* Add TableStats to ParquetAppender.
* Figure out of ParquetAppender.flush() is needed.
* Additional end-to-end testing
* Add some documentation
Thanks,
David
Edit: Update GitHub link.
was (Author: davidzchen):
Hi Hyunsik,
That's an interesting idea. Do you mean that Tajo will use Parquet as the
default storage format or have all storage formats deserialize into a
representation that follows the Dremel model? Parquet doesn't really have its
own in-memory representation. Each of the Parquet packages basically
deserialize into a given in-memory representation using the readers and
writers. For example, parquet-avro deserializes into Avro GenericRecords (or
SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code
deserializes into Tajo Tuples.
My changes are currently in the {{parquet}} branch in my fork on GitHub:
https://github.com/davidzchen/incubator-tajo/tree/parquet
They are almost ready. During further testing, I found a few more issues, most
of them I have now fixed. One thing I noticed was that when reading a
projection, the resulting Tuple still has all the columns of the table schema
but the non-projected fields are simply null. What is the motivation for
retaining all the columns in the Tuple rather than having the Tuple only
contain the projected columns?
There is one last test that is failing which is caused by the fact that I am
not handling the {{NULL_TYPE}} data type when converting the Tajo schema to a
Parquet schema on write. What is {{NULL_TYPE}} used for? I wasn't able to find
much documentation on its use. I can always write this as a placeholder column
or special-case it. Once I fix this, I will post a review request.
There are some follow-up work items that I plan to do, most likely as review
changes:
* Add TableStats to ParquetAppender.
* Figure out of ParquetAppender.flush() is needed.
* Additional end-to-end testing
* Add some documentation
Thanks,
David
Edit: Update GitHub link.
> Parquet Integration
> -------------------
>
> Key: TAJO-30
> URL: https://issues.apache.org/jira/browse/TAJO-30
> Project: Tajo
> Issue Type: New Feature
> Reporter: Hyunsik Choi
> Assignee: David Chen
> Labels: Parquet
> Attachments: TAJO-30.patch
>
>
> Parquet is a columnar storage format developed by Twitter. Implement Parquet
> (http://parquet.io/) support for Tajo.
> The implementation consists of the following:
> * {{ParquetScanner}} and {{ParquetAppender}} - FileScanner and FileAppenders
> for reading and writing Parquet.
> * {{TajoParquetReader}} and {{TajoParquetWriter}} - Top-level reader and
> writer for serializing/deserializing to Tajo Tuples.
> * {{TajoReadSupport}} and {{TajoWriteSupport}} - Abstractions to perform
> conversion between Parquet and Tajo records.
> * {{TajoRecordMaterializer}} - Materializes Tajo Tuples from Parquet's
> internal representation.
> * {{TajoRecordConverter}} - Used by {{TajoRecordMateriailzer}} to
> materialize a Tajo Tuple.
> * {{TajoSchemaConverter}} - Converts between Tajo and Parquet schemas.
--
This message was sent by Atlassian JIRA
(v6.2#6252)