[ 
https://issues.apache.org/jira/browse/TAJO-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986743#comment-13986743
 ] 

Hyunsik Choi commented on TAJO-710:
-----------------------------------

Hi David,

I'm very interested in your ongoing work. Followed by the comment in TAJO-809, 
I'd like to continue the discussion about the internal data model.

Before I start this discussion, I'd like to say I still have no strong idea 
about the discussion that I'll start. I just want to find the best way.

Currently, there are various file formats, such as Avro, ORC, and Parquet, in 
Hadoop ecosystems. Also, Hive and Pig already have their data model including 
complex nested data type. Finally, Tajo should support all file formats, and 
need to keep the compatibility with Hive and Pig if possible. The challenge I'm 
thinking is that how we support them all (or most of them) in a simple way. 
Since their models are somewhat different to one another, we need to have some 
internal data model and need to translate other models into our internal model.

Now, instead of creating a new model, I'm focusing on which model involves 
other models in a processing-friendly or simple way. In my rough idea, Dremel 
model is a good candidate for this purpose. Probably, array, map, union, tuple, 
bag, and structs used in existing systems are just instances of nested record 
type with some constraints.

For example, Array type can be represented by repeated fields. Map is an 
instance of record type (you used in TAJO-809) and can be represented by a 
key-value pair and repeated keyword. Also, union type can be represented by a 
record type, composed of multiple optional fields. Of course, union type is 
more restricted.

What do you think about them? I'm open to any ideas. As I mentioned before, I 
just want to find the best way.

Best regards,
Hyunsik

> Add support for nested schemas and non-scalar types
> ---------------------------------------------------
>
>                 Key: TAJO-710
>                 URL: https://issues.apache.org/jira/browse/TAJO-710
>             Project: Tajo
>          Issue Type: New Feature
>          Components: data type
>            Reporter: David Chen
>            Assignee: David Chen
>
> Add support for nested schemas and non-scalar types (maps, arrays, enums, and 
> unions). Here are some ways other systems handle nested schemas:
>  * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
>  * Impala doesn't support nested schemas or non-scalar data types 
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_unsupported.html)
>  and disallows complex types in their Parquet support 
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html).
>  * Presto also does not support non-scalar types 
> (http://prestodb.io/docs/current/language/types.html)
> From the discussion in TAJO-30:
> {quote}
> I have a plan for nested schema. Currently, Tajo only supports a flat schema 
> like relational DBMS. So, even though Tajo is extended to nested data mode, 
> it will not break the compatibility.
> I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). 
> When I consider nested data model, I thought two main points. Parquet data 
> model satisfies with these points. The first point that I've thought is the 
> processing model on nested data. Parquet data model is the same to that of 
> BigQuery, and BigQuery already concreted the processing model including 
> flattening, cross production on repeated fields, and aggregation on repeated 
> fields [1][2]. The second point is file format. Parquet is a native file 
> format for this model. Parquet already includes the efficient record assembly 
> method. Besides, Parquet is already mature and is widely used in many systems.
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/data
> I'm thinking that we need three stages for this work. Firstly, we can start 
> with a small change to improve our schema system. Then, we will add some 
> physical operator to just flatten one nested row into a number of flattened 
> rows. Finally, we will solve some query optimization issues like 
> projection/filter push down on nested schema and will add some physical 
> operators to directly process nested rows.
> If you have any idea, feel free to share with us.
> Thanks,
> Hyunsik
> {quote}
> This ticket may need to be broken up into multiple sub-tasks. Each sub-task 
> will involve defining an extension to the query language to support the data 
> type, implementing the new data type, then adding support for the data type 
> in each of the storage types. I have opened tickets for each of these four 
> tasks but not as subtasks because it is very likely that each of these tasks 
> will have subtasks of their own:
>  * TAJO-721: Adding support for nested records
>  * TAJO-722: Adding support for maps
>  * TAJO-723: Adding support for array
>  * TAJO-724: Adding support for unions
> Adding support for the enum type can be a consideration, but is lower 
> priority than the other four complex types. Neither Hive nor Pig currently 
> have an enum type (even though storage formats such as Avro and Parquet do) 
> and, I believe, simply convert enum values to strings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to