[jira] [Updated] (TAJO-710) Add support for nested schemas and complex types

David Chen (JIRA) Tue, 25 Mar 2014 13:24:53 -0700

     [ 
https://issues.apache.org/jira/browse/TAJO-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Chen updated TAJO-710:
----------------------------

    Description: 
Add support for nested schemas and non-scalar types (maps, arrays, enums, and 
unions). Here are some ways other systems handle nested schemas:

 * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
 * Impala doesn't support nested schemas or non-scalar data types and simply 
flattens nested schemas.

>From the discussion in TAJO-30:

{quote}
I have a plan for nested schema. Currently, Tajo only supports a flat schema 
like relational DBMS. So, even though Tajo is extended to nested data mode, it 
will not break the compatibility.

I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When 
I consider nested data model, I thought two main points. Parquet data model 
satisfies with these points. The first point that I've thought is the 
processing model on nested data. Parquet data model is the same to that of 
BigQuery, and BigQuery already concreted the processing model including 
flattening, cross production on repeated fields, and aggregation on repeated 
fields [1][2]. The second point is file format. Parquet is a native file format 
for this model. Parquet already includes the efficient record assembly method. 
Besides, Parquet is already mature and is widely used in many systems.

[1] http://research.google.com/pubs/pub36632.html
[2] https://developers.google.com/bigquery/docs/data

I'm thinking that we need three stages for this work. Firstly, we can start 
with a small change to improve our schema system. Then, we will add some 
physical operator to just flatten one nested row into a number of flattened 
rows. Finally, we will solve some query optimization issues like 
projection/filter push down on nested schema and will add some physical 
operators to directly process nested rows.

If you have any idea, feel free to share with us.

Thanks,
Hyunsik
{quote}

This ticket may need to be broken up into multiple sub-tasks:

 * Extending the query language to support nested schemas and non-scalar types
 * Adding support for nested records
 * Adding support for maps
 * Adding support for enums
 * Adding support for array
 * Adding support for unions

  was:
Add support for nested schemas and complex types (maps, arrays, enums, and 
unions). Here are some ways other systems handle nested schemas:

 * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
 * Impala doesn't support nested schemas or non-scalar data types and simply 
flattens nested schemas.

>From the discussion in TAJO-30:

{quote}
I have a plan for nested schema. Currently, Tajo only supports a flat schema 
like relational DBMS. So, even though Tajo is extended to nested data mode, it 
will not break the compatibility.

I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When 
I consider nested data model, I thought two main points. Parquet data model 
satisfies with these points. The first point that I've thought is the 
processing model on nested data. Parquet data model is the same to that of 
BigQuery, and BigQuery already concreted the processing model including 
flattening, cross production on repeated fields, and aggregation on repeated 
fields [1][2]. The second point is file format. Parquet is a native file format 
for this model. Parquet already includes the efficient record assembly method. 
Besides, Parquet is already mature and is widely used in many systems.

[1] http://research.google.com/pubs/pub36632.html
[2] https://developers.google.com/bigquery/docs/data

I'm thinking that we need three stages for this work. Firstly, we can start 
with a small change to improve our schema system. Then, we will add some 
physical operator to just flatten one nested row into a number of flattened 
rows. Finally, we will solve some query optimization issues like 
projection/filter push down on nested schema and will add some physical 
operators to directly process nested rows.

If you have any idea, feel free to share with us.

Thanks,
Hyunsik
{quote}

This ticket may need to be broken up into multiple sub-tasks:

 * Extending the query language to support nested schemas and non-scalar types
 * Adding support for nested records
 * Adding support for maps
 * Adding support for enums
 * Adding support for array
 * Adding support for unions


> Add support for nested schemas and complex types
> ------------------------------------------------
>
>                 Key: TAJO-710
>                 URL: https://issues.apache.org/jira/browse/TAJO-710
>             Project: Tajo
>          Issue Type: New Feature
>          Components: data type
>            Reporter: David Chen
>
> Add support for nested schemas and non-scalar types (maps, arrays, enums, and 
> unions). Here are some ways other systems handle nested schemas:
>  * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
>  * Impala doesn't support nested schemas or non-scalar data types and simply 
> flattens nested schemas.
> From the discussion in TAJO-30:
> {quote}
> I have a plan for nested schema. Currently, Tajo only supports a flat schema 
> like relational DBMS. So, even though Tajo is extended to nested data mode, 
> it will not break the compatibility.
> I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). 
> When I consider nested data model, I thought two main points. Parquet data 
> model satisfies with these points. The first point that I've thought is the 
> processing model on nested data. Parquet data model is the same to that of 
> BigQuery, and BigQuery already concreted the processing model including 
> flattening, cross production on repeated fields, and aggregation on repeated 
> fields [1][2]. The second point is file format. Parquet is a native file 
> format for this model. Parquet already includes the efficient record assembly 
> method. Besides, Parquet is already mature and is widely used in many systems.
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/data
> I'm thinking that we need three stages for this work. Firstly, we can start 
> with a small change to improve our schema system. Then, we will add some 
> physical operator to just flatten one nested row into a number of flattened 
> rows. Finally, we will solve some query optimization issues like 
> projection/filter push down on nested schema and will add some physical 
> operators to directly process nested rows.
> If you have any idea, feel free to share with us.
> Thanks,
> Hyunsik
> {quote}
> This ticket may need to be broken up into multiple sub-tasks:
>  * Extending the query language to support nested schemas and non-scalar types
>  * Adding support for nested records
>  * Adding support for maps
>  * Adding support for enums
>  * Adding support for array
>  * Adding support for unions



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TAJO-710) Add support for nested schemas and complex types

Reply via email to