[jira] [Comment Edited] (ARROW-4936) [Rust] Add parquet test file for all supported types in 2.5.0 format

Dr. Christoph Jung (Jira) Sat, 24 Apr 2021 00:15:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331162#comment-17331162
 ]


Dr. Christoph Jung edited comment on ARROW-4936 at 4/24/21, 7:14 AM:
---------------------------------------------------------------------

I'm willing to volunteer for this one.

[https://github.com/drcgjung]
{quote}Being generally interested contributing to the datafusion/ballista 
development stream.
 ~5years of professional experience in Apache Spark (RDD & dataframe) for 
large-scale measurement data.
 ~4years open source contributions to JBoss (aka "Wildfly") a while back.
{quote}
Some obligatory questions:
 * Java API = [https://github.com/apache/parquet-mr] ?

 * Parquet Format = [https://github.com/apache/parquet-format] ?

 * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? 
Because there really was no Java API release using 2.5.0 and
 and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] 
refers to 2.6.0

 * "arrow-testing" = [https://github.com/apache/arrow-testing] ?
 * New folder there, data/parquet/types ?
 * Where to put the java generator project, also there? 

 * Better to have a single parquet with all the types or one parquet per basic 
type (can be many derived ones, see below)?

 * Would it be good to include the format version into the test parquet file 
name (for later additions when rust/parquet upgrades the format)?

 * I count 14 "plain" logical, parameterized types.

 * I count relevant 29 basic type instantiations, each could be represented 
mandatory and optional (=>58 test types)
 ** string
 ** enum
 ** uuid
 ** int_8, ... uint_64
 ** decimal_32, decimal_64 (maybe additional precision tests?)
 ** date
 ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, 
time_local_micros, time_local_nanos
 ** timstamp_utc_millis, ... timestamp_local_nanos
 ** interval
 ** json, bson

 * Nested types could be derived in arbitrary combinations, but I guess its ok 
to have
 one LIST and two MAP types per basic test type (one as required key and one as 
value). Again,
 nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 
nested test types)

 * There will be two PRs necessary because of the two repositories involved 
(arrow hard-linking to a version in the arrow-testing repo). The
 arrow PR will have to change the version link to the arrow-testing repo (which 
is maybe not safe for other arrow subprojects). Is that ok?

Thanks if/for considering me ;)

 

 

 


was (Author: doc_schorsch):
I'm willing to volunteer for this one.

[https://github.com/drcgjung]
{quote}Being generally interested contributing to the datafusion/ballista 
development stream.
 ~5years of professional experience in Apache Spark (RDD & dataframe) for 
large-scale measurement data.
 ~4years open source contributions to JBoss (aka "Wildfly") a while back.
{quote}
Some obligatory questions:
 * Java API = [https://github.com/apache/parquet-mr] ?

 * Parquet Format = [https://github.com/apache/parquet-format] ?

 * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? 
Because there really was no Java API release using 2.5.0 and
 and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] 
refers to 2.6.0

 * "arrow-testing" = [https://github.com/apache/arrow-testing] ?

 * New folder there, data/parquet/types ?

 * Better to have a single parquet with all the types or one parquet per basic 
type (can be many derived ones, see below)?

 * Would it be good to include the format version into the test parquet file 
name (for later additions when rust/parquet upgrades the format)?

 * I count 14 "plain" logical, parameterized types.

 * I count relevant 29 basic type instantiations, each could be represented 
mandatory and optional (=>58 test types)
 ** string
 ** enum
 ** uuid
 ** int_8, ... uint_64
 ** decimal_32, decimal_64 (maybe additional precision tests?)
 ** date
 ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, 
time_local_micros, time_local_nanos
 ** timstamp_utc_millis, ... timestamp_local_nanos
 ** interval
 ** json, bson

 * Nested types could be derived in arbitrary combinations, but I guess its ok 
to have
 one LIST and two MAP types per basic test type (one as required key and one as 
value). Again,
 nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 
nested test types)

 * There will be two PRs necessary because of the two repositories involved 
(arrow hard-linking to a version in the arrow-testing repo). The
 arrow PR will have to change the version link to the arrow-testing repo (which 
is maybe not safe for other arrow subprojects). Is that ok?

Thanks if/for considering me ;)

 

 

 

> [Rust] Add parquet test file for all supported types in 2.5.0 format
> --------------------------------------------------------------------
>
>                 Key: ARROW-4936
>                 URL: https://issues.apache.org/jira/browse/ARROW-4936
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>    Affects Versions: 0.13.0
>            Reporter: Andy Grove
>            Priority: Major
>              Labels: beginner
>
> Suggested A/C
>  * Generate a Parquet file using the Java API and check it into the 
> arrow-testing repo
>  * Write unit tests in the parquet crate for reading all types
>  * Write unit tests in the datafusion crate for reading all types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-4936) [Rust] Add parquet test file for all supported types in 2.5.0 format

Reply via email to