[ https://issues.apache.org/jira/browse/ARROW-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331162#comment-17331162 ]
Dr. Christoph Jung edited comment on ARROW-4936 at 4/24/21, 7:14 AM: --------------------------------------------------------------------- I'm willing to volunteer for this one. [https://github.com/drcgjung] {quote}Being generally interested contributing to the datafusion/ballista development stream. ~5years of professional experience in Apache Spark (RDD & dataframe) for large-scale measurement data. ~4years open source contributions to JBoss (aka "Wildfly") a while back. {quote} Some obligatory questions: * Java API = [https://github.com/apache/parquet-mr] ? * Parquet Format = [https://github.com/apache/parquet-format] ? * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? Because there really was no Java API release using 2.5.0 and and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] refers to 2.6.0 * "arrow-testing" = [https://github.com/apache/arrow-testing] ? * New folder there, data/parquet/types ? * Where to put the java generator project, also there? * Better to have a single parquet with all the types or one parquet per basic type (can be many derived ones, see below)? * Would it be good to include the format version into the test parquet file name (for later additions when rust/parquet upgrades the format)? * I count 14 "plain" logical, parameterized types. * I count relevant 29 basic type instantiations, each could be represented mandatory and optional (=>58 test types) ** string ** enum ** uuid ** int_8, ... uint_64 ** decimal_32, decimal_64 (maybe additional precision tests?) ** date ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, time_local_micros, time_local_nanos ** timstamp_utc_millis, ... timestamp_local_nanos ** interval ** json, bson * Nested types could be derived in arbitrary combinations, but I guess its ok to have one LIST and two MAP types per basic test type (one as required key and one as value). Again, nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 nested test types) * There will be two PRs necessary because of the two repositories involved (arrow hard-linking to a version in the arrow-testing repo). The arrow PR will have to change the version link to the arrow-testing repo (which is maybe not safe for other arrow subprojects). Is that ok? Thanks if/for considering me ;) was (Author: doc_schorsch): I'm willing to volunteer for this one. [https://github.com/drcgjung] {quote}Being generally interested contributing to the datafusion/ballista development stream. ~5years of professional experience in Apache Spark (RDD & dataframe) for large-scale measurement data. ~4years open source contributions to JBoss (aka "Wildfly") a while back. {quote} Some obligatory questions: * Java API = [https://github.com/apache/parquet-mr] ? * Parquet Format = [https://github.com/apache/parquet-format] ? * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? Because there really was no Java API release using 2.5.0 and and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] refers to 2.6.0 * "arrow-testing" = [https://github.com/apache/arrow-testing] ? * New folder there, data/parquet/types ? * Better to have a single parquet with all the types or one parquet per basic type (can be many derived ones, see below)? * Would it be good to include the format version into the test parquet file name (for later additions when rust/parquet upgrades the format)? * I count 14 "plain" logical, parameterized types. * I count relevant 29 basic type instantiations, each could be represented mandatory and optional (=>58 test types) ** string ** enum ** uuid ** int_8, ... uint_64 ** decimal_32, decimal_64 (maybe additional precision tests?) ** date ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, time_local_micros, time_local_nanos ** timstamp_utc_millis, ... timestamp_local_nanos ** interval ** json, bson * Nested types could be derived in arbitrary combinations, but I guess its ok to have one LIST and two MAP types per basic test type (one as required key and one as value). Again, nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 nested test types) * There will be two PRs necessary because of the two repositories involved (arrow hard-linking to a version in the arrow-testing repo). The arrow PR will have to change the version link to the arrow-testing repo (which is maybe not safe for other arrow subprojects). Is that ok? Thanks if/for considering me ;) > [Rust] Add parquet test file for all supported types in 2.5.0 format > -------------------------------------------------------------------- > > Key: ARROW-4936 > URL: https://issues.apache.org/jira/browse/ARROW-4936 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion > Affects Versions: 0.13.0 > Reporter: Andy Grove > Priority: Major > Labels: beginner > > Suggested A/C > * Generate a Parquet file using the Java API and check it into the > arrow-testing repo > * Write unit tests in the parquet crate for reading all types > * Write unit tests in the datafusion crate for reading all types -- This message was sent by Atlassian Jira (v8.3.4#803005)