[
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662145#comment-17662145
]
Rok Mihevc commented on ARROW-5123:
-----------------------------------
This issue has been migrated to [issue
#21608|https://github.com/apache/arrow/issues/21608] on GitHub. Please see the
[migration documentation|https://github.com/apache/arrow/issues/14542] for
further details.
> [Rust] derive RecordWriter from struct definitions
> --------------------------------------------------
>
> Key: ARROW-5123
> URL: https://issues.apache.org/jira/browse/ARROW-5123
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Rust
> Reporter: Xavier Lange
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
> Time Spent: 14.5h
> Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a
> rough transition time in the project):
> https://github.com/sunchao/parquet-rs/pull/197
>
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a
> struct which mirrors the schema of your file, this
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
> pub a_bool: bool,
> pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter<T> {
> fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust
> compiler. The code generation takes rust syntax and emits additional syntax.
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin,
> loaded by the machinery in cargo. Users don't have to do any special
> `build.rs` steps or anything like that, it's automatic by including
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The
> `syn` crate parses the struct from a string-representation to a AST (a
> recursive enum value). The AST contains all the values I care about when
> generating a `RecordWriter` impl:
> - the name of the struct
> - the lifetime variables of the struct
> - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo`
> struct. It has the bits I care about for writing a column: `field_name`,
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter`
> implementation. The templating functionality is provided by the `quote`
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
> fn write_row_group(..) {
> $({
> $column_writer_snippet
> })
> }
> }
> ```
> this template is then added under the struct definition, ending up something
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
> fn write_row_group(..) {
> {
> write_col_1();
> };
> {
> write_col_2();
> }
> }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully
> expanded and standalone. If a user ever changes their `struct MyValue`
> definition the `ParquetRecordWriter` will be regenerated. There's no
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to
> install `cargo expand` [more info on
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
> pub a_bool: bool,
> pub a2_bool: bool,
> }
> impl RecordWriter<DumbRecord> for &[DumbRecord] {
> fn write_to_row_group(
> &self,
> row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
> ) {
> let mut row_group_writer = row_group_writer;
> {
> let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
> let mut column_writer =
> row_group_writer.next_column().unwrap().unwrap();
> if let
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
> column_writer
> {
> typed.write_batch(&vals[..], None, None).unwrap();
> }
> row_group_writer.close_column(column_writer).unwrap();
> };
> {
> let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
> let mut column_writer =
> row_group_writer.next_column().unwrap().unwrap();
> if let
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
> column_writer
> {
> typed.write_batch(&vals[..], None, None).unwrap();
> }
> row_group_writer.close_column(column_writer).unwrap();
> }
> }
> }
> ```
> now I need to write out all the combinations of types we support and make
> sure it writes out data.
> Procedural Macros
> ===
> The `parquet_derive` crate can ONLY export the derivation functionality. No
> traits, nothing else. The derive crate can not host test cases. It's kind of
> like a "dummy" crate which is only used by the compiler, never the code.
> The parent crate cannot use the derivation functionality, which is important
> because it means test code cannot be in the parent crate. This forces us to
> have a third crate, `parquet_derive_test`.
> I'm open to being wrong on any one of these finer points. I had to bang on
> this for a while to get it to compile!
> Potentials For Better Design
> ===
> - [x] Recursion could be limited by generating the code as "snippets"
> instead of one big `quote!` AST generator. Or so I think. It might be nicer
> to push generating each columns writing code to another loop.
> - [X] ~~It would be nicer if I didn't have to be so picky about data going
> in to the `write_batch` function. Is it possible we could make a version of
> the function which accept `Into<DataType>` or similar? This would greatly
> simplify this derivation code as it would not need to enumerate all the
> supported types. Something like `write_generic_batch(&[impl Into<DataType>])`
> would be neat.~~ (not tackling in this generation of the plugin)
> - [X] ~~Another idea to improving writing columns, could we have a write
> function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just
> write a mapping for accessing the one value, we could skip the whole
> intermediate vec for `write_batch`. Should have some significant memory
> advantages.~~ (not tackling in this generation of the plugin, it's a bigger
> parquet-rs enhancement)
> - [X] ~~It might be worthwhile to derive a parquet schema directly from a
> struct definition. That should stamp out opportunities for type errors.~~
> (moved to #203)
> Status
> ===
> I have successfully integrated this work with my own data exporter (takes
> postgres/couchdb and outputs a single parquet file).
> I think this code is worth including in the project, with the caveat that it
> only generates simplistic `RecordWriter`s. As people start to use we can add
> code generation for more complex, nested structs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)