nevi-me opened a new pull request #7319:
URL: https://github.com/apache/arrow/pull/7319


   **Note**: I started making changes to #6785, and ended up deviating a lot. 
   ___
   
   This is a draft to implement an arrow writer for parquet. It supports the 
following (no complete test coverage yet):
   
   * writing primitives except for booleans and binary
   * nested structs
   * null values (via definition levels)
   
   It does not yet support:
   
   - [ ] Boolean arrays (have to be handled differently from numeric values)
   - [ ] Binary arrays
   - [ ] List arrays (still figuring out deeply-nested repetition levels)
   - [ ] Dictionary arrays
   - [ ] Union arrays (are they even possible?)
   
   I have only added a test by creating a nested schema, which I tested on 
pyarrow.
   
   ```jupyter
   # schema of test_complex.parquet
   
   a: int32 not null
   b: int32
   c: struct<d: double, e: struct<f: float>> not null
     child 0, d: double
     child 1, e: struct<f: float>
         child 0, f: float
   ```
   
   This PR potentially addresses:
   
   * https://issues.apache.org/jira/browse/ARROW-8289
   * https://issues.apache.org/jira/browse/ARROW-8423
   * https://issues.apache.org/jira/browse/ARROW-8424
   * https://issues.apache.org/jira/browse/ARROW-8425
   
   And I would like to propose either opening new JIRAs for the above 
incomplete items, or renaming the last 3 above.
   
   ___
   
   **Help Needed**
   
   I'm implementing the definition and repetition levels on first principle 
from an old Parquet blog post from the Twitter engineering blog. It's likely 
that I'm not getting some concepts correct, so I would appreciate help with:
   
   * Checking if my logic is correct
   * Guidance or suggestions on how to more efficiently extract levels from 
arrays
   * Adding tests - I suspect we might need a lot of tests, so far we only test 
writing 1 batch, so I don't know how paging would work when writing a large 
enough file
   
   I also don't know if the various encoding levels (dictionary, RLE, etc.) and 
compression levels are applied automagically, or if that'd be something we need 
to explicitly enable.
   
   CC @sunchao @sadikovi @andygrove @paddyhoran 
   
   Might be of interest to @mcassels @maxburke


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to