alamb opened a new issue #216:
URL: https://github.com/apache/arrow-rs/issues/216


   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-8421
   
   This is the parent story. See subtasks for more information.
   
   Notes from [~wesm] :
   
   A couple of initial things to keep in mind
    * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
    * You can optimize the special case where a nullable field's data has no 
nulls
    * A good amount of code is required to handle converting from the Arrow 
physical form of various logical types to the Parquet equivalent one, see 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc] 
for details
    * It would be worth thinking up front about how dictionary-encoded data is 
handled both on the Arrow write and Arrow read paths. In parquet-cpp we 
initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to 
dense String), and through real world need I was forced to revisit this (quite 
painfully) to enable Arrow dictionaries to survive roundtrips to Parquet 
format, and also achieve better performance and memory use in both reads and 
writes. You can certainly do a dictionary-to-dense conversion like we did, but 
you may someday find yourselves doing the same painful refactor that I did to 
make dictionary write and read not only more efficient but also dictionary 
order preserving.
   
   Notes from [~sunchao] :
   
   I roughly skimmed through the C++ implementation and think on the high level 
we need to do the following:
    # implement a method similar to {{WriteArrow}} in 
[column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
 We can further break this up into smaller pieces such as: 
dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on 
and so forth.
    # implement an arrow writer in the parquet crate 
[here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow]. This 
needs to offer similar APIs as 
[writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to