[ https://issues.apache.org/jira/browse/ARROW-12121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neville Dipale resolved ARROW-12121. ------------------------------------ Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9825 [https://github.com/apache/arrow/pull/9825] > [Rust] [Parquet] Arrow writer benchmarks > ---------------------------------------- > > Key: ARROW-12121 > URL: https://issues.apache.org/jira/browse/ARROW-12121 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust > Reporter: Neville Dipale > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The common concern with Parquet's Arrow readers and writers is that they're > slow. > My diagnosis is that we rely on a chain of processes, which introduces > overhead. > For example, writing an Arrow RecordBatch involves the following: > 1. Iterate through arrays to create def/rep levels > 2. Extract Parquet primitive values from arrays using these levels > 3. Write primitive values, validating them in the process (when they already > should be validated) > 4. Split the already materialised values into small batches for Parquet > chunks (consider where we have 1e6 values in a batch) > 5. Write these batches, computing the stats of each batch, and encoding values > The above is as a side-effect of convenience, as it would likely require a > lot more effort to bypass some of the steps. > I have ideas around going from step 1 to 5 directly, but won't know if it's > better if there aren't performance benchmarks. I also struggle to see if I'm > making improvements while I clean up the writer code, especially removing the > allocations that I created to reduce the complexity of the level calculations. > With ARROW-12120 (random array & batch generator), it becomes more convenient > to benchmark (and test many combinations of) the Arrow writer. > I would thus like to start adding benchmarks for the Arrow writer. -- This message was sent by Atlassian Jira (v8.3.4#803005)