[ https://issues.apache.org/jira/browse/ARROW-14024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li updated ARROW-14024: ----------------------------- Fix Version/s: 6.0.0 > [C++] ScanOptions::batch_size not respected in parquet/IPC readers > ------------------------------------------------------------------ > > Key: ARROW-14024 > URL: https://issues.apache.org/jira/browse/ARROW-14024 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Weston Pace > Assignee: David Li > Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > At first glance it seems like Parquet's reader should work. The > ScanOptions::batch_size property is forwarded into the ArrowReaderProperties > for the parquet::arrow::FileReader. However, we then use ReadOneRowGroup > which doesn't look at the batch_size option. > The IPC reader simply doesn't look at the property at all. > Even if we can't control the source read size (e.g. we have to read a full > row group / record batch and have no control over its size) we can still > split whatever we read into smaller batches that respect the batch size. > This is important for achieving parallelism as we can then partition the CPU > work across these batches. -- This message was sent by Atlassian Jira (v8.3.4#803005)