[jira] [Commented] (ARROW-14648) [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches

Todd Farmer (Jira) Fri, 26 Aug 2022 09:06:15 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585507#comment-17585507
 ]


Todd Farmer commented on ARROW-14648:
-------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment.
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Dataset] Change scanner readahead limits to be based on bytes instead 
> of number of batches
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14648
>                 URL: https://issues.apache.org/jira/browse/ARROW-14648
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: datasets, query-engine
>
> In the scanner readahead is controlled by "batch_readahead" and 
> "fragment_readahead" (both specified in the scan options).  This was mainly 
> motivated on my work with CSV and the defaults of 32 and 8 will cause the 
> scanner to buffer ~256MB of data (given the default block size of 1MB).
> For parquet / IPC this would mean we are buffering 256 row groups which is 
> entirely too high.
> Rather than make users figure out complex parameters we should have a single 
> readahead limit that is specified in bytes.
> This will be "best effort".  I'm not suggest we support partial reads of row 
> groups / record batches so if the limit is set very small we still might end 
> up with more in RAM just because we can only load entire row groups.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14648) [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches

Reply via email to