[ 
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17659048#comment-17659048
 ] 

Rok Mihevc commented on ARROW-2019:
-----------------------------------

This issue has been migrated to [issue 
#18000|https://github.com/apache/arrow/issues/18000] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> Control the memory allocated for inner vector in LIST
> -----------------------------------------------------
>
>                 Key: ARROW-2019
>                 URL: https://issues.apache.org/jira/browse/ARROW-2019
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Siddharth Teotia
>            Assignee: Siddharth Teotia
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to