[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299344#comment-16299344
 ] 

ASF GitHub Bot commented on ARROW-1943:
---------------------------------------

siddharthteotia opened a new pull request #1439: ARROW-1943: handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439
 
 
   The current implementation of setInitialCapacity() uses a factor of 5 for 
every level we go into list:
   
   So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) 
and we start with an initial capacity of 128, we end up not throwing 
OversizedAllocationException from the BigIntVector because at every level we 
increased the capacity by 5 and by the time we reached inner scalar that 
actually stores the data, we were well over max size limit per vector (1MB).
   
   We saw this problem downstream when we failed to read deeply nested JSON 
data.
   
   The potential fix is to use the factor of 5 only when we are down to the 
leaf vector. As the depth increases and we are still working with complex/list, 
we don't use the factor of 5.
   
   cc @jacques-n , @BryanCutler , @icexelloss 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> ------------------------------------------------------------
>
>                 Key: ARROW-1943
>                 URL: https://issues.apache.org/jira/browse/ARROW-1943
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Siddharth Teotia
>            Assignee: Siddharth Teotia
>              Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) 
> and we start with an initial capacity of 128, we end up not throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to