[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301848#comment-16301848
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia closed pull request #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
 
b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
index a9221f2f6..9a23fd8c3 100644
--- 
a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
+++ 
b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
@@ -29,6 +29,8 @@
 import org.apache.arrow.vector.UInt4Vector;
 import org.apache.arrow.vector.ValueVector;
 import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BaseFixedWidthVector;
 import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID;
 import org.apache.arrow.vector.types.pojo.FieldType;
 import org.apache.arrow.vector.util.CallBack;
@@ -134,7 +136,11 @@ public FieldVector getDataVector() {
   @Override
   public void setInitialCapacity(int numRecords) {
 offsetAllocationSizeInBytes = (numRecords + 1) * OFFSET_WIDTH;
-vector.setInitialCapacity(numRecords * 
RepeatedValueVector.DEFAULT_REPEAT_PER_RECORD);
+if (vector instanceof BaseFixedWidthVector || vector instanceof 
BaseVariableWidthVector) {
+  vector.setInitialCapacity(numRecords * 
RepeatedValueVector.DEFAULT_REPEAT_PER_RECORD);
+} else {
+ vector.setInitialCapacity(numRecords);
+}
   }
 
   @Override
diff --git 
a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java 
b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
index 1acce7e0b..e2023f446 100644
--- a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
+++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
@@ -559,6 +559,134 @@ public void testNestedListVector() throws Exception {
 }
   }
 
+  @Test
+  public void testNestedListVector1() throws Exception {
+try (ListVector listVector = ListVector.empty("sourceVector", allocator)) {
+
+  MinorType listType = MinorType.LIST;
+  MinorType scalarType = MinorType.BIGINT;
+
+  listVector.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList1 = (ListVector)listVector.getDataVector();
+  innerList1.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList2 = (ListVector)innerList1.getDataVector();
+  innerList2.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList3 = (ListVector)innerList2.getDataVector();
+  innerList3.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList4 = (ListVector)innerList3.getDataVector();
+  innerList4.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList5 = (ListVector)innerList4.getDataVector();
+  innerList5.addOrGetVector(FieldType.nullable(listType.getType()));
+
+  ListVector innerList6 = (ListVector)innerList5.getDataVector();
+  innerList6.addOrGetVector(FieldType.nullable(scalarType.getType()));
+
+  listVector.setInitialCapacity(128);
+}
+  }
+
+  @Test
+  public void testNestedListVector2() throws Exception {
+try (ListVector listVector = ListVector.empty("sourceVector", allocator)) {
+  listVector.setInitialCapacity(1);
+  UnionListWriter listWriter = listVector.getWriter();
+  /* allocate memory */
+  listWriter.allocate();
+
+  /* write one or more inner lists at index 0 */
+  listWriter.setPosition(0);
+  listWriter.startList();
+
+  listWriter.list().startList();
+  listWriter.list().bigInt().writeBigInt(50);
+  listWriter.list().bigInt().writeBigInt(100);
+  listWriter.list().bigInt().writeBigInt(200);
+  listWriter.list().endList();
+
+  listWriter.list().startList();
+  listWriter.list().bigInt().writeBigInt(75);
+  listWriter.list().bigInt().writeBigInt(125);
+  listWriter.list().endList();
+
+  listWriter.endList();
+
+  /* write one or more inner lists at index 1 */
+  listWriter.setPosition(1);
+  listWriter.startList();
+
+  listWriter.list().startList();
+  listWriter.list().bigInt().writeBigInt(15);
+  listWriter.list().bigInt().writeBigInt(20);
+  listWriter.list().endList();
+
+  listWriter.list().startList();
+  

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300939#comment-16300939
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

jacques-n commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353517425
 
 
   I'm +1 on this approach.  It may not be perfect but it is definitely far 
better than the old approach.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300938#comment-16300938
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

jacques-n commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353517425
 
 
   I'm +1 on this approach.  It may not be perfect but it is definitely far 
better than the old appraoch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300872#comment-16300872
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353504925
 
 
   Ping.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299424#comment-16299424
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353239485
 
 
   I thought more about this and the implemented solution is debatable even 
though it fixes the overallocation exception problem.
   
   For example, consider the newly added unit test for schema : LIST (LIST 
(INT))
   
   Since each position in the top level vector has 1 or more lists, the number 
of offsets in the inner list will always be greater than offsets in its parent. 
This implies that there is some factor of increase in capacity as we go down 
the tree. In the context of unit test:
   
   The value capacity of outer list is 2 and inner list is 4 because each 
position of outer list has 2 inner lists and then we have an int vector with 
value capacity 10 comprising of data across all inner lists.
   
   So doing list.setInitialCapacity(2) -> innerList.setInitialCapacity(2) -> 
intVector.setInitialCapacity(2 * 5) will require expansion of offset buffer 
(and validity) of inner list.
   
   The question really is if there is a reasonable way to increase the 
multiplier as we go down the nested lists. The current patch keeps it same 
until we get down to scalars and then we directly use a multiplier of 5.
   
   However, this will potentially require re-allocation of internal buffers of 
each inner list vector as the user app writes data into deeply nested lists.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299423#comment-16299423
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353239485
 
 
   I thought more about this and the implemented solution is debatable even 
though it fixes the overallocation exception problem.
   
   For example, consider the newly added unit test for schema : LIST (LIST 
(INT))
   
   Since each position in the top level vector has 1 or more lists, the number 
of offsets in the inner list will always be greater than offsets in its parent. 
This implies that there is some factor of increase in capacity as we go down 
the tree. In the context of unit test:
   
   The value capacity of outer list is 2 and inner list is 4 because each 
position of outer list has 2 inner lists and then we have an int vector with 
value capacity 10 comprising of data across all inner lists.
   
   So doing list.setInitialCapacity(2) -> innerList.setInitialCapacity(2) -> 
intVector.setInitialCapacity(2 * 5) will require expansion of offset buffer 
(and validity) of inner list.
   
   The question really is if there is a reasonable way to increase the 
multiplier as we go down the nested lists. The current patch keeps it same 
until we get down to scalars and then we directly use a multiplier of 5.
   
   However, this will potentially require re-allocation of internal buffers of 
each inner list vector.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299344#comment-16299344
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia opened a new pull request #1439: ARROW-1943: handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439
 
 
   The current implementation of setInitialCapacity() uses a factor of 5 for 
every level we go into list:
   
   So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
and we start with an initial capacity of 128, we end up not throwing 
OversizedAllocationException from the BigIntVector because at every level we 
increased the capacity by 5 and by the time we reached inner scalar that 
actually stores the data, we were well over max size limit per vector (1MB).
   
   We saw this problem downstream when we failed to read deeply nested JSON 
data.
   
   The potential fix is to use the factor of 5 only when we are down to the 
leaf vector. As the depth increases and we are still working with complex/list, 
we don't use the factor of 5.
   
   cc @jacques-n , @BryanCutler , @icexelloss 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up not throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)