[jira] [Created] (ARROW-542) [Java] Implement dictionaries in stream/file encoding

2017-02-08 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-542:
---

 Summary: [Java] Implement dictionaries in stream/file encoding
 Key: ARROW-542
 URL: https://issues.apache.org/jira/browse/ARROW-542
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Vectors
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-542) [Java] Implement dictionaries in stream/file encoding

2017-02-08 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858295#comment-15858295
 ] 

Emilio Lahr-Vivaz commented on ARROW-542:
-

[~wesmckinn] I'm looking into how dictionary vectors will be encoded in the 
file format. In the current message definitions, it appears dictionary batches 
are distinct from regular batches, and have an ID associated with them: 
https://github.com/apache/arrow/blob/b99d049c3d1894908b7e52774eb657675dc1f439/format/Message.fbs#L284
Wouldn't the dictionary already be defined by the Field? I'm unclear what the 
ID in the DictionaryBatch is supposed to represent.
Thanks,

> [Java] Implement dictionaries in stream/file encoding
> -
>
> Key: ARROW-542
> URL: https://issues.apache.org/jira/browse/ARROW-542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-542) [Java] Implement dictionaries in stream/file encoding

2017-02-08 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858342#comment-15858342
 ] 

Emilio Lahr-Vivaz commented on ARROW-542:
-

Ah, makes sense thanks.

> [Java] Implement dictionaries in stream/file encoding
> -
>
> Key: ARROW-542
> URL: https://issues.apache.org/jira/browse/ARROW-542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-542) [Java] Implement dictionaries in stream/file encoding

2017-02-09 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859863#comment-15859863
 ] 

Emilio Lahr-Vivaz commented on ARROW-542:
-

It's getting a little complicated trying to encode/decode the dictionaries, 
given the interplay between the reader/writers, the vector loader/unloader and 
the ArrowRecordBatch. Right now I'm trying to rely on finding DictionaryVector 
class instances, but that breaks down when things start getting encoded. The 
two step process between the vector loaders/unloaders and the file 
readers/writers makes it hard to track state. The ArrowRecordBatch which is 
passed around doesn't even include any Field data. It seems like it would be 
more straightforward to require the user to set the dictionary ids up front in 
the Field. The dictionary ID is defined as a Long, which seems to imply that 
they were not meant to be entirely transient (otherwise it could be an Int or 
smaller). [~wesmckinn] thoughts? I realize this goes against what you've been 
saying.

> [Java] Implement dictionaries in stream/file encoding
> -
>
> Key: ARROW-542
> URL: https://issues.apache.org/jira/browse/ARROW-542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-542) [Java] Implement dictionaries in stream/file encoding

2017-02-09 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860292#comment-15860292
 ] 

Emilio Lahr-Vivaz commented on ARROW-542:
-

Another blocker I'm hitting is that I don't see any way that the type of a 
dictionary block can be determined during read. DictionaryEncoding has an 
indexType, but that seems to refer to the ints used to reference the dictionary 
values: 
https://github.com/apache/arrow/blob/b99d049c3d1894908b7e52774eb657675dc1f439/format/Message.fbs#L165
A dictionary encoded vector currently has it's type defined as the dictionary 
index type, but the type of the dictionary is not defined. It works when the 
data is in memory with the dictionary alongside it, but not when encoding to 
the file format... Possibly the dictionary encoded vector should specify the 
dictionary type? It seems like either that or the message format needs another 
field for the dictionary type.

> [Java] Implement dictionaries in stream/file encoding
> -
>
> Key: ARROW-542
> URL: https://issues.apache.org/jira/browse/ARROW-542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-691) [Java] Encode dictionary Int type in message format

2017-03-22 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-691:
---

 Summary: [Java] Encode dictionary Int type in message format
 Key: ARROW-691
 URL: https://issues.apache.org/jira/browse/ARROW-691
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Affects Versions: 0.3.0
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz
 Fix For: 0.3.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-725) [Format] Constant length list type

2017-03-29 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947105#comment-15947105
 ] 

Emilio Lahr-Vivaz commented on ARROW-725:
-

Yeah, I'll try to put something up in the next day or two

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-725) [Format] Constant length list type

2017-03-29 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz reassigned ARROW-725:
---

Assignee: Emilio Lahr-Vivaz

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-725) [Format] Constant length list type

2017-03-29 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947126#comment-15947126
 ] 

Emilio Lahr-Vivaz commented on ARROW-725:
-

Should I make that change as part of adding the list? Or leave it for now?

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (ARROW-725) [Format] Constant length list type

2017-03-29 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947126#comment-15947126
 ] 

Emilio Lahr-Vivaz edited comment on ARROW-725 at 3/29/17 1:31 PM:
--

Should I make that change as part of adding the list? Or leave it for now?
edit: I think I'll leave it for now, as I'm not entirely comfortable with the 
C++ codebase yet.


was (Author: elahrvivaz):
Should I make that change as part of adding the list? Or leave it for now?

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-796) [Java] Checkstyle additions causing build failure in some environments

2017-04-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962875#comment-15962875
 ] 

Emilio Lahr-Vivaz commented on ARROW-796:
-

Yeah, I've been using 3.3.9 and didn't have a problem. I guess the readme does 
say 'maven 3.3 or later'. What version of maven was intellij using?

> [Java] Checkstyle additions causing build failure in some environments
> --
>
> Key: ARROW-796
> URL: https://issues.apache.org/jira/browse/ARROW-796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Wes McKinney
> Fix For: 0.3.0
>
>
> Even after the conflict fixed in ARROW-677, I'm running into build problems:
> {code}
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further 
> details.
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Apache Arrow Java Root POM  FAILURE [0.586s]
> [INFO] Arrow Format .. SKIPPED
> [INFO] Arrow Memory .. SKIPPED
> [INFO] Arrow Vectors . SKIPPED
> [INFO] Arrow Tools ... SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 0.742s
> [INFO] Finished at: Sat Apr 08 17:11:40 EDT 2017
> [INFO] Final Memory: 20M/633M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on 
> project arrow-java-root: Execution validate of goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: 
> java.lang.AbstractMethodError: 
> org.slf4j.impl.JDK14LoggerAdapter.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> [ERROR] -
> [ERROR] realm =
> plugin>org.apache.maven.plugins:maven-checkstyle-plugin:2.17
> {code}
> If I remove the checkstyle plugin from the root pom.xml, everything is OK



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-725) [Format] Constant length list type

2017-04-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963146#comment-15963146
 ] 

Emilio Lahr-Vivaz commented on ARROW-725:
-

I'd like to get this into the 0.3 release if possible - can I add it as a 
blocker?

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-815) [Java] Allow for expanding underlying buffer size after allocation

2017-04-12 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-815:
---

 Summary: [Java] Allow for expanding underlying buffer size after 
allocation
 Key: ARROW-815
 URL: https://issues.apache.org/jira/browse/ARROW-815
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz
 Fix For: 0.3.0


There are currently some `reAlloc` methods, but not exposed at the vector 
level. Would be useful to expose them to allow for growing vector capacity 
without losing current data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-2500) [Java] IPC Writers/readers are not always setting validity bits correctly

2018-04-23 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-2500:


 Summary: [Java] IPC Writers/readers are not always setting 
validity bits correctly
 Key: ARROW-2500
 URL: https://issues.apache.org/jira/browse/ARROW-2500
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Affects Versions: 0.9.0, 0.8.0
Reporter: Emilio Lahr-Vivaz


When writing multiple batches to a Stream/File Writer, the first validity bit 
can get garbled between writing and reading. I couldn't pinpoint the exact 
issue, but I was able to re-create it with a fairly simple unit test.

in TestArrowStream.java:

{code:java}
  @Test
  public void testReadWriteMultipleBatches() throws IOException {

ByteArrayOutputStream os = new ByteArrayOutputStream();

try (IntVector vector = new IntVector("foo", allocator);) {
  Schema schema = new Schema(Collections.singletonList(vector.getField()), 
null);
  try (VectorSchemaRoot root = new VectorSchemaRoot(schema, 
Collections.singletonList((FieldVector) vector), vector.getValueCount());
   ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
MapDictionaryProvider(), Channels.newChannel(os));) {
writer.start();

vector.setNull(0);
vector.setSafe(1, 1);
vector.setSafe(2, 2);
vector.setNull(3);
vector.setSafe(4, 1);
vector.setValueCount(5);
root.setRowCount(5);
writer.writeBatch();

vector.setNull(0);
vector.setSafe(1, 1);
vector.setSafe(2, 2);
vector.setValueCount(3);
root.setRowCount(3);
writer.writeBatch();
  }
}

ByteArrayInputStream in = new ByteArrayInputStream(os.toByteArray());

try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator);) {
  IntVector read = (IntVector) 
reader.getVectorSchemaRoot().getFieldVectors().get(0);

  reader.loadNextBatch();

  assertEquals(read.getValueCount(), 5);
  assertNull(read.getObject(0));
  assertEquals(read.getObject(1), Integer.valueOf(1));
  assertEquals(read.getObject(2), Integer.valueOf(2));
  assertNull(read.getObject(3));
  assertEquals(read.getObject(4), Integer.valueOf(1));

  reader.loadNextBatch();

  assertEquals(read.getValueCount(), 3);
  assertNull(read.getObject(0));
  assertEquals(read.getObject(1), Integer.valueOf(1));
  assertEquals(read.getObject(2), Integer.valueOf(2));
}
  }
{code}

in TestArrowFile.java:

{code}
 @Test
  public void testReadWriteMultipleBatches() throws IOException {
File file = new File("target/mytest_nulls_multibatch.arrow");

try (IntVector vector = new IntVector("foo", allocator);) {
  Schema schema = new Schema(Collections.singletonList(vector.getField()), 
null);
  try (FileOutputStream fileOutputStream = new FileOutputStream(file);
   VectorSchemaRoot root = new VectorSchemaRoot(schema, 
Collections.singletonList((FieldVector) vector), vector.getValueCount());
   ArrowFileWriter writer = new ArrowFileWriter(root, new 
MapDictionaryProvider(), fileOutputStream.getChannel());) {
writer.start();

vector.setNull(0);
vector.setSafe(1, 1);
vector.setSafe(2, 2);
vector.setNull(3);
vector.setSafe(4, 1);
vector.setValueCount(5);
root.setRowCount(5);
writer.writeBatch();

vector.setNull(0);
vector.setSafe(1, 1);
vector.setSafe(2, 2);
vector.setValueCount(3);
root.setRowCount(3);
writer.writeBatch();
  }
}

try (FileInputStream fileInputStream = new FileInputStream(file);
 ArrowFileReader reader = new 
ArrowFileReader(fileInputStream.getChannel(), allocator);) {
  IntVector read = (IntVector) 
reader.getVectorSchemaRoot().getFieldVectors().get(0);

  reader.loadNextBatch();

  assertEquals(read.getValueCount(), 5);
  assertNull(read.getObject(0));
  assertEquals(read.getObject(1), Integer.valueOf(1));
  assertEquals(read.getObject(2), Integer.valueOf(2));
  assertNull(read.getObject(3));
  assertEquals(read.getObject(4), Integer.valueOf(1));

  reader.loadNextBatch();

  assertEquals(read.getValueCount(), 3);
  assertNull(read.getObject(0));
  assertEquals(read.getObject(1), Integer.valueOf(1));
  assertEquals(read.getObject(2), Integer.valueOf(2));
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2500) [Java] IPC Writers/readers are not always setting validity bits correctly

2018-04-24 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449781#comment-16449781
 ] 

Emilio Lahr-Vivaz commented on ARROW-2500:
--

Note: this didn't seem to occur in 0.6.

> [Java] IPC Writers/readers are not always setting validity bits correctly
> -
>
> Key: ARROW-2500
> URL: https://issues.apache.org/jira/browse/ARROW-2500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Emilio Lahr-Vivaz
>Priority: Major
>
> When writing multiple batches to a Stream/File Writer, the first validity bit 
> can get garbled between writing and reading. I couldn't pinpoint the exact 
> issue, but I was able to re-create it with a fairly simple unit test.
> in TestArrowStream.java:
> {code:java}
>   @Test
>   public void testReadWriteMultipleBatches() throws IOException {
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> try (IntVector vector = new IntVector("foo", allocator);) {
>   Schema schema = new 
> Schema(Collections.singletonList(vector.getField()), null);
>   try (VectorSchemaRoot root = new VectorSchemaRoot(schema, 
> Collections.singletonList((FieldVector) vector), vector.getValueCount());
>ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
> MapDictionaryProvider(), Channels.newChannel(os));) {
> writer.start();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setNull(3);
> vector.setSafe(4, 1);
> vector.setValueCount(5);
> root.setRowCount(5);
> writer.writeBatch();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setValueCount(3);
> root.setRowCount(3);
> writer.writeBatch();
>   }
> }
> ByteArrayInputStream in = new ByteArrayInputStream(os.toByteArray());
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator);) {
>   IntVector read = (IntVector) 
> reader.getVectorSchemaRoot().getFieldVectors().get(0);
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 5);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
>   assertNull(read.getObject(3));
>   assertEquals(read.getObject(4), Integer.valueOf(1));
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 3);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
> }
>   }
> {code}
> in TestArrowFile.java:
> {code}
>  @Test
>   public void testReadWriteMultipleBatches() throws IOException {
> File file = new File("target/mytest_nulls_multibatch.arrow");
> try (IntVector vector = new IntVector("foo", allocator);) {
>   Schema schema = new 
> Schema(Collections.singletonList(vector.getField()), null);
>   try (FileOutputStream fileOutputStream = new FileOutputStream(file);
>VectorSchemaRoot root = new VectorSchemaRoot(schema, 
> Collections.singletonList((FieldVector) vector), vector.getValueCount());
>ArrowFileWriter writer = new ArrowFileWriter(root, new 
> MapDictionaryProvider(), fileOutputStream.getChannel());) {
> writer.start();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setNull(3);
> vector.setSafe(4, 1);
> vector.setValueCount(5);
> root.setRowCount(5);
> writer.writeBatch();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setValueCount(3);
> root.setRowCount(3);
> writer.writeBatch();
>   }
> }
> try (FileInputStream fileInputStream = new FileInputStream(file);
>  ArrowFileReader reader = new 
> ArrowFileReader(fileInputStream.getChannel(), allocator);) {
>   IntVector read = (IntVector) 
> reader.getVectorSchemaRoot().getFieldVectors().get(0);
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 5);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
>   assertNull(read.getObject(3));
>   assertEquals(read.getObject(4), Integer.valueOf(1));
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 3);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
> }
>   }
> {code}



[jira] [Created] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-04-24 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-886:
---

 Summary: VariableLengthVectors don't reAlloc offsets
 Key: ARROW-886
 URL: https://issues.apache.org/jira/browse/ARROW-886
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Affects Versions: 0.3.0
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz
 Fix For: 0.3.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-482) [Java] Provide API access to "custom_metadata" Field attribute in IPC setting

2017-04-27 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz reassigned ARROW-482:
---

Assignee: Emilio Lahr-Vivaz

> [Java] Provide API access to "custom_metadata" Field attribute in IPC setting
> -
>
> Key: ARROW-482
> URL: https://issues.apache.org/jira/browse/ARROW-482
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Emilio Lahr-Vivaz
>
> This is necessary for supplying application-specific type metadata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-482) [Java] Provide API access to "custom_metadata" Field attribute in IPC setting

2017-04-27 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987019#comment-15987019
 ] 

Emilio Lahr-Vivaz commented on ARROW-482:
-

I'm going to work on this if no one else has claimed it. Thanks,

> [Java] Provide API access to "custom_metadata" Field attribute in IPC setting
> -
>
> Key: ARROW-482
> URL: https://issues.apache.org/jira/browse/ARROW-482
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Emilio Lahr-Vivaz
>
> This is necessary for supplying application-specific type metadata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-855) Arrow Memory Leak

2017-05-09 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002694#comment-16002694
 ] 

Emilio Lahr-Vivaz commented on ARROW-855:
-

Won't the buffer allocator hold on to and re-use buffers? You might have to 
close the allocator to free those resources.

> Arrow Memory Leak
> -
>
> Key: ARROW-855
> URL: https://issues.apache.org/jira/browse/ARROW-855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Memory, Java - Vectors
>Affects Versions: 0.1.0
> Environment: CentOS release 6.7+Indellij IDEA
>Reporter: xufusheng
>Priority: Critical
>  Labels: test
>
> we create a memory table by arrow and the source data come from HBase.
> Create a memory table and then drop the table,there will be a memory leak.
> Hundreds of times,There will be OutOfMemoryError.
> anyone encounter similar problems?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-997) [Java] Implement transfer in FixedSizeListVector

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-997:
---

 Summary: [Java] Implement transfer in FixedSizeListVector
 Key: ARROW-997
 URL: https://issues.apache.org/jira/browse/ARROW-997
 Project: Apache Arrow
  Issue Type: Task
Affects Versions: 0.3.0
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-999:
---

 Summary: [Java] Minor types don't account for nullable FieldType 
flag
 Key: ARROW-999
 URL: https://issues.apache.org/jira/browse/ARROW-999
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Emilio Lahr-Vivaz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz updated ARROW-999:

Description: Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, 
...), ...)" returns a NullableFloat4Vector instead of a Float4Vector.

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz reassigned ARROW-999:
---

Assignee: Emilio Lahr-Vivaz

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004773#comment-16004773
 ] 

Emilio Lahr-Vivaz commented on ARROW-999:
-

So it seems from the layout 
(https://github.com/apache/arrow/blob/master/format/Layout.md) that all vectors 
have a null bitarray, however it doesn't have to be populated if there are no 
null values. So is the intention that the Field nullable flag should just 
control the creation of the null bitarray?

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz updated ARROW-999:

Description: 
Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
returns a NullableFloat4Vector instead of a Float4Vector.
edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
top-level vector. I'm confused as to what the nullable flag is supposed to 
represent then.

  was:Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
returns a NullableFloat4Vector instead of a Float4Vector.


> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz reassigned ARROW-999:
---

Assignee: (was: Emilio Lahr-Vivaz)

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz updated ARROW-999:

Priority: Minor  (was: Major)

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Priority: Minor
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emilio Lahr-Vivaz updated ARROW-999:

Issue Type: Improvement  (was: Bug)

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-999) [Java] Minor types don't account for nullable FieldType flag

2017-05-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005169#comment-16005169
 ] 

Emilio Lahr-Vivaz commented on ARROW-999:
-

Thanks, maybe I should update this issue to be 'implement non-nullable 
FieldVectors'? Is that something that would be desirable to implement, or would 
it introduce too much complexity? It seems like it would provide a performance 
improvement in the non-nullable case.

> [Java] Minor types don't account for nullable FieldType flag
> 
>
> Key: ARROW-999
> URL: https://issues.apache.org/jira/browse/ARROW-999
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Priority: Minor
>
> Calling e.g. `FLOAT4.getNewVector("foo", new FieldType(false, ...), ...)" 
> returns a NullableFloat4Vector instead of a Float4Vector.
> edit: Float4Vector doesn't implement FieldVector, so can't currently be a 
> top-level vector. I'm confused as to what the nullable flag is supposed to 
> represent then.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1015) [Java] Implement schema-level metadata

2017-05-12 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-1015:


 Summary: [Java] Implement schema-level metadata
 Key: ARROW-1015
 URL: https://issues.apache.org/jira/browse/ARROW-1015
 Project: Apache Arrow
  Issue Type: Task
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz
 Fix For: 0.4.0


Schema already defines metadata in the arrow format - implement in Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-994) Arrow is unable to reuse Memory chunk in the Pool.

2017-05-15 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010739#comment-16010739
 ] 

Emilio Lahr-Vivaz commented on ARROW-994:
-

[~tokendeng] did you determine that this wasn't an actual issue?

> Arrow is unable to reuse Memory chunk in the Pool.
> --
>
> Key: ARROW-994
> URL: https://issues.apache.org/jira/browse/ARROW-994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Memory, Java - Vectors
>Affects Versions: 0.3.0
> Environment: Arrow 0.3.0
> java version "1.7.0_79"
> Windows 7 64bits
> IntelliJ IDEA 14.0 ( run the test)
>Reporter: Deng Changchun
>Priority: Critical
>  Labels: Leak, Memory
> Attachments: MemoryLeakTest.java
>
>
> See the Attachment ( MemoryLeakTest.java), it just want to test whether Arrow 
> can reuse memory chunk in the Pool or not. However, it didn't all the time. 
> So, I think there is a serious glitch in Arrow 0.3.0( also in 0.1.0, related 
> bug: https://issues.apache.org/jira/browse/ARROW-855).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-1045) [JAVA] Add support for custom metadata in org.apache.arrow.vector.types.pojo.*

2017-05-17 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014551#comment-16014551
 ] 

Emilio Lahr-Vivaz commented on ARROW-1045:
--

This might be fixed already by the two linked issues.

> [JAVA] Add support for custom metadata in org.apache.arrow.vector.types.pojo.*
> --
>
> Key: ARROW-1045
> URL: https://issues.apache.org/jira/browse/ARROW-1045
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jacques Nadeau
>
> Custom metadata for Arrow Schema and Arrow Fields is lost if a user 
> translates to/from the Java implementations pojo helper objects. Conversion 
> to/from the Flatbuf schema should be lossless.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-06-09 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1604#comment-1604
 ] 

Emilio Lahr-Vivaz commented on ARROW-886:
-

I forget exactly what scenario led to this being an issue, but I think using 
the writer framework, possibly with FixedSizeListVectors (which aren't 
accessible through writers), caused the vector to run out of space. Possibly 
instead of the changes that were made to reAlloc, we instead need to make sure 
setSafe() is being called appropriately, and add it where needed.

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-06-12 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046549#comment-16046549
 ] 

Emilio Lahr-Vivaz commented on ARROW-886:
-

We are calling it explicitly. In our use case we have a NullableMapVector with 
a bunch of child vectors of varying types. Currently we just check the size of 
the top-level vector and call re-alloc if we need to grow. For variable length 
child vectors, we call setSafe - however, for fixed size vectors it would be 
nice to be able to just do the check once on the MapVector, then expand as 
needed, instead of using setSafe and checking on each vector. We're trying to 
minimize reads and checks during write.

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-06-12 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046550#comment-16046550
 ] 

Emilio Lahr-Vivaz commented on ARROW-886:
-

Is the only issue with variable length buffers? Possibly in those classes we 
could just not reAlloc the data buffer, and leverage setSafe for that.

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1175) [Java] Implement/test dictionary-encoded subfields

2017-07-03 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072438#comment-16072438
 ] 

Emilio Lahr-Vivaz commented on ARROW-1175:
--

Note: I don't think it's directly supported by the Writer API, so you have to 
do it slightly manually. Here is our code for writing dictionary values encoded 
children in a map vector (FYI there is some non-relevant bits that involve our 
integration with geotools, and it's in scala):

https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/vector/ArrowAttributeWriter.scala#L189-L191
https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/vector/ArrowAttributeWriter.scala#L224-L225

https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/vector/ArrowAttributeReader.scala#L148
https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/vector/ArrowAttributeReader.scala#L164-L167

> [Java] Implement/test dictionary-encoded subfields
> --
>
> Key: ARROW-1175
> URL: https://issues.apache.org/jira/browse/ARROW-1175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>
> We do not have any tests about types like:
> {code}
> List
> {code}
> cc [~julienledem] [~elahrvivaz]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1175) [Java] Implement/test dictionary-encoded subfields

2017-07-03 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072494#comment-16072494
 ] 

Emilio Lahr-Vivaz commented on ARROW-1175:
--

Some more details - the important bit is calling 'vector.addOrGet' with the 
correct dictionary encoded field. This sets up the metadata correctly. The 
child vector is of the dictionary encoded type (e.g. Int), and you have to 
manually encode dictionary values before writing them. On read, you have to 
examine the schema so that you know to manually decode the Int values 
appropriately.

> [Java] Implement/test dictionary-encoded subfields
> --
>
> Key: ARROW-1175
> URL: https://issues.apache.org/jira/browse/ARROW-1175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>
> We do not have any tests about types like:
> {code}
> List
> {code}
> cc [~julienledem] [~elahrvivaz]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-08-02 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110778#comment-16110778
 ] 

Emilio Lahr-Vivaz commented on ARROW-886:
-

That would probably be fine. My use case is:

1. Create a parent vector (NullableMapVector) with several children of 
different types (some fixed width, some variable width)
2. Write an unknown number of elements to the vectors, using the child vector 
mutators

For the variable-width vectors, I'm using `mutator.setSafe`, as there is no 
other way to guarantee sufficient size (to my knowledge). But for the fixed 
with vectors, I'd like to check size once on the parent, then expand them all 
as needed, instead of having to use `.setSafe` for every write. Having a way to 
externally grow a vector (outside of setSafe) seems like it would be useful any 
time you don't know the number of elements you're writing up front.

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1340) [Java] NullableMapVector field doesn't maintain metadata

2017-08-08 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-1340:


 Summary: [Java] NullableMapVector field doesn't maintain metadata
 Key: ARROW-1340
 URL: https://issues.apache.org/jira/browse/ARROW-1340
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1407) Dictionaries can only hold a maximum of 4096 indices

2017-08-23 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138958#comment-16138958
 ] 

Emilio Lahr-Vivaz commented on ARROW-1407:
--

FYI, you don't have to use the DictionaryEncoder class. If you don't mind 
mapping your dictionary values yourself, you can do something like:

{code:java}
NullableIntVector vector = new FieldType(true, MinorType.INT.getType, 
dictionaryEncoding).createNewSingleVector(name, allocator, callBack);
vector.getMutator().setSafe(i, j);
{code}

> Dictionaries can only hold a maximum of 4096 indices
> 
>
> Key: ARROW-1407
> URL: https://issues.apache.org/jira/browse/ARROW-1407
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.6.0
>Reporter: Shayan Monshizadeh
>Priority: Minor
> Attachments: Screen Shot 2017-08-22 at 7.14.07 PM.png
>
>
> Dictionaries seem to only be able to hold 4096 indices, meaning only vectors 
> with 4096 values or less can be turned into dictionaries. The image attached 
> is a stack trace of what happens when try to encode a dictionary with a 
> vector containing 4097 strings, and a dictionary containing two distinct 
> values. 
> Basically the error can be traced to line 95 of DictionaryEncoder.java 
> (`setter.invoke(mutator, i, encoded);`). It seems that the indices array 
> which hold the encoded values is allocated on line 84 as 
> `indices.allocateNew()` and it seems that `allocateNew()` only allocates 4096 
> bytes of data initially. The code runs if there are 4096 rows of data or 
> less. Anymore and the same error is given.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2017-09-06 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155303#comment-16155303
 ] 

Emilio Lahr-Vivaz commented on ARROW-1463:
--

+1 agree with everything said so far. I'd prefer not to have any code 
generation at all, unless it provides some performance gains.

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: SIDDHARTH TEOTIA
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)