[jira] [Commented] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

Bryan Cutler (Jira) Fri, 25 Feb 2022 10:22:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498249#comment-17498249
 ]


Bryan Cutler commented on ARROW-14549:
--------------------------------------

[~hu6360567] Calling `allocateNew()` will create new buffers, which is one way 
to clear previous results. If you don't want to allocate any new memory, you 
would need to to zero out all the vectors by calling `zeroVector()` and 
`setValueCount(0)`

> VectorSchemaRoot is not refreshed when value is null
> ----------------------------------------------------
>
>                 Key: ARROW-14549
>                 URL: https://issues.apache.org/jira/browse/ARROW-14549
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 6.0.0
>            Reporter: Wenbo Hu
>            Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a 
> stream of batches rather than creating a new VectorSchemaRoot instance each 
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>      final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>    while (iterator.hasNext()){ // retrieve result from iterator 
>      final VectorSchemaRoot root = iterator.next(); 
> option.getCallback().handleBatchResult(root); 
>      root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ......
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>      root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>       final String[] headers = 
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>  
>       printer = new CSVPrinter(writer, 
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List<FieldVector> fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>       final int rowId = i; 
>       final List<String> row = fieldVectors.stream().map(v -> 
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); 
> printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ......
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root 
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)) {
>      // create config without reuse schema root and custom batch size from 
> option
>      final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>      final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>      while (iterator.hasNext()) {
>      // retrieve result from iterator
>      try (VectorSchemaRoot root = iterator.next()) { 
>           option.getCallback().handleBatchResult(root); root.allocateNew(); 
>       }
>    }
>  } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

Reply via email to