[GitHub] [arrow] hu6360567 opened a new issue #11589: VectorSchemaRoot is not refreshed when value is null

GitBox Tue, 02 Nov 2021 00:16:29 -0700


hu6360567 opened a new issue #11589:
URL: https://github.com/apache/arrow/issues/11589



   I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
   But the following code, unexpected behaivor happens.
   
   Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
   | col_1 | col_2  | col_3  |
   |-------|--------|--------|
   | 1     | abc    | 3.14   |
   | 2     | NULL | NULL |
   
   
   ```java
       public void querySql(String query, QueryOption option) throws Exception {
           try (final java.sql.Connection conn = 
connectContainer.getConnection();
                final Statement stmt = conn.createStatement();
                final ResultSet rs = stmt.executeQuery(query)
           ) {
               // create config without reuse schema root and custom batch size 
from option
               final JdbcToArrowConfig config = new 
JdbcToArrowConfigBuilder().setAllocator(new 
RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
                       
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
   
               final ArrowVectorIterator iterator = 
JdbcToArrow.sqlToArrowVectorIterator(rs, config);
               while (iterator.hasNext()) {
                   // retrieve result from iterator
                   final VectorSchemaRoot root = iterator.next();
                   option.getCallback().handleBatchResult(root);
                   root.allocateNew();    // it has to be allocate new
               }
   
           } catch (java.lang.Exception e) {
               throw new Exception(e.getMessage());
           }
       }
   
   ......
     // batch_size is set to 1, then callback is called twice.
   QueryOptions options = new QueryOption(1, 
        root -> {
               // if printer is not set, get schema, write header
               if (printer == null) {
                   final String[] headers = 
root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
   
                   printer = new CSVPrinter(writer, 
CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build());
               }
   
               final int rows = root.getRowCount();
               final List<FieldVector> fieldVectors = root.getFieldVectors();
   
               // iterate over rows
               for (int i = 0; i < rows; i++) {
                   final int rowId = i;
                   final List<String> row = fieldVectors.stream().map(v -> 
v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList());
                   printer.printRecord(row);
               }
           });
   
      connection.querySql("SELECT * FROM test_db", options);
   ......
   ```
   
   if `root.allocateNew()` is called, the csv file is expected,
   ```
   column_1,column_2,column_3
   1,abc,3.14
   2,null,null
   ```
   Otherwise, null values of 2nd row are remaining the same values of 1st row
   ```
   column_1,column_2,column_3
   1,abc,3.14
   2,abc,3.14
   ```
   
   **Question: Should I call `allocateNew` every time? When should I close the 
ValueVector/VectorSchemaRoot?**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] hu6360567 opened a new issue #11589: VectorSchemaRoot is not refreshed when value is null

Reply via email to