[GitHub] [orc] guiyanakuang commented on a change in pull request #952: ORC-1004: Make orc writer support the selected vector

GitBox Wed, 05 Jan 2022 04:16:48 -0800


guiyanakuang commented on a change in pull request #952:
URL: https://github.com/apache/orc/pull/952#discussion_r778773979




##########
File path: java/core/src/java/org/apache/orc/impl/WriterImpl.java
##########
@@ -683,20 +683,22 @@ public void addUserMetadata(String name, ByteBuffer 
value) {
 
   @Override
   public void addRowBatch(VectorizedRowBatch batch) throws IOException {
+    InternalVectorizedRowBatch internalBatch = 
InternalVectorizedRowBatch.encapsulation(batch);
+    int batchSize = internalBatch.size();
     try {
       // If this is the first set of rows in this stripe, tell the tree writers
       // to prepare the stripe.
-      if (batch.size != 0 && rowsInStripe == 0) {
+      if (batchSize != 0 && rowsInStripe == 0) {
         treeWriter.prepareStripe(stripes.size() + 1);
       }
       if (buildIndex) {
         // Batch the writes up to the rowIndexStride so that we can get the
         // right size indexes.
         int posn = 0;
-        while (posn < batch.size) {
-          int chunkSize = Math.min(batch.size - posn,
+        while (posn < batchSize) {
+          int chunkSize = Math.min(batchSize - posn,
               rowIndexStride - rowsInIndex);
-          treeWriter.writeRootBatch(batch, posn, chunkSize);
+          treeWriter.writeRootBatch(internalBatch, posn, chunkSize);

Review comment:
       Reusing Selected within a VectorizedRowBatch actually ensures consistent 
processing across types.
   Let's look at a complex case provided in the test case 
   ```java
     public void testWriteComplexTypeUseSelectedVector() throws IOException {
       TypeDescription schema =
           
TypeDescription.fromString("struct<a:map<int,uniontype<int,string>>," +
               "b:array<struct<c:int>>>");
       ....
    }    
   ```
   For example, we have three rows of data, and the third row is selected after 
the filter. Before reading the third map, we need to determine the offsets of 
the key and value fields, since the first two maps may have multiple key-value 
pairs. Of course, if the key and value are also complex types, recursion may be 
required.
   
   So processing only at the entry does not cover all cases and each type needs 
to be processed separately.
   
   Imagine that the recursive processing eventually comes to the base type, 
which doesn't know what offset means. offset may be relative to the row, or it 
may be a value that has gone through multiple jumps. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] guiyanakuang commented on a change in pull request #952: ORC-1004: Make orc writer support the selected vector

Reply via email to