siddharthteotia commented on a change in pull request #5267: Re-implement 
ORCRecordReader
URL: https://github.com/apache/incubator-pinot/pull/5267#discussion_r410357273
 
 

 ##########
 File path: 
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java
 ##########
 @@ -95,25 +171,151 @@ public GenericRow next()
   @Override
   public GenericRow next(GenericRow reuse)
       throws IOException {
-    _recordReader.nextBatch(_reusableVectorizedRowBatch);
-    return _recordExtractor.extract(_reusableVectorizedRowBatch, reuse);
+    int numFields = _orcFields.size();
 
 Review comment:
   Hive's ORC reader has a zero copy option to boost performance when reading 
from HDFS -- https://issues.apache.org/jira/browse/HIVE-6347
   
   Secondly, I wonder if we can do something about making this vectorized. 
Probably not because we anyway have to create a GenericRow and return it for 
every next() call. However, it is possible to read a single column vector at a 
time from VectorizedRowBatch and avoid the repeated dynamic dispatch in the loop
   
   Say if we do vectorized reads from each column vector from a single 
VectorizedRowBatch and then pick values from each cell to create a batch of 
GenericRow, would performance be any different? In that case, the next() API 
semantics can be changed to bulk based.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to