siddharthteotia commented on a change in pull request #5267: Re-implement
ORCRecordReader
URL: https://github.com/apache/incubator-pinot/pull/5267#discussion_r410357273
##########
File path:
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java
##########
@@ -95,25 +171,151 @@ public GenericRow next()
@Override
public GenericRow next(GenericRow reuse)
throws IOException {
- _recordReader.nextBatch(_reusableVectorizedRowBatch);
- return _recordExtractor.extract(_reusableVectorizedRowBatch, reuse);
+ int numFields = _orcFields.size();
Review comment:
Hive's ORC reader has a zero copy option to boost performance when reading
from HDFS -- https://issues.apache.org/jira/browse/HIVE-6347
Secondly, I wonder if we can do something about making this vectorized.
Probably not because we anyway have to create a GenericRow and return it for
every next() call. However, it is possible to read a single column vector at a
time from VectorizedRowBatch and avoid the repeated dynamic dispatch in the loop
Say if we do vectorized reads from each column vector from a single
VectorizedRowBatch and then pick values from each cell to create a batch of
GenericRow, would performance be any different? In that case, the next() API
semantics can be changed to bulk based.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]