Parth Chandra created DRILL-5351:
------------------------------------
Summary: Excessive bounds checking in the Parquet reader
Key: DRILL-5351
URL: https://issues.apache.org/jira/browse/DRILL-5351
Project: Apache Drill
Issue Type: Improvement
Reporter: Parth Chandra
In profiling the Parquet reader, the variable length decoding appears to be a
major bottleneck making the reader CPU bound rather than disk bound.
A yourkit profile indicates the following methods being severe bottlenecks -
VarLenBinaryReader.determineSizeSerial(long)
NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
DrillBuf.chk(int, int)
NullableVarBinaryVector$Mutator.fillEmpties()
The problem is that each of these methods does some form of bounds checking and
eventually of course, the actual write to the ByteBuf is also bounds checked.
DrillBuf.chk can be disabled by a configuration setting. Disabling this does
improve performance of TPCH queries. In addition, all regression, unit, and
TPCH-SF100 tests pass.
I would recommend we allow users to turn this check off if there are
performance critical queries.
Removing the bounds checking at every level is going to be a fair amount of
work. In the meantime, it appears that a few simple changes to variable length
vectors improves query performance by about 10% across the board.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)