pavibhai commented on a change in pull request #1072:
URL: https://github.com/apache/orc/pull/1072#discussion_r837606381
##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -194,6 +194,18 @@
ORC_MAX_DISK_RANGE_CHUNK_LIMIT("orc.max.disk.range.chunk.limit",
"hive.exec.orc.max.disk.range.chunk.limit",
Integer.MAX_VALUE - 1024, "When reading stripes >2GB, specify max limit
for the chunk size."),
+ ORC_MIN_DISK_SEEK_SIZE("orc.min.disk.seek.size",
+ "hive.exec.orc.min.disk.seek.size",
+ 0,
+ "When determining contiguous reads, gaps within this
size are "
+ + "read contiguously and not seeked. Default value of
zero disables this "
+ + "optimization"),
+ ORC_MIN_DISK_SEEK_SIZE_TOLERANCE("orc.min.disk.seek.size.tolerance",
Review comment:
> Would this patch be different from that?
I can see the following differences:
* In this case the decision of reading extra bytes is based on the read plan
in ORC as compared to a simple read ahead
* We will not request extra if the read does not intend on going there
* In our tests we are seeing 4MB seems to work well as shown in the
benchmark results
* This can additionally be tweaked to other FS by configuring these values
based on FS
> There can be cases when this could be reading more than necessary and
throwing off the read bytes later. Would that cause perf penalties?
There will be a trade-off between memory and cpu, the option to both read
the extra bytes and drop the extra bytes is configurable allowing one to turn
both or one of them off as the need demands.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]