[
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787375#comment-17787375
]
ASF GitHub Bot commented on PARQUET-2171:
-----------------------------------------
steveloughran commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1817012867
OK, I've tried to address the changes as well as merge with master
The one thing I'm yet to do is the one by @danielcweeks : have an interface
for which the hadoop vector IO would be just one implementation.
We effectively have that in SeekableInputStream; two new default methods:
one a probe for the api availability and the other an invocation.
```
Would you be able to wire up the iceberg reader to that? And if not, what
changes are needed?
One thing we would need to make sure was good is the awaitFuture stuff;
that's a copy of what's in hadoop to handle async IO operations. There's also a
hard coded timeout of 300s to wait for the results; I don't know/recall where
that number came from but it's potentially dubious as it won't recover from
network problems.
> Implement vectored IO in parquet file format
> --------------------------------------------
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Mukund Thakur
> Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving
> read performance for seek heavy readers. Spark Jobs and others which uses
> parquet will greatly benefit from this api. Details can be found hereĀ
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867
--
This message was sent by Atlassian Jira
(v8.20.10#820010)