[
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770167#comment-17770167
]
ASF GitHub Bot commented on PARQUET-2171:
-----------------------------------------
parthchandra commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1739701945
> @danielcweeks that's a good point about pluggability.
> I don't know if that would be useful for iceberg
https://github.com/apache/hadoop-api-shim
Iceberg can use the base Parquet File reader out of the box so should be
able to use vector IO as it is.
> getting iceberg to pass down which stripes it wants to read is critical
for this to work best with s3, abfs and gcs. how are you reading the files at
present?
However if the S3FileIO feature is enabled, Iceberg provides its own
InputStream and InputFile implementation that use AWS SDK V2. Maybe an option
to provide your own input stream to vector io might work.
> Implement vectored IO in parquet file format
> --------------------------------------------
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Mukund Thakur
> Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving
> read performance for seek heavy readers. Spark Jobs and others which uses
> parquet will greatly benefit from this api. Details can be found hereĀ
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867
--
This message was sent by Atlassian Jira
(v8.20.10#820010)