[
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770195#comment-17770195
]
ASF GitHub Bot commented on PARQUET-2171:
-----------------------------------------
parthchandra commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1739813955
@mukund-thakur @steveloughran this is a great PR! Some numbers from an
independent benchmark. I used Spark to parallelize the reading of all rowgroups
(just the reading of the raw data) from TPC-DS/SF10000/store_sales using
various APIS and here are some numbers for you.
32 executors, 16 cores
`fs.s3a.threads.max` = 20
Reader | Avg Time (minutes) | Median | vs Baseline
> Implement vectored IO in parquet file format
> --------------------------------------------
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Mukund Thakur
> Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving
> read performance for seek heavy readers. Spark Jobs and others which uses
> parquet will greatly benefit from this api. Details can be found hereĀ
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867
--
This message was sent by Atlassian Jira
(v8.20.10#820010)