[ 
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-19863.
-------------------------------------
    Fix Version/s: 3.5.1
       Resolution: Fixed

> Incorrect Vectored IO metrics from Local Filesystem
> ---------------------------------------------------
>
>                 Key: HADOOP-19863
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19863
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.5.0
>            Reporter: Peter Toth
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.1
>
>         Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot 
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in 
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] 
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of 
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
>  and
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜  bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i, 
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜  bin/spark-shell --conf 
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test file size was ~45KB:
> {code:java}
> ➜  ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the 
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around 
> the file size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to