[jira] [Updated] (HADOOP-19863) Incorrect Vectored IO metrics

Peter Toth (Jira) Thu, 16 Apr 2026 10:17:13 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Toth updated HADOOP-19863:
--------------------------------
    Description: 
As discussed in 
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we 
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark 
tasks are not correct.

Spark fetches that metric via {{FileSystem.getAllStatistics}} see
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
 and
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]

Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜  bin/spark-shell

scala> spark.createDataFrame((0 until 5000).map(i => (i, 
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png|width=85%!

Vectored IO is disabled explicitely:
{code:java}
➜  bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false

scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png|width=85%!

In my case the generated test files size was ~45KB:
{code:java}
➜  ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the 
decreased 1680B probably belongs to that.

There is no data pruning in the query so the metric value should be around the 
file size.

  was:
As discussed in 
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we 
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark 
tasks are not correct.

Spark fetches that metric via {{FileSystem.getAllStatistics}} see
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
 and
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]

Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜  bin/spark-shell

scala> spark.createDataFrame((0 until 5000).map(i => (i, 
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png|width=75%!

Vectored IO is disabled explicitely:
{code:java}
➜  bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false

scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png|width=75%!

In my case the generated test files size was ~45KB:
{code:java}
➜  ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the 
decreased 1680B probably belongs to that.

There is no data pruning in the query so the metric value should be around the 
file size.


> Incorrect Vectored IO metrics
> -----------------------------
>
>                 Key: HADOOP-19863
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19863
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: cloud-storage
>    Affects Versions: 3.5.0
>            Reporter: Peter Toth
>            Priority: Major
>         Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot 
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in 
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] 
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of 
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
>  and
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜  bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i, 
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜  bin/spark-shell --conf 
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test files size was ~45KB:
> {code:java}
> ➜  ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the 
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around 
> the file size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-19863) Incorrect Vectored IO metrics

Reply via email to