Peter Toth created HADOOP-19863:
-----------------------------------

             Summary: Incorrect Vectored IO metrics
                 Key: HADOOP-19863
                 URL: https://issues.apache.org/jira/browse/HADOOP-19863
             Project: Hadoop Common
          Issue Type: Bug
          Components: cloud-storage
    Affects Versions: 3.5.0
            Reporter: Peter Toth
         Attachments: Screenshot 2026-04-16 at 18.50.15.png, Screenshot 
2026-04-16 at 18.51.26.png, Screenshot 2026-04-16 at 19.02.30.png, Screenshot 
2026-04-16 at 19.03.51.png

As discussed in 
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we 
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark 
tasks are not correct.

Spark fetches that metric via {{FileSystem.getAllStatistics}} see
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
 and
 - 
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]

Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜  bin/spark-shell

scala> spark.createDataFrame((0 until 5000).map(i => (i, 
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png!

Vectored IO is disabled explicitely:
{code:java}
➜  bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false

scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png!

In my case the generated test files size was ~45KB:
{code:java}
➜  ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the 
decreased 1680B probably belongs to that.

There is no data pruning in the query so the metric value should be around the 
file size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to