Peter Toth created HADOOP-19863:
-----------------------------------
Summary: Incorrect Vectored IO metrics
Key: HADOOP-19863
URL: https://issues.apache.org/jira/browse/HADOOP-19863
Project: Hadoop Common
Issue Type: Bug
Components: cloud-storage
Affects Versions: 3.5.0
Reporter: Peter Toth
Attachments: Screenshot 2026-04-16 at 18.50.15.png, Screenshot
2026-04-16 at 18.51.26.png, Screenshot 2026-04-16 at 19.02.30.png, Screenshot
2026-04-16 at 19.03.51.png
As discussed in
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark
tasks are not correct.
Spark fetches that metric via {{FileSystem.getAllStatistics}} see
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
and
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜ bin/spark-shell
scala> spark.createDataFrame((0 until 5000).map(i => (i,
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png!
Vectored IO is disabled explicitely:
{code:java}
➜ bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png!
In my case the generated test files size was ~45KB:
{code:java}
➜ ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth wheel 0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth wheel 44944 Apr 16 18:57
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the
decreased 1680B probably belongs to that.
There is no data pruning in the query so the metric value should be around the
file size.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]