[
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Toth updated HADOOP-19863:
--------------------------------
Description:
As discussed in
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark
tasks are not correct.
Spark fetches that metric via {{FileSystem.getAllStatistics}} see
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
and
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜ bin/spark-shell
scala> spark.createDataFrame((0 until 5000).map(i => (i,
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png|width=85%!
Vectored IO is disabled explicitely:
{code:java}
➜ bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png|width=85%!
In my case the generated test files size was ~45KB:
{code:java}
➜ ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth wheel 0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth wheel 44944 Apr 16 18:57
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the
decreased 1680B probably belongs to that.
There is no data pruning in the query so the metric value should be around the
file size.
was:
As discussed in
[https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] we
noticed that when vectoried IO is enabled the {{BytesRead}} metrics of Spark
tasks are not correct.
Spark fetches that metric via {{FileSystem.getAllStatistics}} see
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
and
-
[https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
Vectored IO is enabled by default:
{code:java}
➜ bin/spark-shell
scala> spark.createDataFrame((0 until 5000).map(i => (i,
s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.02.30.png|width=75%!
Vectored IO is disabled explicitely:
{code:java}
➜ bin/spark-shell --conf spark.hadoop.parquet.hadoop.vectored.io.enabled=false
scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM t2").collect()
{code}
!Screenshot 2026-04-16 at 19.03.51.png|width=75%!
In my case the generated test files size was ~45KB:
{code:java}
➜ ls -ll /tmp/t2
total 88
-rw-r--r--@ 1 ptoth wheel 0 Apr 16 18:57 _SUCCESS
-rw-r--r--@ 1 ptoth wheel 44944 Apr 16 18:57
part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
I believe reading the parquet footers don't go through vectored IO so the
decreased 1680B probably belongs to that.
There is no data pruning in the query so the metric value should be around the
file size.
> Incorrect Vectored IO metrics
> -----------------------------
>
> Key: HADOOP-19863
> URL: https://issues.apache.org/jira/browse/HADOOP-19863
> Project: Hadoop Common
> Issue Type: Bug
> Components: cloud-storage
> Affects Versions: 3.5.0
> Reporter: Peter Toth
> Priority: Major
> Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705]
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
> -
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
> and
> -
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜ bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i,
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜ bin/spark-shell --conf
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test files size was ~45KB:
> {code:java}
> ➜ ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth wheel 0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth wheel 44944 Apr 16 18:57
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around
> the file size.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]