[
https://issues.apache.org/jira/browse/SPARK-34956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080149#comment-18080149
]
Jake Waro commented on SPARK-34956:
-----------------------------------
h2. Adding evidence for SPARK-34956 (multi-field nested column pruning on
Generator output)
This issue is still affected as of Spark 3.5.3. Below is a minimal reproducer
with cost data, in case it's useful for prioritization or as a regression-test
artifact for any future fix.
h2. Why this comment
SPARK-34956 tracks extending nested column pruning on Generator output to the
multi-field case (the single-field case being handled by its predecessor
SPARK-34638). I built a self-contained reproducer for our production scenario
and discovered, while reading \{{NestedColumnAliasing.scala}}, that the
behavior we were seeing is exactly the limitation this JIRA tracks:
{quote}
{\{// We only process single field case. For multiple field case, we cannot}}
{\{// directly move field extractor into the generator expression.}}
{\{// TODO(SPARK-34956): support multiple fields.}}
{quote}
The reproducer demonstrates the limitation cleanly, quantifies the cost on
synthetic data, and shows the issue persists on Spark 3.5.0–3.5.3.
h2. Minimal reproducer
A self-contained Maven project: [SparkNestedPruningRepro
gist|https://gist.github.com/jwaro/441abb857175f9c9d69041bd2a4a81da]. Public
artifacts only (Apache Spark, Apache Iceberg, Scala). No external services or
proprietary data.
To run:
{code:bash}
mvn compile exec:exec
{code}
Exit code is non-zero when the limitation is present (so the reproducer can
serve as a regression-test artifact for a fix).
h2. Schema
{code:sql}
CREATE TABLE repro_cat.db.two_level (
id BIGINT NOT NULL,
outer ARRAY<STRUCT<
a: STRING,
padding_outer: STRING,
inner: ARRAY<STRUCT<
x: INT,
y: INT,
z: INT,
padding_a: STRING,
padding_b: STRING,
padding_c: STRING,
padding_d: STRING
>>
>>
)
USING iceberg
TBLPROPERTIES ('format-version' = '2')
{code}
1 million rows inserted, each with one outer element containing a 5-element
inner array. The \{{padding_*}} string columns vary per row to defeat Parquet's
RLE/dictionary encoding so the bytesRead difference is observable locally.
h2. Failing query — multiple fields from Generator output
Accesses *both* \{{inner.x}} and \{{inner.y}} (two fields from the generator
output — the case this JIRA tracks):
{code:scala}
spark.table("repro_cat.db.two_level")
.select(col("id"), explode(arrays_zip(col("outer.a"),
col("outer.inner"))).as("zipped"))
.select(col("id"), col("zipped.a").as("a"),
explode(col("zipped.inner")).as("inner_elem"))
.select(col("id"), col("inner_elem.x"), col("inner_elem.y"))
.agg(sum("x"), sum("y"))
{code}
Observed \{{BatchScan.readSchema()}} (the only authoritative source for what
got pushed into the V2 scan; the printed plan-tree \{{BatchScan[outer#34]}}
shows output attributes, not the pruned schema):
{noformat}
root
|-- outer: array
| |-- element: struct
| | |-- a: string
| | |-- inner: array
| | | |-- element: struct
| | | | |-- x: integer
| | | | |-- y: integer
| | | | |-- z: integer <-- not accessed
| | | | |-- padding_a: string <-- not accessed
| | | | |-- padding_b: string <-- not accessed
| | | | |-- padding_c: string <-- not accessed
| | | | |-- padding_d: string <-- not accessed
{noformat}
8 leaves total; 5 of them never referenced by the query. This matches the
\{{TODO(SPARK-34956)}} branch in
\{{NestedColumnAliasing.GeneratorNestedColumnAliasing}} (the bailout when
\{{nestedFieldsOnGenerator.size > 1}}).
h2. Control case — single field from Generator output
The same table, queried with a single field from the generator output, prunes
correctly (this is the SPARK-34638 case):
{code:scala}
spark.table("repro_cat.db.two_level")
.select(col("id"), explode(col("outer.a")).as("a"))
.agg(count("a"))
{code}
{\{readSchema}}:
{noformat}
root
|-- outer: array
| |-- element: struct
| | |-- a: string
{noformat}
1 leaf. Single-field generator pruning works; multi-field does not — exactly as
SPARK-34956 describes.
h2. I/O cost
Captured via \{{InputMetrics.bytesRead}} on the source-scan stage on 1M
synthetic rows:
|| Case || Fields from generator output || bytesRead ||
| CONTROL (SPARK-34638 case) | 1 | 1,240,898 bytes |
| FAILING (SPARK-34956 case) | 2 | 33,603,904 bytes |
| Ratio | | *27.1×* |
The failing case reads 27× more bytes than the control despite asking for only
one additional logical leaf. The wide \{{readSchema}} forces Iceberg to pull
every padding field from disk.
In a production deployment, the same access pattern over a real Iceberg V2
table produces a ~6× bytesRead inflation and a ~1.7–3.3× wall-time increase on
individual tests, contributing to a ~2.0× suite-wide runtime gap vs. an
equivalent flat-table layout.
h2. Version sweep
Confirmed identical 27.1× bytesRead ratio and 8-leaf \{{readSchema}} on Spark
3.5.0, 3.5.1, 3.5.2, 3.5.3 (Iceberg 1.9.2 throughout). Not a patch-release
regression; the limitation has been present throughout the 3.5 series. This may
be worth updating the JIRA's "Affects Version" field, which currently lists
only 3.2.0.
h2. Pruning-related configs are no-ops (as expected)
The reproducer also reruns the failing query under four configs commonly cited
as pruning-related:
* \{{spark.sql.optimizer.nestedSchemaPruning.enabled=true}}
* \{{spark.sql.nestedSchemaPruning.enabled=true}}
* \{{spark.sql.optimizer.expression.nestedPruning.enabled=true}}
* \{{spark.sql.optimizer.nestedPredicatePushdown.enabled=true}}
All four produce identical 8-leaf \{{readSchema}}. Consistent with this JIRA's
"the rule itself is what's limited" framing.
h2. Relation to other recent work
* *SPARK-51831* / [PR 51046|https://github.com/apache/spark/pull/51046] (merged
Sep 2025) closed the flat-column V2 pruning gap reported in
[apache/iceberg#9268|https://github.com/apache/iceberg/issues/9268]. The PR
author explicitly noted nested arrays-of-structs were not the focus and left
them as a follow-up — SPARK-34956 is that follow-up.
* *SPARK-42879* (open, unassigned) covers a related multi-field projection case
but for a different access shape.
h2. Reproducer as a regression-test artifact
The reproducer's exit code is non-zero when the limitation is present and zero
otherwise, so it can serve as a regression test for a candidate fix. Happy to
maintain it or move it to a more permanent location if that would help.
h2. Environment
* Spark 3.5.0 / 3.5.1 / 3.5.2 / 3.5.3 (all reproduce)
* Iceberg 1.9.2 (\{{iceberg-spark-runtime-3.5_2.12}})
* Scala 2.12.19
* JDK 17 (Temurin)
* Local-mode single-node Spark; hadoop-catalog Iceberg over a temp directory
* AQE enabled (does not affect the result)
> Support multiple fields when nested column pruning on Generator output
> -----------------------------------------------------------------------
>
> Key: SPARK-34956
> URL: https://issues.apache.org/jira/browse/SPARK-34956
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: L. C. Hsieh
> Priority: Major
>
> SPARK-34638 enhances nested column pruning rule on Generator's output.
> SPARK-34638 supports single field case, e.g.
> {{df.select(explode($"items").as("item")).select($"item.a")}}. This ticket is
> open for tracking multiple-field support.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]