Comparison between union and stack in pyspark

Dhruv Singla Wed, 16 Apr 2025 04:44:37 -0700

Hey Everyone

In Spark, suppose i have the following df.


```
df = spark.createDataFrame([['A', 'A06', 'B', 'B02', '202412'], ['A',
'A04', 'B', 'B03', '202501'], ['B', 'B01', 'C', 'C02', '202411'], ['B',
'B03', 'A', 'A06', '202502']], 'entity_code: string, entity_rollup: string,
target_entity_code: string, target_entity_rollup: string, period: string')

df.show()
df.createOrReplaceTempView('v1')

+-----------+-------------+------------------+--------------------+------+
|entity_code|entity_rollup|target_entity_code|target_entity_rollup|period|
+-----------+-------------+------------------+--------------------+------+
|          A|          A06|                 B|                 B02|202412|
|          A|          A04|                 B|                 B03|202501|
|          B|          B01|                 C|                 C02|202411|
|          B|          B03|                 A|                 A06|202502|
+-----------+-------------+------------------+--------------------+------+

```

I have two queries
```sql
SELECT DISTINCT
  STACK(
    2
    , entity_code, entity_rollup
    , target_entity_code, target_entity_rollup
  ) AS (entity_code, entity_rollup)
  , period
FROM v1

Running 'explain' on the above gives
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[entity_code#22625, entity_rollup#22626,
period#21414], functions=[])
   +- Exchange hashpartitioning(entity_code#22625, entity_rollup#22626,
period#21414, 200), ENSURE_REQUIREMENTS, [plan_id=12977]
      +- HashAggregate(keys=[entity_code#22625, entity_rollup#22626,
period#21414], functions=[])
         +- Project [entity_code#22625, entity_rollup#22626, period#21414]
            +- Generate stack(2, entity_code#21410, entity_rollup#21411,
target_entity_code#21412, target_entity_rollup#21413), [period#21414],
false, [entity_code#22625, entity_rollup#22626]
               +- LocalTableScan [entity_code#21410, entity_rollup#21411,
target_entity_code#21412, target_entity_rollup#21413, period#21414]
```

The second query is
```sql
SELECT entity_code, entity_rollup, period
FROM v1

UNION

SELECT target_entity_code, target_entity_rollup, period
FROM v1

Running 'explain' on above query did not have the FileScan, so added that
from actual query as well
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130,
entity#24131, period#24175], functions=[])
   +- Exchange hashpartitioning(entity_source_id#24129,
entity_rollup#24130, entity#24131, period#24175, 200), ENSURE_REQUIREMENTS,
[plan_id=14134]
      +- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130,
entity#24131, period#24175], functions=[])
         +- Union
            :- Project [billing_entity_code#24170 AS
entity_source_id#24129, billing_entity_region#24171 AS entity_rollup#24130,
billing_entity_name#24169 AS entity#24131, period#24175]
            :  +- FileScan ... Batched: true, DataFilters: [], Format:
Parquet, Location: ..., PartitionFilters: [], PushedFilters: [],
ReadSchema:
struct<billing_entity_name:string,billing_entity_code:string,billing_entity_region:string,period:...
            +- Project [receiving_entity_code#24186 AS
entity_source_id#24132, receiving_entity_region#24187 AS
entity_rollup#24133, receiving_entity_name#24185 AS entity#24134,
period#24188]
               +- FileScan ... Batched: true, DataFilters: [], Format:
Parquet, Location: ..., PartitionFilters: [], PushedFilters: [],
ReadSchema:
struct<receiving_entity_name:string,receiving_entity_code:string,receiving_entity_region:string,p...
```

The difference in the two queries is that Union has two file reads, while
Stack has one. Rest is solely the difference between union and stack. Which
one of the two is faster?

Thanks & Regards
Dhruv

Comparison between union and stack in pyspark

Reply via email to