Hey Everyone
In Spark, suppose i have the following df.
```
df = spark.createDataFrame([['A', 'A06', 'B', 'B02', '202412'], ['A',
'A04', 'B', 'B03', '202501'], ['B', 'B01', 'C', 'C02', '202411'], ['B',
'B03', 'A', 'A06', '202502']], 'entity_code: string, entity_rollup: string,
target_entity_code: string, target_entity_rollup: string, period: string')
df.show()
df.createOrReplaceTempView('v1')
+-----------+-------------+------------------+--------------------+------+
|entity_code|entity_rollup|target_entity_code|target_entity_rollup|period|
+-----------+-------------+------------------+--------------------+------+
| A| A06| B| B02|202412|
| A| A04| B| B03|202501|
| B| B01| C| C02|202411|
| B| B03| A| A06|202502|
+-----------+-------------+------------------+--------------------+------+
```
I have two queries
```sql
SELECT DISTINCT
STACK(
2
, entity_code, entity_rollup
, target_entity_code, target_entity_rollup
) AS (entity_code, entity_rollup)
, period
FROM v1
Running 'explain' on the above gives
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[entity_code#22625, entity_rollup#22626,
period#21414], functions=[])
+- Exchange hashpartitioning(entity_code#22625, entity_rollup#22626,
period#21414, 200), ENSURE_REQUIREMENTS, [plan_id=12977]
+- HashAggregate(keys=[entity_code#22625, entity_rollup#22626,
period#21414], functions=[])
+- Project [entity_code#22625, entity_rollup#22626, period#21414]
+- Generate stack(2, entity_code#21410, entity_rollup#21411,
target_entity_code#21412, target_entity_rollup#21413), [period#21414],
false, [entity_code#22625, entity_rollup#22626]
+- LocalTableScan [entity_code#21410, entity_rollup#21411,
target_entity_code#21412, target_entity_rollup#21413, period#21414]
```
The second query is
```sql
SELECT entity_code, entity_rollup, period
FROM v1
UNION
SELECT target_entity_code, target_entity_rollup, period
FROM v1
Running 'explain' on above query did not have the FileScan, so added that
from actual query as well
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130,
entity#24131, period#24175], functions=[])
+- Exchange hashpartitioning(entity_source_id#24129,
entity_rollup#24130, entity#24131, period#24175, 200), ENSURE_REQUIREMENTS,
[plan_id=14134]
+- HashAggregate(keys=[entity_source_id#24129, entity_rollup#24130,
entity#24131, period#24175], functions=[])
+- Union
:- Project [billing_entity_code#24170 AS
entity_source_id#24129, billing_entity_region#24171 AS entity_rollup#24130,
billing_entity_name#24169 AS entity#24131, period#24175]
: +- FileScan ... Batched: true, DataFilters: [], Format:
Parquet, Location: ..., PartitionFilters: [], PushedFilters: [],
ReadSchema:
struct<billing_entity_name:string,billing_entity_code:string,billing_entity_region:string,period:...
+- Project [receiving_entity_code#24186 AS
entity_source_id#24132, receiving_entity_region#24187 AS
entity_rollup#24133, receiving_entity_name#24185 AS entity#24134,
period#24188]
+- FileScan ... Batched: true, DataFilters: [], Format:
Parquet, Location: ..., PartitionFilters: [], PushedFilters: [],
ReadSchema:
struct<receiving_entity_name:string,receiving_entity_code:string,receiving_entity_region:string,p...
```
The difference in the two queries is that Union has two file reads, while
Stack has one. Rest is solely the difference between union and stack. Which
one of the two is faster?
Thanks & Regards
Dhruv