I have a data frame in following schema:
household
root
 |-- country_code: string (nullable = true)
 |-- region_code: string (nullable = true)
 |-- individuals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- individual_id: string (nullable = true)
 |    |    |-- ids: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id_last_seen: date (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- year_released: integer (nullable = true)

I can use the following code to find households that contain at least one
device that was released after the year 2018

val sql = """
select household_id
from household
where exists(individuals, id -> exists(id.ids, dev -> dev.year_released >
2018))
"""
val v = spark.sql(sql)

It works well, however, I found the query planner was not able to prune the
unneeded columns, Spark instead has to read all columns of the nested
structures

Tested this with spark 2.4.5 and 3.0.0, got the same result.

Just wonder if Spark supports or will add support to column scan pruning
for an array of structs?

Reply via email to