I have a data frame in following schema: household root |-- country_code: string (nullable = true) |-- region_code: string (nullable = true) |-- individuals: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- individual_id: string (nullable = true) | | |-- ids: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- id_last_seen: date (nullable = true) | | | | |-- type: string (nullable = true) | | | | |-- value: string (nullable = true) | | | | |-- year_released: integer (nullable = true)
I can use the following code to find households that contain at least one device that was released after the year 2018 val sql = """ select household_id from household where exists(individuals, id -> exists(id.ids, dev -> dev.year_released > 2018)) """ val v = spark.sql(sql) It works well, however, I found the query planner was not able to prune the unneeded columns, Spark instead has to read all columns of the nested structures Tested this with spark 2.4.5 and 3.0.0, got the same result. Just wonder if Spark supports or will add support to column scan pruning for an array of structs?