Hi, On reading a complex JSON, Spark infers schema as following:
root |-- header: struct (nullable = true) | |-- deviceId: string (nullable = true) | |-- sessionId: string (nullable = true) |-- payload: struct (nullable = true) | |-- deviceObjects: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- additionalPayload: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- data: struct (nullable = true) | | | | | | |-- *a: struct (nullable = true)* | | | | | | | |-- address: string (nullable = true) When we save the above Json in parquet using Spark sql we get only two top level columns "header" and "payload" in parquet. So now we want to do a mean calculation on element *a: struct (nullable = true)* With reference to the Databricks blog for handling complex JSON https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html *"when using Parquet, all struct columns will receive the same treatment as top-level columns. Therefore, if you have filters on a nested field, you will get the same benefits as a top-level column."* Referring to the above statement, will parquet treat *a: struct (nullable = true)* as top-level column struct and SQL query on the Dataset will be optimized? If not, do we need to externally impose the schema by exploding the complex type before writing to parquet in order to get top-level column benefit? What we can do with Spark 2.1, to extract the best performance over such nested structure like *a: struct (nullable = true).* Thanks