Hi,

On reading a complex JSON, Spark infers schema as following:

root
 |-- header: struct (nullable = true)
 |    |-- deviceId: string (nullable = true)
 |    |-- sessionId: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- deviceObjects: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additionalPayload: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- data: struct (nullable = true)
 |    |    |    |    |    |    |-- *a: struct (nullable = true)*
 |    |    |    |    |    |    |    |-- address: string (nullable = true)

When we save the above Json in parquet using Spark sql we get only two top
level columns "header" and "payload" in parquet.

So now we want to do a mean calculation on element  *a: struct (nullable =
true)*

With reference to the Databricks blog for handling complex JSON
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

*"when using Parquet, all struct columns will receive the same treatment as
top-level columns. Therefore, if you have filters on a nested field, you
will get the same benefits as a top-level column."*

Referring to the above statement, will parquet treat *a: struct (nullable =
true)* as top-level column struct and SQL query on the Dataset will be
optimized?

If not, do we need to externally impose the schema by exploding the complex
type before writing to parquet in order to get top-level column benefit?
What we can do with Spark 2.1, to extract the best performance over such
nested structure like *a: struct (nullable = true).*

Thanks

Reply via email to