Complext JSON Handling in Spark 2.1

Patrick Mon, 24 Jul 2017 03:35:14 -0700

Hi,

On reading a complex JSON, Spark infers schema as following:


root
 |-- header: struct (nullable = true)
 |    |-- deviceId: string (nullable = true)
 |    |-- sessionId: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- deviceObjects: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additionalPayload: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- data: struct (nullable = true)
 |    |    |    |    |    |    |-- *a: struct (nullable = true)*
 |    |    |    |    |    |    |    |-- address: string (nullable = true)

When we save the above Json in parquet using Spark sql we get only two top
level columns "header" and "payload" in parquet.

So now we want to do a mean calculation on element  *a: struct (nullable =
true)*

With reference to the Databricks blog for handling complex JSON
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

*"when using Parquet, all struct columns will receive the same treatment as
top-level columns. Therefore, if you have filters on a nested field, you
will get the same benefits as a top-level column."*

Referring to the above statement, will parquet treat *a: struct (nullable =
true)* as top-level column struct and SQL query on the Dataset will be
optimized?

If not, do we need to externally impose the schema by exploding the complex
type before writing to parquet in order to get top-level column benefit?
What we can do with Spark 2.1, to extract the best performance over such
nested structure like *a: struct (nullable = true).*

Thanks

Complext JSON Handling in Spark 2.1

Reply via email to