MapType vs StructType

2015-07-17 Thread Corey Nolet
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for
some reason, the inferSchema tools in Spark SQL extracts the schema of
nested JSON objects as StructTypes.

This makes it really confusing when trying to rectify the object hierarchy
when I have maps because the Catalyst conversion layer underneath is
expecting a Row or Product and not a Map.

Why wasn't MapType used here? Is there any significant difference between
the two of these types that would cause me not to use a MapType when I'm
constructing my own schema representing a set of nested Map[String,_]'s?


Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
I'll add there is a JIRA to override the default past some threshold of #
of unique keys: https://issues.apache.org/jira/browse/SPARK-4476
https://issues.apache.org/jira/browse/SPARK-4476

On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust mich...@databricks.com
wrote:

 The difference between a map and a struct here is that in a struct all
 possible keys are defined as part of the schema and can each can have a
 different type (and we don't support union types).  JSON doesn't have
 differentiated data structures so we go with the one that gives you more
 information when doing inference by default.  If you pass in a schema to
 JSON however, you can override this and have a JSON object parsed as a map.

 On Fri, Jul 17, 2015 at 11:02 AM, Corey Nolet cjno...@gmail.com wrote:

 I notice JSON objects are all parsed as Map[String,Any] in Jackson but
 for some reason, the inferSchema tools in Spark SQL extracts the schema
 of nested JSON objects as StructTypes.

 This makes it really confusing when trying to rectify the object
 hierarchy when I have maps because the Catalyst conversion layer underneath
 is expecting a Row or Product and not a Map.

 Why wasn't MapType used here? Is there any significant difference between
 the two of these types that would cause me not to use a MapType when I'm
 constructing my own schema representing a set of nested Map[String,_]'s?







Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
The difference between a map and a struct here is that in a struct all
possible keys are defined as part of the schema and can each can have a
different type (and we don't support union types).  JSON doesn't have
differentiated data structures so we go with the one that gives you more
information when doing inference by default.  If you pass in a schema to
JSON however, you can override this and have a JSON object parsed as a map.

On Fri, Jul 17, 2015 at 11:02 AM, Corey Nolet cjno...@gmail.com wrote:

 I notice JSON objects are all parsed as Map[String,Any] in Jackson but for
 some reason, the inferSchema tools in Spark SQL extracts the schema of
 nested JSON objects as StructTypes.

 This makes it really confusing when trying to rectify the object hierarchy
 when I have maps because the Catalyst conversion layer underneath is
 expecting a Row or Product and not a Map.

 Why wasn't MapType used here? Is there any significant difference between
 the two of these types that would cause me not to use a MapType when I'm
 constructing my own schema representing a set of nested Map[String,_]'s?






Re: MapType vs StructType

2015-07-17 Thread Corey Nolet
This helps immensely. Thanks Michael!

On Fri, Jul 17, 2015 at 4:33 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'll add there is a JIRA to override the default past some threshold of #
 of unique keys: https://issues.apache.org/jira/browse/SPARK-4476
 https://issues.apache.org/jira/browse/SPARK-4476

 On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust mich...@databricks.com
 wrote:

 The difference between a map and a struct here is that in a struct all
 possible keys are defined as part of the schema and can each can have a
 different type (and we don't support union types).  JSON doesn't have
 differentiated data structures so we go with the one that gives you more
 information when doing inference by default.  If you pass in a schema to
 JSON however, you can override this and have a JSON object parsed as a map.

 On Fri, Jul 17, 2015 at 11:02 AM, Corey Nolet cjno...@gmail.com wrote:

 I notice JSON objects are all parsed as Map[String,Any] in Jackson but
 for some reason, the inferSchema tools in Spark SQL extracts the schema
 of nested JSON objects as StructTypes.

 This makes it really confusing when trying to rectify the object
 hierarchy when I have maps because the Catalyst conversion layer underneath
 is expecting a Row or Product and not a Map.

 Why wasn't MapType used here? Is there any significant difference
 between the two of these types that would cause me not to use a MapType
 when I'm constructing my own schema representing a set of nested
 Map[String,_]'s?