[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119450#comment-16119450
 ] 

Jochen Niebuhr edited comment on SPARK-21651 at 8/9/17 6:03 AM:
----------------------------------------------------------------

Ok, here's an example:

If you have an entity in json objects which stores some relations to other 
entites, it might look like this:
{code}
{ "id": "06d32281-db4d-4d47-911a-c0b59cc0ed26", "relations": { 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... some relation content ... */ 
}, "1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "38401db2-1036-499f-b21e-e9be532cddb2", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "1cbb297c-cec8-4288-9edc-9d4b5dad3eec", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... */ } } }
{code}

If I'm putting that JSON through the JSON Infer Schema step, it will generate a 
schema like this:
{code}Struct<id: String, relations: 
Struct<38401db2-1036-499f-b21e-e9be532cddb2: Struct<>, 
1cbb297c-cec8-4288-9edc-9d4b5dad3eec: Struct<>, 
06d32281-db4d-4d47-911a-c0b59cc0ed26: Struct<>>>{code}

If I do this with a sample of 100.000 documents, the schema will become very 
large and probably crash my job or at least take forever. But since everything 
in relations shares the same Key and Value types, I could just say relations is 
a MapType. My schema wouldn't grow as large and I could simply query it.

So the expected schema would be:
{code}Struct<id: String, relations: Map<String, Struct<>>>{code}

In the version I implemented in the MongoDB driver this behavior has the 
following requirements: 
* Over 250 keys in a single Struct
* All Value Types are compatible


was (Author: jniebuhr):
If you have an entity in json objects which stores some relations to other 
entites, it might look like this:
{code}
{ "id": "06d32281-db4d-4d47-911a-c0b59cc0ed26", "relations": { 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... some relation content ... */ 
}, "1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "38401db2-1036-499f-b21e-e9be532cddb2", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "1cbb297c-cec8-4288-9edc-9d4b5dad3eec", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... */ } } }
{code}

If I'm putting that JSON through the JSON Infer Schema step, it will generate a 
schema like this:
{code}Struct<id: String, relations: 
Struct<38401db2-1036-499f-b21e-e9be532cddb2: Struct<>, 
1cbb297c-cec8-4288-9edc-9d4b5dad3eec: Struct<>, 
06d32281-db4d-4d47-911a-c0b59cc0ed26: Struct<>>>{code}

If I do this with a sample of 100.000 documents, the schema will become very 
large and probably crash my job or at least take forever. But since everything 
in relations shares the same Key and Value types, I could just say relations is 
a MapType. My schema wouldn't grow as large and I could simply query it.

So the expected schema would be:
{code}Struct<id: String, relations: Map<String, Struct<>>>{code}

In the version I implemented in the MongoDB driver this behavior has the 
following requirements: 
* Over 250 keys in a single Struct
* All Value Types are compatible

> Detect MapType in Json InferSchema
> ----------------------------------
>
>                 Key: SPARK-21651
>                 URL: https://issues.apache.org/jira/browse/SPARK-21651
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0, 2.1.1, 2.2.0
>            Reporter: Jochen Niebuhr
>            Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to