Dataframe from 1.5G json (non JSONL)

raksja Tue, 05 Jun 2018 11:39:55 -0700

I have a json file which is a continuous array of objects of similar type
[{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed.


This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a
whole regular json file. 


[{"id":"1","entityMetadata":{"lastChange":"2018-05-11
01:09:18.0","createdDateTime":"2018-05-11
01:09:18.0","modifiedDateTime":"2018-05-11
01:09:18.0"},"type":"11"},{"id":"2","entityMetadata":{"lastChange":"2018-05-11
01:09:18.0","createdDateTime":"2018-05-11
01:09:18.0","modifiedDateTime":"2018-05-11
01:09:18.0"},"type":"11"},{"id":"3","entityMetadata":{"lastChange":"2018-05-11
01:09:18.0","createdDateTime":"2018-05-11
01:09:18.0","modifiedDateTime":"2018-05-11
01:09:18.0"},"type":"11"}..................]


I get OOM on executors whenever i try to load this into spark.

Try 1
val hdf=spark.read.json("/user/tmp/hugedatafile")
hdf.show(2) or hdf.take(1) gives OOM

Try 2
Took a small sampledatafile and got schema to avoid schema infering
val sampleSchema=spark.read.json("/user/tmp/sampledatafile").schema
val hdf=spark.read.schema(sampleSchema).json("/user/tmp/hugedatafile")
hdf.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM

Try 3 
Repartition it after before performing action
gives OOM

Try 4 
Read about the https://issues.apache.org/jira/browse/SPARK-20980 completely
val hdf = spark.read.option("multiLine",
true)..schema(sampleSchema).json("/user/tmp/hugedatafile")
hdf.show(1) or hdf.take(1) gives OOM


Can any one help me here?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Dataframe from 1.5G json (non JSONL)

Reply via email to