Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread raksja
Its happenning in the executor # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 25800"... -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread Jay
I might have missed it but can you tell if the OOM is happening in driver or executor ? Also it would be good if you can post the actual exception. On Tue 5 Jun, 2018, 1:55 PM Nicolas Paris, wrote: > IMO your json cannot be read in parallell at all then spark only offers > you > to play again

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
IMO your json cannot be read in parallell at all then spark only offers you to play again with memory. I d'say at one step it has to feet in both one executor and in the driver. I d'try something like 20GB for both driver and executors and by using dynamic amount of executor in order to then

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes I would say thats the first thing that i tried. thing is even though i provide more num executor and more memory to each, this process gets OOM in only one task which is stuck and unfinished. I dont think its splitting the load to other tasks. I had 11 blocks on that file i stored in hdfs

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
have you played with driver/executor memory configuration ? Increasing them should avoid OOM 2018-06-05 22:30 GMT+02:00 raksja : > Agreed, gzip or non splittable, the question that i have and examples i > have > posted above all are referring to non compressed file. A single json file > with

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Agreed, gzip or non splittable, the question that i have and examples i have posted above all are referring to non compressed file. A single json file with Array of objects in a continuous line. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes its in right format as we are able to process that in python. Also I agree that JSONL would work when split that [{},{},...] array of objects into something like this {} {} {} But since i get the data from another system like this i cannot control, my question is whether its possible

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format. On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias wrote: > Are you sure that your JSON file has the right format? > >

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Anastasios Zouzias
Are you sure that your JSON file has the right format? spark.read.json(...) expects a file where *each line is a json object*. My wild guess is that val hdf=spark.read.json("/user/tmp/hugedatafile") hdf.show(2) or hdf.take(1) gives OOM tries to fetch all the data into the driver. Can you

Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
I have a json file which is a continuous array of objects of similar type [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed. This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a whole regular json file. [{"id":"1","entityMetadata":{"lastChange":"2018-05-11