Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread raksja
Its happenning in the executor # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 25800"... -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes I would say thats the first thing that i tried. thing is even though i provide more num executor and more memory to each, this process gets OOM in only one task which is stuck and unfinished. I dont think its splitting the load to other tasks. I had 11 blocks on that file i stored in hdfs

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Agreed, gzip or non splittable, the question that i have and examples i have posted above all are referring to non compressed file. A single json file with Array of objects in a continuous line. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes its in right format as we are able to process that in python. Also I agree that JSONL would work when split that [{},{},...] array of objects into something like this {} {} {} But since i get the data from another system like this i cannot control, my question is whether its possible

Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
I have a json file which is a continuous array of objects of similar type [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed. This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a whole regular json file. [{"id":"1","entityMetadata":{"lastChange":"2018-05-11

Re: Submit many spark applications

2018-05-25 Thread raksja
ok, when to use what? do you have any recommendation? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Submit many spark applications

2018-05-25 Thread raksja
When you mean spark uses, did you meant this https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala? InProcessLauncher would just start a subprocess as you mentioned earlier. How about this, does this makes a rest api call to

Re: Submit many spark applications

2018-05-25 Thread raksja
thanks for the reply. Have you tried submit a spark job directly to Yarn using YarnClient. https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html Not sure whether its performant and scalable? -- Sent from:

Re: Submit many spark applications

2018-05-23 Thread raksja
Hi Marcelo, I'm facing same issue when making spark-submits from an ec2 instance and reaching native memory limit sooner. we have the #1, but we are still in spark 2.1.0, couldnt try #2. So InProcessLauncher wouldnt use the native memory, so will it overload the mem of parent process? Is