Thanks Ted for the input. I was able to get it working with pyspark shell but the same job submitted via 'spark-submit' using client or cluster deploy mode ends up with these errors:
~~~~~ java.lang.OutOfMemoryError: Java heap space at java.lang.Object.clone(Native Method) at akka.util.CompactByteString$.apply(ByteString.scala:410) at akka.util.ByteString$.apply(ByteString.scala:22) at akka.remote.transport.netty.TcpHandlers$class.onMessage(TcpSupport.scala:45) at akka.remote.transport.netty.TcpServerHandler.onMessage(TcpSupport.scala:57) at akka.remote.transport.netty.NettyServerHelpers$class.messageReceived(NettyHelpers.scala:43) at akka.remote.transport.netty.ServerHandler.messageReceived(NettyTransport.scala:179) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ERROR Utils: Uncaught exception in thread task-result-getter-3 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2219) at java.util.ArrayList.grow(ArrayList.java:242) at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216) at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208) at java.util.ArrayList.add(ArrayList.java:440) at com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:275) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:97) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ~~~~ When using pyspark, I'm passing the options via cmd line: pyspark --master yarn --deploy-mode client --driver-memory 6g --executor-memory 8g --executor-cores 4 while I'm doing the same via conf.set() in the script that I'm using with spark-submit. I do notice the options being picked up correctly under the Environment tab but its not clear as to why pyspark is successful while spark-submit fails. Is there any difference in how these two ways to run the job? Thanks! On Sun, Apr 10, 2016 at 4:28 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Looks like the exception occurred on driver. > > Consider increasing the values for the following config: > > conf.set("spark.driver.memory", "10240m") > conf.set("spark.driver.maxResultSize", "2g") > > Cheers > > On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev <buntu...@gmail.com> wrote: > >> I'm running it via pyspark against yarn in client deploy mode. I do >> notice in the spark web ui under Environment tab all the options I've set, >> so I'm guessing these are accepted. >> >> On Sat, Apr 9, 2016 at 5:52 PM, Jacek Laskowski <ja...@japila.pl> wrote: >> >>> Hi, >>> >>> (I haven't played with GraphFrames) >>> >>> What's your `sc.master`? How do you run your application -- >>> spark-submit or java -jar or sbt run or...? The reason I'm asking is >>> that few options might not be in use whatsoever, e.g. >>> spark.driver.memory and spark.executor.memory in local mode. >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> ---- >>> https://medium.com/@jaceklaskowski/ >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark >>> Follow me at https://twitter.com/jaceklaskowski >>> >>> >>> On Sat, Apr 9, 2016 at 7:51 PM, Buntu Dev <buntu...@gmail.com> wrote: >>> > I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M >>> (60mb) >>> > edges: >>> > >>> > tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)") >>> > >>> > I keep running into Java heap space errors: >>> > >>> > ~~~~~ >>> > >>> > ERROR actor.ActorSystemImpl: Uncaught fatal error from thread >>> > [sparkDriver-akka.actor.default-dispatcher-33] shutting down >>> ActorSystem >>> > [sparkDriver] >>> > java.lang.OutOfMemoryError: Java heap space >>> > at scala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:90) >>> > at scala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:88) >>> > at scala.Array$.ofDim(Array.scala:218) >>> > at akka.util.ByteIterator.toArray(ByteIterator.scala:462) >>> > at akka.util.ByteString.toArray(ByteString.scala:321) >>> > at >>> > >>> akka.remote.transport.AkkaPduProtobufCodec$.decodePdu(AkkaPduCodec.scala:168) >>> > at >>> > >>> akka.remote.transport.ProtocolStateActor.akka$remote$transport$ProtocolStateActor$$decodePdu(AkkaProtocolTransport.scala:513) >>> > at >>> > >>> akka.remote.transport.ProtocolStateActor$$anonfun$5.applyOrElse(AkkaProtocolTransport.scala:357) >>> > at >>> > >>> akka.remote.transport.ProtocolStateActor$$anonfun$5.applyOrElse(AkkaProtocolTransport.scala:352) >>> > at >>> > >>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) >>> > at akka.actor.FSM$class.processEvent(FSM.scala:595) >>> > at >>> > >>> akka.remote.transport.ProtocolStateActor.processEvent(AkkaProtocolTransport.scala:220) >>> > at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589) >>> > at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583) >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>> > at >>> > >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> > at >>> > >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> > at >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> > at >>> > >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> > >>> > ~~~~~ >>> > >>> > >>> > Here is my config: >>> > >>> > conf.set("spark.executor.memory", "8192m") >>> > conf.set("spark.executor.cores", 4) >>> > conf.set("spark.driver.memory", "10240m") >>> > conf.set("spark.driver.maxResultSize", "2g") >>> > conf.set("spark.kryoserializer.buffer.max", "1024mb") >>> > >>> > >>> > Wanted to know if there are any other configs to tweak? >>> > >>> > >>> > Thanks! >>> >> >> >