Try using Spark 1.4.0 with SQL code generation turned on; this should make a huge difference.
On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian < sanjaysubraman...@yahoo.com> wrote: > hey guys > > I tried the following settings as well. No luck > > --total-executor-cores 24 --executor-memory 4G > > > BTW on the same cluster , impala absolutely kills it. same query 9 > seconds. no memory issues. no issues. > > In fact I am pretty disappointed with Spark-SQL. > I have worked with Hive during the 0.9.x stages and taken projects to > production successfully and Hive actually very rarely craps out. > > Whether the spark folks like what I say or not, yes my expectations are > pretty high of Spark-SQL if I were to change the ways we are doing things > at my workplace. > Until that time, we are going to be hugely dependent on Impala and > Hive(with SSD speeding up the shuffle stage , even MR jobs are not that > slow now). > > I want to clarify for those of u who may be asking - why I am not using > spark with Scala and insisting on using spark-sql ? > > - I have already pipelined data from enterprise tables to Hive > - I am using CDH 5.3.3 (Cloudera starving developers version) > - I have close to 300 tables defined in Hive external tables. > - Data if on HDFS > - On an average we have 150 columns per table > - One an everyday basis , we do crazy amounts of ad-hoc joining of new and > old tables in getting datasets ready for supervised ML > - I thought that quite simply I can point Spark to the Hive meta and do > queries as I do - in fact the existing queries would work as is unless I am > using some esoteric Hive/Impala function > > Anyway, if there are some settings I can use and get spark-sql to run even > on standalone mode that will be huge help. > > On the pre-production cluster I have spark on YARN but could never get it > to run fairly complex queries and I have no answers from this group of the > CDH groups. > > So my assumption is that its possibly not solved , else I have always got > very quick answers and responses :-) to my questions on all CDH groups, > Spark, Hive > > best regards > > sanjay > > > > ------------------------------ > *From:* Josh Rosen <rosenvi...@gmail.com> > *To:* Sanjay Subramanian <sanjaysubraman...@yahoo.com> > *Cc:* "user@spark.apache.org" <user@spark.apache.org> > *Sent:* Friday, June 12, 2015 7:15 AM > *Subject:* Re: spark-sql from CLI --->EXCEPTION: > java.lang.OutOfMemoryError: Java heap space > > It sounds like this might be caused by a memory configuration problem. In > addition to looking at the executor memory, I'd also bump up the driver > memory, since it appears that your shell is running out of memory when > collecting a large query result. > > Sent from my phone > > > > On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian < > sanjaysubraman...@yahoo.com.INVALID> wrote: > > hey guys > > Using Hive and Impala daily intensively. > Want to transition to spark-sql in CLI mode > > Currently in my sandbox I am using the Spark (standalone mode) in the CDH > distribution (starving developer version 5.3.3) > 3 datanode hadoop cluster > 32GB RAM per node > 8 cores per node > > spark > 1.2.0+cdh5.3.3+371 > > > I am testing some stuff on one view and getting memory errors > Possibly reason is default memory per executor showing on 18080 is > 512M > > These options when used to start the spark-sql CLI does not seem to have > any effect > --total-executor-cores 12 --executor-memory 4G > > > > /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e "select distinct > isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view" > > aers.aers_demo_view (7 million+ records) > =================== > isr bigint case id > event_dt bigint Event date > age double age of patient > age_cod string days,months years > sex string M or F > year int > quarter int > > > VIEW DEFINITION > ================ > CREATE VIEW `aers.aers_demo_view` AS SELECT `isr` AS `isr`, `event_dt` AS > `event_dt`, `age` AS `age`, `age_cod` AS `age_cod`, `gndr_cod` AS `sex`, > `year` AS `year`, `quarter` AS `quarter` FROM (SELECT > `aers_demo_v1`.`isr`, > `aers_demo_v1`.`event_dt`, > `aers_demo_v1`.`age`, > `aers_demo_v1`.`age_cod`, > `aers_demo_v1`.`gndr_cod`, > `aers_demo_v1`.`year`, > `aers_demo_v1`.`quarter` > FROM > `aers`.`aers_demo_v1` > UNION ALL > SELECT > `aers_demo_v2`.`isr`, > `aers_demo_v2`.`event_dt`, > `aers_demo_v2`.`age`, > `aers_demo_v2`.`age_cod`, > `aers_demo_v2`.`gndr_cod`, > `aers_demo_v2`.`year`, > `aers_demo_v2`.`quarter` > FROM > `aers`.`aers_demo_v2` > UNION ALL > SELECT > `aers_demo_v3`.`isr`, > `aers_demo_v3`.`event_dt`, > `aers_demo_v3`.`age`, > `aers_demo_v3`.`age_cod`, > `aers_demo_v3`.`gndr_cod`, > `aers_demo_v3`.`year`, > `aers_demo_v3`.`quarter` > FROM > `aers`.`aers_demo_v3` > UNION ALL > SELECT > `aers_demo_v4`.`isr`, > `aers_demo_v4`.`event_dt`, > `aers_demo_v4`.`age`, > `aers_demo_v4`.`age_cod`, > `aers_demo_v4`.`gndr_cod`, > `aers_demo_v4`.`year`, > `aers_demo_v4`.`quarter` > FROM > `aers`.`aers_demo_v4` > UNION ALL > SELECT > `aers_demo_v5`.`primaryid` AS `ISR`, > `aers_demo_v5`.`event_dt`, > `aers_demo_v5`.`age`, > `aers_demo_v5`.`age_cod`, > `aers_demo_v5`.`gndr_cod`, > `aers_demo_v5`.`year`, > `aers_demo_v5`.`quarter` > FROM > `aers`.`aers_demo_v5` > UNION ALL > SELECT > `aers_demo_v6`.`primaryid` AS `ISR`, > `aers_demo_v6`.`event_dt`, > `aers_demo_v6`.`age`, > `aers_demo_v6`.`age_cod`, > `aers_demo_v6`.`sex` AS `GNDR_COD`, > `aers_demo_v6`.`year`, > `aers_demo_v6`.`quarter` > FROM > `aers`.`aers_demo_v6`) `aers_demo_view` > > > > > > > > 15/06/11 08:36:36 WARN DefaultChannelPipeline: An exception was thrown by > a user handler while handling an exception event ([id: 0x01b99855, / > 10.0.0.19:58117 => /10.0.0.19:52016] EXCEPTION: > java.lang.OutOfMemoryError: Java heap space) > java.lang.OutOfMemoryError: Java heap space > at > org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42) > at > org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34) > at > org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) > at > org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) > at > org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) > at > org.jboss.netty.handler.codec.frame.FrameDecoder.newCumulationBuffer(FrameDecoder.java:507) > at > org.jboss.netty.handler.codec.frame.FrameDecoder.updateCumulation(FrameDecoder.java:345) > at > org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:312) > at > org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) > at > org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) > at > org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) > at > org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) > at > org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/06/11 08:36:40 ERROR Utils: Uncaught exception in thread > task-result-getter-0 > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.lang.Long.valueOf(Long.java:577) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:113) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:103) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:171) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:558) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:352) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:80) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/06/11 08:36:38 ERROR ActorSystemImpl: exception on LARS’ timer thread > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > akka.dispatch.AbstractNodeQueue.<init>(AbstractNodeQueue.java:19) > at > akka.actor.LightArrayRevolverScheduler$TaskQueue.<init>(Scheduler.scala:431) > at > akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397) > at > akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "task-result-getter-0" java.lang.OutOfMemoryError: GC > overhead limit exceeded > at java.lang.Long.valueOf(Long.java:577) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:113) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.read(DefaultSerializers.java:103) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:171) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:558) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:352) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:80) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/06/11 08:36:41 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-scheduler-1] shutting down ActorSystem [sparkDriver] > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > akka.dispatch.AbstractNodeQueue.<init>(AbstractNodeQueue.java:19) > at > akka.actor.LightArrayRevolverScheduler$TaskQueue.<init>(Scheduler.scala:431) > at > akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397) > at > akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) > at java.lang.Thread.run(Thread.java:745) > 15/06/11 08:36:46 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: GC overhead limit exceeded > 15/06/11 08:36:46 ERROR SparkSQLDriver: Failed in [select distinct > isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view] > org.apache.spark.SparkException: Job cancelled because SparkContext was > shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428) > at > akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201) > at > akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163) > at akka.actor.ActorCell.terminate(ActorCell.scala:338) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) > at > akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) > at akka.dispatch.Mailbox.run(Mailbox.scala:218) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 15/06/11 08:36:51 WARN DefaultChannelPipeline: An exception was thrown by > a user handler while handling an exception event ([id: 0x79935a9b, / > 10.0.0.35:54028 => /10.0.0.19:52016] EXCEPTION: > java.lang.OutOfMemoryError: Java heap space) > java.lang.OutOfMemoryError: Java heap space > 15/06/11 08:36:52 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-5] shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: Java heap space > 15/06/11 08:36:53 WARN DefaultChannelPipeline: An exception was thrown by > a user handler while handling an exception event ([id: 0xcb8c4b5d, / > 10.0.0.18:46744 => /10.0.0.19:52016] EXCEPTION: > java.lang.OutOfMemoryError: Java heap space) > java.lang.OutOfMemoryError: Java heap space > 15/06/11 08:36:56 WARN NioEventLoop: Unexpected exception in the selector > loop. > java.lang.OutOfMemoryError: GC overhead limit exceeded > 15/06/11 08:36:57 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: GC overhead limit exceeded > 15/06/11 08:36:58 ERROR Utils: Uncaught exception in thread > task-result-getter-3 > java.lang.OutOfMemoryError: GC overhead limit exceeded > Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: GC > overhead limit exceeded > 15/06/11 08:37:01 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: Java heap space > Time taken: 70.982 seconds > 15/06/11 08:37:06 WARN QueuedThreadPool: 4 threads could not be stopped > 15/06/11 08:37:11 ERROR MapOutputTrackerMaster: Error communicating with > MapOutputTracker > akka.pattern.AskTimeoutException: > Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#-2109395547]] > had already been terminated. > at > akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) > at > org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111) > at > org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:122) > at > org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:330) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1210) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) > Exception in thread "Thread-3" org.apache.spark.SparkException: Error > communicating with MapOutputTracker > at > org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:116) > at > org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:122) > at > org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:330) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1210) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) > Caused by: akka.pattern.AskTimeoutException: > Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#-2109395547]] > had already been terminated. > at > akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) > at > org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111) > > > > >