Hello all, I've been wrestling with this problem all day and any suggestions would be greatly appreciated.
I'm trying to test reading a parquet file that's stored in s3 using a spark cluster deployed on ec2. The following works in the spark shell when run completely locally on my own machine (i.e. no --master option passed to the spark-shell command): val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/") p.registerAsTable("s") sql("select count(*) from s").collect I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4) using the standalone cluster manager and deployed with the spark-ec2 script. Running the same code in a spark shell connected to the cluster it basically hangs on the select statement. The workers/slaves simply time out and restart every 30 seconds when they hit what appears to be an activity timeout, as if there's no activity from the spark-shell (based on what I see in the stderr logs for the job, I assume this is expected behavior when connected from a spark-shell that's sitting idle). I see these messages about every 30 seconds: 14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated: app-20140908213842-0002/7 is now EXITED (Command exited with code 1) 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor app-20140908213842-0002/7 removed: Command exited with code 1 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added: app-20140908213842-0002/8 on worker-20140908183422-ip-10-60-107-194.ec2.internal-53445 (ip-10-60-107-194.ec2.internal:53445) with 2 cores 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445 with 2 cores, 4.0 GB RAM 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated: app-20140908213842-0002/8 is now RUNNING Eventually it fails with a: 14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated: app-20140908213842-0002/9 is now EXITED (Command exited with code 1) 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor app-20140908213842-0002/9 removed: Command exited with code 1 14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED 14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1 14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at SparkPlan.scala:85 14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at http://192.168.10.198:4040 14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all executors 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to shut down org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) As far as the "Initial job has not accepted any resources" I'm running the spark-shell command with: SPARK_MEM=2g ./spark-shell --master spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077 According to the master web page each node has 6 Gig so I'm not sure why I'm seeing that message either. If I run with less than 2g I get the following in my spark-shell: 14/09/08 17:47:38 INFO Remoting: Remoting shut down 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. java.io.IOException: Error reading summaries at parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128) .... Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.concurrent.FutureTask.report(FutureTask.java:122) I'm not sure if this exception is from the spark-shell jvm or transferred over from the master or a worker through the master. Any help would be greatly appreciated. Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org