Querying a parquet file in s3 with an ec2 install

Jim Carroll Mon, 08 Sep 2014 14:59:16 -0700

Hello all,

I've been wrestling with this problem all day and any suggestions would be
greatly appreciated.


I'm trying to test reading a parquet file that's stored in s3 using a spark
cluster deployed on ec2. The following works in the spark shell when run
completely locally on my own machine (i.e. no --master option passed to the
spark-shell command):

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/")
p.registerAsTable("s")
sql("select count(*) from s").collect

I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4) using
the standalone cluster manager and deployed with the spark-ec2 script. 

Running the same code in a spark shell connected to the cluster it basically
hangs on the select statement. The workers/slaves simply time out and
restart every 30 seconds when they hit what appears to be an activity
timeout, as if there's no activity from the spark-shell (based on what I see
in the stderr logs for the job, I assume this is expected behavior when
connected from a spark-shell that's sitting idle).

I see these messages about every 30 seconds:

14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/7 is now EXITED (Command exited with code 1)
14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor
app-20140908213842-0002/7 removed: Command exited with code 1
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added:
app-20140908213842-0002/8 on
worker-20140908183422-ip-10-60-107-194.ec2.internal-53445
(ip-10-60-107-194.ec2.internal:53445) with 2 cores
14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445
with 2 cores, 4.0 GB RAM
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/8 is now RUNNING

Eventually it fails with a: 

14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/9 is now EXITED (Command exited with code 1)
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor
app-20140908213842-0002/9 removed: Command exited with code 1
14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: Master removed our application: FAILED
14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool 
14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1
14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at
http://192.168.10.198:4040
14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all
executors
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

As far as the "Initial job has not accepted any resources" I'm running the
spark-shell command with:

SPARK_MEM=2g ./spark-shell --master
spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077

According to the master web page each node has 6 Gig so I'm not sure why I'm
seeing that message either. If I run with less than 2g I get the following
in my spark-shell:

14/09/08 17:47:38 INFO Remoting: Remoting shut down
14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
java.io.IOException: Error reading summaries
        at
parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
        ....
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)

I'm not sure if this exception is from the spark-shell jvm or transferred
over from the master or a worker through the master.

Any help would be greatly appreciated.

Thanks
Jim









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Querying a parquet file in s3 with an ec2 install

Reply via email to