Concurrent Spark jobs

emlyn Tue, 19 Jan 2016 08:59:15 -0800

We have a Spark application that runs a number of ETL jobs, writing the
outputs to Redshift (using databricks/spark-redshift). This is triggered by
calling DataFrame.write.save on the different DataFrames one after another.
I noticed that during the Redshift load while the output of one job is being
loaded into Redshift (which can take ~20 minutes for some jobs), the cluster
is sitting idle.


In order to maximise the use of the cluster, we tried starting a thread for
each job so that they can all be submitted simultaneously, and therefore the
cluster can be utilised by another job while one is being written to
Redshift.

However, when this is run, it fails with a TimeoutException (see stack trace
below). Would it make sense to increase "spark.sql.broadcastTimeout"? I'm
not sure that would actually solve anything. Should it not be possible to
save multiple DataFrames simultaneously? Or any other hints on how to make
better use of the cluster's resources?

Thanks.


Stack trace:

Exception in thread "main" java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
...
        at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[300 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
        at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
        at org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1676)
        at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1673)
        at org.apache.spark.sql.DataFrame.mapPartitions(DataFrame.scala:1465)
        at
com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:264)
        at
com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:374)
        at
com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
        at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Concurrent Spark jobs

Reply via email to