We have a Spark application that runs a number of ETL jobs, writing the outputs to Redshift (using databricks/spark-redshift). This is triggered by calling DataFrame.write.save on the different DataFrames one after another. I noticed that during the Redshift load while the output of one job is being loaded into Redshift (which can take ~20 minutes for some jobs), the cluster is sitting idle.
In order to maximise the use of the cluster, we tried starting a thread for each job so that they can all be submitted simultaneously, and therefore the cluster can be utilised by another job while one is being written to Redshift. However, when this is run, it fails with a TimeoutException (see stack trace below). Would it make sense to increase "spark.sql.broadcastTimeout"? I'm not sure that would actually solve anything. Should it not be possible to save multiple DataFrames simultaneously? Or any other hints on how to make better use of the cluster's resources? Thanks. Stack trace: Exception in thread "main" java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at java.util.concurrent.FutureTask.report(FutureTask.java:122) ... at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1676) at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1673) at org.apache.spark.sql.DataFrame.mapPartitions(DataFrame.scala:1465) at com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:264) at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:374) at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org