Re: Concurrent Spark jobs

Madabhattula Rajesh Kumar Tue, 19 Jan 2016 20:17:22 -0800

Hi,

Just a thought. Can we use Spark Job Server and trigger jobs through rest
apis. In this case, all jobs will share same context and run the jobs
parallel.


If any one has other thoughts please share

Regards,
Rajesh

On Tue, Jan 19, 2016 at 10:28 PM, emlyn <em...@swiftkey.com> wrote:

> We have a Spark application that runs a number of ETL jobs, writing the
> outputs to Redshift (using databricks/spark-redshift). This is triggered by
> calling DataFrame.write.save on the different DataFrames one after another.
> I noticed that during the Redshift load while the output of one job is
> being
> loaded into Redshift (which can take ~20 minutes for some jobs), the
> cluster
> is sitting idle.
>
> In order to maximise the use of the cluster, we tried starting a thread for
> each job so that they can all be submitted simultaneously, and therefore
> the
> cluster can be utilised by another job while one is being written to
> Redshift.
>
> However, when this is run, it fails with a TimeoutException (see stack
> trace
> below). Would it make sense to increase "spark.sql.broadcastTimeout"? I'm
> not sure that would actually solve anything. Should it not be possible to
> save multiple DataFrames simultaneously? Or any other hints on how to make
> better use of the cluster's resources?
>
> Thanks.
>
>
> Stack trace:
>
> Exception in thread "main" java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Futures timed out after [300
> seconds]
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ...
>         at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>         at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [300 seconds]
>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>         at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:107)
>         at
>
> org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
>         at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>         at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>         at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>         at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>         at
> org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
>         at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>         at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>         at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>         at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>         at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>         at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>         at
> org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1676)
>         at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1673)
>         at
> org.apache.spark.sql.DataFrame.mapPartitions(DataFrame.scala:1465)
>         at
>
> com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:264)
>         at
>
> com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:374)
>         at
>
> com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
>         at
>
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
>         at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Concurrent Spark jobs

Reply via email to