Thanks for the responses (not sure why they aren't showing up on the list).
Michael wrote: > The JDBC wrapper for Redshift should allow you to follow these > instructions. Let me know if you run into any more issues. > http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html I'm not sure that this solves my problem - if I understand it correctly, this is to split a database write over multiple concurrent connections (one from each partition), whereas what I want is to allow other tasks to continue running on the cluster while the the write to Redshift is taking place. Also I don't think it's good practice to load data into Redshift with INSERT statements over JDBC - it is recommended to use the bulk load commands that can analyse the data and automatically set appropriate compression etc on the table. Rajesh wrote: > Just a thought. Can we use Spark Job Server and trigger jobs through rest > apis. In this case, all jobs will share same context and run the jobs > parallel. > If any one has other thoughts please share I'm not sure this would work in my case as they are not completely separate jobs, but just different outputs to Redshift, that share intermediate results. Running them as completely separate jobs would mean recalculating the intermediate results for each output. I suppose it might be possible to persist the intermediate results somewhere, and then delete them once all the jobs have run, but that is starting to add a lot of complication which I'm not sure is justified. Maybe some pseudocode would help clarify things, so here is a very simplified view of our Spark application: // load and transform data, then cache the result df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data')) df1.cache() // perform some further transforms of the cached data df2 = transform2(df1) df3 = transform3(df1) // write the final data out to Redshift df2.write().options(...).(format "com.databricks.spark.redshift").save() df3.write().options(...).(format "com.databricks.spark.redshift").save() When the application runs, the steps are executed in the following order: - scan parquet folder - transform1 executes - df1 stored in cache - transform2 executes - df2 written to Redshift (while cluster sits idle) - transform3 executes - df3 written to Redshift I would like transform3 to begin executing as soon as the cluster has capacity, without having to wait for df2 to be written to Redshift, so I tried rewriting the last two lines as (again pseudocode): f1 = future{df2.write().options(...).(format "com.databricks.spark.redshift").save()}.execute() f2 = future{df3.write().options(...).(format "com.databricks.spark.redshift").save()}.execute() f1.get() f2.get() In the hope that the first write would no longer block the following steps, but instead it fails with a TimeoutException (see stack trace in previous message). Is there a way to start the different writes concurrently, or is that not possible in Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org