Re: Concurrent Spark jobs

emlyn Thu, 21 Jan 2016 04:24:07 -0800

Thanks for the responses (not sure why they aren't showing up on the list).

Michael wrote:
> The JDBC wrapper for Redshift should allow you to follow these
> instructions. Let me know if you run into any more issues. 
> http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html

I'm not sure that this solves my problem - if I understand it correctly,
this is to split a database write over multiple concurrent connections (one
from each partition), whereas what I want is to allow other tasks to
continue running on the cluster while the the write to Redshift is taking
place.
Also I don't think it's good practice to load data into Redshift with INSERT
statements over JDBC - it is recommended to use the bulk load commands that
can analyse the data and automatically set appropriate compression etc on
the table.

Rajesh wrote:
> Just a thought. Can we use Spark Job Server and trigger jobs through rest
> apis. In this case, all jobs will share same context and run the jobs
> parallel.
> If any one has other thoughts please share

I'm not sure this would work in my case as they are not completely separate
jobs, but just different outputs to Redshift, that share intermediate
results. Running them as completely separate jobs would mean recalculating
the intermediate results for each output. I suppose it might be possible to
persist the intermediate results somewhere, and then delete them once all
the jobs have run, but that is starting to add a lot of complication which
I'm not sure is justified.

Maybe some pseudocode would help clarify things, so here is a very
simplified view of our Spark application:

// load and transform data, then cache the result
df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data'))
df1.cache()

// perform some further transforms of the cached data
df2 = transform2(df1)
df3 = transform3(df1)

// write the final data out to Redshift
df2.write().options(...).(format "com.databricks.spark.redshift").save()
df3.write().options(...).(format "com.databricks.spark.redshift").save()

When the application runs, the steps are executed in the following order:
- scan parquet folder
- transform1 executes
- df1 stored in cache
- transform2 executes
- df2 written to Redshift (while cluster sits idle)
- transform3 executes
- df3 written to Redshift

I would like transform3 to begin executing as soon as the cluster has
capacity, without having to wait for df2 to be written to Redshift, so I
tried rewriting the last two lines as (again pseudocode):

f1 = future{df2.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f2 = future{df3.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f1.get()
f2.get()

In the hope that the first write would no longer block the following steps,
but instead it fails with a TimeoutException (see stack trace in previous
message). Is there a way to start the different writes concurrently, or is
that not possible in Spark?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Concurrent Spark jobs

Reply via email to