I would like to know if Spark has any facility by which particular tasks can be scheduled to run on chosen nodes.

The use case: we have a large custom-format database. It is partitioned and the segments are stored on local SSD on multiple nodes. Incoming queries are matched against the database; this involves either sending each key to the correct node, or sending a batch to all nodes simultaneously, where the queries are filtered and processed against one segment each, and the results are merged at the end.

Currently we are doing this with htcondor using a DAG to define the workflow and requirements expressions to match particular jobs to particular databases, but it's coarse-grained and more suited for batch processing than real-time, as well as being cumbersome to define and manage.

I wonder whether Spark would suit this workflow, and if so how?

It seems that either we would need to schedule parts of our jobs on the appropriate nodes, which I can't see how to do:
http://spark.apache.org/docs/latest/job-scheduling.html

Or possibly we could define our partitioned database as a custom type of RDD - however then we would need to define operations which work on two RDDs simultaneously (i.e. the static database and the incoming set of queries) which doesn't seem to fit Spark well AFAICS.

Any other ideas how we could approach this, either with Spark or suggestions for other frameworks to look at? (We would actually prefer non-Java frameworks but are happy to look at all options)

Thanks,

Brian Candler.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to