Re: How do you perform blocking IO in apache spark job?

Sean Owen Mon, 08 Sep 2014 08:55:38 -0700

What is the driver-side Future for? Are you trying to make the remote
Spark workers execute more requests to your service concurrently? it's
not clear from your messages whether it's something like a web
service, or just local native code.

So the time spent in your processing -- whatever returns Double -- is
mostly waiting for a blocking service to return?  I assume the
external service is not at capacity yet and can handle more concurrent
requests, or else, there's no point in adding parallelism.

First I'd figure out how many parallel requests the service can handle
before starting to slow down; call it N. It won't help to make more
than N requests in parallel. So first I'd make sure you really are not
yet at that point.

You can make more partitions with repartition(), to have at least N
partitions. Then you want to make sure there are enough executors,
with access to enough cores, to run N tasks concurrently on the
cluster. That should maximize parallelism.

You can indeed write remote functions that parallelize themselves with
Future (not on the driver side) but I think ideally you get the
parallelism from Spark, absent a reason not to.

On Mon, Sep 8, 2014 at 4:30 PM, DrKhu <khudyakov....@gmail.com> wrote:
> What if, when I traverse RDD, I need to calculate values in dataset by
> calling external (blocking) service? How do you think that could be
> achieved?
>
> val values: Future[RDD[Double]] = Future sequence tasks
>
> I've tried to create a list of Futures, but as RDD id not Traversable,
> Future.sequence is not suitable.
>
> I just wonder, if anyone had such a problem, and how did you solve it? What
> I'm trying to achieve is to get a parallelism on a single worker node, so I
> can call that external service 3000 times per second.
>
> Probably, there is another solution, more suitable for spark, like having
> multiple working nodes on single host.
>
> It's interesting to know, how do you cope with such a challenge? Thanks.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do you perform blocking IO in apache spark job?

Reply via email to