Re: Running a task over a single input

Sean Owen Wed, 28 Jan 2015 07:46:01 -0800

On Wed, Jan 28, 2015 at 1:44 PM, Matan Safriel <dev.ma...@gmail.com> wrote:
> So I assume I can safely run a function F of mine within the spark driver
> program, without dispatching it to the cluster (?), thereby sticking to one
> piece of code for both a real cluster run over big data, and for small
> on-demand runs for a single input (now and then), both scenarios using my
> same code attached to the same application-specific configuration of my
> business logic. Is that correct?


Yes. A function is just a function and nothing stops you from running
it in your program that executes on the driver, and this has no
relation to Spark. The same function can be sent out to run on
distributed data too, by Spark.

> Can I still write its output the same way Spark actions allow for a real
> distributed task?

You can call any code you like to write data from the driver,
including things like HDFS APIs. Spark operations operate on RDDs, and
a piece of data in memory on the driver is not an RDD. In that sense,
no, but you can easily call parallelize() to send data out to the
cluster as an RDD. Maybe that's simpler; it is less efficient though
as you copy the data to a remote worker to save it, instead of just
saving it directly.


> Would I see it as a task in the monitoring UI (http://<driver-node>:4040)of
> the driver?

The UI shows distributed operations run by Spark. It would not show
arbitrary function calls on the driver. I suppose I'm trying to say
that those have nothing to do with Spark per se.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running a task over a single input

Reply via email to