On Wed, Jan 28, 2015 at 1:44 PM, Matan Safriel <dev.ma...@gmail.com> wrote: > So I assume I can safely run a function F of mine within the spark driver > program, without dispatching it to the cluster (?), thereby sticking to one > piece of code for both a real cluster run over big data, and for small > on-demand runs for a single input (now and then), both scenarios using my > same code attached to the same application-specific configuration of my > business logic. Is that correct?
Yes. A function is just a function and nothing stops you from running it in your program that executes on the driver, and this has no relation to Spark. The same function can be sent out to run on distributed data too, by Spark. > Can I still write its output the same way Spark actions allow for a real > distributed task? You can call any code you like to write data from the driver, including things like HDFS APIs. Spark operations operate on RDDs, and a piece of data in memory on the driver is not an RDD. In that sense, no, but you can easily call parallelize() to send data out to the cluster as an RDD. Maybe that's simpler; it is less efficient though as you copy the data to a remote worker to save it, instead of just saving it directly. > Would I see it as a task in the monitoring UI (http://<driver-node>:4040)of > the driver? The UI shows distributed operations run by Spark. It would not show arbitrary function calls on the driver. I suppose I'm trying to say that those have nothing to do with Spark per se. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org