Re: Running a task once on each executor
Hi Christopher, I am also in the need of having a single function call on the node level. Your suggestion makes sense as a solution to the requirement, but still feels like a workaround, this check will get called on every row...Also having static members and methods created specially on a multi-threaded environment is bad code smell. Would be nice to have a way of having a way of exposing the nodes that would allow simply invoking a function from the driver to the nodes without having to do any transformation and looping through every record. Would be more efficient and more flexible from a user's perspective. Tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p11908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running a task once on each executor
How exactly does rdd.mapPartitions be executed once in each VM? I am running mapPartitions and the call function seems not to execute the code? JavaPairRDDString, String twos = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)); twos.values().saveAsTextFile(args[2]); JavaRDDString ls = twos.values().mapPartitions(new FlatMapFunctionIteratorlt;String, String() { @Override public IterableString call(IteratorString arg0) throws Exception { System.out.println(Usage should call my jar once: + arg0); return Lists.newArrayList();} }); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running a task once on each executor
Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour rdd.mapPartitions and it would still work as code would only be executed once in each VM, but was wondering if there is more efficient way of doing this by using a generated RDD with one partition per executor. This remark was misleading, what I meant was that in conjunction with the TaskNonce pattern, my function would be called only once per executor as long as the RDD had atleast one partition on each executor Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running a task once on each executor
Deenar, when you say just once, have you defined across multiple what (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that maintains done/not-done state, that is invoked by each of the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark the state boolean transient to prevent it from going through serdes. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar deenar.toras...@db.comwrote: Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this library on each executor. b) To get jvm stats or instrumentation on each executor. Currently I have a crude way of achieving something similar, I just run a map on a large RDD that is hash partitioned, this does not guarantee that the job would run just once. Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running a task once on each executor
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce getSingleton() { return mSingleton; } public void doThisOnce() { if (mIsAlreadyDone) return; lock (mSyncObject) { mIsAlreadyDone = true; ... } } which you would invoke as TaskNonce.getSingleton().doThisOnce() from within the map closure. If you're using the Spark Java API, you can put all this code in the mapper class itself. There is no need to require one-row RDD partitions to achieve what you want, if I understand your problem statement correctly. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar deenar.toras...@db.comwrote: Christopher It is once per JVM. TaskNonce would meet my needs. I guess if I want it once per thread, then a ThreadLocal would do the same. But how do I invoke TaskNonce, what is the best way to generate a RDD to ensure that there is one element per executor. Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html Sent from the Apache Spark User List mailing list archive at Nabble.com.