Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher,

I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a multi-threaded
environment is bad code smell. 

Would be nice to have a way of having a way of exposing the nodes that would
allow simply invoking a function from the driver to the nodes without having
to do any transformation and looping through every record. Would be more
efficient and more flexible from a user's perspective.

Tnks,
Rod





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p11908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions  be executed once in each VM?

I am running  mapPartitions and the call function seems not to execute the
code?

JavaPairRDDString, String twos = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k));
twos.values().saveAsTextFile(args[2]);

JavaRDDString ls = twos.values().mapPartitions(new
FlatMapFunctionIteratorlt;String, String() {
@Override
public IterableString call(IteratorString arg0) throws Exception {
   System.out.println(Usage should call my jar once:  + arg0);
   return Lists.newArrayList();}
});



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher

Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.

once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?

@dmpour 
rdd.mapPartitions and it would still work as code would only be executed
once in each VM, but was wondering if there is more efficient way of doing
this by using a generated RDD with one partition per executor. 
This remark was misleading, what I meant was that in conjunction with the
TaskNonce pattern, my function would be called only once per executor as
long as the RDD had atleast one partition on each executor

Deenar





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say just once, have you defined across multiple what
(e.g., across multiple threads in the same JVM on the same machine)? In
principle you can have multiple executors on the same machine.

In any case, assuming it's the same JVM, have you considered using a
singleton that maintains done/not-done state, that is invoked by each of
the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark
the state boolean transient to prevent it from going through serdes.



--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar deenar.toras...@db.comwrote:

 Hi

 Is there a way in Spark to run a function on each executor just once. I
 have
 a couple of use cases.

 a) I use an external library that is a singleton. It keeps some global
 state
 and provides some functions to manipulate it (e.g. reclaim memory. etc.) .
 I
 want to check the global state of this library on each executor.

 b) To get jvm stats or instrumentation on each executor.

 Currently I have a crude way of achieving something similar, I just run a
 map on a large RDD that is hash partitioned, this does not guarantee that
 the job would run just once.

 Deenar



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this:

public class TaskNonce {

  private transient boolean mIsAlreadyDone;

  private static transient TaskNonce mSingleton = new TaskNonce();

  private transient Object mSyncObject = new Object();

  public TaskNonce getSingleton() { return mSingleton; }

  public void doThisOnce() {
if (mIsAlreadyDone) return;
lock (mSyncObject) {
  mIsAlreadyDone = true;
  ...
}
  }

which you would invoke as TaskNonce.getSingleton().doThisOnce() from within
the map closure. If you're using the Spark Java API, you can put all this
code in the mapper class itself.

There is no need to require one-row RDD partitions to achieve what you
want, if I understand your problem statement correctly.



--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar deenar.toras...@db.comwrote:

 Christopher

 It is once per JVM. TaskNonce would meet my needs. I guess if I want it
 once
 per thread, then a ThreadLocal would do the same.

 But how do I invoke TaskNonce, what is the best way to generate a RDD to
 ensure that there is one element per executor.

 Deenar



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.