Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say "just once", have you defined "across multiple "
(e.g., across multiple threads in the same JVM on the same machine)? In
principle you can have multiple executors on the same machine.

In any case, assuming it's the same JVM, have you considered using a
singleton that maintains done/not-done state, that is invoked by each of
the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark
the state boolean "transient" to prevent it from going through serdes.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar wrote:

> Hi
>
> Is there a way in Spark to run a function on each executor just once. I
> have
> a couple of use cases.
>
> a) I use an external library that is a singleton. It keeps some global
> state
> and provides some functions to manipulate it (e.g. reclaim memory. etc.) .
> I
> want to check the global state of this library on each executor.
>
> b) To get jvm stats or instrumentation on each executor.
>
> Currently I have a crude way of achieving something similar, I just run a
> map on a large RDD that is hash partitioned, this does not guarantee that
> the job would run just once.
>
> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-25 Thread deenar.toraskar
Christopher 

It is once per JVM. TaskNonce would meet my needs. I guess if I want it once
per thread, then a ThreadLocal would do the same.

But how do I invoke TaskNonce, what is the best way to generate a RDD to
ensure that there is one element per executor.

Deenar 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this:

public class TaskNonce {

  private transient boolean mIsAlreadyDone;

  private static transient TaskNonce mSingleton = new TaskNonce();

  private transient Object mSyncObject = new Object();

  public TaskNonce getSingleton() { return mSingleton; }

  public void doThisOnce() {
if (mIsAlreadyDone) return;
lock (mSyncObject) {
  mIsAlreadyDone = true;
  ...
}
  }

which you would invoke as TaskNonce.getSingleton().doThisOnce() from within
the map closure. If you're using the Spark Java API, you can put all this
code in the mapper class itself.

There is no need to require one-row RDD partitions to achieve what you
want, if I understand your problem statement correctly.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar wrote:

> Christopher
>
> It is once per JVM. TaskNonce would meet my needs. I guess if I want it
> once
> per thread, then a ThreadLocal would do the same.
>
> But how do I invoke TaskNonce, what is the best way to generate a RDD to
> ensure that there is one element per executor.
>
> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Hi Christopher

>>which you would invoke as TaskNonce.getSingleton().doThisOnce() from
within the map closure.

Say I have a cluster with 24 workers (one thread per worker
SPARK_WORKER_CORES). My application would have 24 executors each with its
own VM.

The RDDs i process have millions of rows and many partitions. I could do

rdd.mapPartitions and it would still work as code would only be executed
once in each VM, but was wondering if there is more efficient way of doing
this by using a generated RDD with one partition per executor.

Also when I want to return some stats from each executor rdd.mapPartitions
would return multiple results.

Deenar





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3337.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions  be executed once in each VM?

I am running  mapPartitions and the call function seems not to execute the
code?

JavaPairRDD twos = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k));
twos.values().saveAsTextFile(args[2]);

JavaRDD ls = twos.values().mapPartitions(new
FlatMapFunction, String>() {
@Override
public Iterable call(Iterator arg0) throws Exception {
   System.out.println("Usage should call my jar once: " + arg0);
   return Lists.newArrayList();}
});



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, dmpour is correct in that there's a many-to-many mapping between
executors and partitions (an executor can be assigned multiple partitions,
and a given partition can in principle move a different executor).

I'm not sure why you seem to require this problem statement to be solved
with RDDs. It is fairly easy to have something executed once per JVM, using
the pattern I suggested. Is there some other requirement I have missed?

Sent while mobile. Pls excuse typos etc.
On Mar 27, 2014 9:06 AM, "dmpour23"  wrote:

> How exactly does rdd.mapPartitions  be executed once in each VM?
>
> I am running  mapPartitions and the call function seems not to execute the
> code?
>
> JavaPairRDD twos = input.map(new
> Split()).sortByKey().partitionBy(new HashPartitioner(k));
> twos.values().saveAsTextFile(args[2]);
>
> JavaRDD ls = twos.values().mapPartitions(new
> FlatMapFunction, String>() {
> @Override
> public Iterable call(Iterator arg0) throws Exception {
>System.out.println("Usage should call my jar once: " + arg0);
>return Lists.newArrayList();}
> });
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher

Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.

once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?

@dmpour 
>>rdd.mapPartitions and it would still work as code would only be executed
once in each VM, but was wondering if there is more efficient way of doing
this by using a generated RDD with one partition per executor. 
This remark was misleading, what I meant was that in conjunction with the
TaskNonce pattern, my function would be called only once per executor as
long as the RDD had atleast one partition on each executor

Deenar





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, yes, you may indeed be overthinking it a bit, about how Spark
executes maps/filters etc. I'll focus on the high-order bits so it's clear.

Let's assume you're doing this in Java. Then you'd pass some
*MyMapper*instance to J
*avaRDD#map(myMapper)*.

So you'd have a class *MyMapper extends Function*. The
*call()* method of that class is effectively the function that will be
executed by the workers on your RDD's rows.

Within that *MyMapper#call()*, you can access static members and methods of
*MyMapper* itself. You could implement your *runOnce() *there.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Thu, Mar 27, 2014 at 4:20 PM, deenar.toraskar wrote:

> Christopher
>
> Sorry I might be missing the obvious, but how do i get my function called
> on
> all Executors used by the app? I dont want to use RDDs unless necessary.
>
> once I start my shell or app, how do I get
> TaskNonce.getSingleton().doThisOnce() executed on each executor?
>
> @dmpour
> >>rdd.mapPartitions and it would still work as code would only be executed
> once in each VM, but was wondering if there is more efficient way of doing
> this by using a generated RDD with one partition per executor.
> This remark was misleading, what I meant was that in conjunction with the
> TaskNonce pattern, my function would be called only once per executor as
> long as the RDD had atleast one partition on each executor
>
> Deenar
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-28 Thread dmpour23
Is it possible to do this:\

JavaRDD parttionedRdds = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k)).values();
parttionedRdds.saveAsTextFile(args[2]);
//Then run my SingletonFunction (My app depends on the saved Files)
parttionedRdds.map(new SingletonFunc());

The parttionedRdds.map(new SingletonFunc()); is never called do i need to
set ctx.setJobGroup 
or what i am trying to implement will not work ?

What is a usecase for ctx.setJobGroup is there any example?

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3427.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher,

I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a multi-threaded
environment is bad code smell. 

Would be nice to have a way of having a way of exposing the nodes that would
allow simply invoking a function from the driver to the nodes without having
to do any transformation and looping through every record. Would be more
efficient and more flexible from a user's perspective.

Tnks,
Rod





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p11908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org