Re: Execute function once on each node

2016-07-19 Thread Rabin Banerjee
" I am working on a spark application that requires the ability to run a
function on each node in the cluster
"
--
Use Apache Ignite instead of Spark. Trust me it's awesome for this use case.

Regards,
Rabin Banerjee
On Jul 19, 2016 3:27 AM, "joshuata"  wrote:

> I am working on a spark application that requires the ability to run a
> function on each node in the cluster. This is used to read data from a
> directory that is not globally accessible to the cluster. I have tried
> creating an RDD with n elements and n partitions so that it is evenly
> distributed among the n nodes, and then mapping a function over the RDD.
> However, the runtime makes no guarantees that each partition will be stored
> on a separate node. This means that the code will run multiple times on the
> same node while never running on another.
>
> I have looked through the documentation and source code for both RDDs and
> the scheduler, but I haven't found anything that will do what I need. Does
> anybody know of a solution I could use?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Technical limitations keep us from running another filesystem on the SSDs.
We are running on a very large HPC cluster without control over low-level
system components. We have tried setting up an ad-hoc HDFS cluster on the
nodes in our allocation, but we have had very little luck. It ends up being
very brittle and difficult for the simulation code to access.

On Tue, Jul 19, 2016 at 7:08 AM Koert Kuipers  wrote:

> The whole point of a well designed global filesystem is to not move the
> data
>
> On Jul 19, 2016 10:07, "Koert Kuipers"  wrote:
>
>> If you run hdfs on those ssds (with low replication factor) wouldn't it
>> also effectively write to local disk with low latency?
>>
>> On Jul 18, 2016 21:54, "Josh Asplund"  wrote:
>>
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>>
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>>
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>>
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:
>>>
 I am working on a spark application that requires the ability to run a
 function on each node in the cluster. This is used to read data from a
 directory that is not globally accessible to the cluster. I have tried
 creating an RDD with n elements and n partitions so that it is evenly
 distributed among the n nodes, and then mapping a function over the RDD.
 However, the runtime makes no guarantees that each partition will be
 stored
 on a separate node. This means that the code will run multiple times on
 the
 same node while never running on another.

 I have looked through the documentation and source code for both RDDs
 and
 the scheduler, but I haven't found anything that will do what I need.
 Does
 anybody know of a solution I could use?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>


Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Thank you for that advice. I have tried similar techniques, but not that
one.

On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Thanks for the explanation. Try creating a custom RDD whose getPartitions
> returns an array of custom partition objects of size n (= number of nodes).
> In a custom partition object, you can have the file path and ip/hostname
> where the partition needs to be computed. Then, have getPreferredLocations
> return the ip/hostname from the partition object and in compute function,
> assert that you are in right ip/hostname (or fail) and read the content of
> the file.
>
> Not a 100% sure it will work though.
>
> On Tue, Jul 19, 2016, 2:54 AM Josh Asplund  wrote:
>
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>>
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>>
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>>
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:
>>>
 I am working on a spark application that requires the ability to run a
 function on each node in the cluster. This is used to read data from a
 directory that is not globally accessible to the cluster. I have tried
 creating an RDD with n elements and n partitions so that it is evenly
 distributed among the n nodes, and then mapping a function over the RDD.
 However, the runtime makes no guarantees that each partition will be
 stored
 on a separate node. This means that the code will run multiple times on
 the
 same node while never running on another.

 I have looked through the documentation and source code for both RDDs
 and
 the scheduler, but I haven't found anything that will do what I need.
 Does
 anybody know of a solution I could use?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org




Re: Execute function once on each node

2016-07-19 Thread Koert Kuipers
The whole point of a well designed global filesystem is to not move the data

On Jul 19, 2016 10:07, "Koert Kuipers"  wrote:

> If you run hdfs on those ssds (with low replication factor) wouldn't it
> also effectively write to local disk with low latency?
>
> On Jul 18, 2016 21:54, "Josh Asplund"  wrote:
>
> The spark workers are running side-by-side with scientific simulation
> code. The code writes output to local SSDs to keep latency low. Due to the
> volume of data being moved (10's of terabytes +), it isn't really feasible
> to copy the data to a global filesystem. Executing a function on each node
> would allow us to read the data in situ without a copy.
>
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
>
> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> You can't assume that the number to nodes will be constant as some may
>> fail, hence you can't guarantee that a function will execute at most once
>> or atleast once on a node. Can you explain your use case in a bit more
>> detail?
>>
>> On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:
>>
>>> I am working on a spark application that requires the ability to run a
>>> function on each node in the cluster. This is used to read data from a
>>> directory that is not globally accessible to the cluster. I have tried
>>> creating an RDD with n elements and n partitions so that it is evenly
>>> distributed among the n nodes, and then mapping a function over the RDD.
>>> However, the runtime makes no guarantees that each partition will be
>>> stored
>>> on a separate node. This means that the code will run multiple times on
>>> the
>>> same node while never running on another.
>>>
>>> I have looked through the documentation and source code for both RDDs and
>>> the scheduler, but I haven't found anything that will do what I need.
>>> Does
>>> anybody know of a solution I could use?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>


Re: Execute function once on each node

2016-07-19 Thread Aniket Bhatnagar
Thanks for the explanation. Try creating a custom RDD whose getPartitions
returns an array of custom partition objects of size n (= number of nodes).
In a custom partition object, you can have the file path and ip/hostname
where the partition needs to be computed. Then, have getPreferredLocations
return the ip/hostname from the partition object and in compute function,
assert that you are in right ip/hostname (or fail) and read the content of
the file.

Not a 100% sure it will work though.

On Tue, Jul 19, 2016, 2:54 AM Josh Asplund  wrote:

> The spark workers are running side-by-side with scientific simulation
> code. The code writes output to local SSDs to keep latency low. Due to the
> volume of data being moved (10's of terabytes +), it isn't really feasible
> to copy the data to a global filesystem. Executing a function on each node
> would allow us to read the data in situ without a copy.
>
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
>
> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> You can't assume that the number to nodes will be constant as some may
>> fail, hence you can't guarantee that a function will execute at most once
>> or atleast once on a node. Can you explain your use case in a bit more
>> detail?
>>
>> On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:
>>
>>> I am working on a spark application that requires the ability to run a
>>> function on each node in the cluster. This is used to read data from a
>>> directory that is not globally accessible to the cluster. I have tried
>>> creating an RDD with n elements and n partitions so that it is evenly
>>> distributed among the n nodes, and then mapping a function over the RDD.
>>> However, the runtime makes no guarantees that each partition will be
>>> stored
>>> on a separate node. This means that the code will run multiple times on
>>> the
>>> same node while never running on another.
>>>
>>> I have looked through the documentation and source code for both RDDs and
>>> the scheduler, but I haven't found anything that will do what I need.
>>> Does
>>> anybody know of a solution I could use?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Execute function once on each node

2016-07-18 Thread Josh Asplund
The spark workers are running side-by-side with scientific simulation code.
The code writes output to local SSDs to keep latency low. Due to the volume
of data being moved (10's of terabytes +), it isn't really feasible to copy
the data to a global filesystem. Executing a function on each node would
allow us to read the data in situ without a copy.

I understand that manually assigning tasks to nodes reduces fault
tolerance, but the simulation codes already explicitly assign tasks, so a
failure of any one node is already a full-job failure.

On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar 
wrote:

> You can't assume that the number to nodes will be constant as some may
> fail, hence you can't guarantee that a function will execute at most once
> or atleast once on a node. Can you explain your use case in a bit more
> detail?
>
> On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:
>
>> I am working on a spark application that requires the ability to run a
>> function on each node in the cluster. This is used to read data from a
>> directory that is not globally accessible to the cluster. I have tried
>> creating an RDD with n elements and n partitions so that it is evenly
>> distributed among the n nodes, and then mapping a function over the RDD.
>> However, the runtime makes no guarantees that each partition will be
>> stored
>> on a separate node. This means that the code will run multiple times on
>> the
>> same node while never running on another.
>>
>> I have looked through the documentation and source code for both RDDs and
>> the scheduler, but I haven't found anything that will do what I need. Does
>> anybody know of a solution I could use?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
You can't assume that the number to nodes will be constant as some may
fail, hence you can't guarantee that a function will execute at most once
or atleast once on a node. Can you explain your use case in a bit more
detail?

On Mon, Jul 18, 2016, 10:57 PM joshuata  wrote:

> I am working on a spark application that requires the ability to run a
> function on each node in the cluster. This is used to read data from a
> directory that is not globally accessible to the cluster. I have tried
> creating an RDD with n elements and n partitions so that it is evenly
> distributed among the n nodes, and then mapping a function over the RDD.
> However, the runtime makes no guarantees that each partition will be stored
> on a separate node. This means that the code will run multiple times on the
> same node while never running on another.
>
> I have looked through the documentation and source code for both RDDs and
> the scheduler, but I haven't found anything that will do what I need. Does
> anybody know of a solution I could use?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>