Thank you for that advice. I have tried similar techniques, but not that
one.

On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Thanks for the explanation. Try creating a custom RDD whose getPartitions
> returns an array of custom partition objects of size n (= number of nodes).
> In a custom partition object, you can have the file path and ip/hostname
> where the partition needs to be computed. Then, have getPreferredLocations
> return the ip/hostname from the partition object and in compute function,
> assert that you are in right ip/hostname (or fail) and read the content of
> the file.
>
> Not a 100% sure it will work though.
>
> On Tue, Jul 19, 2016, 2:54 AM Josh Asplund <joshaspl...@gmail.com> wrote:
>
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>>
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>>
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>>
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata <joshaspl...@gmail.com> wrote:
>>>
>>>> I am working on a spark application that requires the ability to run a
>>>> function on each node in the cluster. This is used to read data from a
>>>> directory that is not globally accessible to the cluster. I have tried
>>>> creating an RDD with n elements and n partitions so that it is evenly
>>>> distributed among the n nodes, and then mapping a function over the RDD.
>>>> However, the runtime makes no guarantees that each partition will be
>>>> stored
>>>> on a separate node. This means that the code will run multiple times on
>>>> the
>>>> same node while never running on another.
>>>>
>>>> I have looked through the documentation and source code for both RDDs
>>>> and
>>>> the scheduler, but I haven't found anything that will do what I need.
>>>> Does
>>>> anybody know of a solution I could use?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>

Reply via email to