On Wed, Mar 4, 2015 at 11:57 PM, Steve Ramage <s...@sjrx.net> wrote:

> I'm writing a task dispatching system for scientific data collection on
> timesharing clusters. There is a specific API for submission of tasks and a
> few implementations one of which does the data collection locally, another
> to a MySQL Server, and the one I am working on currently an Akka based
> distributed system. It is a standard many producer, many consumer system,
> where the consumer tasks very in length from subsecond to several hours in
> lengths. Different producers may be different algorithms entirely, with
> different workloads (most typically the workloads are either here are 4
> million tasks to do, or keep generating tasks until all consumers are busy,
> and then slowly keep generating more to keep them busy).
>
> The typical use case is for a series of jobs to be submitted to torque or
> some dispatch system, almost surely consumers will be dispatched this way.
> In some cases producers may be run on a dedicated node, in others they may
> be queued as well. Consumers and producers may run on the same node, or on
> different nodes but pairwise communication is always possible. Users of
> this system really don't want to know or care about how their tasks are
> executed just that they are and that it is done as simple as possible. The
> whole actor system may exist only few a few hours, or may exist over a few
> days cycling both producers and consumers.
>
> The one problem I have right now is configuration of the cluster
> (singleton) in the akka sense. Users have no idea what machine will execute
> the tasks. A realistic assumption is that there exists a shared file
> system, which can always be relied upon. Most likely failure of nodes will
> occur due to being terminated for violating walltime or memory limits.
>
> My thoughts on how to deal with this are as follows:
>
> 1) Pick a random number >= 1024 to serve as the port.
> 2) Users will supply the network to bound too, or it will be auto-detected.
> 3) Look in a known directory and scan for files of the name akka.N, where
> N is an integer.
> 4) Take the highest number add 1 (or assume the highest number is 0), read
> the other files and put the ipaddress : ports as seed nodes.
> 5) Try to write the file as new akka.N+1, if this fails go back to step 1.
> 6) Try to start the actor system, if this fails go back to step 1.
> 7) When shutting down the actor system delete the file akka.N+1. (also
> happens in step 5 and step 6)
>
> Question 1) Is there a better option, and/or is there a problem with the
> above. I am somewhat concerned about producers and consumers not forming
> one cohesive cluster but little islands.
>

I think your solution will work fine. The key to avoiding islands is that a
node can only join an existing member, and only the first seed node joins
itself for bootstrapping. The first seed node is defined in the akka.0
file, so you must use that as the first seed node everywhere.


>
> Question 2) If I detect that one of my nodes has been quarantined by
> another node can I just restart the actor system using the above protocol
> to get it to rejoin the pool.
>

yes

/Patrik


> I actually think partitions will never happen as far as I'm concerned,
> although node deaths might happen. Any other weird failures would result in
> the network operations team terminating every job anyway and shutting down
> the cluster so I don't really need to be robust to that.
>
>
> Thanks,
>
> Steve Ramage
>
>  --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

[image: Scala Days] <http://event.scaladays.org/scaladays-sanfran-2015>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to