HPC Environment

Steve Ramage Wed, 04 Mar 2015 14:57:20 -0800

I'm writing a task dispatching system for scientific data collection on 
timesharing clusters. There is a specific API for submission of tasks and a 
few implementations one of which does the data collection locally, another 
to a MySQL Server, and the one I am working on currently an Akka based 
distributed system. It is a standard many producer, many consumer system, 
where the consumer tasks very in length from subsecond to several hours in 
lengths. Different producers may be different algorithms entirely, with 
different workloads (most typically the workloads are either here are 4 
million tasks to do, or keep generating tasks until all consumers are busy, 
and then slowly keep generating more to keep them busy).


The typical use case is for a series of jobs to be submitted to torque or 
some dispatch system, almost surely consumers will be dispatched this way. 
In some cases producers may be run on a dedicated node, in others they may 
be queued as well. Consumers and producers may run on the same node, or on 
different nodes but pairwise communication is always possible. Users of 
this system really don't want to know or care about how their tasks are 
executed just that they are and that it is done as simple as possible. The 
whole actor system may exist only few a few hours, or may exist over a few 
days cycling both producers and consumers. 

The one problem I have right now is configuration of the cluster 
(singleton) in the akka sense. Users have no idea what machine will execute 
the tasks. A realistic assumption is that there exists a shared file 
system, which can always be relied upon. Most likely failure of nodes will 
occur due to being terminated for violating walltime or memory limits.

My thoughts on how to deal with this are as follows:

1) Pick a random number >= 1024 to serve as the port.
2) Users will supply the network to bound too, or it will be auto-detected.
3) Look in a known directory and scan for files of the name akka.N, where N 
is an integer.
4) Take the highest number add 1 (or assume the highest number is 0), read 
the other files and put the ipaddress : ports as seed nodes.
5) Try to write the file as new akka.N+1, if this fails go back to step 1.
6) Try to start the actor system, if this fails go back to step 1.
7) When shutting down the actor system delete the file akka.N+1. (also 
happens in step 5 and step 6)

Question 1) Is there a better option, and/or is there a problem with the 
above. I am somewhat concerned about producers and consumers not forming 
one cohesive cluster but little islands.

Question 2) If I detect that one of my nodes has been quarantined by 
another node can I just restart the actor system using the above protocol 
to get it to rejoin the pool. I actually think partitions will never happen 
as far as I'm concerned, although node deaths might happen. Any other weird 
failures would result in the network operations team terminating every job 
anyway and shutting down the cluster so I don't really need to be robust to 
that.


Thanks,

Steve Ramage

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] Starting Akka Cluster & Seed Nodes in Torque/SGE/HPC Environment

Reply via email to