Hi all I've been working on a Parameter Server for Flink and I have some basic functionality up, which supports registering a key, plus setting, updating and fetching a parameter.
The idea is to start a Parameter Server somewhere and pass the address and port in a configuration for another actor system to access it. Right now, I have a standalone module for this, named flink-server under staging. There is a {{ParameterClient}} which allows users to do the above operations in a blocking fashion by waiting on a result from the server. You can look at the code here: https://github.com/apache/flink/compare/master...sachingoel0101:parameter_server [It is highly derived from the JobManager implementation.] One obvious thing to do is to ensure there are several servers which can serve data to users. This can help achieve redundancy too by copying data over several servers and keeping them synchronized. 1. We can follow a slave model where starting a server anywhere starts a server on all slave machines. After this, I plan to copy a key-value pair on several machines by computing their hashes [key's and server's UUID's] modulo #servers. This way every server knows where exactly all the keys are residing. This however has a problem at the time of failures. If a server fails, we need to recompute the modulo values and re-distribute almost all of the data to maintain redundancy. 2. Another method is, for every task manager started, inside the same system, one server should be started and this server will handle all data transfers from the tasks running inside the particular TaskManager. This way, whenever there is a failure of a machine, the JobManager at least knows and can let other TaskManagers and their servers know of the failure of their fellow server. Since the Job Manager is maintaining a list of the servers/task-managers, we can maintain a indexed list of servers very easily. Then it's just a matter of mapping a key to an index in the JobManager's instance list. [Of course, I'm assuming it would be hard to assign indexes to servers in a standalone fashion such that everyone has the exact same view]. I'm more in favor of 2, since we after all need to utilize this in iterative algorithms, and that will need integration into task manager and runtime context anyways. Plus, having a master to control everything makes everything easy. :') What do you guys think? Cheers! Sachin PS. Sorry about the long email on a weekend. -- Sachin Goel Computer Science, IIT Delhi m. +91-9871457685