Hi all
I've been working on a Parameter Server for Flink and I have some basic
functionality up, which supports registering a key, plus setting, updating
and fetching a parameter.

The idea is to start a Parameter Server somewhere and pass the address and
port in a configuration for another actor system to access it. Right now, I
have a standalone module for this, named flink-server under staging.

There is a {{ParameterClient}} which allows users to do the above
operations in a blocking fashion by waiting on a result from the server.
You can look at the code here:
https://github.com/apache/flink/compare/master...sachingoel0101:parameter_server
[It is highly derived from the JobManager implementation.]

One obvious thing to do is to ensure there are several servers which can
serve data to users. This can help achieve redundancy too by copying data
over several servers and keeping them synchronized.

1.  We can follow a slave model where starting a server anywhere starts a
server on all slave machines. After this, I plan to copy a key-value pair
on several machines by computing their hashes [key's and server's UUID's]
modulo #servers. This way every server knows where exactly all the keys are
residing. This however has a problem at the time of failures.
If a server fails, we need to recompute the modulo values and re-distribute
almost all of the data to maintain redundancy.

2. Another method is, for every task manager started, inside the same
system, one server should be started and this server will handle all data
transfers from the tasks running inside the particular TaskManager. This
way, whenever there is a failure of a machine, the JobManager at least
knows and can let other TaskManagers and their servers know of the failure
of their fellow server.
Since the Job Manager is maintaining a list of the servers/task-managers,
we can maintain a indexed list of servers very easily. Then it's just a
matter of mapping a key to an index in the JobManager's instance list.
[Of course, I'm assuming it would be hard to assign indexes to servers in a
standalone fashion such that everyone has the exact same view].

I'm more in favor of 2, since we after all need to utilize this in
iterative algorithms, and that will need integration into task manager and
runtime context anyways. Plus, having a master to control everything makes
everything easy. :')

What do you guys think?

Cheers!
Sachin

PS. Sorry about the long email on a weekend.

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

Reply via email to