Aaron Kimball wrote:
Hadoop has some classes for controlling how sockets are used. See
org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.

The socket factory implementation chosen is controlled by the
hadoop.rpc.socket.factory.class.default configuration parameter. You could
probably write your own SocketFactory that gives back socket implementations
that tee the conversation to another port, or to a file, etc.

So, "it's possible," but I don't know that anyone's implemented this. I
think others may have examined Hadoop's protocols via wireshark or other
external tools, but those don't have much insight into Hadoop's internals.
(Neither, for that matter, would the socket factory. You'd probably need to
be pretty clever to introspect as to exactly what type of message is being
sent and actually do semantic analysis, etc.)

also worry about anything opening a URL, for which there are JVM-level factories, and Jetty which opens its own listeners, though presumably its the clients you'd want to play with.

I'm going to be honest and say this is a fairly ambitious project for a master's thesis because you are going to be nestling deep into code across the system, possibly making changes whose benefits people who run well managed datacentres won't see the benefit of (they don't have connectivity problems as they set up the machines and the network properly, it's only people like me whose home desktop is badly configured ( https://issues.apache.org/jira/browse/HADOOP-3426 )

Now, what might be handy is better diagnostics of the configuration,
1. code to run on every machine to test the network, look at the config, play with DNS, detect problems and report them with meaningful errors that point to wiki pages with hints 2. every service which opens ports to log this event somewhere (ideally to a service base class) so instead of trying to work out which ports hadoop is using by playing with netstat -p and jps -v, you can make a query of the nodes (command line, signal and GET /ports) and get each services list of active protocols, ports and IP addresses as text or JSON. 3. some class to take that JSON list and then try to access the various things, log failures
 4. Some MR jobs to run the code in (3) and see what happens
5. Some MR jobs whose aim in life is to measure network bandwidth and do stats on round trip times.

Just a thought :)


See also some thoughts of mine on Hadoop/university collaboration
http://www.slideshare.net/steve_l/hadoop-and-universities

Reply via email to