Aaron Kimball wrote:
Hadoop has some classes for controlling how sockets are used. See
org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.
The socket factory implementation chosen is controlled by the
hadoop.rpc.socket.factory.class.default configuration parameter. You could
probably write your own SocketFactory that gives back socket implementations
that tee the conversation to another port, or to a file, etc.
So, "it's possible," but I don't know that anyone's implemented this. I
think others may have examined Hadoop's protocols via wireshark or other
external tools, but those don't have much insight into Hadoop's internals.
(Neither, for that matter, would the socket factory. You'd probably need to
be pretty clever to introspect as to exactly what type of message is being
sent and actually do semantic analysis, etc.)
also worry about anything opening a URL, for which there are JVM-level
factories, and Jetty which opens its own listeners, though presumably
its the clients you'd want to play with.
I'm going to be honest and say this is a fairly ambitious project for a
master's thesis because you are going to be nestling deep into code
across the system, possibly making changes whose benefits people who run
well managed datacentres won't see the benefit of (they don't have
connectivity problems as they set up the machines and the network
properly, it's only people like me whose home desktop is badly
configured ( https://issues.apache.org/jira/browse/HADOOP-3426 )
Now, what might be handy is better diagnostics of the configuration,
1. code to run on every machine to test the network, look at the
config, play with DNS, detect problems and report them with meaningful
errors that point to wiki pages with hints
2. every service which opens ports to log this event somewhere
(ideally to a service base class) so instead of trying to work out which
ports hadoop is using by playing with netstat -p and jps -v, you can
make a query of the nodes (command line, signal and GET /ports) and get
each services list of active protocols, ports and IP addresses as text
or JSON.
3. some class to take that JSON list and then try to access the
various things, log failures
4. Some MR jobs to run the code in (3) and see what happens
5. Some MR jobs whose aim in life is to measure network bandwidth and
do stats on round trip times.
Just a thought :)
See also some thoughts of mine on Hadoop/university collaboration
http://www.slideshare.net/steve_l/hadoop-and-universities