[ 
https://issues.apache.org/jira/browse/HADOOP-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477345#comment-13477345
 ] 

Todd Lipcon commented on HADOOP-6311:
-------------------------------------

Hi Colin,

Thanks for writing up the design doc. I think it probably should actually go on 
HDFS-347, which is the overall feature JIRA, rather than this one, which is 
about one of the implementation subtasks. But, anyway, here are some comments:

{quote}
* Portable.  Hadoop supports multiple operating systems and environments,
including Linux, Solaris, and Windows.
{quote}

IMO, there is not a requirement that performance enhancements work on all 
systems. It is up to the maintainers of each port to come up with the most 
efficient way to do things. Though there is an active effort to get Hadoop 
working on Windows, it is not yet a requirement.

So long as we maintain the current TCP-based read (which we have to, anyway, 
for remote access), we'll have portability. If the Windows port doesn't 
initially offer this feature, that seems acceptable to me, as they could later 
add whatever mechanism makes the most sense for them.

{quote}
* High performance.  If performance is compromised, there is no point to any of
this work, because clients could simply use the existing, non-short-circuit
write pathways to access data.
{quote}
Should clarify that the performance of the mechanism by which FDs are _passed_ 
is less important, since the client will cache the open FDs and just re-use 
them for subsequent random reads against the same file (the primary use case 
for this improvement). So long as the overhead of passing the FDs isn't huge, 
we should be OK.

{quote}
There are other problems.  How would the datanode clients and the server decide
on a socket path?  If it asks every time prior to connecting, that could be
slow.  If the DFSClient cached this socket path, how long should it cache it
before expiring the cache?  What happens if the administrator does not properly
set up the socket path, as discussed earlier?  What happens if the
administrator wants to put multiple DataNodes on the same node?
{quote}
Per above, slowness here is not a concern, since we only need to do the 
socket-passing on file open. HDFS applications generally open a file once and 
then perform many many reads against the same block before opening the next 
block.

As for how the socket path is communicated, why not do it via an RPC? For 
example, in your solution #3, we're using an RPC to communicate a cookie. 
Instead of that, it can just return its abstract namespace socket name. (You 
seem to propose this under solution #3 below, but here in solution #1 reject it)

Another option would be to add a new field to the 
DatanodeId/DatanodeRegistration: when the client gets block locations it could 
also include the socket paths.

{quote}
The response is not a path, but a 64-bit cookie.  The DFSClient then connects
to the DN via a UNIX domain socket, and presents the cookie.  In response, he
receives the file descriptor.
{quote}

I don't see the purpose of the cookie, still, since it adds yet another opaque 
token, and requires the DN code to "publish" the file descriptor with a cookie, 
and we end up with extra data structures, cached open files, cache expiration 
policies, etc.

----

{quote}
Choice #3.  Blocking FdServer versus non-blocking FdServer.
Non-blocking servers in C are somewhat more complex than blocking servers.
However, if I used a blocking server, there would be no obvious way to
determine how many threads it should use.  Because it depends on how busy the
server is expected to be, only the system administrator can know ahead of time.
Additionally, many schedulers do not deal well with a large number of threads,
especially on older versions of Linux and commercial UNIX variants.
Coincidentally, these happen to be the exactly kind of systems many of our
users run.
{quote}

I don't really buy this. The socket only needs to be active long enough to pass 
a single fd, which should take a few milliseconds. The number of requests for 
fd-passing is based on the number of block opens, _not_ the number of reads. So 
a small handful of threads should be able to handle even significant workloads 
just fine. We also do fine with threads on the data xceiver path, often 
configured into the hundreds or thousands.

{quote}
Another problem with blocking servers is that shutting them down can be
difficult.  Since there is no time limit on blocking I/O, a message sent to the
server to terminate may take a while, or possibly forever, to be acted on.
This may seem like a trivial or unimportant problem, but it is a very real one
in unit tests.  Socket receive and send timeouts can reduce the extra time
needed to shut down, but never quite eliminate it.
{quote}
Again I don't buy it, we do fine with blocking IO everywhere else.. Why is this 
context different?

----

*Wire protocol*

The wire protocol should use protobufs, so it can be evolved in the future. I 
also am still against the cookie approach. I think a more sensible protocol 
would be the following:

0) The client somehow obtains the socket path (either by RPC to the DN or by it 
being part of the DatanodeId)
1) Client connects to the socket, and sends a message which is a fully formed 
protobuf message, including the block ID and block token
2) JNI server receives the message and passes it back to Java, which parses it, 
opens the block, and passes file descriptors (or an error) back to JNI.
3) JNI server sends the file descriptors along with a response protobuf
4) JNI client receives protobuf data and optionally fds (in success), forwards 
them back to Java where protobuf decode happens

This eliminates the need for a new cookie construct, and a bunch of data 
structures on the server side. The JNI code becomes stateless except for its 
job of accept(), read(), and write(). It's also extensible due to the use of 
protobufs (which is helpful if the server later needs to provide some extra 
information about the format of the files, etc)

The APIs would then become something like:
{code}
class FdResponse {
  FileDescriptor []fds;
  byte[] responseData;
}

interface FdProvider {
  FdResponse handleRequest(byte[] request);
}

class FdServer { // implementation in JNI
  FdServer(FdProvider provider);
}
{code}
(the FdServer in C calls back into the FdProvider interface to handle the 
requests)

----

*Security:*

One question: if you use the autobinding in the abstract namespace, does that 
prevent a later attacker from explicitly picking the same name? ie is the 
"autobind" namespace fully distinct from the other one? Otherwise I'm afraid of 
the following attack:

- malicious client acts like a reader, finds out the socket path of the local DN
- client starts a while-loop, trying to bind that same socket
- DN eventually restarts due to normal operations. client seizes the socket
- other clients on that machine have the cached socket path and get 
man-in-the-middled

I think this could be somewhat mitigated by using the credentials-checking 
facility of domain sockets: the client can know the proper uid of the datanode, 
and verify against that. Otherwise perhaps there is some token-like scheme we 
could use. But we should make sure this is foolproof.

If the above is indeed a possibility, I'm not sure the abstract namespace buys 
us anything over using a well-defined path (eg inside /var/run), since we'd 
need to do the credentials check anyway.

----

*Overall thoughts*

I agree in general with your arguments about the complexity of the unix socket 
approach and its non-portability, but let me also bring to the table a couple 
arguments _for_ it:

As history shows, I was originally one of the proponents of the short-circuit 
read approach (in HDFS-347, etc). It provided much better performance than 
loopback TCP, especially for random read. For sequential read, it also uses 
much less CPU: in particular I see a lot of system CPU being used by loopback 
TCP sockets for both read and write on the non-short-circuit code paths. Short 
circuit avoids this.

But my experience with HDFS-2246 and in thinking about some other upcoming 
improvements, there are a couple downsides as well:

1) Metrics: because the client gains full control of the file descriptors, the 
datanode no longer knows anything about the amount of disk IO being used by 
each of its clients, or even the total aggregate. This makes the metrics 
under-reported, and I don't see any way around it.

In addition to throughput metrics, we'll also end up lacking latency statistics 
against the local disk in the datanode. We now have latency percentiles for 
disk access, and that will be useful to identify dying/slow disks for 
applications like HBase. We don't get that with short-circuit.

2) Fault handling: if there's an actual IO error on the underlying disk, the 
client is the one who sees it instead of the datanode. This means that the 
datanode won't necessarily mark the disk as bad. We should figure out how to 
address this for the short-circuit path (e.g the client could send an RPC to 
the DN which asks it to move a given block to the front of the block scanner 
queue upon hitting an error)

3) Client thickness: there has recently been talk of changing the on-disk 
format in the datanode, for example to introduce inline checksums in the block 
data. Currently, such changes would be datanode-only, but with short circuit 
the client also needs an update. We need to ensure that whatever short circuit 
code we write, we have suitable fallbacks: eg by transferring some kind of disk 
format version identifier in the RPC, and having the DN reject the request if 
the client won't be able to handle the storage format.

4) Future QoS enhancements: currently there is no QoS/prioritization construct 
within the datanode, but as mixed workload clusters become more popular (eg 
HBase + MR) we would like to be able to introduce QoS features. If all IO goes 
through the datanode, then this is far more feasible. With short-circuit, once 
we hand out an fd, we've lost all ability to throttle a client.

So, I'm not _against_ the fd passing approach, but I think these downsides 
should be called out in the docs, and if there's any way we can think of to 
mitigate some of them, that would be good to consider.

I'd also be really interested to see data on performance of the existing 
datanode code running on either unix sockets or on a system patched with the 
new "TCP friends" feature which eliminates a lot of the loopback overhead. 
According to some benchmarks I've read about, it should cut CPU usage by a 
factor of 2-3, which might make the win of short-circuit much smaller. Has 
anyone done any prototypes in this area? The other advantage of this approach 
(non-short-circuit unix sockets instead of TCP) is that it would improve 
performance of the write pipeline as well, where I currently see a ton of 
overhead due to TCP in the kernel.

                
> Add support for unix domain sockets to JNI libs
> -----------------------------------------------
>
>                 Key: HADOOP-6311
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6311
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: native
>    Affects Versions: 0.20.0
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>         Attachments: 6311-trunk-inprogress.txt, design.txt, 
> HADOOP-6311.014.patch, HADOOP-6311.016.patch, HADOOP-6311.018.patch, 
> HADOOP-6311.020b.patch, HADOOP-6311.020.patch, HADOOP-6311.021.patch, 
> HADOOP-6311.022.patch, HADOOP-6311-0.patch, HADOOP-6311-1.patch, 
> hadoop-6311.txt
>
>
> For HDFS-347 we need to use unix domain sockets. This JIRA is to include a 
> library in common which adds a o.a.h.net.unix package based on the code from 
> Android (apache 2 license)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to