[jira] [Commented] (HDFS-11546) Federation Router RPC server

Chris Douglas (JIRA) Tue, 02 May 2017 16:06:12 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993970#comment-15993970
 ]


Chris Douglas commented on HDFS-11546:
--------------------------------------

Thanks for the iterations, I think this is close.

bq. How do we link to HADOOP-12077
bq. What kind of documentation would we add for the TODOs
I don't think it needs an explicit link, I was just trying to understand the 
semantics the router implements. Users of router-based HDFS federation (R-BF) 
will want to understand what guarantees it provides for these calls without 
reading the code. Somewhere in the branch there should be a .md documenting how 
to configure and use R-BF, how the config knobs interact, limitations (e.g., no 
snapshots), etc. Filing a documentation subtask on HDFS-10467 will provide a 
place to track its impl.

Comments on 006:
* The {{ConnectionPool#isConnectionUsable}} method doesn't distinguish between 
a reference to a busy, closed, or even null connection. The caller is expected 
to handle all these cases, or are closed/busy the same case? Should the caller 
retry on null?
* It looks like the connection could be returned to multiple threads before 
it's marked busy, since that only happens at {{ConnectionContext#getClient}}.
** The caller could receive a "non-busy" connection handle, only to have it 
closed before it can use it.
** Since {{busy}} is a boolean, those multiple threads will mark the thread as 
non-busy at the first call that releases it
** It's plausible that this will grow a pool of connections roughly 
proportional to load, but its semantics are fuzzy.
* The synchronization in {{ConnectionPool}} is hard to follow. For example, 
{{connections}} is protected by the instance ({{#close}}), by the read lock 
({{#getConnection}}), by a lock on the atomic int ({{#addingConnections}}). 
Even if this improves concurrency, it seems difficult to maintain without 
introducing deadlocks (I couldn't find any cycles in the existing code).
** Would it make sense for the {{ConnectionManager}} to mange the size of its 
{{ConnectionPool}} instances, rather than having each dynamically size itself? 
Imposing a cap on the total number of concurrent connections, QoS/quotas, etc. 
at the router probably requires some central management, and it could make the 
synchronization of the pool itself trivial (read lock for grabbing a connection 
from the pool, a write lock to add a thread from the manager
** The pattern of creating a thread to create the connection out-of-band is 
kind of heavy. If this were moved to the {{ConnectionManager}}, perhaps the 
decision could be inline and- if it's expensive- the connection creation could 
be offloaded to a single thread (pool) in the manager

The rest of the patch looks OK. I can't really go through {{ClientProtocol}} in 
detail, but this looks fine as a base.

I looked at {{getListing}} and {{create}}. The semantics for the multiple 
possible create locations, as discussed in HADOOP-12077 and elsewhere, are hard 
to reason about. If a remote location is temporarily unavailable, when it comes 
back clients may have a very strange view of the system, particularly retrying 
through separate routers which may have a partitioned view. It's an interesting 
capability to include, and for read-only paths replicated at all targets it's 
probably fine. If users have docs laying out how to use it (safely and 
dangerously), it seems fine to include even if it won't work seamlessly.

> Federation Router RPC server
> ----------------------------
>
>                 Key: HDFS-11546
>                 URL: https://issues.apache.org/jira/browse/HDFS-11546
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>    Affects Versions: HDFS-10467
>            Reporter: Inigo Goiri
>            Assignee: Inigo Goiri
>         Attachments: HDFS-11546-HDFS-10467-000.patch, 
> HDFS-11546-HDFS-10467-001.patch, HDFS-11546-HDFS-10467-002.patch, 
> HDFS-11546-HDFS-10467-003.patch, HDFS-11546-HDFS-10467-004.patch, 
> HDFS-11546-HDFS-10467-005.patch, HDFS-11546-HDFS-10467-006.patch
>
>
> RPC server side of the Federation Router implements ClientProtocol.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-11546) Federation Router RPC server

Reply via email to