This vote failed due to Daryn Sharp's veto. The concern is being addressed by HDFS-13873. I will start a new vote once this is committed.
Note for Daryn. Your non-responsive handling of the veto makes a bad precedence and is a bad example of communication on the lists from a respected member of this community. Please check your availability for followup discussions if you choose to get involved with important decisions. On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <shv.had...@gmail.com> wrote: > Hi Daryn, > > Wanted to backup Chen's earlier response to your concerns about rotating > calls in the call queue. > Our design > 1. targets directly the livelock problem by rejecting calls on the > Observer that are not likely to be responded in timely matter: HDFS-13873. > 2. The call queue rotation is only done on Observers, and never on the > active NN, so it stays free of attacks like you suggest. > > If this is a satisfactory mitigation for the problem could you please > reconsider your -1, so that people could continue voting on this thread. > > Thanks, > --Konst > > On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote: > >> -1 pending additional info. After a cursory scan, I have serious >> concerns regarding the design. This seems like a feature that should have >> been purely implemented in hdfs w/o touching the common IPC layer. >> >> The biggest issue in the alignment context. It's purpose appears to be >> for allowing handlers to reinsert calls back into the call queue. That's >> completely unacceptable. A buggy or malicious client can easily cause >> livelock in the IPC layer with handlers only looping on calls that never >> satisfy the condition. Why is this not implemented via RetriableExceptions? >> >> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yzh...@cloudera.com.invalid> >> wrote: >> >>> Great work guys. >>> >>> Wonder if we can elaborate what's impact of not having #2 fixed, and why >>> #2 >>> is not needed for the feature to complete? >>> 2. Need to fix automatic failover with ZKFC. Currently it does not >>> doesn't >>> know about ObserverNodes trying to convert them to SBNs. >>> >>> Thanks. >>> --Yongjun >>> >>> >>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.had...@gmail.com >>> > >>> wrote: >>> >>> > Hi Hadoop developers, >>> > >>> > I would like to propose to merge to trunk the feature branch >>> HDFS-12943 for >>> > Consistent Reads from Standby Node. The feature is intended to scale >>> read >>> > RPC workloads. On large clusters reads comprise 95% of all RPCs to the >>> > NameNode. We should be able to accommodate higher overall RPC >>> workloads (up >>> > to 4x by some estimates) by adding multiple ObserverNodes. >>> > >>> > The main functionality has been implemented see sub-tasks of >>> HDFS-12943. >>> > We followed up with the test plan. Testing was done on two independent >>> > clusters (see HDFS-14058 and HDFS-14059) with security enabled. >>> > We ran standard HDFS commands, MR jobs, admin commands including manual >>> > failover. >>> > We know of one cluster running this feature in production. >>> > >>> > There are a few outstanding issues: >>> > 1. Need to provide proper documentation - a user guide for the new >>> feature >>> > 2. Need to fix automatic failover with ZKFC. Currently it does not >>> doesn't >>> > know about ObserverNodes trying to convert them to SBNs. >>> > 3. Scale testing and performance fine-tuning >>> > 4. As testing progresses, we continue fixing non-critical bugs like >>> > HDFS-14116. >>> > >>> > I attached a unified patch to the umbrella jira for the review and >>> Jenkins >>> > build. >>> > Please vote on this thread. The vote will run for 7 days until Wed Dec >>> 12. >>> > >>> > Thanks, >>> > --Konstantin >>> > >>> >> >> >> -- >> >> Daryn >> >