Hi Ted,
Thank you for the confirmation. NFS 'rm -rf's often stumble on such dotfiles, indeed. It would be intesting to investigate which files are being kept open, and why—but I'm not going to consider this a blocker for 3.7.0. (And yes, it would be interesting to see how that FUSE-based FS would handle that same issue.) Your report (rightly) stated "That doesn't look good," but did not contain an explicit -1. Given the situation and analysis, I am consequently not going to count it as a "disapproving vote." Thank you, Damien Diederen Ted Dunning <ted.dunn...@gmail.com> writes: > Damien, > > Great analysis. I am running on an NFS mount. > > I can move to native disk, but I may try switching to a FUSE mount. The > issue is likely that the directory is being deleted while there is an open > file. When that happens, NFS converts that file into a file whose name > starts with "." which can cause rmdir to subsequently fail. We have fixed > that problem on the FUSE mounted version of the file system. > > > > On Thu, Mar 25, 2021 at 1:55 AM Damien Diederen <ddiede...@apache.org> > wrote: > >> >> Hi Ted, all, >> >> Thank you for testing. You reported: >> >> >> I had these test failures. That doesn't look good. >> […] >> > >> > Repeated tests. Similar, but not identical failures. >> […] >> >> I agree that it doesn't look good, but am also unsurprised. This is >> precisely why I included this text in my call for vote: >> >> >>> I cannot say that I find the state of the test suite satisfactory, but >> >>> the failures which are often observed are due to timing and/or TCP/IP >> >>> port assignment issues, and repeated runs are "sufficient" to clear >> >>> them. >> >>> >> >>> I was hoping to contribute more on that front, but have been unable so >> >>> far, and don't want to keep the 3.7 branch hostage—so here is a timid >> >>> RC2. >> >> >> We currently have a number of tests which are very flaky unless they are >> being run on hardware which is both powerful and dedicated. >> >> This happens for a number of reasons, including hardcoded deadlines and >> TCP/IP port allocation races. >> >> There has been some work on fixing this, notably by Justin Ling Mao and >> Mohammad Arshad. I have also contributed, and intend to do more—but the >> past weeks have conspired against it. This is a slope we have to climb, >> and I'm afraid it will take some time. >> >> My understanding is that test suites failures were not supposed to delay >> 3.7.0 as long as they were 1/ sporadic and 2/ understood. It is very >> unfortunate that the extent of the problem was not discovered before the >> 3.7 fork point, as it has kept us in limbo for a while! >> >> >> Note that an additional factor is that 'pom.xml' contains the following: >> >> <surefire-forkcount>8</surefire-forkcount> >> >> which isn't very good as a default given the current situation. In >> fact, both CI suites override it, the Jenkins one with: >> >> -Dsurefire-forkcount=4 >> >> and GitHub CI with: >> >> -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5 >> >> >> >> [*ERROR*] * RequestThrottlerTest.testRequestThrottler:198 expected: <1> >> >> but was: <0>* >> >> This test is a known offender, and is on my list of items to investigate. >> >> I highly suspect the 5s timeout is the culprit; it should probably be >> bumped up by some margin (and STALL_TIME be correspondingly adjusted): >> >> // make sure the server received all 5 requests >> submitted.await(5, TimeUnit.SECONDS); >> Map<String, Object> metrics = MetricsUtils.currentServerMetrics(); >> >> // but only two requests can get into the pipeline because of the >> throttler >> assertEquals(2L, (long) metrics.get("prep_processor_request_queued")); >> assertEquals(1L, (long) metrics.get("request_throttle_wait_count")); >> >> >> https://github.com/apache/zookeeper/blob/release-3.7.0-2/zookeeper-server/src/test/java/org/apache/zookeeper/server/RequestThrottlerTest.java#L192-L198 >> >> >> >> [*ERROR*] * >> LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625 >> file >> '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2' >> deletion failed* >> […] >> >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete >> >> directory /mapr/c0/user/td...* >> >> This looks a bit more worrisome; I have never seen that failure. >> >> The path '/mapr/c0/...', however, hints at a "nonstandard" filesystem. >> >> I would expect the current test suite to trip on weird corner cases on >> networked filesystems such as NFS, let alone something more exotic. Can >> you confirm that the above is not a "vanilla" filesystem? If so, that >> would explain the failure. >> >> (Not saying that aspect of the test suite should not be improved.) >> >> >> If you are willing to go through the ordeal once more, I would suggest: >> >> * Limiting/disabling the test suite concurrency, which is currently >> known to be problematic. Perhaps with the following: >> >> -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5 >> >> * Running it on top of a POSIX/local filesystem; >> >> >> What do you think? >> >> Cheers, -D >> >> >> >> Ted Dunning <ted.dunn...@gmail.com> writes: >> > Repeated tests. Similar, but not identical failures. >> > >> > >> > >> > On Wed, Mar 24, 2021 at 4:09 PM Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> > >> >> >> >> I had these test failures. That doesn't look good. >> >> >> >> I haven't been keeping track, however, and am not sure that they are >> >> problems: >> >> >> >> [*ERROR*] *Failures: * >> >> >> >> [*ERROR*] * RequestThrottlerTest.testRequestThrottler:198 expected: <1> >> >> but was: <0>* >> >> >> >> [*ERROR*] * >> >> LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625 >> >> file >> >> >> '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2' >> >> deletion failed* >> >> >> >> [*ERROR*] *Errors: * >> >> >> >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete >> >> directory /mapr/c0/user/td...* >> >> >> >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete >> >> directory /mapr/c0/user/td...* >> >> >> >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete >> >> directory /mapr/c0/user/td...* >> >> >> >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete >> >> directory /mapr/c0/user/td...* >> >> >> >> [*INFO*] >> >> >> >> [*ERROR*] *Tests run: 2907, Failures: 2, Errors: 4, Skipped: 4* >> >> >> >> >> >> Software info: >> >> >> >> Downloaded 3.7.0-2 from github in tar.gz form >> >> >> >> Platform info: >> >> >> >> Linux nodeb 5.4.0-65-generic #73~18.04.1-Ubuntu SMP Tue Jan 19 09:02:24 >> >> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> >> >> 32GB RAM, modest number of cores >> >> >> >> >> >> >> >> >> >> On Wed, Mar 24, 2021 at 3:10 PM Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> >> >>> >> >>> I am starting tests now. >> >>> >> >>> I had an issue (self-inflicted) when I tried to run two tests at the >> same >> >>> time. >> >>> >> >>> >> >>> >> >>> On Wed, Mar 24, 2021 at 1:22 PM Damien Diederen <ddiede...@apache.org> >> >>> wrote: >> >>> >> >>>> >> >>>> Dear all, >> >>>> >> >>>> Thank you for the reviews and testing. >> >>>> >> >>>> Glad that we have not met yet another snag so far! >> >>>> >> >>>> I have counted 5 approving votes, 2 of which are binding. (The mail >> >>>> archives seem to agree.) As per the process, I am waiting for a 3rd >> >>>> binding vote before starting the release process. >> >>>> >> >>>> Feel free to prod your PMC friends ;) >> >>>> >> >>>> Cheers, -D >> >>>> >> >>>> >> >>>> >> >>>> Enrico Olivelli <eolive...@gmail.com> writes: >> >>>> > +1 (binding) >> >>>> > - Build and run tests on Ubuntu (Java and C client) on jdk8 >> >>>> > - verified rat,signatures,sha512sum,checkstyle and spotbugs (on >> JDK8) >> >>>> > - This time (for the first time!) I was able to build the C client >> on >> >>>> > MacOs (BigSur) ! >> >>>> > - verified the list of License files in the binary tarball >> >>>> > >> >>>> > great work Damien >> >>>> > >> >>>> > Enrico >> >>>> > >> >>>> > Il giorno lun 22 mar 2021 alle ore 12:37 Mohammad arshad >> >>>> > <mohammad.ars...@huawei.com> ha scritto: >> >>>> >> >> >>>> >> +1 (non-binding) >> >>>> >> >> >>>> >> Verified signature and checksum of release artifacts, all are ok >> >>>> >> Run Junit test cases with jdk1.8.0_232, total 2951 test cases, 4 >> >>>> >> skipped, rest all passed >> >>>> >> Built tarball from source code, installed 3 node cluster and >> >>>> >> verified basic functionalities from API, executed few cli >> >>>> >> commands. No issues observed >> >>>> >> Connected HBase, HDFS and Yarn clusters (all using zk 3.5.6) to zk >> >>>> >> 3.7.0 cluster, no issues observed. >> >>>> >> >> >>>> >> Thanks & Regards >> >>>> >> Arshad >> >>>> >> -----Original Message----- >> >>>> >> From: Patrick Hunt [mailto:ph...@apache.org] >> >>>> >> Sent: Saturday, March 20, 2021 5:17 AM >> >>>> >> To: DevZooKeeper <dev@zookeeper.apache.org> >> >>>> >> Subject: Re: [VOTE] Apache ZooKeeper release 3.7.0 candidate 2 >> >>>> >> >> >>>> >> +1 xsum/sig validate. rat ran clean and I was able to compile >> (dep/cve >> >>>> >> check passed) and manual verification of a few different cluster >> >>>> >> sizes was successful. >> >>>> >> >> >>>> >> Regards, >> >>>> >> >> >>>> >> Patrick >> >>>> >> >> >>>> >> >> >>>> >> On Wed, Mar 17, 2021 at 4:06 AM Damien Diederen < >> ddiede...@apache.org >> >>>> > >> >>>> >> wrote: >> >>>> >> >> >>>> >> > >> >>>> >> > Greetings, all! >> >>>> >> > >> >>>> >> > After a long delay, here is a third release candidate for >> ZooKeeper >> >>>> 3.7.0. >> >>>> >> > >> >>>> >> > Compared to RC1, it contains... quite a few changes. It notably >> >>>> fixes >> >>>> >> > the quota feature for multi transactions, repairs the test suite >> on >> >>>> >> > macOS (Catalina), makes a few tests less flaky, and avoids a CVE. >> >>>> >> > >> >>>> >> > The complete set of changes can be obtained with the Git range >> >>>> >> > expression 'release-3.7.0-1..release-3.7.0-2', or on GitHub at: >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> https://github.com/apache/zookeeper/compare/release-3.7.0-1...release- >> >>>> >> > 3.7.0-2 >> >>>> >> > >> >>>> >> > I cannot say that I find the state of the test suite >> satisfactory, >> >>>> but >> >>>> >> > the failures which are often observed are due to timing and/or >> >>>> TCP/IP >> >>>> >> > port assignment issues, and repeated runs are "sufficient" to >> clear >> >>>> >> > them. >> >>>> >> > >> >>>> >> > I was hoping to contribute more on that front, but have been >> unable >> >>>> so >> >>>> >> > far, and don't want to keep the 3.7 branch hostage—so here is a >> >>>> timid >> >>>> >> > RC2. >> >>>> >> > >> >>>> >> > >> >>>> >> > ZooKeeper 3.7.0 introduces a number of new features, notably: >> >>>> >> > >> >>>> >> > * An API to start a ZooKeeper server from Java >> (ZOOKEEPER-3874); >> >>>> >> > >> >>>> >> > * Quota enforcement (ZOOKEEPER-3301); >> >>>> >> > >> >>>> >> > * Host name canonicalization in quorum SASL authentication >> >>>> >> > (ZOOKEEPER-4030); >> >>>> >> > >> >>>> >> > * Support for BCFKS key/trust store format (ZOOKEEPER-3950); >> >>>> >> > >> >>>> >> > * A choice of mandatory authentication scheme(s) >> (ZOOKEEPER-3561); >> >>>> >> > >> >>>> >> > * A "whoami" API and CLI command (ZOOKEEPER-3969); >> >>>> >> > >> >>>> >> > * The possibility of disabling digest authentication >> >>>> >> > (ZOOKEEPER-3979); >> >>>> >> > >> >>>> >> > * Multiple SASL "superUsers" (ZOOKEEPER-3959); >> >>>> >> > >> >>>> >> > * Fast-tracking of throttled requests (ZOOKEEPER-3683); >> >>>> >> > >> >>>> >> > * Additional security metrics (ZOOKEEPER-3978); >> >>>> >> > >> >>>> >> > * SASL support in the C and Perl clients (ZOOKEEPER-1112, >> >>>> >> > ZOOKEEPER-3714); >> >>>> >> > >> >>>> >> > * A new zkSnapshotComparer.sh tool (ZOOKEEPER-3427); >> >>>> >> > >> >>>> >> > * Notes on how to benchmark ZooKeeper with the YCSB tool >> >>>> >> > (ZOOKEEPER-3264). >> >>>> >> > >> >>>> >> > >> >>>> >> > The release notes are available here: >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi >> >>>> >> > te/releasenotes.html >> >>>> >> > >> >>>> >> > >> >>>> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310 >> >>>> >> > 801&version=12346617 >> >>>> >> > >> >>>> >> > *** Please download, test and vote by March 21st 2021, 23:59 >> UTC+0. >> >>>> >> > *** >> >>>> >> > >> >>>> >> > Source files: >> >>>> >> > >> >>>> >> > >> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/ >> >>>> >> > >> >>>> >> > Maven staging repo: >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> https://repository.apache.org/content/repositories/orgapachezookeeper- >> >>>> >> > 1067/ >> >>>> >> > >> >>>> >> > The release candidate tag in git to be voted upon: >> release-3.7.0-2 >> >>>> >> > >> >>>> >> > https://github.com/apache/zookeeper/tree/release-3.7.0-2 >> >>>> >> > >> >>>> >> > ZooKeeper's KEYS file containing PGP keys we use to sign the >> >>>> release: >> >>>> >> > >> >>>> >> > https://www.apache.org/dist/zookeeper/KEYS >> >>>> >> > >> >>>> >> > The staging version of the website is: >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi >> >>>> >> > te/ >> >>>> >> > >> >>>> >> > >> >>>> >> > Should we release this candidate? >> >>>> >> > >> >>>> >> > >> >>>> >> > Damien Diederen >> >>>> >> > >> >>>> >> >>> >>