Damien, Great analysis. I am running on an NFS mount.
I can move to native disk, but I may try switching to a FUSE mount. The issue is likely that the directory is being deleted while there is an open file. When that happens, NFS converts that file into a file whose name starts with "." which can cause rmdir to subsequently fail. We have fixed that problem on the FUSE mounted version of the file system. On Thu, Mar 25, 2021 at 1:55 AM Damien Diederen <ddiede...@apache.org> wrote: > > Hi Ted, all, > > Thank you for testing. You reported: > > >> I had these test failures. That doesn't look good. > […] > > > > Repeated tests. Similar, but not identical failures. > […] > > I agree that it doesn't look good, but am also unsurprised. This is > precisely why I included this text in my call for vote: > > >>> I cannot say that I find the state of the test suite satisfactory, but > >>> the failures which are often observed are due to timing and/or TCP/IP > >>> port assignment issues, and repeated runs are "sufficient" to clear > >>> them. > >>> > >>> I was hoping to contribute more on that front, but have been unable so > >>> far, and don't want to keep the 3.7 branch hostage—so here is a timid > >>> RC2. > > > We currently have a number of tests which are very flaky unless they are > being run on hardware which is both powerful and dedicated. > > This happens for a number of reasons, including hardcoded deadlines and > TCP/IP port allocation races. > > There has been some work on fixing this, notably by Justin Ling Mao and > Mohammad Arshad. I have also contributed, and intend to do more—but the > past weeks have conspired against it. This is a slope we have to climb, > and I'm afraid it will take some time. > > My understanding is that test suites failures were not supposed to delay > 3.7.0 as long as they were 1/ sporadic and 2/ understood. It is very > unfortunate that the extent of the problem was not discovered before the > 3.7 fork point, as it has kept us in limbo for a while! > > > Note that an additional factor is that 'pom.xml' contains the following: > > <surefire-forkcount>8</surefire-forkcount> > > which isn't very good as a default given the current situation. In > fact, both CI suites override it, the Jenkins one with: > > -Dsurefire-forkcount=4 > > and GitHub CI with: > > -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5 > > > >> [*ERROR*] * RequestThrottlerTest.testRequestThrottler:198 expected: <1> > >> but was: <0>* > > This test is a known offender, and is on my list of items to investigate. > > I highly suspect the 5s timeout is the culprit; it should probably be > bumped up by some margin (and STALL_TIME be correspondingly adjusted): > > // make sure the server received all 5 requests > submitted.await(5, TimeUnit.SECONDS); > Map<String, Object> metrics = MetricsUtils.currentServerMetrics(); > > // but only two requests can get into the pipeline because of the > throttler > assertEquals(2L, (long) metrics.get("prep_processor_request_queued")); > assertEquals(1L, (long) metrics.get("request_throttle_wait_count")); > > > https://github.com/apache/zookeeper/blob/release-3.7.0-2/zookeeper-server/src/test/java/org/apache/zookeeper/server/RequestThrottlerTest.java#L192-L198 > > > >> [*ERROR*] * > LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625 > file > '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2' > deletion failed* > […] > >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete > >> directory /mapr/c0/user/td...* > > This looks a bit more worrisome; I have never seen that failure. > > The path '/mapr/c0/...', however, hints at a "nonstandard" filesystem. > > I would expect the current test suite to trip on weird corner cases on > networked filesystems such as NFS, let alone something more exotic. Can > you confirm that the above is not a "vanilla" filesystem? If so, that > would explain the failure. > > (Not saying that aspect of the test suite should not be improved.) > > > If you are willing to go through the ordeal once more, I would suggest: > > * Limiting/disabling the test suite concurrency, which is currently > known to be problematic. Perhaps with the following: > > -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5 > > * Running it on top of a POSIX/local filesystem; > > > What do you think? > > Cheers, -D > > > > Ted Dunning <ted.dunn...@gmail.com> writes: > > Repeated tests. Similar, but not identical failures. > > > > > > > > On Wed, Mar 24, 2021 at 4:09 PM Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> > >> I had these test failures. That doesn't look good. > >> > >> I haven't been keeping track, however, and am not sure that they are > >> problems: > >> > >> [*ERROR*] *Failures: * > >> > >> [*ERROR*] * RequestThrottlerTest.testRequestThrottler:198 expected: <1> > >> but was: <0>* > >> > >> [*ERROR*] * > >> LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625 > >> file > >> > '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2' > >> deletion failed* > >> > >> [*ERROR*] *Errors: * > >> > >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete > >> directory /mapr/c0/user/td...* > >> > >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete > >> directory /mapr/c0/user/td...* > >> > >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete > >> directory /mapr/c0/user/td...* > >> > >> [*ERROR*] * TxnLogToolkitTest.tearDown:62 » IO Unable to delete > >> directory /mapr/c0/user/td...* > >> > >> [*INFO*] > >> > >> [*ERROR*] *Tests run: 2907, Failures: 2, Errors: 4, Skipped: 4* > >> > >> > >> Software info: > >> > >> Downloaded 3.7.0-2 from github in tar.gz form > >> > >> Platform info: > >> > >> Linux nodeb 5.4.0-65-generic #73~18.04.1-Ubuntu SMP Tue Jan 19 09:02:24 > >> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > >> > >> > >> 32GB RAM, modest number of cores > >> > >> > >> > >> > >> On Wed, Mar 24, 2021 at 3:10 PM Ted Dunning <ted.dunn...@gmail.com> > wrote: > >> > >>> > >>> I am starting tests now. > >>> > >>> I had an issue (self-inflicted) when I tried to run two tests at the > same > >>> time. > >>> > >>> > >>> > >>> On Wed, Mar 24, 2021 at 1:22 PM Damien Diederen <ddiede...@apache.org> > >>> wrote: > >>> > >>>> > >>>> Dear all, > >>>> > >>>> Thank you for the reviews and testing. > >>>> > >>>> Glad that we have not met yet another snag so far! > >>>> > >>>> I have counted 5 approving votes, 2 of which are binding. (The mail > >>>> archives seem to agree.) As per the process, I am waiting for a 3rd > >>>> binding vote before starting the release process. > >>>> > >>>> Feel free to prod your PMC friends ;) > >>>> > >>>> Cheers, -D > >>>> > >>>> > >>>> > >>>> Enrico Olivelli <eolive...@gmail.com> writes: > >>>> > +1 (binding) > >>>> > - Build and run tests on Ubuntu (Java and C client) on jdk8 > >>>> > - verified rat,signatures,sha512sum,checkstyle and spotbugs (on > JDK8) > >>>> > - This time (for the first time!) I was able to build the C client > on > >>>> > MacOs (BigSur) ! > >>>> > - verified the list of License files in the binary tarball > >>>> > > >>>> > great work Damien > >>>> > > >>>> > Enrico > >>>> > > >>>> > Il giorno lun 22 mar 2021 alle ore 12:37 Mohammad arshad > >>>> > <mohammad.ars...@huawei.com> ha scritto: > >>>> >> > >>>> >> +1 (non-binding) > >>>> >> > >>>> >> Verified signature and checksum of release artifacts, all are ok > >>>> >> Run Junit test cases with jdk1.8.0_232, total 2951 test cases, 4 > >>>> >> skipped, rest all passed > >>>> >> Built tarball from source code, installed 3 node cluster and > >>>> >> verified basic functionalities from API, executed few cli > >>>> >> commands. No issues observed > >>>> >> Connected HBase, HDFS and Yarn clusters (all using zk 3.5.6) to zk > >>>> >> 3.7.0 cluster, no issues observed. > >>>> >> > >>>> >> Thanks & Regards > >>>> >> Arshad > >>>> >> -----Original Message----- > >>>> >> From: Patrick Hunt [mailto:ph...@apache.org] > >>>> >> Sent: Saturday, March 20, 2021 5:17 AM > >>>> >> To: DevZooKeeper <dev@zookeeper.apache.org> > >>>> >> Subject: Re: [VOTE] Apache ZooKeeper release 3.7.0 candidate 2 > >>>> >> > >>>> >> +1 xsum/sig validate. rat ran clean and I was able to compile > (dep/cve > >>>> >> check passed) and manual verification of a few different cluster > >>>> >> sizes was successful. > >>>> >> > >>>> >> Regards, > >>>> >> > >>>> >> Patrick > >>>> >> > >>>> >> > >>>> >> On Wed, Mar 17, 2021 at 4:06 AM Damien Diederen < > ddiede...@apache.org > >>>> > > >>>> >> wrote: > >>>> >> > >>>> >> > > >>>> >> > Greetings, all! > >>>> >> > > >>>> >> > After a long delay, here is a third release candidate for > ZooKeeper > >>>> 3.7.0. > >>>> >> > > >>>> >> > Compared to RC1, it contains... quite a few changes. It notably > >>>> fixes > >>>> >> > the quota feature for multi transactions, repairs the test suite > on > >>>> >> > macOS (Catalina), makes a few tests less flaky, and avoids a CVE. > >>>> >> > > >>>> >> > The complete set of changes can be obtained with the Git range > >>>> >> > expression 'release-3.7.0-1..release-3.7.0-2', or on GitHub at: > >>>> >> > > >>>> >> > > >>>> >> > > >>>> > https://github.com/apache/zookeeper/compare/release-3.7.0-1...release- > >>>> >> > 3.7.0-2 > >>>> >> > > >>>> >> > I cannot say that I find the state of the test suite > satisfactory, > >>>> but > >>>> >> > the failures which are often observed are due to timing and/or > >>>> TCP/IP > >>>> >> > port assignment issues, and repeated runs are "sufficient" to > clear > >>>> >> > them. > >>>> >> > > >>>> >> > I was hoping to contribute more on that front, but have been > unable > >>>> so > >>>> >> > far, and don't want to keep the 3.7 branch hostage—so here is a > >>>> timid > >>>> >> > RC2. > >>>> >> > > >>>> >> > > >>>> >> > ZooKeeper 3.7.0 introduces a number of new features, notably: > >>>> >> > > >>>> >> > * An API to start a ZooKeeper server from Java > (ZOOKEEPER-3874); > >>>> >> > > >>>> >> > * Quota enforcement (ZOOKEEPER-3301); > >>>> >> > > >>>> >> > * Host name canonicalization in quorum SASL authentication > >>>> >> > (ZOOKEEPER-4030); > >>>> >> > > >>>> >> > * Support for BCFKS key/trust store format (ZOOKEEPER-3950); > >>>> >> > > >>>> >> > * A choice of mandatory authentication scheme(s) > (ZOOKEEPER-3561); > >>>> >> > > >>>> >> > * A "whoami" API and CLI command (ZOOKEEPER-3969); > >>>> >> > > >>>> >> > * The possibility of disabling digest authentication > >>>> >> > (ZOOKEEPER-3979); > >>>> >> > > >>>> >> > * Multiple SASL "superUsers" (ZOOKEEPER-3959); > >>>> >> > > >>>> >> > * Fast-tracking of throttled requests (ZOOKEEPER-3683); > >>>> >> > > >>>> >> > * Additional security metrics (ZOOKEEPER-3978); > >>>> >> > > >>>> >> > * SASL support in the C and Perl clients (ZOOKEEPER-1112, > >>>> >> > ZOOKEEPER-3714); > >>>> >> > > >>>> >> > * A new zkSnapshotComparer.sh tool (ZOOKEEPER-3427); > >>>> >> > > >>>> >> > * Notes on how to benchmark ZooKeeper with the YCSB tool > >>>> >> > (ZOOKEEPER-3264). > >>>> >> > > >>>> >> > > >>>> >> > The release notes are available here: > >>>> >> > > >>>> >> > > >>>> >> > > >>>> > https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi > >>>> >> > te/releasenotes.html > >>>> >> > > >>>> >> > > >>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310 > >>>> >> > 801&version=12346617 > >>>> >> > > >>>> >> > *** Please download, test and vote by March 21st 2021, 23:59 > UTC+0. > >>>> >> > *** > >>>> >> > > >>>> >> > Source files: > >>>> >> > > >>>> >> > > https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/ > >>>> >> > > >>>> >> > Maven staging repo: > >>>> >> > > >>>> >> > > >>>> >> > > >>>> > https://repository.apache.org/content/repositories/orgapachezookeeper- > >>>> >> > 1067/ > >>>> >> > > >>>> >> > The release candidate tag in git to be voted upon: > release-3.7.0-2 > >>>> >> > > >>>> >> > https://github.com/apache/zookeeper/tree/release-3.7.0-2 > >>>> >> > > >>>> >> > ZooKeeper's KEYS file containing PGP keys we use to sign the > >>>> release: > >>>> >> > > >>>> >> > https://www.apache.org/dist/zookeeper/KEYS > >>>> >> > > >>>> >> > The staging version of the website is: > >>>> >> > > >>>> >> > > >>>> >> > > >>>> > https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi > >>>> >> > te/ > >>>> >> > > >>>> >> > > >>>> >> > Should we release this candidate? > >>>> >> > > >>>> >> > > >>>> >> > Damien Diederen > >>>> >> > > >>>> > >>> >