Hi Ted,

Thank you for the confirmation.  NFS 'rm -rf's often stumble on such
dotfiles, indeed.  It would be intesting to investigate which files are
being kept open, and why—but I'm not going to consider this a blocker
for 3.7.0.  (And yes, it would be interesting to see how that FUSE-based
FS would handle that same issue.)

Your report (rightly) stated "That doesn't look good," but did not
contain an explicit -1.  Given the situation and analysis, I am
consequently not going to count it as a "disapproving vote."

Thank you,
Damien Diederen



Ted Dunning <ted.dunn...@gmail.com> writes:
> Damien,
>
> Great analysis. I am running on an NFS mount.
>
> I can move to native disk, but I may try switching to a FUSE mount. The
> issue is likely that the directory is being deleted while there is an open
> file. When that happens, NFS converts that file into a file whose name
> starts with "." which can cause rmdir to subsequently fail. We have fixed
> that problem on the FUSE mounted version of the file system.
>
>
>
> On Thu, Mar 25, 2021 at 1:55 AM Damien Diederen <ddiede...@apache.org>
> wrote:
>
>>
>> Hi Ted, all,
>>
>> Thank you for testing.  You reported:
>>
>> >> I had these test failures. That doesn't look good.
>> […]
>> >
>> > Repeated tests. Similar, but not identical failures.
>> […]
>>
>> I agree that it doesn't look good, but am also unsurprised.  This is
>> precisely why I included this text in my call for vote:
>>
>> >>> I cannot say that I find the state of the test suite satisfactory, but
>> >>> the failures which are often observed are due to timing and/or TCP/IP
>> >>> port assignment issues, and repeated runs are "sufficient" to clear
>> >>> them.
>> >>>
>> >>> I was hoping to contribute more on that front, but have been unable so
>> >>> far, and don't want to keep the 3.7 branch hostage—so here is a timid
>> >>> RC2.
>>
>>
>> We currently have a number of tests which are very flaky unless they are
>> being run on hardware which is both powerful and dedicated.
>>
>> This happens for a number of reasons, including hardcoded deadlines and
>> TCP/IP port allocation races.
>>
>> There has been some work on fixing this, notably by Justin Ling Mao and
>> Mohammad Arshad.  I have also contributed, and intend to do more—but the
>> past weeks have conspired against it.  This is a slope we have to climb,
>> and I'm afraid it will take some time.
>>
>> My understanding is that test suites failures were not supposed to delay
>> 3.7.0 as long as they were 1/ sporadic and 2/ understood.  It is very
>> unfortunate that the extent of the problem was not discovered before the
>> 3.7 fork point, as it has kept us in limbo for a while!
>>
>>
>> Note that an additional factor is that 'pom.xml' contains the following:
>>
>>     <surefire-forkcount>8</surefire-forkcount>
>>
>> which isn't very good as a default given the current situation.  In
>> fact, both CI suites override it, the Jenkins one with:
>>
>>     -Dsurefire-forkcount=4
>>
>> and GitHub CI with:
>>
>>     -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5
>>
>>
>> >> [*ERROR*] *  RequestThrottlerTest.testRequestThrottler:198 expected: <1>
>> >> but was: <0>*
>>
>> This test is a known offender, and is on my list of items to investigate.
>>
>> I highly suspect the 5s timeout is the culprit; it should probably be
>> bumped up by some margin (and STALL_TIME be correspondingly adjusted):
>>
>>     // make sure the server received all 5 requests
>>     submitted.await(5, TimeUnit.SECONDS);
>>     Map<String, Object> metrics = MetricsUtils.currentServerMetrics();
>>
>>     // but only two requests can get into the pipeline because of the
>> throttler
>>     assertEquals(2L, (long) metrics.get("prep_processor_request_queued"));
>>     assertEquals(1L, (long) metrics.get("request_throttle_wait_count"));
>>
>>
>> https://github.com/apache/zookeeper/blob/release-3.7.0-2/zookeeper-server/src/test/java/org/apache/zookeeper/server/RequestThrottlerTest.java#L192-L198
>>
>>
>> >> [*ERROR*] *
>> LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625
>> file
>> '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2'
>> deletion failed*
>> […]
>> >> [*ERROR*] *  TxnLogToolkitTest.tearDown:62 » IO Unable to delete
>> >> directory /mapr/c0/user/td...*
>>
>> This looks a bit more worrisome; I have never seen that failure.
>>
>> The path '/mapr/c0/...', however, hints at a "nonstandard" filesystem.
>>
>> I would expect the current test suite to trip on weird corner cases on
>> networked filesystems such as NFS, let alone something more exotic.  Can
>> you confirm that the above is not a "vanilla" filesystem?  If so, that
>> would explain the failure.
>>
>> (Not saying that aspect of the test suite should not be improved.)
>>
>>
>> If you are willing to go through the ordeal once more, I would suggest:
>>
>>   * Limiting/disabling the test suite concurrency, which is currently
>>     known to be problematic.  Perhaps with the following:
>>
>>         -Dsurefire-forkcount=1 -Dsurefire.rerunFailingTestsCount=5
>>
>>   * Running it on top of a POSIX/local filesystem;
>>
>>
>> What do you think?
>>
>> Cheers, -D
>>
>>
>>
>> Ted Dunning <ted.dunn...@gmail.com> writes:
>> > Repeated tests. Similar, but not identical failures.
>> >
>> >
>> >
>> > On Wed, Mar 24, 2021 at 4:09 PM Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> >
>> >>
>> >> I had these test failures. That doesn't look good.
>> >>
>> >> I haven't been keeping track, however, and am not sure that they are
>> >> problems:
>> >>
>> >> [*ERROR*] *Failures: *
>> >>
>> >> [*ERROR*] *  RequestThrottlerTest.testRequestThrottler:198 expected: <1>
>> >> but was: <0>*
>> >>
>> >> [*ERROR*] *
>> >> LoadFromLogTest>ClientBase.tearDown:590->ClientBase.recursiveDelete:625
>> >> file
>> >>
>> '/mapr/c0/user/tdunning/zookeeper-release-3.7.0-2/zookeeper-server/target/surefire/test3407191642699245455.junit.dir/version-2'
>> >> deletion failed*
>> >>
>> >> [*ERROR*] *Errors: *
>> >>
>> >> [*ERROR*] *  TxnLogToolkitTest.tearDown:62 » IO Unable to delete
>> >> directory /mapr/c0/user/td...*
>> >>
>> >> [*ERROR*] *  TxnLogToolkitTest.tearDown:62 » IO Unable to delete
>> >> directory /mapr/c0/user/td...*
>> >>
>> >> [*ERROR*] *  TxnLogToolkitTest.tearDown:62 » IO Unable to delete
>> >> directory /mapr/c0/user/td...*
>> >>
>> >> [*ERROR*] *  TxnLogToolkitTest.tearDown:62 » IO Unable to delete
>> >> directory /mapr/c0/user/td...*
>> >>
>> >> [*INFO*]
>> >>
>> >> [*ERROR*] *Tests run: 2907, Failures: 2, Errors: 4, Skipped: 4*
>> >>
>> >>
>> >> Software info:
>> >>
>> >> Downloaded 3.7.0-2 from github in tar.gz form
>> >>
>> >> Platform info:
>> >>
>> >> Linux nodeb 5.4.0-65-generic #73~18.04.1-Ubuntu SMP Tue Jan 19 09:02:24
>> >> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >>
>> >> 32GB RAM, modest number of cores
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Mar 24, 2021 at 3:10 PM Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> >>
>> >>>
>> >>> I am starting tests now.
>> >>>
>> >>> I had an issue (self-inflicted) when I tried to run two tests at the
>> same
>> >>> time.
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 24, 2021 at 1:22 PM Damien Diederen <ddiede...@apache.org>
>> >>> wrote:
>> >>>
>> >>>>
>> >>>> Dear all,
>> >>>>
>> >>>> Thank you for the reviews and testing.
>> >>>>
>> >>>> Glad that we have not met yet another snag so far!
>> >>>>
>> >>>> I have counted 5 approving votes, 2 of which are binding.  (The mail
>> >>>> archives seem to agree.)  As per the process, I am waiting for a 3rd
>> >>>> binding vote before starting the release process.
>> >>>>
>> >>>> Feel free to prod your PMC friends ;)
>> >>>>
>> >>>> Cheers, -D
>> >>>>
>> >>>>
>> >>>>
>> >>>> Enrico Olivelli <eolive...@gmail.com> writes:
>> >>>> > +1 (binding)
>> >>>> > - Build and run tests on Ubuntu (Java and C client) on jdk8
>> >>>> > - verified rat,signatures,sha512sum,checkstyle and spotbugs (on
>> JDK8)
>> >>>> > - This time (for the first time!) I was able to build the C client
>> on
>> >>>> > MacOs (BigSur) !
>> >>>> > - verified the list of License files in the binary tarball
>> >>>> >
>> >>>> > great work Damien
>> >>>> >
>> >>>> > Enrico
>> >>>> >
>> >>>> > Il giorno lun 22 mar 2021 alle ore 12:37 Mohammad arshad
>> >>>> > <mohammad.ars...@huawei.com> ha scritto:
>> >>>> >>
>> >>>> >> +1 (non-binding)
>> >>>> >>
>> >>>> >> Verified signature and checksum of release artifacts, all are ok
>> >>>> >> Run Junit test cases with jdk1.8.0_232, total 2951 test cases, 4
>> >>>> >> skipped, rest all passed
>> >>>> >> Built tarball from source code, installed 3 node cluster and
>> >>>> >> verified basic functionalities from API, executed few cli
>> >>>> >> commands. No issues observed
>> >>>> >> Connected HBase, HDFS and Yarn clusters (all using zk 3.5.6) to zk
>> >>>> >> 3.7.0 cluster, no issues observed.
>> >>>> >>
>> >>>> >> Thanks & Regards
>> >>>> >> Arshad
>> >>>> >> -----Original Message-----
>> >>>> >> From: Patrick Hunt [mailto:ph...@apache.org]
>> >>>> >> Sent: Saturday, March 20, 2021 5:17 AM
>> >>>> >> To: DevZooKeeper <dev@zookeeper.apache.org>
>> >>>> >> Subject: Re: [VOTE] Apache ZooKeeper release 3.7.0 candidate 2
>> >>>> >>
>> >>>> >> +1 xsum/sig validate. rat ran clean and I was able to compile
>> (dep/cve
>> >>>> >> check passed) and manual verification of a few different cluster
>> >>>> >> sizes was successful.
>> >>>> >>
>> >>>> >> Regards,
>> >>>> >>
>> >>>> >> Patrick
>> >>>> >>
>> >>>> >>
>> >>>> >> On Wed, Mar 17, 2021 at 4:06 AM Damien Diederen <
>> ddiede...@apache.org
>> >>>> >
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >> >
>> >>>> >> > Greetings, all!
>> >>>> >> >
>> >>>> >> > After a long delay, here is a third release candidate for
>> ZooKeeper
>> >>>> 3.7.0.
>> >>>> >> >
>> >>>> >> > Compared to RC1, it contains... quite a few changes.  It notably
>> >>>> fixes
>> >>>> >> > the quota feature for multi transactions, repairs the test suite
>> on
>> >>>> >> > macOS (Catalina), makes a few tests less flaky, and avoids a CVE.
>> >>>> >> >
>> >>>> >> > The complete set of changes can be obtained with the Git range
>> >>>> >> > expression 'release-3.7.0-1..release-3.7.0-2', or on GitHub at:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>>
>> https://github.com/apache/zookeeper/compare/release-3.7.0-1...release-
>> >>>> >> > 3.7.0-2
>> >>>> >> >
>> >>>> >> > I cannot say that I find the state of the test suite
>> satisfactory,
>> >>>> but
>> >>>> >> > the failures which are often observed are due to timing and/or
>> >>>> TCP/IP
>> >>>> >> > port assignment issues, and repeated runs are "sufficient" to
>> clear
>> >>>> >> > them.
>> >>>> >> >
>> >>>> >> > I was hoping to contribute more on that front, but have been
>> unable
>> >>>> so
>> >>>> >> > far, and don't want to keep the 3.7 branch hostage—so here is a
>> >>>> timid
>> >>>> >> > RC2.
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > ZooKeeper 3.7.0 introduces a number of new features, notably:
>> >>>> >> >
>> >>>> >> >   * An API to start a ZooKeeper server from Java
>> (ZOOKEEPER-3874);
>> >>>> >> >
>> >>>> >> >   * Quota enforcement (ZOOKEEPER-3301);
>> >>>> >> >
>> >>>> >> >   * Host name canonicalization in quorum SASL authentication
>> >>>> >> > (ZOOKEEPER-4030);
>> >>>> >> >
>> >>>> >> >   * Support for BCFKS key/trust store format (ZOOKEEPER-3950);
>> >>>> >> >
>> >>>> >> >   * A choice of mandatory authentication scheme(s)
>> (ZOOKEEPER-3561);
>> >>>> >> >
>> >>>> >> >   * A "whoami" API and CLI command (ZOOKEEPER-3969);
>> >>>> >> >
>> >>>> >> >   * The possibility of disabling digest authentication
>> >>>> >> > (ZOOKEEPER-3979);
>> >>>> >> >
>> >>>> >> >   * Multiple SASL "superUsers" (ZOOKEEPER-3959);
>> >>>> >> >
>> >>>> >> >   * Fast-tracking of throttled requests (ZOOKEEPER-3683);
>> >>>> >> >
>> >>>> >> >   * Additional security metrics (ZOOKEEPER-3978);
>> >>>> >> >
>> >>>> >> >   * SASL support in the C and Perl clients (ZOOKEEPER-1112,
>> >>>> >> > ZOOKEEPER-3714);
>> >>>> >> >
>> >>>> >> >   * A new zkSnapshotComparer.sh tool (ZOOKEEPER-3427);
>> >>>> >> >
>> >>>> >> >   * Notes on how to benchmark ZooKeeper with the YCSB tool
>> >>>> >> > (ZOOKEEPER-3264).
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > The release notes are available here:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>>
>> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi
>> >>>> >> > te/releasenotes.html
>> >>>> >> >
>> >>>> >> >
>> >>>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310
>> >>>> >> > 801&version=12346617
>> >>>> >> >
>> >>>> >> > *** Please download, test and vote by March 21st 2021, 23:59
>> UTC+0.
>> >>>> >> > ***
>> >>>> >> >
>> >>>> >> > Source files:
>> >>>> >> >
>> >>>> >> >
>> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/
>> >>>> >> >
>> >>>> >> > Maven staging repo:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>>
>> https://repository.apache.org/content/repositories/orgapachezookeeper-
>> >>>> >> > 1067/
>> >>>> >> >
>> >>>> >> > The release candidate tag in git to be voted upon:
>> release-3.7.0-2
>> >>>> >> >
>> >>>> >> >   https://github.com/apache/zookeeper/tree/release-3.7.0-2
>> >>>> >> >
>> >>>> >> > ZooKeeper's KEYS file containing PGP keys we use to sign the
>> >>>> release:
>> >>>> >> >
>> >>>> >> >   https://www.apache.org/dist/zookeeper/KEYS
>> >>>> >> >
>> >>>> >> > The staging version of the website is:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>>
>> https://people.apache.org/~ddiederen/zookeeper-3.7.0-candidate-2/websi
>> >>>> >> > te/
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > Should we release this candidate?
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > Damien Diederen
>> >>>> >> >
>> >>>>
>> >>>
>>

Reply via email to