“However it's good to find the issue earlier if there really is any, before release announced.”
I run the complete unit test suite before announcing a release candidate. Just to be clear. Totally agree we should get these problems sorted before an actual release. My policy is to cancel a RC if anyone vetoes for this reason... want as much coverage and varying environments as we can manage. Thank you for your help so far and I hope the failures you see result in analysis and fixes that lead to better test stability. > On Apr 11, 2019, at 9:32 PM, Yu Li <car...@gmail.com> wrote: > > Confirmed in 1.4.7 source the listed out cases passed (all in the 1st part > of hbase-server so the result comes out quickly.)... Also confirmed the > test ran order are the same... > > Will try 1.5.0 again to prevent the environment difference caused by time. > If 1.5.0 still fails, will start to do the git bisect to locate the first > bad commit. > > Was also expecting an easy pass and +1 as always to save time and efforts, > but obvious no luck. However it's good to find the issue earlier if there > really is any, before release announced. > > Best Regards, > Yu > > >> On Fri, 12 Apr 2019 at 12:16, Yu Li <car...@gmail.com> wrote: >> >> Fine, let's focus on verifying whether it's a real problem rather than >> arguing about wording, after all that's not my intention... >> >> As mentioned, I participated in the 1.4.7 release vote[1] and IIRC I was >> using the same env and all tests passed w/o issue, that's where my concern >> lies and the main reason I gave a -1 vote. I'm running against 1.4.7 source >> on the same now and let's see the result. >> >> [1] https://www.mail-archive.com/dev@hbase.apache.org/msg51380.html >> >> Best Regards, >> Yu >> >> >> On Fri, 12 Apr 2019 at 12:05, Andrew Purtell <andrew.purt...@gmail.com> >> wrote: >> >>> I believe the test execution order matters. We run some tests in >>> parallel. The ordering of tests is determined by readdir() results and this >>> differs from host to host and checkout to checkout. So when you see a >>> repeatable group of failures, that’s great. And when someone else doesn’t >>> see those same tests fail, or they cannot be reproduced when running by >>> themselves, the commonly accepted term of art for this is “flaky”. >>> >>> >>>> On Apr 11, 2019, at 8:52 PM, Yu Li <car...@gmail.com> wrote: >>>> >>>> Sorry but I'd call it "possible environment related problem" or "some >>>> feature may not work well in specific environment", rather than a flaky. >>>> >>>> Will check against 1.4.7 released source package before opening any >>> JIRA. >>>> >>>> Best Regards, >>>> Yu >>>> >>>> >>>> On Fri, 12 Apr 2019 at 11:37, Andrew Purtell <andrew.purt...@gmail.com> >>>> wrote: >>>> >>>>> And if they pass in my environment , then what should we call it then. >>> I >>>>> have no doubt you are seeing failures. Therefore can you please file >>> JIRAs >>>>> and attach information that can help identify a fix. Thanks. >>>>> >>>>>> On Apr 11, 2019, at 8:35 PM, Yu Li <car...@gmail.com> wrote: >>>>>> >>>>>> I ran the test suite with the -Dsurefire.rerunFailingTestsCount=2 >>> option >>>>>> and on two different env separately, so it sums up to 6 times stable >>>>>> failure for each case, and from my perspective this is not flaky. >>>>>> >>>>>> IIRC last time when verifying 1.4.7 on the same env no such issue >>>>> observed, >>>>>> will double check. >>>>>> >>>>>> Best Regards, >>>>>> Yu >>>>>> >>>>>> >>>>>> On Fri, 12 Apr 2019 at 00:07, Andrew Purtell < >>> andrew.purt...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> There are two failure cases it looks like. And this looks like >>> flakes. >>>>>>> >>>>>>> The wrong FS assertions are not something I see when I run these >>> tests >>>>>>> myself. I am not able to investigate something I can’t reproduce. >>> What I >>>>>>> suggest is since you can reproduce do a git bisect to find the commit >>>>> that >>>>>>> introduced the problem. Then we can revert it. As an alternative we >>> can >>>>>>> open a JIRA, report the problem, temporarily @ignore the test, and >>>>>>> continue. This latter option only should be done if we are fairly >>>>> confident >>>>>>> it is a test only problem. >>>>>>> >>>>>>> The connect exceptions are interesting. I see these sometimes when >>> the >>>>>>> suite is executed, not this particular case, but when the failed >>> test is >>>>>>> executed by itself it always passes. It is possible some change to >>>>> classes >>>>>>> related to the minicluster or startup or shutdown timing are the >>> cause, >>>>> but >>>>>>> it is test time flaky behavior. I’m not happy about this but it >>> doesn’t >>>>>>> actually fail the release because the failure is never repeatable >>> when >>>>> the >>>>>>> test is run standalone. >>>>>>> >>>>>>> In general it would be great if some attention was paid to test >>>>>>> cleanliness on branch-1. As RM I’m not in a position to insist that >>>>>>> everything is perfect or there will never be another 1.x release, >>>>> certainly >>>>>>> not from branch-1. So, tests which fail repeatedly block a release >>> IMHO >>>>> but >>>>>>> flakes do not. >>>>>>> >>>>>>> >>>>>>>> On Apr 10, 2019, at 11:20 PM, Yu Li <car...@gmail.com> wrote: >>>>>>>> >>>>>>>> -1 >>>>>>>> >>>>>>>> Observed many UT failures when checking the source package (tried >>>>>>> multiple >>>>>>>> rounds on two different environments, MacOs and Linux, got the same >>>>>>>> result), including (but not limited to): >>>>>>>> >>>>>>>> TestBulkload: >>>>>>>> >>>>>>> >>>>> >>> shouldBulkLoadSingleFamilyHLog(org.apache.hadoop.hbase.regionserver.TestBulkLoad) >>>>>>>> Time elapsed: 0.083 s <<< ERROR! >>>>>>>> java.lang.IllegalArgumentException: Wrong FS: >>>>>>>> >>>>>>> >>>>> >>> file:/var/folders/t6/vch4nh357f98y1wlq09lbm7h0000gn/T/junit1805329913454564189/junit8020757893576011944/data/default/shouldBulkLoadSingleFamilyHLog/8f4a6b584533de2fd1bf3c398dfaac29, >>>>>>>> expected: hdfs://localhost:55938 >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestBulkLoad.testRegionWithFamiliesAndSpecifiedTableName(TestBulkLoad.java:246) >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestBulkLoad.testRegionWithFamilies(TestBulkLoad.java:256) >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestBulkLoad.shouldBulkLoadSingleFamilyHLog(TestBulkLoad.java:150) >>>>>>>> >>>>>>>> TestStoreFile: >>>>>>>> >>>>>>> >>>>> >>> testCacheOnWriteEvictOnClose(org.apache.hadoop.hbase.regionserver.TestStoreFile) >>>>>>>> Time elapsed: 0.083 s <<< ERROR! >>>>>>>> java.net.ConnectException: Call From localhost/127.0.0.1 to >>>>>>> localhost:55938 >>>>>>>> failed on connection exception: java.net.ConnectException: >>> Connection >>>>>>>> refused; For more details see: >>>>>>>> http://wiki.apache.org/hadoop/ConnectionRefused >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestStoreFile.writeStoreFile(TestStoreFile.java:1047) >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestStoreFile.testCacheOnWriteEvictOnClose(TestStoreFile.java:908) >>>>>>>> >>>>>>>> TestHFile: >>>>>>>> testEmptyHFile(org.apache.hadoop.hbase.io.hfile.TestHFile) Time >>>>> elapsed: >>>>>>>> 0.08 s <<< ERROR! >>>>>>>> java.net.ConnectException: Call From >>>>>>>> z05f06378.sqa.zth.tbsite.net/11.163.183.195 to localhost:35529 >>> failed >>>>> on >>>>>>>> connection exception: java.net.ConnectException: Connection refused; >>>>> For >>>>>>>> more details see: http://wiki.apache.org/hadoop/ConnectionRefused >>>>>>>> at >>>>>>>> org.apache.hadoop.hbase.io >>>>>>> .hfile.TestHFile.testEmptyHFile(TestHFile.java:90) >>>>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>>>> at >>>>>>>> org.apache.hadoop.hbase.io >>>>>>> .hfile.TestHFile.testEmptyHFile(TestHFile.java:90) >>>>>>>> >>>>>>>> TestBlocksScanned: >>>>>>>> >>>>>>> >>>>> >>> testBlocksScannedWithEncoding(org.apache.hadoop.hbase.regionserver.TestBlocksScanned) >>>>>>>> Time elapsed: 0.069 s <<< ERROR! >>>>>>>> java.lang.IllegalArgumentException: Wrong FS: >>>>> hdfs://localhost:35529/tmp/ >>>>>>>> >>>>>>> >>>>> >>> hbase-jueding.ly/hbase/data/default/TestBlocksScannedWithEncoding/a4a416cc3060d9820a621c294af0aa08 >>>>>>> , >>>>>>>> expected: file:/// >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestBlocksScanned._testBlocksScanned(TestBlocksScanned.java:90) >>>>>>>> at >>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.regionserver.TestBlocksScanned.testBlocksScannedWithEncoding(TestBlocksScanned.java:86) >>>>>>>> >>>>>>>> And please let me know if any known issue I'm not aware of. Thanks. >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Yu >>>>>>>> >>>>>>>> >>>>>>>>> On Mon, 8 Apr 2019 at 11:38, Yu Li <car...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> The performance report LGTM, thanks! (and sorry for the lag due to >>>>>>>>> Qingming Festival Holiday here in China) >>>>>>>>> >>>>>>>>> Still verifying the release, just some quick feedback: observed >>> some >>>>>>>>> incompatible changes in compatibility report including >>>>>>>>> HBASE-21492/HBASE-21684 and worth a reminder in ReleaseNote. >>>>>>>>> >>>>>>>>> Irrelative but noticeable: the 1.4.9 release note URL is invalid on >>>>>>>>> https://hbase.apache.org/downloads.html >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Yu >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Fri, 5 Apr 2019 at 08:45, Andrew Purtell <apurt...@apache.org> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> The difference is basically noise per the usual YCSB evaluation. >>>>> Small >>>>>>>>>> differences in workloads D and F (slightly worse) and workload E >>>>>>> (slightly >>>>>>>>>> better) that do not indicate serious regression. >>>>>>>>>> >>>>>>>>>> Linux version 4.14.55-62.37.amzn1.x86_64 >>>>>>>>>> c3.8xlarge x 5 >>>>>>>>>> OpenJDK Runtime Environment (build 1.8.0_181-shenandoah-b13) >>>>>>>>>> -Xms20g -Xmx20g -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseNUMA >>>>>>>>>> -XX:-UseBiasedLocking -XX:+ParallelRefProcEnabled >>>>>>>>>> Hadoop 2.9.2 >>>>>>>>>> Init: Load 100 M rows and snapshot >>>>>>>>>> Run: Delete table, clone and redeploy from snapshot, run 10 M >>>>>>> operations >>>>>>>>>> Args: -threads 100 -target 50000 >>>>>>>>>> Test table: {NAME => 'u', BLOOMFILTER => 'ROW', VERSIONS => '1', >>>>>>> IN_MEMORY >>>>>>>>>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => >>>>>>>>>> 'ROW_INDEX_V1', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', >>>>>>> MIN_VERSIONS => >>>>>>>>>> '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', >>> REPLICATION_SCOPE => >>>>>>>>>> '0'} >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> YCSB Workload A >>>>>>>>>> >>>>>>>>>> target 50k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 200592 200583 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 49852 49855 >>>>>>>>>> [READ], AverageLatency(us) 544 559 >>>>>>>>>> [READ], MinLatency(us) 267 292 >>>>>>>>>> [READ], MaxLatency(us) 165631 185087 >>>>>>>>>> [READ], 95thPercentileLatency(us) 738 742 >>>>>>>>>> [READ], 99thPercentileLatency(us), 1877 1961 >>>>>>>>>> [UPDATE], AverageLatency(us) 1370 1181 >>>>>>>>>> [UPDATE], MinLatency(us) 702 646 >>>>>>>>>> [UPDATE], MaxLatency(us) 180735 177279 >>>>>>>>>> [UPDATE], 95thPercentileLatency(us) 1943 1652 >>>>>>>>>> [UPDATE], 99thPercentileLatency(us) 3257 3085 >>>>>>>>>> >>>>>>>>>> YCSB Workload B >>>>>>>>>> >>>>>>>>>> target 50k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 200599 200581 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 49850 49855 >>>>>>>>>> [READ], AverageLatency(us), 454 471 >>>>>>>>>> [READ], MinLatency(us) 203 213 >>>>>>>>>> [READ], MaxLatency(us) 183423 174207 >>>>>>>>>> [READ], 95thPercentileLatency(us) 563 599 >>>>>>>>>> [READ], 99thPercentileLatency(us) 1360 1172 >>>>>>>>>> [UPDATE], AverageLatency(us) 1064 1029 >>>>>>>>>> [UPDATE], MinLatency(us) 746 726 >>>>>>>>>> [UPDATE], MaxLatency(us) 163455 101631 >>>>>>>>>> [UPDATE], 95thPercentileLatency(us) 1327 1157 >>>>>>>>>> [UPDATE], 99thPercentileLatency(us) 2241 1898 >>>>>>>>>> >>>>>>>>>> YCSB Workload C >>>>>>>>>> >>>>>>>>>> target 50k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 200541 200538 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 49865 49865 >>>>>>>>>> [READ], AverageLatency(us) 332 327 >>>>>>>>>> [READ], MinLatency(us) 175 179 >>>>>>>>>> [READ], MaxLatency(us) 210559 170367 >>>>>>>>>> [READ], 95thPercentileLatency(us) 410 396 >>>>>>>>>> [READ], 99thPercentileLatency(us) 871 892 >>>>>>>>>> >>>>>>>>>> YCSB Workload D >>>>>>>>>> >>>>>>>>>> target 50k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 200579 200562 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 49855 49859 >>>>>>>>>> [READ], AverageLatency(us) 487 547 >>>>>>>>>> [READ], MinLatency(us) 210 214 >>>>>>>>>> [READ], MaxLatency(us) 192255 177535 >>>>>>>>>> [READ], 95thPercentileLatency(us) 973 1529 >>>>>>>>>> [READ], 99thPercentileLatency(us) 1836 2683 >>>>>>>>>> [INSERT], AverageLatency(us) 1239 1152 >>>>>>>>>> [INSERT], MinLatency(us) 807 788 >>>>>>>>>> [INSERT], MaxLatency(us) 184575 148735 >>>>>>>>>> [INSERT], 95thPercentileLatency(us) 1496 1243 >>>>>>>>>> [INSERT], 99thPercentileLatency(us) 2965 2495 >>>>>>>>>> >>>>>>>>>> YCSB Workload E >>>>>>>>>> >>>>>>>>>> target 10k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 100605 100568 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 9939 9943 >>>>>>>>>> [SCAN], AverageLatency(us) 3548 2687 >>>>>>>>>> [SCAN], MinLatency(us) 696 678 >>>>>>>>>> [SCAN], MaxLatency(us) 1059839 238463 >>>>>>>>>> [SCAN], 95thPercentileLatency(us) 8327 6791 >>>>>>>>>> [SCAN], 99thPercentileLatency(us) 17647 14415 >>>>>>>>>> [INSERT], AverageLatency(us) 2688 1555 >>>>>>>>>> [INSERT], MinLatency(us) 887 815 >>>>>>>>>> [INSERT], MaxLatency(us) 173311 154623 >>>>>>>>>> [INSERT], 95thPercentileLatency(us) 4455 2571 >>>>>>>>>> [INSERT], 99thPercentileLatency(us) 9303 5375 >>>>>>>>>> >>>>>>>>>> YCSB Workload F >>>>>>>>>> >>>>>>>>>> target 50k/op/s 1.4.9 1.5.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [OVERALL], RunTime(ms) 200562 204178 >>>>>>>>>> [OVERALL], Throughput(ops/sec) 49859 48976 >>>>>>>>>> [READ], AverageLatency(us) 856 1137 >>>>>>>>>> [READ], MinLatency(us) 262 257 >>>>>>>>>> [READ], MaxLatency(us) 205567 222335 >>>>>>>>>> [READ], 95thPercentileLatency(us) 2365 3475 >>>>>>>>>> [READ], 99thPercentileLatency(us) 3099 4143 >>>>>>>>>> [READ-MODIFY-WRITE], AverageLatency(us) 2559 2917 >>>>>>>>>> [READ-MODIFY-WRITE], MinLatency(us) 1100 1034 >>>>>>>>>> [READ-MODIFY-WRITE], MaxLatency(us) 208767 204799 >>>>>>>>>> [READ-MODIFY-WRITE], 95thPercentileLatency(us) 5747 7627 >>>>>>>>>> [READ-MODIFY-WRITE], 99thPercentileLatency(us) 7203 8919 >>>>>>>>>> [UPDATE], AverageLatency(us) 1700 1777 >>>>>>>>>> [UPDATE], MinLatency(us) 737 687 >>>>>>>>>> [UPDATE], MaxLatency(us) 97983 94271 >>>>>>>>>> [UPDATE], 95thPercentileLatency(us) 3377 4147 >>>>>>>>>> [UPDATE], 99thPercentileLatency(us) 4147 4831 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Thu, Apr 4, 2019 at 1:14 AM Yu Li <car...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks for the efforts boss. >>>>>>>>>>> >>>>>>>>>>> Since it's a new minor release, do we have performance comparison >>>>>>> report >>>>>>>>>>> with 1.4.9 as we did when releasing 1.4.0? If so, any reference? >>>>> Many >>>>>>>>>>> thanks! >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> Yu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 4 Apr 2019 at 07:44, Andrew Purtell <apurt...@apache.org >>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The fourth HBase 1.5.0 release candidate (RC3) is available for >>>>>>>>>> download >>>>>>>>>>> at >>>>>>>>>>>> https://dist.apache.org/repos/dist/dev/hbase/hbase-1.5.0RC3/ >>> and >>>>>>>>>> Maven >>>>>>>>>>>> artifacts are available in the temporary repository >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>> https://repository.apache.org/content/repositories/orgapachehbase-1292/ >>>>>>>>>>>> >>>>>>>>>>>> The git tag corresponding to the candidate is '1.5.0RC3’ >>>>>>> (b0bc7225c5). >>>>>>>>>>>> >>>>>>>>>>>> A detailed source and binary compatibility report for this >>> release >>>>> is >>>>>>>>>>>> available for your review at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>> https://dist.apache.org/repos/dist/dev/hbase/hbase-1.5.0RC3/compat-check-report.html >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>>> A list of the 115 issues resolved in this release can be found >>> at >>>>>>>>>>>> https://s.apache.org/K4Wk . The 1.5.0 changelog is derived from >>>>> the >>>>>>>>>>>> changelog of the last branch-1.4 release, 1.4.9. >>>>>>>>>>>> >>>>>>>>>>>> Please try out the candidate and vote +1/0/-1. >>>>>>>>>>>> >>>>>>>>>>>> The vote will be open for at least 72 hours. Unless objection I >>>>> will >>>>>>>>>> try >>>>>>>>>>> to >>>>>>>>>>>> close it Friday April 12, 2019 if we have sufficient votes. >>>>>>>>>>>> >>>>>>>>>>>> Prior to making this announcement I made the following preflight >>>>>>>>>> checks: >>>>>>>>>>>> >>>>>>>>>>>> RAT check passes (7u80) >>>>>>>>>>>> Unit test suite passes (7u80, 8u181)* >>>>>>>>>>>> Opened the UI in a browser, poked around >>>>>>>>>>>> LTT load 100M rows with 100% verification and 20% updates >>> (8u181) >>>>>>>>>>>> ITBLL 1B rows with slowDeterministic monkey (8u181) >>>>>>>>>>>> ITBLL 1B rows with serverKilling monkey (8u181) >>>>>>>>>>>> >>>>>>>>>>>> There are known flaky tests. See HBASE-21904 and HBASE-21905. >>> These >>>>>>>>>> flaky >>>>>>>>>>>> tests do not represent serious test failures that would prevent >>> a >>>>>>>>>>> release. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Andrew >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best regards, >>>>>>>>>> Andrew >>>>>>>>>> >>>>>>>>>> Words like orphans lost among the crosstalk, meaning torn from >>>>> truth's >>>>>>>>>> decrepit hands >>>>>>>>>> - A23, Crosstalk >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >>