[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315797#comment-16315797 ] Duo Zhang commented on HBASE-19731: --- I wrote a method to loop testCheckAndDeleteWithCompareOp and can make the test fail. {code} @Test public void test() throws IOException { for (int i = 0; i < 100; i++) { testCheckAndDeleteWithCompareOp(); TEST_UTIL.deleteTable(TableName.valueOf(name.getMethodName())); } } {code} Let me dig more. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315817#comment-16315817 ] Duo Zhang commented on HBASE-19731: --- [~stack] I think the problem is we assigned the same timestamp twice. I added a static tsAssigned field in HRegion {code} public static volatile List tsAssigned; {code} And in MutationBatchOperation.prepareMiniBatchOperations, I did this {code} if (!region.getRegionInfo().isMetaRegion() && HRegion.tsAssigned != null) { HRegion.tsAssigned.add(timestamp); } {code} And I also modified the UT {code} @Test public void test() throws IOException { try { for (int i = 0; i < 100; i++) { testCheckAndDeleteWithCompareOp(); TEST_UTIL.deleteTable(TableName.valueOf(name.getMethodName())); HRegion.tsAssigned = null; } } catch (AssertionError e) { HRegion.tsAssigned.forEach(System.out::println); throw e; } } {code} Notice that I will create HRegion.tsAssigned in testCheckAndDeleteWithCompareOp after the creation of the test table. And finally I got this output {noformat} 1515397552529 1515397552533 1515397552535 1515397552537 1515397552539 1515397552541 1515397552543 1515397552546 1515397552547 1515397552548 1515397552549 1515397552550 1515397552551 1515397552554 1515397552555 1515397552556 1515397552556 {noformat} You can see that the test fails immediately after we issue the same ts again. This means we are doing faster mutation for beta1 so it is more easier to run into this situation? Maybe a good news... > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315858#comment-16315858 ] Duo Zhang commented on HBASE-19731: --- [~stack] Do you think we need to make this logic default? I mean, implement the same logic in each region, which will not assign the same timestamp twice? > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316536#comment-16316536 ] stack commented on HBASE-19731: --- This is great [~Apache9]. I've been running tests to see if this new since alpha4 We used to have a means of not ensuring same ts on update Let me find that too. Will be back. Nice work sir. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316734#comment-16316734 ] stack commented on HBASE-19731: --- Patch fixes the test for sure. +1. On making this test timestamper default, we can't, right? Proper fix is HLC. Failing that, a timestamper like that in patch would be a limit of about 1k ops a second? And the checkAndSet for time is costly? We'd have to be parsimonious about checking time (currently we do it all over the code base w/o regard for cost). It looks like the test fails in same place in alpha-4 so my thought that it new to beta-1 doesn't hold. Makes sense. I don't see it in the general flakies list: https://builds.apache.org/job/HBASE-Find-Flaky-Tests/lastSuccessfulBuild/artifact/dashboard.html probably because apache jenkins is slow overall... slower than my local machine or JMS's (or yours). Thanks for jumping in here [~Apache9] and confirming speculation on root issue (would have taken me way longer to figure...) > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317141#comment-16317141 ] Hudson commented on HBASE-19731: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4365 (See [https://builds.apache.org/job/HBase-Trunk_matrix/4365/]) HBASE-19731 TestFromClientSide#testCheckAndDeleteWithCompareOp and (stack: rev 2509a150c0792e914429264453510b9028250c29) * (add) hbase-common/src/test/java/org/apache/hadoop/hbase/util/NonRepeatedEnvironmentEdge.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317495#comment-16317495 ] Duo Zhang commented on HBASE-19731: --- {quote} Failing that, a timestamper like that in patch would be a limit of about 1k ops a second? {quote} It is region level, and is only for write, so 1k ops per second is fast enough? No? Since the problem only appears on row level, we can simply do a sharding? For example, using 1024 AtomicLongs, then the qps limit can be up to 1M. What do you think [~stack]? Thanks. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317512#comment-16317512 ] stack commented on HBASE-19731: --- bq. What do you think stack? Each edit having its own unique timestamp would solve a bunch of mysterious issues we see in tests but also in prod. Your suggestion is nice and straightforward. I hate having a running counter already that is close to what is needed here -- MVCC -- and then a system that is almost done -- HLC -- that would fix this issue in a manner that could be depended upon cluster-wide never mind inside region-scope only. HLC is so close... What you think [~Apache9] > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317528#comment-16317528 ] Duo Zhang commented on HBASE-19731: --- So what is the plan for HLC? In which version will it land? If it is not too far away I think it is OK to keep the code as is. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319551#comment-16319551 ] stack commented on HBASE-19731: --- bq. So what is the plan for HLC? In which version will it land? Unfortunately no one working on it at mo. Was stalled by need to improve perf (crossing a synchronize getting unique timestamp) and then our intern who was working on it has moved on. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey
[ https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319561#comment-16319561 ] Duo Zhang commented on HBASE-19731: --- OK, then I think it is worth to make a temporary work around first. Will open a new issue for it. > TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are > flakey > --- > > Key: HBASE-19731 > URL: https://issues.apache.org/jira/browse/HBASE-19731 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: stack >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19731.patch > > > These two tests fail frequently locally; rare does this suite pass. > The failures are either of these two tests. Unfortunately, running the test > standalone does not bring on the issue; need to run the whole suite. > In both cases, we have a Delete followed by a Put and then a checkAnd* -type > operation which does a Get expecting to find the just put Put but it fails on > occasion. > Looks to be an mvcc issues or Put going in at same timestamp as the Delete. > Its hard to debug given any added logging seems to make it all pass again. > Seems this too is new in beta-1. Running tests against alpha-4 seem to pass. > Doing a compare -- This message was sent by Atlassian JIRA (v6.4.14#64029)