[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106880#comment-13106880
 ] 

Hudson commented on HBASE-4195:
---

Integrated in HBase-TRUNK # (See 
[https://builds.apache.org/job/HBase-TRUNK//])
HBASE-4195 Possible inconsistency in a memstore read after a reseek, 
possible performance improvement

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.92.0
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt, 4195-v3.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106737#comment-13106737
 ] 

nkeywal commented on HBASE-4195:


Ok, thank you.
The remaining points I saw in 4195-v2.txt  are the one mentioned above by Ted 
at 16/Sep/11 04:11 and by Andrew at 16/Sep/11 01:19 (modifications in the 
comments not in the code).

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106734#comment-13106734
 ] 

stack commented on HBASE-4195:
--

I can do clean up on commit N.  No need of new patch.  Let me run some tests 
first.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106727#comment-13106727
 ] 

Ted Yu commented on HBASE-4195:
---

@N:
4195-v2.txt has taken care of heap size issue.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106718#comment-13106718
 ] 

nkeywal commented on HBASE-4195:


Thank you. What is the next step? Should I write a new patch taking into 
account all the different comment?

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106630#comment-13106630
 ] 

Andrew Purtell commented on HBASE-4195:
---

bq. Yes, it's because I removed a field in MemStore (the reseek counter). It 
seems that the fix is to change 12 to 11 in MemStore.java

This is the correct fix for that.


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106627#comment-13106627
 ] 

nkeywal commented on HBASE-4195:


Same here, I got one failure with TestGetRowVersions, but it seems unrelated 
(it worked once & failed once).
Other unit tests in errors fail with and without the patch.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106589#comment-13106589
 ] 

Ted Yu commented on HBASE-4195:
---

Test suite passed.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106516#comment-13106516
 ] 

nkeywal commented on HBASE-4195:


Note that I am currently running the full unit test and checking the failures 
are not related to this fix. 

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch, 
> 4195-v2.txt
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105925#comment-13105925
 ] 

nkeywal commented on HBASE-4195:


Yes, it's because I removed a field in MemStore (the reseek counter).

It seems that the fix is to change 12 to 11 in MemStore.java 
{noformat} 
  public final static long FIXED_OVERHEAD = ClassSize.align(
  ClassSize.OBJECT + (12 * ClassSize.REFERENCE));
{noformat}

The unit test is happy after this change, do you confirm it's the right 
solution?

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105916#comment-13105916
 ] 

nkeywal commented on HBASE-4195:


I am going to check.




> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-15 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105846#comment-13105846
 ] 

Ted Yu commented on HBASE-4195:
---

I think the following test failure (w.r.t. MemStore.FIXED_OVERHEAD) is related 
to this JIRA:
{code}
testSizes(org.apache.hadoop.hbase.io.TestHeapSize)  Time elapsed: 0.056 sec  
<<< FAILURE!
junit.framework.AssertionFailedError: expected:<104> but was:<112>
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.failNotEquals(Assert.java:283)
at junit.framework.Assert.assertEquals(Assert.java:64)
at junit.framework.Assert.assertEquals(Assert.java:130)
at junit.framework.Assert.assertEquals(Assert.java:136)
at 
org.apache.hadoop.hbase.io.TestHeapSize.testSizes(TestHeapSize.java:272)
{code}

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-15 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105838#comment-13105838
 ] 

Ted Yu commented on HBASE-4195:
---

Minor comment:
{code}
+  // kvset and snapshot will never be null.
+  // if tailSet can't find anything, SS is empty (not null).
{code}
I think SS above should be replaced with SortedSet.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-15 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105801#comment-13105801
 ] 

Andrew Purtell commented on HBASE-4195:
---

Looks good, especially the new comments, +1.

There is a minor typo in the comments that can be cleaned up on commit:

{code}
+  1) t's not possible to use the 'kvTail' and 'snapshot'
{code}

should be

{code}
+  1) It's not possible to use the 'kvTail' and 'snapshot'
{code}


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-15 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105752#comment-13105752
 ] 

Ted Yu commented on HBASE-4195:
---

+1 on latest patch.
A few empty lines can be removed at time of commit.

Ran TestHRegion#testWritesWhileGetting 101 times which all passed based on 
latest patch.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch, 20110915_4195_4188_MemStore.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100994#comment-13100994
 ] 

stack commented on HBASE-4195:
--

Sure on single patch.  Thanks.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-08 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100988#comment-13100988
 ] 

nkeywal commented on HBASE-4195:


Ok, will do (likely this week end). Do you mind if I do a single patch for this 
jira and for HBASE-4188? They both touch MemStore seek and reseek.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100984#comment-13100984
 ] 

stack commented on HBASE-4195:
--

N

Would you mind making a patch that puts common code into a single method and 
that heavily docs what you've found; i.e. repeat in code your expectations 
above so if we want to change this code subsequently or the scope of readpoint 
changes, the editor will get the benefit of your rumination?

Thanks boss.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-08 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100139#comment-13100139
 ] 

nkeywal commented on HBASE-4195:


@stack: yes I am ok with all your points. Thanks! Some details below:

bq. Are seek and reseek the same now? Or it seems like they have a bunch of 
common code... can we factor it out to common method if so?

The initialization of kvTail & snapshotTail differs, then it's the same code. 
There are only 6 lines of code, but I aggree, it would be cleaner if shared in 
a private method (this would simplify as well the improvement on peek)


bq. We're fixing a bug where we may miss a Put if a flush comes in in meantime 
because we won't have a running Iterator on new KVSet (but maybe this is not 
such a big deal - perhaps - because its unlikely the new Put will be within the 
purview of the current read point?

That's what I expect. Note that between the 3 implementations:
- the initial one: it was impossible because we were just using the iterator 
without going back to the list.
- the one currently in the tunk: possible because we're restarting from the 
very beginning of the list.
- the proposed one; in the middle: we're not restarting from the beginning from 
from an intermediate point of the list.

So we're not in the same situation as we were 2 years ago, but I expect 
(without having done a full analysis) that the readpoint will hide this.

The best of the best, in terms of performance and similarity to the initial 
implementation, would be to get the sub-skiplist implictly pointed by the 
iterator, but there is nothing in the Java API to do it today: it would require 
to implement a specific skip list.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {nofor

[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-07 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099619#comment-13099619
 ] 

stack commented on HBASE-4195:
--

N:

We should remove:

{code}
this.reseekNumKeys = conf.getInt(RESEEKMAX_KEY, RESEEKMAX_DEFAULT);
{code}

and reseekNumKeys and the defines else someone will come along later and wonder 
what these were about?

Are seek and reseek the same now?  Or it seems like they have a bunch of common 
code... can we factor it out to common method if so?


To add to your synthesis:

+ We're removing the numIterReseek facility because we get new tailSet every 
time we reseek.
+ It looks like there is nice perf. benefit if we go w/ this patch
+ We're fixing a bug where we may miss a Put if a flush comes in in meantime 
because we won't have a running Iterator on new KVSet (but maybe this is not 
such a big deal -- perhaps -- because its unlikely the new Put will be within 
the purview of the current read point?

If you agree on the above changes and synthesis, no need of a new patch.  I"ll 
do the clean up on commit.  Thanks boss.



> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information 

[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-07 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098951#comment-13098951
 ] 

Ted Yu commented on HBASE-4195:
---

I already voted +1 on latest patch.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-09-07 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098944#comment-13098944
 ] 

nkeywal commented on HBASE-4195:


Synthesis:
 - the patch on the test case itself helps to understand where the error come 
from, but does not change anything else
 - the fix on the memstore fix the menstore part.
 - there is still another issue, not linked to the memstore but to the flush 
part (see message above from 23/Aug/11)
 - the test cas may fail for this reason. The probability of a failure is 
increased by increasing the test count and lowering the flush interval
 - Anyway, the flush issue is already handled in HBASE-2856.

So I think we can consider the patch as ok for the scope of this bug? @Ted : 
@Stack: do you agree? Do you need more info?


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090922#comment-13090922
 ] 

Ted Yu commented on HBASE-4195:
---

Clarification for my earlier comments:
I performed the test (TestHRegion#testWritesWhileGetting) using the patch 
20110824_4195_MemStore.patch.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-25 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090840#comment-13090840
 ] 

nkeywal commented on HBASE-4195:


With the current implementation, it can fails in TestHRegion at assert 
"assertEquals("i=" + i, expectedCount, result.size());" when there is a mess up 
on the lists. It should not occur with the new implementation.

The issue in the flush is shown by the assert just next to this one. It's (log 
added in the new patch)
{noformat}
  LOG.warn("These two KV should have the same value." +
" Previous KV:" +
previousKV + "(memStoreTS:" + previousKV.getMemstoreTS() + ")" +
", New KV: " +
kv + "(memStoreTS:" + kv.getMemstoreTS() + ")"
  );
  assertEquals(previousKV.getValue(), thisValue);
{noformat}

With these modified values on testWritesWhileGetting:
{noformat}
int testCount = 1000; // more iterations. Increase it more if necessary.
// [...]
int flushInterval = 2; // more flush
int compactInterval = 1000 * flushInterval; // no compact (should have no 
impact, but...)
{noformat}

I produced this error on the trunk + patch proposed:

{noformat}
[...]
put iteration = 1927
put iteration = 2021
These two KV should have the same value. Previous 
KV:row0/family4:qual99/1942/Put/vlen=4(memStoreTS:993), New KV: 
row0/family5:qual0/1944/Put/vlen=4(memStoreTS:0)
E
Time: 19.051
There was 1 failure:
1) testWritesWhileGetting(org.apache.hadoop.hbase.regionserver.TestHRegion)
junit.framework.AssertionFailedError: expected:<1942> but was:<1944>
at 
org.apache.hadoop.hbase.HBaseTestCase.assertEquals(HBaseTestCase.java:704)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting(TestHRegion.java:2781)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[...]
{noformat}


May be there are more memory allocations on 0.90 and as a consequence more 
random interruptions caused by the GC? Note that I am doing the tests outside 
of Maven, but with JUnit, without specific java parameters. 

If worth checking that on .90 it is as well an error when the family changes 
(qualifier equals qual0 or qual1). May be it's something different on .90.

Unfortunatly  (or not :-)) I am on vacations for a week, so I won't be able to 
give a hand the next days, but I will be back middle of next week.





> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, t

[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090791#comment-13090791
 ] 

stack commented on HBASE-4195:
--

bq. But may be it's not an issue as these put could be skipped anyway by the 
readpoint criteria ? I just don't know. This state does not last long, as the 
Store will recreate the scanner when notified of the flush.

It looks like region scanner takes out a read point on construction and holds 
to it as scanner runs so I guess neither put would be seen (since they came in 
after scanner started so will have write points in excess of the read point).  
I'm just a tourist in this code so I could be wrong.

N, so the failing TestHRegion#testWritesWhileGetting happens on the 0.90 branch 
only.  Do you want me to try this patch on 0.90?  We don't fail on TRUNK (Ted 
show this the case above and I verified it a while back).  How could this be 
the case when the MemStore is same in both code branches?





> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090641#comment-13090641
 ] 

Ted Yu commented on HBASE-4195:
---

Consecutively run TestHRegion#testWritesWhileGetting 100 times.
All of which passed.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Fix For: 0.90.5
>
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090555#comment-13090555
 ] 

Ted Yu commented on HBASE-4195:
---

+1 on patch (some extra empty lines should be removed).

All tests pass.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Critical
> Attachments: 20110824_4195_MemStore.patch, 
> 20110824_4195_TestHRegion.patch
>
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090324#comment-13090324
 ] 

nkeywal commented on HBASE-4195:


I tested 2 cases:
1) The current implementation
2) A new one (reusing the first one logic without calling seek).


Times (for public void TestHRegion#testWritesWhileGetting()) are, on a 
virtualized environment:

Current - 26s
Proposed - 13s

It's a little bit suprising to have a so good result, but it could be explained 
by the behaviour when the counter gets negative: we can have both a lot of 
iterations on the list + a call to seek. There is also a lot of gc related 
time, so it will depends on the memory available. 

But as a conclusion, it seems that it's interesting to use directly the search 
on the sublist. I will do a patch on the trunk.

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090073#comment-13090073
 ] 

nkeywal commented on HBASE-4195:


The behavior we have with the previous (or with the one I propose)
implementation is:

* before the flush, the MemScanner iterator points on the KV Lists of the
MemStore. So the "put" on the MemStore.kvset will be seen by the scanner.

* The flush does:
snapshot = kvset;
kvset = new SkipList();

* So after the flush, the "put" will be made on the new Memstore.kvset,
hence not visible by the existing scanner. So the  memScanner behaves
differently before & after the flush.

But may be it's not an issue as these put could be skipped anyway by the
readpoint criteria ? I just don't know. This state does not last long, as
the Store will recreate the scanner when notified of the flush.





> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090040#comment-13090040
 ] 

stack commented on HBASE-4195:
--

bq. Problem solved then

Ehh... not exactly (smile)

bq. For #1, note that the previous implementation will completly ignore the 
modifications made during the flush: the flush creates a new KV List (kvset), 
and this list is not seen by the MemScanner, as it uses the previous iterator. 
I don't know if it's an issue or not.

This is what we want, right?  A consistent view across the flush; definitely 
not part of 'snapshot' only.

Your suggested fix sounds good (we'll keep iterators on whatever the sets were 
when scan started in spite of a flush coming in midway through the scan?)


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089856#comment-13089856
 ] 

nkeywal commented on HBASE-4195:


For #2, yes, it seems that HBASE-2856 is addressing this type of issues. 
Problem solved then :)

For #1, note that the previous implementation will completly ignore the 
modifications made during the flush: the flush creates a new KV List (kvset), 
and this list is not seen by the MemScanner, as it uses the previous iterator. 
I don't know if it's an issue or not.

This said, I believe it's possible to have an implementation with the same 
properties as the previous one, with an optimized reseek time, by keeping the 
pointers to the sublists in the MemStoreScanner. The "reseek" implementation 
would then become very similar to the "seek" one. I tested this approach, it 
seems to work functionally as the previous one (i.e. I fail in the case #2 
mentionned above). I have not tested the reality of performance improvement, 
but if there is an aggreement on this approach, I can do it.

The implementation of reseek would be:
{noformat}
public synchronized boolean reseek(KeyValue key) {
// kvset and snapshot will never be empty.
// if tailSet cant find anything, SS is empty (not null).
kvTail = kvTail.tailSet(key, true);
snapshotTail = snapshotTail.tailSet(key, true);

kvsetIt = kvTail.iterator();
snapshotIt = snapshotTail.iterator();

kvsetNextRow = getNext(kvsetIt);
snapshotNextRow = getNext(snapshotIt);

KeyValue lowest = getLowest();

// has data := (lowest != null)
return lowest != null;
}
{noformat}

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, bu

[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089798#comment-13089798
 ] 

stack commented on HBASE-4195:
--

#2 above looks like HBASE-2856 (HBASE-2856 is a long issue but skim it and I 
think you'll find it what you describe here -- Amit is working on this one; we 
chatted at the hackathon yesterday on the issue).

#1 sounds bad; should we back out the reseek?


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089407#comment-13089407
 ] 

nkeywal commented on HBASE-4195:


Update: I have two other scenarios for failure, both linked to flush occuring 
during a get.

1) With the current/optimized implementation of reseek, the set snapshot and 
kvset can be changed by the thread doing the flush right in the middle of the 
reseek. This will lead to an unconsistant state.

A second effect of the flush process creating a new kvset is that the latter 
writes may not be seen by the MemStore scanner, as it will still be connected 
to the previous kvset.

2) More important, and actually not linked to the reseek optimization itself, 
the following scenario will lead to see a write on multiple families as non 
atomic.

t1 : put, it finishes at t2. Write a single row with multiple families.
t3 : get starts, it finishes at t9
t4 : the get continues, reads the value for the first family (scanner)
t5 : put, it finishes at t6. Change the values of the row previously written.
t7 : flush start,  it finishes at t8. 
t9 : the get  continues, reads the value for the second family (scanner) from 
the FileScanner
t10: get finishes

In this case, the get will have the values of the first write for the first 
families, and the value of the second write for the last families.

This is due to the fact that the flush process create a file, and notifies the 
scanners. The scanners then refreshes their view. The notification is "java 
synchronized" with the next in the StoreFileScanner, so it does not happen 
during a scan within a family, but it occurs between the families within a 
read, as there is one scanner per family. If you add a read lock in the next() 
(on HRegion.updatelock), the problem does not occur, as the flush will not take 
place during a read.

As it's a random bug, there can be other scenarios. In my environnement, when 
there is a failure:
- it's always with a KV with a memstoreTS equals to 0
- the column is always "qual1"

As said, this second is actually not linked to the modifications on 
"MemStoreScanner#reseek", but is linked to the flush/get parallel execution. I 
would tend to think that the issue happens in production as well.


> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a num

[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-17 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086358#comment-13086358
 ] 

nkeywal commented on HBASE-4195:


I can do a simple patch (removing all the code around numIterReseek). However, 
it would conflict with the patch for HBASE-4188/HBASE-1938. Is it possible for 
you to commit this one first? 

Note that I have been able to make this reseek implementation fails as well by 
adding a Thread.sleep between the search on the two iterators. In other words, 
there is a race condition somewhere. It could be a conflict with the "flush" 
process. I noticed that a flush cannot happen during a put (lock on 
hregion.update) or a seek (lock on store), but there is nothing to prevent a 
reseek to take place during the snapshot. But I don't how long it will take to 
find the real issue behind all this, so a partial fix lowering the probability 
of having an issue makes sense...

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-16 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086111#comment-13086111
 ] 

Ted Yu commented on HBASE-4195:
---

TestHRegion.testWritesWhileGetting failed in build 266:
https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/lastCompletedBuild/testReport/org.apache.hadoop.hbase.regionserver/TestHRegion/testWritesWhileGetting/

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086030#comment-13086030
 ] 

stack commented on HBASE-4195:
--

bq. Can someone confirm that we should not see partial writes in this case?

If we are seeing partial writes then we are not paying attention to RWCC 
sequence numbers properly.  The refactored reseek is probably how the code was 
before hbase-3855 (I didn't check) and that code was around when RWCC was being 
worked out is my guess... so it probably 'worked' (not seeing parital writes).  
The new patch adding seeks may not be watching RWCC properly.

You've done some nice analysis above N.  Any chance of your seeing it home?  I 
will run my dumb TestHRegion test if you have any patch you'd like me to try.



> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4195) Possible inconsistency in a memstore read after a reseek, possible performance improvement

2011-08-15 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085122#comment-13085122
 ] 

nkeywal commented on HBASE-4195:


Do you want me do write the patch?

You confirm that we should not see a partial write?

> Possible inconsistency in a memstore read after a reseek, possible 
> performance improvement
> --
>
> Key: HBASE-4195
> URL: https://issues.apache.org/jira/browse/HBASE-4195
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
> Environment: all
>Reporter: nkeywal
>Priority: Critical
>
> This follows the dicussion around HBASE-3855, and the random errors (20% 
> failure on trunk) on the unit test 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting
> I saw some points related to numIterReseek, used in the 
> MemStoreScanner#getNext (line 690):
> {noformat}679 protected KeyValue getNext(Iterator it) {
> 680 KeyValue ret = null;
> 681 long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();
> 682 //DebugPrint.println( " MS@" + hashCode() + ": threadpoint = " + 
> readPoint);
> 683
> 684 while (ret == null && it.hasNext()) {
> 685   KeyValue v = it.next();
> 686   if (v.getMemstoreTS() <= readPoint) {
> 687 // keep it.
> 688 ret = v;
> 689   }
> 690   numIterReseek--;
> 691   if (numIterReseek == 0) {
> 692 break;
> 693}
> 694 }
> 695 return ret;
> 696   }{noformat}
> This function is called by seek, reseek, and next. The numIterReseek is only 
> usefull for reseek.
> There are some issues, I am not totally sure it's the root cause of the test 
> case error, but it could explain partly the randomness of the error, and one 
> point is for sure a bug.
> 1) In getNext, numIterReseek is decreased, then compared to zero. The seek 
> function sets numIterReseek to zero before calling getNext. It means that the 
> value will be actually negative, hence the test will always fail, and the 
> loop will continue. It is the expected behaviour, but it's quite smart.
> 2) In "reseek", numIterReseek is not set between the loops on the two 
> iterators. If the numIterReseek is equals to zero after the loop on the first 
> one, the loop on the second one will never call seek, as numIterReseek will 
> be negative.
> 3) Still in "reseek", the test to call "seek" is (kvsetNextRow == null && 
> numIterReseek == 0). In other words, if kvsetNextRow is not null when 
> numIterReseek equals zero, numIterReseek will start to be negative at the 
> next iteration and seek will never be called.
> 4) You can have side effects if reseek ends with a numIterReseek > 0: the 
> following calls to the "next" function will decrease numIterReseek to zero, 
> and getNext will break instead of continuing the loop. As a result, later 
> calls to next() may return null or not depending on how is configured the 
> default value for numIterReseek.
> To check if the issue comes from point 4, you can set the numIterReseek to 
> zero before returning in reseek:
> {noformat}  numIterReseek = 0;
>   return (kvsetNextRow != null || snapshotNextRow != null);
> }{noformat}
> On my env, on trunk, it seems to work, but as it's random I am not really 
> sure. I also had to modify the test (I added a loop) to make it fails more 
> often, the original test was working quite well here.
> It has to be confirmed that this totally fix (it could be partial or 
> unrelated) 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting 
> before implementing a complete solution.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira