[jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925509#comment-13925509
 ] 

Feng Honghua commented on HBASE-10679:
--

I reran several times(5+) again locally and can't re-produce the failure of the 
failed case TestLogRolling,  I triggered another Hadoop QA run and this time 
all tests passed.

btw: seems the failed case has failed pretty frequently recently:-( (in last 
month it ever failed for HBASE-10648 / HBASE-10615 / HBASE-9990 / HBASE-10582 / 
HBASE-10527 / HBASE-10575 / HBASE-10570 / HBASE-10532 / HBASE-10537 / 
HBASE-10534 / HBASE-6642 / HBASE-3909 / HBASE-10169 and so on and on...)

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch, 
 HBASE-10679-trunk_v2.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10710) TestLogRolling.testLogRollOnDatanodeDeath fails occasionally in Hadoop QA run

2014-03-10 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10710:


 Summary: TestLogRolling.testLogRollOnDatanodeDeath fails 
occasionally in Hadoop QA run
 Key: HBASE-10710
 URL: https://issues.apache.org/jira/browse/HBASE-10710
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: Feng Honghua


This case ever failed in last month for HBASE-10648 / HBASE-10615 / HBASE-9990 
/ HBASE-10582 / HBASE-10527 / HBASE-10575 / HBASE-10570 / HBASE-10532 / 
HBASE-10537 / HBASE-10534 / HBASE-6642 / HBASE-3909 / HBASE-10169 and so on and 
on..., but seems it can't reproduce in local run.

This issue is created for any further tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925519#comment-13925519
 ] 

Feng Honghua commented on HBASE-10679:
--

HBASE-10710 is created for tracking failure of 
TestLogRolling.testLogRollOnDatanodeDeath

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch, 
 HBASE-10679-trunk_v2.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10710) TestLogRolling.testLogRollOnDatanodeDeath fails occasionally in Hadoop QA run

2014-03-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925551#comment-13925551
 ] 

Feng Honghua commented on HBASE-10710:
--

I see...Thanks for the clarification, [~apurtell]

 TestLogRolling.testLogRollOnDatanodeDeath fails occasionally in Hadoop QA run
 -

 Key: HBASE-10710
 URL: https://issues.apache.org/jira/browse/HBASE-10710
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: Feng Honghua

 This case ever failed in last month for HBASE-10648 / HBASE-10615 / 
 HBASE-9990 / HBASE-10582 / HBASE-10527 / HBASE-10575 / HBASE-10570 / 
 HBASE-10532 / HBASE-10537 / HBASE-10534 / HBASE-6642 / HBASE-3909 / 
 HBASE-10169 and so on and on..., but seems it can't reproduce in local run.
 This issue is created for any further tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10710) TestLogRolling.testLogRollOnDatanodeDeath fails occasionally in Hadoop QA run

2014-03-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925562#comment-13925562
 ] 

Feng Honghua commented on HBASE-10710:
--

Get your point, totally agree :-) [~apurtell]

 TestLogRolling.testLogRollOnDatanodeDeath fails occasionally in Hadoop QA run
 -

 Key: HBASE-10710
 URL: https://issues.apache.org/jira/browse/HBASE-10710
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: Feng Honghua

 This case ever failed in last month for HBASE-10648 / HBASE-10615 / 
 HBASE-9990 / HBASE-10582 / HBASE-10527 / HBASE-10575 / HBASE-10570 / 
 HBASE-10532 / HBASE-10537 / HBASE-10534 / HBASE-6642 / HBASE-3909 / 
 HBASE-10169 and so on and on..., but seems it can't reproduce in local run.
 This issue is created for any further tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-09 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10679:
-

Attachment: HBASE-10679-trunk_v2.patch

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch, 
 HBASE-10679-trunk_v2.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-08 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924782#comment-13924782
 ] 

Feng Honghua commented on HBASE-10651:
--

Ping for review, thanks :-)

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch, HBASE-10651-trunk_v2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-03-08 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924783#comment-13924783
 ] 

Feng Honghua commented on HBASE-10595:
--

Ping for review or further comment, [~enis] / [~v.himanshu] ? Thanks. :-)

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch, HBASE-10595-trunk_v4.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner is never closed if the region has been moved-out or re-opened when performing scan request

2014-03-08 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Attachment: HBASE-10662-0.96_v1.patch
HBASE-10662-0.94_v1.patch

[~lhofhansl] : Patches for 0.94 and 0.96 attached. Tests passed in my local run

 RegionScanner is never closed if the region has been moved-out or re-opened 
 when performing scan request
 

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Fix For: 0.98.1, 0.99.0

 Attachments: HBASE-10662-0.94_v1.patch, HBASE-10662-0.96_v1.patch, 
 HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 When regionserver processes scan request from client and can't find online 
 region corresponding to the passed-in scanner-id (due to being moved out) or 
 find the region has been re-opened, it throws NotServingRegionException and 
 removes the corresponding RegionScannerHolder from scanners without closing 
 the related region scanner (nor cancelling the related lease), but when the 
 lease expires, the related region scanner still doesn't be closed since it 
 doesn't present in scanners now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10663) Some code cleanup of class Leases and ScannerListener.leaseExpired

2014-03-07 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10663:
-

Summary: Some code cleanup of class Leases and ScannerListener.leaseExpired 
 (was: Refactor/cleanup of class Leases and ScannerListener.leaseExpired)

Would anybody please help review this? Just some safe code cleanup. Thanks.

 Some code cleanup of class Leases and ScannerListener.leaseExpired
 --

 Key: HBASE-10663
 URL: https://issues.apache.org/jira/browse/HBASE-10663
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10663-trunk_v1.patch


 Some cleanup of Leases and ScannerListener.leaseExpired:
 # Reject renewLease if stopRequested (same as addLease, stopRequested means 
 Leases is asked to stop and is waiting for all remained leases to expire)
 # Raise log level from info to warn for case that no related region scanner 
 found when a lease expires (should it be an error?)
 # Replace System.currentTimeMillis() with 
 EnvironmentEdgeManager.currentTimeMillis()
 # Correct some wrong comments and remove some irrelevant comments(Queue 
 rather than Map is used for leases before?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-07 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923687#comment-13923687
 ] 

Feng Honghua commented on HBASE-10679:
--

Would anybody help confirm the bug and review the patch? Thanks :-)

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-07 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923790#comment-13923790
 ] 

Feng Honghua commented on HBASE-10679:
--

I reran the failed cases locally and they all passed.

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-07 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924756#comment-13924756
 ] 

Feng Honghua commented on HBASE-10679:
--

bq.When the AtomicLong hits the max, it goes negative which should be fine 
since we are toString the value. It then goes down all the ways around the zero 
so all should be good. Nice one Honghua. I know its only a few lines but 
probably took a lot longer than that to figure it out.
# Yes, though the final fix is pretty straightforward, the scenario and 
condition triggering the bug is quite tricky and not that easy to comprehend 
and figure out.
# Since the scannerId is returned as long rather than string to client for 
further consequent scan requests, and -1 is deemed as invalid scannerId, so 
negative scannerId isn't acceptable/desirable. But a back-of-the-envelope 
calculation can goes like this: the count of positive long values are 2^63 = 
9223372036854775808, the scannerId is per regionserver instance and won't span 
different regionserver process lifecycles, 1000 years = 1000 * 365 * 24 * 60 * 
60 = 3153600 seconds, scannerId will be generated/used most quickly if all 
requests are read/scan, and read/scan QPS should be 9223372036854775808 / 
3153600 = 292471208 for scannerId to reach max and then go negative, 
considering it's almost impossible for a regionserver process to live as long 
as 1000 years without downtime, and 272471208 is also an too big read/scan QPS 
for regionserver to serve, we can safely overlook the possibility for scanerId 
to be negative.

bq.Same test failed twice in a row. Want to take a looksee...The tests make 
output. You can navigate some if you click on the above links. You might see 
something in the output that you don't see locally
OK, I'll check. Thanks for reminder

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch, 
 HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10679) Both clients operating on a same region will get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId

2014-03-06 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10679:
-

Status: Patch Available  (was: Open)

 Both clients operating on a same region will get wrong scan results if the 
 first scanner expires and the second scanner is created with the same 
 scannerId
 --

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region

2014-03-06 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10679:
-

Summary: Both clients get wrong scan results if the first scanner expires 
and the second scanner is created with the same scannerId on the same region  
(was: Both clients operating on a same region will get wrong scan results if 
the first scanner expires and the second scanner is created with the same 
scannerId)

 Both clients get wrong scan results if the first scanner expires and the 
 second scanner is created with the same scannerId on the same region
 -

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10679) Both clients operating on a same region will get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId

2014-03-05 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10679:


 Summary: Both clients operating on a same region will get wrong 
scan results if the first scanner expires and the second scanner is created 
with the same scannerId
 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical


The scenario is as below (both Client A and Client B scan against Region R)
# A opens a scanner SA on R, the scannerId is N, it successfully get its first 
row a
# SA's lease expires and it's removed from scanners
# B opens a scanner SB on R, the scannerId is N too. it successfully get its 
first row m
# A issues its second scan request with scannerId N, regionserver finds N is 
valid scannerId and the region matches too. (since the region is always online 
on this regionserver and both two scanners are against it), so it executes scan 
request on SB, returns n to A -- wrong! (get data from other scanner, A 
expects row something like b that follows a)
# B issues its second scan request with scannerId N, regionserver also thinks 
it's valid, and executes scan on SB, return o to B -- wrong! (should return 
n but n has been scanned out by A just now)

The consequence is both clients get wrong scan results:
# A gets data from scanner created by other client, its own scanner has expired 
and removed
# B misses data which should be gotten but has been wrongly scanned out by A

The root cause is scannerId generated by regionserver can't be guaranteed 
unique within regionserver's whole lifecycle, *there is only guarantee that 
scannerIds of scanners that are currently still valid (not expired) are 
unique*, so a same scannerId can present in scanners again after a former 
scanner with this scannerId expires and has been removed from scanners. And if 
the second scanner is against the same region, the bug arises.

Theoretically, the possibility of above scenario should be very rare(two 
consecutive scans on a same region from two different clients get a same 
scannerId, and the first expires before the second is created), but it does can 
happen, and once it happens, the consequence is severe(all clients involved get 
wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10679) Both clients operating on a same region will get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10679:
-

Attachment: HBASE-10679-trunk_v1.patch

Patch attached

 Both clients operating on a same region will get wrong scan results if the 
 first scanner expires and the second scanner is created with the same 
 scannerId
 --

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner never be closed if the region has been moved-out or re-opened when performing scan request

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Summary: RegionScanner never be closed if the region has been moved-out or 
re-opened when performing scan request  (was: RegionScanner never be closed 
when the region has been moved-out or re-opened when performing scan request)

 RegionScanner never be closed if the region has been moved-out or re-opened 
 when performing scan request
 

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 When regionserver processes scan request from client and can't find online 
 region corresponding to the passed-in scanner-id (due to being moved out) or 
 find the region has been re-opened, it throws NotServingRegionException and 
 removes the corresponding RegionScannerHolder from scanners without closing 
 the related region scanner (nor cancelling the related lease), but when the 
 lease expires, the related region scanner still doesn't be closed since it 
 doesn't present in scanners now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner never be closed when the region has been moved-out or re-opened when performing scan request

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Summary: RegionScanner never be closed when the region has been moved-out 
or re-opened when performing scan request  (was: RegionScanner should be closed 
and according lease should be cancelled in regionserver immediately if we find 
the related region has been moved-out or re-opened during performing scan 
request)

 RegionScanner never be closed when the region has been moved-out or re-opened 
 when performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 When regionserver processes scan request from client and can't find online 
 region corresponding to the passed-in scanner-id (due to being moved out) or 
 find the region has been re-opened, it throws NotServingRegionException and 
 removes the corresponding RegionScannerHolder from scanners without closing 
 the related region scanner (nor cancelling the related lease), but when the 
 lease expires, the related region scanner still doesn't be closed since it 
 doesn't present in scanners now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10679) Both clients operating on a same region will get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10679:
-

Attachment: HBASE-10679-trunk_v2.patch

 Both clients operating on a same region will get wrong scan results if the 
 first scanner expires and the second scanner is created with the same 
 scannerId
 --

 Key: HBASE-10679
 URL: https://issues.apache.org/jira/browse/HBASE-10679
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch


 The scenario is as below (both Client A and Client B scan against Region R)
 # A opens a scanner SA on R, the scannerId is N, it successfully get its 
 first row a
 # SA's lease expires and it's removed from scanners
 # B opens a scanner SB on R, the scannerId is N too. it successfully get its 
 first row m
 # A issues its second scan request with scannerId N, regionserver finds N is 
 valid scannerId and the region matches too. (since the region is always 
 online on this regionserver and both two scanners are against it), so it 
 executes scan request on SB, returns n to A -- wrong! (get data from other 
 scanner, A expects row something like b that follows a)
 # B issues its second scan request with scannerId N, regionserver also thinks 
 it's valid, and executes scan on SB, return o to B -- wrong! (should return 
 n but n has been scanned out by A just now)
 The consequence is both clients get wrong scan results:
 # A gets data from scanner created by other client, its own scanner has 
 expired and removed
 # B misses data which should be gotten but has been wrongly scanned out by A
 The root cause is scannerId generated by regionserver can't be guaranteed 
 unique within regionserver's whole lifecycle, *there is only guarantee that 
 scannerIds of scanners that are currently still valid (not expired) are 
 unique*, so a same scannerId can present in scanners again after a former 
 scanner with this scannerId expires and has been removed from scanners. And 
 if the second scanner is against the same region, the bug arises.
 Theoretically, the possibility of above scenario should be very rare(two 
 consecutive scans on a same region from two different clients get a same 
 scannerId, and the first expires before the second is created), but it does 
 can happen, and once it happens, the consequence is severe(all clients 
 involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10693) Correct declarations of Atomic* fields from 'volatile' to 'final'

2014-03-05 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10693:


 Summary: Correct declarations of Atomic* fields from 'volatile' to 
'final'
 Key: HBASE-10693
 URL: https://issues.apache.org/jira/browse/HBASE-10693
 Project: HBase
  Issue Type: Improvement
  Components: io, master
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor


By checking the usage of these Atomic* fields, they themselves don't change 
once being assigned to referencing an Atomic* object, so 'final' rather than 
'volatile' is more proper.
On the other hand, the 'value' encapsulated in Atomic* is already declared 
'volatile', while guarantees to perform correctly in multi-threads scenarios.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10693) Correct declarations of Atomic* fields from 'volatile' to 'final'

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10693:
-

Attachment: HBASE-10693-trunk_v1.patch

Patch attached

 Correct declarations of Atomic* fields from 'volatile' to 'final'
 -

 Key: HBASE-10693
 URL: https://issues.apache.org/jira/browse/HBASE-10693
 Project: HBase
  Issue Type: Improvement
  Components: io, master
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10693-trunk_v1.patch


 By checking the usage of these Atomic* fields, they themselves don't change 
 once being assigned to referencing an Atomic* object, so 'final' rather than 
 'volatile' is more proper.
 On the other hand, the 'value' encapsulated in Atomic* is already declared 
 'volatile', while guarantees to perform correctly in multi-threads scenarios.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10693) Correct declarations of Atomic* fields from 'volatile' to 'final'

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10693:
-

Description: 
By checking the usage of these Atomic* fields, they themselves don't change 
once being assigned to referencing an Atomic* object, so 'final' rather than 
'volatile' is more proper.
On the other hand, the 'value' encapsulated in Atomic* is already declared 
'volatile', which guarantees to perform correctly in multi-threads scenarios.

  was:
By checking the usage of these Atomic* fields, they themselves don't change 
once being assigned to referencing an Atomic* object, so 'final' rather than 
'volatile' is more proper.
On the other hand, the 'value' encapsulated in Atomic* is already declared 
'volatile', while guarantees to perform correctly in multi-threads scenarios.


 Correct declarations of Atomic* fields from 'volatile' to 'final'
 -

 Key: HBASE-10693
 URL: https://issues.apache.org/jira/browse/HBASE-10693
 Project: HBase
  Issue Type: Improvement
  Components: io, master
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10693-trunk_v1.patch


 By checking the usage of these Atomic* fields, they themselves don't change 
 once being assigned to referencing an Atomic* object, so 'final' rather than 
 'volatile' is more proper.
 On the other hand, the 'value' encapsulated in Atomic* is already declared 
 'volatile', which guarantees to perform correctly in multi-threads scenarios.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10693) Correct declarations of Atomic* fields from 'volatile' to 'final'

2014-03-05 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922055#comment-13922055
 ] 

Feng Honghua commented on HBASE-10693:
--

bq.Any other such cases in the code base? 
Should no other such cases... I used  *grep  \-rn  'volatile.\*Atomic'  
hbase-\*/src/main/java*  and only those two places found

 Correct declarations of Atomic* fields from 'volatile' to 'final'
 -

 Key: HBASE-10693
 URL: https://issues.apache.org/jira/browse/HBASE-10693
 Project: HBase
  Issue Type: Improvement
  Components: io, master
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10693-trunk_v1.patch


 By checking the usage of these Atomic* fields, they themselves don't change 
 once being assigned to referencing an Atomic* object, so 'final' rather than 
 'volatile' is more proper.
 On the other hand, the 'value' encapsulated in Atomic* is already declared 
 'volatile', which guarantees to perform correctly in multi-threads scenarios.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10693) Correct declarations of Atomic* fields from 'volatile' to 'final'

2014-03-05 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10693:
-

Status: Patch Available  (was: Open)

 Correct declarations of Atomic* fields from 'volatile' to 'final'
 -

 Key: HBASE-10693
 URL: https://issues.apache.org/jira/browse/HBASE-10693
 Project: HBase
  Issue Type: Improvement
  Components: io, master
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10693-trunk_v1.patch


 By checking the usage of these Atomic* fields, they themselves don't change 
 once being assigned to referencing an Atomic* object, so 'final' rather than 
 'volatile' is more proper.
 On the other hand, the 'value' encapsulated in Atomic* is already declared 
 'volatile', which guarantees to perform correctly in multi-threads scenarios.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if the related region has been re-opened during performing scan reques

2014-03-04 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10662:


 Summary: RegionScanner should be closed and according lease should 
be cancelled in regionserver immediately if the related region has been 
re-opened during performing scan request
 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua


During regionserver processes scan request from client, it fails the request by 
throwing a wrapped NotServingRegionException to client if it finds the region 
related to the passed-in scanner-id has been re-opened, and it also removes the 
RegionScannerHolder from the scanners. In fact under this case, the old and 
invalid RegionScanner related to the passed-in scanner-id should be closed and 
the related lease should be cancelled at the mean time as well.

Currently region's related scanners aren't closed when closing the region, a 
region scanner is closed only when requested explicitly by client, or by 
expiration of the related lease, in this sense the close of region scanners is 
quite passive and lag.

Sounds reasonable to cleanup all related scanners and cancel these scanners' 
leases after closing a region?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if we find the related region has been re-opened during performing sca

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Summary: RegionScanner should be closed and according lease should be 
cancelled in regionserver immediately if we find the related region has been 
re-opened during performing scan request  (was: RegionScanner should be closed 
and according lease should be cancelled in regionserver immediately if the 
related region has been re-opened during performing scan request)

 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if we find the related region has been re-opened 
 during performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 Sounds reasonable to cleanup all related scanners and cancel these scanners' 
 leases after closing a region?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if the related region has been re-opened during performing scan reques

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Attachment: HBASE-10662-trunk_v1.patch

Patch with an immediate fix attached

Since no valid region for such a stale and invalid region scanner, so no 
according coprocessor calls such as 
region.getCoprocessorHost().preScannerClose(scanner) or 
region.getCoprocessorHost().postScannerClose(scanner)

 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if the related region has been re-opened during 
 performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 Sounds reasonable to cleanup all related scanners and cancel these scanners' 
 leases after closing a region?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if we find the related region has been re-opened during performing s

2014-03-04 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919208#comment-13919208
 ] 

Feng Honghua commented on HBASE-10662:
--

When we find the region has been re-opened during serving scan request from 
client in regionserver, if we only remove RegionScannerHolder  from scanners 
but don't close the related scanner. The related lease will be cancelled when 
it expires, but the related region scanner won't be closed in leaseExpired as 
expected:
{code}
  public void leaseExpired() {
  RegionScannerHolder rsh = scanners.remove(this.scannerName);
  if (rsh != null) {
RegionScanner s = rsh.s;
LOG.info(Scanner  + this.scannerName +  lease expired on region 
+ s.getRegionInfo().getRegionNameAsString());
try {
  HRegion region = getRegion(s.getRegionInfo().getRegionName());
  if (region != null  region.getCoprocessorHost() != null) {
region.getCoprocessorHost().preScannerClose(s);
  }

  s.close();
  if (region != null  region.getCoprocessorHost() != null) {
region.getCoprocessorHost().postScannerClose(s);
  }
} catch (IOException e) {
  LOG.error(Closing scanner for 
  + s.getRegionInfo().getRegionNameAsString(), e);
}
  } else {
LOG.info(Scanner  + this.scannerName +  lease expired);
  }
}
{code}
In above code, scanners.remove(this.scannerName) will return a null rsh since 
it has been removed earlier, so the region scanner can't be closed here, which 
means the related region scanner doesn't have a chance to be closed ever.

 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if we find the related region has been re-opened 
 during performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 Sounds reasonable to cleanup all related scanners and cancel these scanners' 
 leases after closing a region?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10663) Refactor/cleanup of class Leases and ScannerListener.leaseExpired

2014-03-04 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10663:


 Summary: Refactor/cleanup of class Leases and 
ScannerListener.leaseExpired
 Key: HBASE-10663
 URL: https://issues.apache.org/jira/browse/HBASE-10663
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor


Some cleanup of Leases and ScannerListener.leaseExpired:
# Reject renewLease if stopRequested (same as addLease, stopRequested means 
Leases is asked to stop and is waiting for all remained leases to expire)
# Raise log level from info to warn for case that no related region scanner 
found when a lease expires (should it be an error?)
# Replace System.currentTimeMillis() with 
EnvironmentEdgeManager.currentTimeMillis()
# Correct some wrong comments and remove some irrelevant comments(Queue rather 
than Map is used for leases before?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10663) Refactor/cleanup of class Leases and ScannerListener.leaseExpired

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10663:
-

Attachment: HBASE-10663-trunk_v1.patch

 Refactor/cleanup of class Leases and ScannerListener.leaseExpired
 -

 Key: HBASE-10663
 URL: https://issues.apache.org/jira/browse/HBASE-10663
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10663-trunk_v1.patch


 Some cleanup of Leases and ScannerListener.leaseExpired:
 # Reject renewLease if stopRequested (same as addLease, stopRequested means 
 Leases is asked to stop and is waiting for all remained leases to expire)
 # Raise log level from info to warn for case that no related region scanner 
 found when a lease expires (should it be an error?)
 # Replace System.currentTimeMillis() with 
 EnvironmentEdgeManager.currentTimeMillis()
 # Correct some wrong comments and remove some irrelevant comments(Queue 
 rather than Map is used for leases before?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if we find the related region has been re-opened during performing s

2014-03-04 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919331#comment-13919331
 ] 

Feng Honghua commented on HBASE-10662:
--

This bug occurs not only when regionserver processes scan request after region 
re-open, but also when regionserver processes scan request after the region is 
moved out(due to balance or user's move request) of the regionserver : 
NotServingRegionException is thrown and the RegionServerHolder is removed from 
scanners in regionserver, but when leaseExpired is executed due to lease 
expires, the related region scanner can't be closed due to the according 
RegionScannerHolder has already been removed from scanners without closing the 
related regionscanner...

 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if we find the related region has been re-opened 
 during performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 Sounds reasonable to cleanup all related scanners and cancel these scanners' 
 leases after closing a region?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if we find the related region has been re-opened during performing sca

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Description: 
During regionserver processes scan request from client, it fails the request by 
throwing a wrapped NotServingRegionException to client if it finds the region 
related to the passed-in scanner-id has been re-opened, and it also removes the 
RegionScannerHolder from the scanners. In fact under this case, the old and 
invalid RegionScanner related to the passed-in scanner-id should be closed and 
the related lease should be cancelled at the mean time as well.

Currently region's related scanners aren't closed when closing the region, a 
region scanner is closed only when requested explicitly by client, or by 
expiration of the related lease, in this sense the close of region scanners is 
quite passive and lag.

When regionserver processes scan request from client and can't find online 
region corresponding to the passed-in scanner-id (due to being moved out) or 
find the region has been re-opened, it throws NotServingRegionException and 
removes the corresponding RegionScannerHolder from scanners without closing the 
related region scanner (nor cancelling the related lease), but when the lease 
expires, the related region scanner still doesn't be closed since it doesn't 
present in scanners now.

  was:
During regionserver processes scan request from client, it fails the request by 
throwing a wrapped NotServingRegionException to client if it finds the region 
related to the passed-in scanner-id has been re-opened, and it also removes the 
RegionScannerHolder from the scanners. In fact under this case, the old and 
invalid RegionScanner related to the passed-in scanner-id should be closed and 
the related lease should be cancelled at the mean time as well.

Currently region's related scanners aren't closed when closing the region, a 
region scanner is closed only when requested explicitly by client, or by 
expiration of the related lease, in this sense the close of region scanners is 
quite passive and lag.

Sounds reasonable to cleanup all related scanners and cancel these scanners' 
leases after closing a region?


 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if we find the related region has been re-opened 
 during performing scan request
 --

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 When regionserver processes scan request from client and can't find online 
 region corresponding to the passed-in scanner-id (due to being moved out) or 
 find the region has been re-opened, it throws NotServingRegionException and 
 removes the corresponding RegionScannerHolder from scanners without closing 
 the related region scanner (nor cancelling the related lease), but when the 
 lease expires, the related region scanner still doesn't be closed since it 
 doesn't present in scanners now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10651:
-

Status: Patch Available  (was: Open)

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch, HBASE-10651-trunk_v2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10663) Refactor/cleanup of class Leases and ScannerListener.leaseExpired

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10663:
-

Status: Patch Available  (was: Open)

 Refactor/cleanup of class Leases and ScannerListener.leaseExpired
 -

 Key: HBASE-10663
 URL: https://issues.apache.org/jira/browse/HBASE-10663
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10663-trunk_v1.patch


 Some cleanup of Leases and ScannerListener.leaseExpired:
 # Reject renewLease if stopRequested (same as addLease, stopRequested means 
 Leases is asked to stop and is waiting for all remained leases to expire)
 # Raise log level from info to warn for case that no related region scanner 
 found when a lease expires (should it be an error?)
 # Replace System.currentTimeMillis() with 
 EnvironmentEdgeManager.currentTimeMillis()
 # Correct some wrong comments and remove some irrelevant comments(Queue 
 rather than Map is used for leases before?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10662) RegionScanner should be closed and according lease should be cancelled in regionserver immediately if we find the related region has been moved-out or re-opened during p

2014-03-04 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10662:
-

Summary: RegionScanner should be closed and according lease should be 
cancelled in regionserver immediately if we find the related region has been 
moved-out or re-opened during performing scan request  (was: RegionScanner 
should be closed and according lease should be cancelled in regionserver 
immediately if we find the related region has been re-opened during performing 
scan request)

 RegionScanner should be closed and according lease should be cancelled in 
 regionserver immediately if we find the related region has been moved-out or 
 re-opened during performing scan request
 ---

 Key: HBASE-10662
 URL: https://issues.apache.org/jira/browse/HBASE-10662
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10662-trunk_v1.patch


 During regionserver processes scan request from client, it fails the request 
 by throwing a wrapped NotServingRegionException to client if it finds the 
 region related to the passed-in scanner-id has been re-opened, and it also 
 removes the RegionScannerHolder from the scanners. In fact under this case, 
 the old and invalid RegionScanner related to the passed-in scanner-id should 
 be closed and the related lease should be cancelled at the mean time as well.
 Currently region's related scanners aren't closed when closing the region, a 
 region scanner is closed only when requested explicitly by client, or by 
 expiration of the related lease, in this sense the close of region scanners 
 is quite passive and lag.
 When regionserver processes scan request from client and can't find online 
 region corresponding to the passed-in scanner-id (due to being moved out) or 
 find the region has been re-opened, it throws NotServingRegionException and 
 removes the corresponding RegionScannerHolder from scanners without closing 
 the related region scanner (nor cancelling the related lease), but when the 
 lease expires, the related region scanner still doesn't be closed since it 
 doesn't present in scanners now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10650) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in RegionServer

2014-03-03 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10650:
-

Status: Patch Available  (was: Open)

Thank [~nkeywal] for the prompt review!

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in RegionServer
 ---

 Key: HBASE-10650
 URL: https://issues.apache.org/jira/browse/HBASE-10650
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10650-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10652) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in rpc

2014-03-03 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10652:
-

Status: Patch Available  (was: Open)

Thank [~nkeywal] and [~stack] for the prompt review!

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in rpc
 --

 Key: HBASE-10652
 URL: https://issues.apache.org/jira/browse/HBASE-10652
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10652-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-03 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918865#comment-13918865
 ] 

Feng Honghua commented on HBASE-10651:
--

bq.How does going back to while to check if we need to terminate relate to 
setting interrupt on thread? isActive doesn't seem to check thread state.
Yeah. Though I add interrupting replication thread in 'terminate' method after 
the line setting 'running' to false, we still need to add checking interrupt 
status in isActive() to make the overall logic complete/consistent, since 
replication thread can possibly be interrupted directly rather than by calling 
terminate() method from outside...

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-03 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10651:
-

Attachment: HBASE-10651-trunk_v2.patch

New patch attached per [~stack]'s suggestion, Thanks.

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch, HBASE-10651-trunk_v2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-03-03 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918970#comment-13918970
 ] 

Feng Honghua commented on HBASE-10595:
--

Ping for further comment

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch, HBASE-10595-trunk_v4.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-03-02 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917343#comment-13917343
 ] 

Feng Honghua commented on HBASE-10595:
--

Thanks [~v.himanshu] and [~enis] for review and comments!
bq.This is a call to NN, and this patch would invoke it for every get request.
Actually before this patch every get request still need a call to NN, The call 
to NN is to determine if the request can be satisfied by the cache 
(getTableInfoModtime will ask NN for the last modified time of the table 
descriptor file...), so you can view the check of existence of table dir is a 
kind of shortcut.
{code}
if (cachedtdm != null) {
  // Check mod time has not changed (this is trip to NN).
  if (getTableInfoModtime(tablename) = cachedtdm.getModtime()) {
cachehits++;
return cachedtdm.getTableDescriptor();
  }
}
{code}

bq.If a user deletes a table dir, wouldn't there be other (and more severe) 
consistencies such as meta being hosed, etc. What is the use case where a user 
is deleting the table dir behind the curtain ?
User deleting a table dir should not be an expected / normal scenario, should 
be deemed as a special case to result in the inconsistency of this patch. HBCK 
should be used to remove the items of this table in meta table for overall 
consistency. This is an issue out of scope of this patch.
This patch in more sensce is to fix the bug that listTables() and 
getTableDescriptor() use different criterion when table dir isn't existent:
# listTables() : deems a table is nonexistent and doesn't return according item 
if the table dir isn't existent, without checking if there is a corresponding 
item in table descriptor cache
# getTableDescriptor() : check the last modified time of the table descriptor 
file under table dir, if the file isn't existent or it's last modified time 
isn't newer than the one of table descriptor cache, deems as a cache hit and 
just return the one from cache

bq.I meant NameNode, not HMaster.
Thanks you two clarifying on this, I now realize you meant the access to NN to 
checking existence of the table dir :-)

bq.That is an inconsistency. But completely making the cache useless is not the 
way to solve it I think.
Glad you agree it's an inconsistency:-)
Why I think completely making the cache useless is acceptable:
# In essence all 'cache' item should be coherent to it backing store (think 
about all other kinds of cache:-)), the difference among various cache 
strategies(such as no-write / write-through / write-back...)  are when and how 
to keep cache to be coherent with its backing store, but not whether it should 
be coherent to its backing store, and whenever it isn't coherent with backing 
store, it's deemed as 'invalid' and should not be used to serve request. In 
this scenario, the table descriptor file is the backing store of the table 
descriptor cache.
# listTables() currently already use the semantic of above 'whenever cache 
isn't coherent with backing store, it's deemed as 'invalid' and should not be 
used to serve request' : it doesn't use cache if the table dir doesn't exist.
# Whether an active master failover-switch happens is transparent to the 
client, what client cares about is its requests are served consistently. But if 
we still use cache for getTableDescriptor() even its backing store(table dir 
and table descriptor file) changes(being removed is a special kind of change, 
right?), inconsistency can happen for two consecutive getTableDescriptor() 
calls : the previous active master uses its cache for the first 
getTableDescriptor, then it fails, another master takes the active master role, 
it finds no table dir(hence no table descriptor file), so it returns null for 
the second getTableDescriptor...

bq. That is the job of HBCK. There is no guarantees expected from the master if 
the user deletes dirs under it.
Partially agree:-). But it sounds much better if we could keep consistency from 
client perspective even under such corruption as table dir is removed from 
outside, right?

Lastly, what about we don't check the existence of the table dir, and just 
return null if the table descriptor file doesn't exist(it's now treated a 
special case of modified time not newer than cache, IMHO it's incorrect)? 
This way we still align with the cache semantic invalidate cache if it's not 
coherent with its backing store but can save an access to NN if the table dir 
does exist.

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: 

[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-03-02 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917346#comment-13917346
 ] 

Feng Honghua commented on HBASE-10595:
--

bq.Lastly, what about we don't check the existence of the table dir, and just 
return null if the table descriptor file doesn't exist(it's now treated a 
special case of modified time not newer than cache, IMHO it's incorrect)? 
This way we still align with the cache semantic invalidate cache if it's not 
coherent with its backing store but can save an access to NN if the table dir 
does exist.
I will make a new patch accordingly if you guys agree on above:-)

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch, HBASE-10595-trunk_v4.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10649) TestMasterMetrics fails occasionally

2014-03-02 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917349#comment-13917349
 ] 

Feng Honghua commented on HBASE-10649:
--

[~yuzhih...@gmail.com] : I also encountered occasional failure of 
TestMasterMetrics in my local box these days. btw: 
TestFromClientSideWithCoprocessor fails (though quite occasionally) as well, do 
you ever encounter it?

 TestMasterMetrics fails occasionally
 

 Key: HBASE-10649
 URL: https://issues.apache.org/jira/browse/HBASE-10649
 Project: HBase
  Issue Type: Test
Reporter: Ted Yu

 Latest occurrence was in https://builds.apache.org/job/HBase-TRUNK/4970
 {code}
 java.io.IOException: Shutting down
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:231)
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:93)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:875)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:839)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:756)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:727)
   at 
 org.apache.hadoop.hbase.master.TestMasterMetrics.startCluster(TestMasterMetrics.java:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
   at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   at org.junit.runners.Suite.runChild(Suite.java:127)
   at org.junit.runners.Suite.runChild(Suite.java:26)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)
 Caused by: java.lang.RuntimeException: Master not initialized after 20ms 
 seconds
   at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:221)
   at 
 org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:425)
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:224)
   ... 25 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10629) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops

2014-03-02 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10629:
-

Component/s: (was: master)
 (was: Client)

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops
 ---

 Key: HBASE-10629
 URL: https://issues.apache.org/jira/browse/HBASE-10629
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua

 There are about three kinds of typical incorrect handling of IE thrown during 
 sleep() in current code base:
 # Shadow it totally -- Has been fixed by HBASE-10497
 # Restore current thread's interrupt status implicitly within while/for loops 
 (Threads.sleep() being called within while/for loops)  -- Has been fixed by 
 HBASE-10516
 # Restore current thread's interrupt status explicitly within while/for loops 
 (directly interrupt current thread within while/for loops)
 There are still places with the last kind of handling error, and as 
 HBASE-10497/HBASE-10516, the last kind of errors should be fixed according to 
 their real scenarios case by case. This is created to serve as a parent jira 
 to fix the last kind errors in a systematic manner



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-02 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10651:


 Summary: Fix incorrect handling of IE that restores current 
thread's interrupt status within while/for loops in Replication
 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
Reporter: Feng Honghua
Assignee: Feng Honghua






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10652) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in rpc

2014-03-02 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10652:


 Summary: Fix incorrect handling of IE that restores current 
thread's interrupt status within while/for loops in rpc
 Key: HBASE-10652
 URL: https://issues.apache.org/jira/browse/HBASE-10652
 Project: HBase
  Issue Type: Sub-task
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10650) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in RegionServer

2014-03-02 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10650:


 Summary: Fix incorrect handling of IE that restores current 
thread's interrupt status within while/for loops in RegionServer
 Key: HBASE-10650
 URL: https://issues.apache.org/jira/browse/HBASE-10650
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10650) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in RegionServer

2014-03-02 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10650:
-

Attachment: HBASE-10650-trunk_v1.patch

patch attached

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in RegionServer
 ---

 Key: HBASE-10650
 URL: https://issues.apache.org/jira/browse/HBASE-10650
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10650-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-02 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10651:
-

Attachment: HBASE-10651-trunk_v1.patch

patch attached

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10652) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in rpc

2014-03-02 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10652:
-

Attachment: HBASE-10652-trunk_v1.patch

patch attached

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in rpc
 --

 Key: HBASE-10652
 URL: https://issues.apache.org/jira/browse/HBASE-10652
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10652-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10651) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in Replication

2014-03-02 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917393#comment-13917393
 ] 

Feng Honghua commented on HBASE-10651:
--

# Directly go back to while() to check if need to terminate after receiving 
interruption
# Make replication thread more responsive to the termination by replication 
manager by interrupting current thread in terminate()

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in Replication
 --

 Key: HBASE-10651
 URL: https://issues.apache.org/jira/browse/HBASE-10651
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10651-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10652) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops in rpc

2014-03-02 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917396#comment-13917396
 ] 

Feng Honghua commented on HBASE-10652:
--

Strictly speaking the previous implementation is correct in logic since the 
loop is terminated by setting running to false and interrupting the hosting 
thread concurrently, but it's still better to align with the standard handling 
idiom here.

 Fix incorrect handling of IE that restores current thread's interrupt status 
 within while/for loops in rpc
 --

 Key: HBASE-10652
 URL: https://issues.apache.org/jira/browse/HBASE-10652
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10652-trunk_v1.patch






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-28 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915541#comment-13915541
 ] 

Feng Honghua commented on HBASE-10595:
--

Thanks [~enis] for comment! sorry for the late reply
bq.Going to NN for checking whether table dir exists basically means that we 
should not be using the cache at all. Users are expected to not delete the 
table directory from the file system, which will cause further inconsistencies. 
Why do you think this is a problem?
# You meant HMaster when saying 'NN', right?
# In the time interval that after table dir is moved to tmp folder and before 
it's removed from table descriptor cache, the results of listTables and 
getTableDescriptor contradict, don't you think it's a kind of inconsistency?
# Users are surely NOT expected to delete table directory on purpose, but if 
they do delete table directory by accident, we should still ensure queries on 
HBase states get consistent results, right? Actually some HBCK unit tests aim 
for ensuring consistency under such corruption from user.

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-28 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Attachment: HBASE-10595-trunk_v4.patch

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch, HBASE-10595-trunk_v4.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-28 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915909#comment-13915909
 ] 

Feng Honghua commented on HBASE-10595:
--

To align with listTables() for which table is deemed nonexistent whenever its 
table dir is nonexistent, getTableDescriptor now gets TableNotFoundException 
whenever its table dir is nonexistent without regard to the table descriptor 
cache.

During deleting table: table dir is renamed(moved) to tmp dir = archive all 
region data = remove table dir = clear table descriptor cache = remove from 
RegionStates = remove from ZKTable = execute postDeleteTable coprocessor

By this patch, client now thinks deleting table succeeds after table dir is 
renamed(nonexistent), rather than after clearing the table descriptor cache, so 
some unit tests assuming states such as regions have been removed from 
RegionStates, postDeleteTable coprocessor has been executed now are more 
possible to fail (since archiving region data / removing table dir in tmp dir 
takes more time), that's why I add Threads.sleep() for some unit-tests in this 
patch -- Why these cases can pass before this patch is not by design, but by 
chance, because it takes much less time from clearing table descriptor cache to 
removing from RegionStates / executing postDeleteTable coprocessor(when without 
archiving table data / removing table dir), and they do fail when I add some 
extra sleep(it equals to scenario where HMaster could suddenly run slowly) 
after clearing table descriptor cache without this patch...

The root cause of above test failure is another bug : HBaseAdmin.deleteTable is 
not really synchronous(some cleanups in HMaster are likely not done yet 
*after* HBaseAdmin.deleteTable() returns). HBase-10636 is created for this bug. 
We can remove the added Threads.sleep() once HBase-10636 is done, and 
personally I think this patch can be resolved independently.

Any opinion?

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch, HBASE-10595-trunk_v4.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10628) Fix semantic inconsistency among methods which are exposed to client

2014-02-27 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10628:


 Summary: Fix semantic inconsistency among methods which are 
exposed to client
 Key: HBASE-10628
 URL: https://issues.apache.org/jira/browse/HBASE-10628
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua


This serves as a placeholder jira for inconsistency of client methods such as 
listTables / tableExists / getTableDescriptor described in HBASE-10584 and 
HBASE-10595, and also some other semantic fix.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-27 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10584:
-

Issue Type: Sub-task  (was: Bug)
Parent: HBASE-10628

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Sub-task
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch, HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-27 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Issue Type: Sub-task  (was: Bug)
Parent: HBASE-10628

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Sub-task
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10629) Fix incorrect handling of IE that restores current thread's interrupt status within while/for loops

2014-02-27 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10629:


 Summary: Fix incorrect handling of IE that restores current 
thread's interrupt status within while/for loops
 Key: HBASE-10629
 URL: https://issues.apache.org/jira/browse/HBASE-10629
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver, Replication
Reporter: Feng Honghua
Assignee: Feng Honghua


There are about three kinds of typical incorrect handling of IE thrown during 
sleep() in current code base:
# Shadow it totally -- Has been fixed by HBASE-10497
# Restore current thread's interrupt status implicitly within while/for loops 
(Threads.sleep() being called within while/for loops)  -- Has been fixed by 
HBASE-10516
# Restore current thread's interrupt status explicitly within while/for loops 
(directly interrupt current thread within while/for loops)

There are still places with the last kind of handling error, and as 
HBASE-10497/HBASE-10516, the last kind of errors should be fixed according to 
their real scenarios case by case. This is created to serve as a parent jira to 
fix the last kind errors in a systematic manner



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (HBASE-9469) Synchronous replication

2014-02-27 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua resolved HBASE-9469.
-

Resolution: Won't Fix

It has less value than expected as described in last comment

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-9469) Synchronous replication

2014-02-27 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915428#comment-13915428
 ] 

Feng Honghua commented on HBASE-9469:
-

Thanks [~terry_zhang] for comment!
bq.we need to make sure client write both side is a transaction if you want to 
data is consistent.
I've explained in above comment that neither writing both sides from client nor 
letting master cluster synchronously push WALEdits to peer can improve the 
overall availability, both ways improve the read availability at the cost of 
write availability : now the write fails in face of either cluster outage...
IMHO, to improve better read availability *without* hurting write availability, 
we need similar treatment as Megastore or Spanner.

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10636) HBaseAdmin.deleteTable isn't 'really' synchronous in that still some cleanup in HMaster after client thinks deleteTable() succeeds

2014-02-27 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10636:


 Summary: HBaseAdmin.deleteTable isn't 'really' synchronous in that 
still some cleanup in HMaster after client thinks deleteTable() succeeds
 Key: HBASE-10636
 URL: https://issues.apache.org/jira/browse/HBASE-10636
 Project: HBase
  Issue Type: Sub-task
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua


In HBaseAdmin.deleteTable():
{code}
public void deleteTable(final TableName tableName) throws IOException {
// Wait until all regions deleted
for (int tries = 0; tries  (this.numRetries * this.retryLongerMultiplier); 
tries++) {
// let us wait until hbase:meta table is updated and
// HMaster removes the table from its HTableDescriptors
if (values == null || values.length == 0) {
  tableExists = false;
  GetTableDescriptorsResponse htds;
  MasterKeepAliveConnection master = 
connection.getKeepAliveMasterService();
  try {
GetTableDescriptorsRequest req =
  RequestConverter.buildGetTableDescriptorsRequest(tableName);
htds = master.getTableDescriptors(null, req);
  } catch (ServiceException se) {
throw ProtobufUtil.getRemoteException(se);
  } finally {
master.close();
  }
  tableExists = !htds.getTableSchemaList().isEmpty();
  if (!tableExists) {
break;
  }
}
  }
{code}
client thinks deleteTable succeeds once it can't retrieve back the 
tableDescriptor

But in HMaster, the DeleteTableHandler which really deletes the table:
{code}
  protected void handleTableOperation(ListHRegionInfo regions)
  throws IOException, KeeperException {
// 1. Wait because of region in transition

// 2. Remove regions from META
LOG.debug(Deleting regions from META);
MetaEditor.deleteRegions(this.server.getCatalogTracker(), regions);

// 3. Move the table in /hbase/.tmp
MasterFileSystem mfs = this.masterServices.getMasterFileSystem();
Path tempTableDir = mfs.moveTableToTemp(tableName);

try {
  // 4. Delete regions from FS (temp directory)
  FileSystem fs = mfs.getFileSystem();
  for (HRegionInfo hri: regions) {
LOG.debug(Archiving region  + hri.getRegionNameAsString() +  from 
FS);
HFileArchiver.archiveRegion(fs, mfs.getRootDir(),
tempTableDir, new Path(tempTableDir, hri.getEncodedName()));
  }

  // 5. Delete table from FS (temp directory)
  if (!fs.delete(tempTableDir, true)) {
LOG.error(Couldn't delete  + tempTableDir);
  }

  LOG.debug(Table ' + tableName + ' archived!);
} finally {
  // 6. Update table descriptor cache
  LOG.debug(Removing ' + tableName + ' descriptor.);
  this.masterServices.getTableDescriptors().remove(tableName);

  // 7. Clean up regions of the table in RegionStates.
  LOG.debug(Removing ' + tableName + ' from region states.);
  states.tableDeleted(tableName);

  // 8. If entry for this table in zk, and up in AssignmentManager, remove 
it.
  LOG.debug(Marking ' + tableName + ' as deleted.);
  am.getZKTable().setDeletedTable(tableName);
}

if (cpHost != null) {
  cpHost.postDeleteTableHandler(this.tableName);
}
  }
{code}
Removing regions out of RegionStates, Marking table deleted from ZK, Calling 
coprocessor's postDeleteTableHandler are all after the table is removed from 
TableDescriptor cache

So client code relying on RegionStates/ZKTable/CP being cleaned up after 
deleteTable() possibly fail, if client requests hit HMaster before those three 
cleanup are done...

Actually when I add some sleep such as 200ms after below line to simulate a 
possible slow-running HMaster
{code}
this.masterServices.getTableDescriptors().remove(tableName);
{code}
Some unit tests(such as moveRegion / confirming postDeleteTable CP immediately 
after deleteTable) can't pass no longer



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-24 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910170#comment-13910170
 ] 

Feng Honghua commented on HBASE-10575:
--

[~lhofhansl], thanks for the review! :-)

Can it be committed, or any further feedback? Thanks

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.96.2, 0.98.1, 0.99.0, 0.94.18

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-24 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Attachment: HBASE-10595-trunk_v3.patch

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch, 
 HBASE-10595-trunk_v3.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-23 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Attachment: HBASE-10595-trunk_v1.patch

Patch attached

A new unit test case is added to enforce the consistency between 
getTableDescriptor and listTables when the table dir is removed by outside. 
This case which should be taken for granted can't pass before the patch.

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909715#comment-13909715
 ] 

Feng Honghua commented on HBASE-10595:
--

Additional note: TestAssignmentManagerOnCluster#testMoveRegionOfDeletedTable 
uses an assumption that regions of a deleted table are guaranteed to be removed 
from RegionStates in HMaster after deleteTable() is done and expects to receive 
a UnknownRegionException(which is by checking that region is not in 
RegionStates) when moving a region after its table is deleted.

But this assumption is wrong because though deleteTable() is synchronous but it 
only verify the table descriptor is empty before return, but in 
DeleteTableHandler the regions are removed from RegionStates *after* 
region-data-dirs / table-dir are removed and table descriptor is removed from 
cache.

Instead of fixing it in the unit test(we could sleep for a while before call 
move(), but it's weird since deleteTable() is synchronous!), I let move() throw 
UnknownRegionException as well when FSTableDescriptors.get(table) return null 
(means already deleted, but before regions are removed from RegionStates)

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909716#comment-13909716
 ] 

Feng Honghua commented on HBASE-10595:
--

TestMasterObserver fails when I run 'mvn test -P runAllTests', but it never 
fails when I run it separately many times(10+), any clue?

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909717#comment-13909717
 ] 

Feng Honghua commented on HBASE-10584:
--

I run 'mvn test' before attaching patch for this jira, the failed case could be 
exposed if I run 'mvn test -P runAllTests', sorry. 

Let's re-run Hadoop QA after HBASE-10595 is committed

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch, HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909722#comment-13909722
 ] 

Feng Honghua commented on HBASE-10584:
--

bq.We need a new issue after this one? There should be a general prescription 
on how to avoid our doing table-state transitions that by-pass each other 
thereby going forward? Should all queries about the state of tables 
(enabled/disabled) go via the master from here on out? If so, what to 
deprecate? (Can be different issue)
Agree on 'There should be a general prescription on how to avoid our doing 
table-state transitions that by-pass each other thereby going forward' and 'All 
queries about the state of tables (enabled/disabled) go via the master from 
here on out'. We can come up with a general prescription and then do a 
comprehensive review of existing code to ensure/enforce the consistency.

We can use a separate more general jira for this purpose(plays as a central 
placeholder for sub-tasks for individual outside inconsistency), it's about the 
consistency from the outside perspective of client, it is different from the 
master's current internal inconsistency due to missed events resulted from 
misusing zk's watch/notify mechanism for state-machine maintenance and 
maintaining truth in multiple places.

Opinion?

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch, HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10556) Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909724#comment-13909724
 ] 

Feng Honghua commented on HBASE-10556:
--

Ping again...:-)

 Possible data loss due to non-handled DroppedSnapshotException for 
 user-triggered flush from client/shell
 -

 Key: HBASE-10556
 URL: https://issues.apache.org/jira/browse/HBASE-10556
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10556-trunk_v1.patch


 During the code review when investigating HBASE-10499, a possibility of data 
 loss due to non-handled DroppedSnapshotException for user-triggered flush is 
 exposed.
 Data loss can happen as below:
 # A flush for some region is triggered via HBaseAdmin or shell
 # The request reaches regionserver and eventually HRegion.internalFlushcache 
 is called, then fails at persisting memstore's snapshot to hfile, 
 DroppedSnapshotException is thrown and the snapshot is left not cleared.
 # DroppedSnapshotException is not handled in HRegion, and is just 
 encapsulated as a ServiceException before returning to client
 # After a while, some new writes are handled and put in the current memstore, 
 then a new flush is triggered for the region due to memstoreSize exceeds 
 flush threshold
 # This second(new) flush succeeds, for the HStore which failed in the 
 previous user-triggered flush, the remained non-empty snapshot is used rather 
 than a new snapshot made from the current memstore, but HLog's latest 
 sequenceId is used for the resultant hfiles --- the sequenceId attached 
 within the hfiles says all edits with sequenceId = it have all been 
 persisted, but actually it's not the truth for the edits still in the 
 existing memstore
 # Now the regionserver hosting this region dies
 # During the replay phase of failover, the edits corresponding to the ones 
 while are in memstore and not actually persisted in hfiles when the previous 
 regionserver dies will be ignored, since they are deemed as persisted by 
 compared to the hfiles' latest consequenceID --- These edits are lost...
 For the second flush, we also can't discard the remained snapshot and make a 
 new one using current memstore, that way the data in the remained snapshot is 
 lost. We should abort the regionserver immediately and rely on the failover 
 to replay the log for data safety.
 DroppedSnapshotException is correctly handled in MemStoreFlusher for 
 internally triggered flush (which are generated by flush-size / rollWriter / 
 periodicFlusher). But user-triggered flush is processed directly by 
 HRegionServer-HRegion without putting a flush entry to flushQueue, hence not 
 handled by MemStoreFlusher



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909725#comment-13909725
 ] 

Feng Honghua commented on HBASE-10575:
--

Ping for another +1 for this jira to be committed? thanks! :-)

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.98.1, 0.99.0

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910028#comment-13910028
 ] 

Feng Honghua commented on HBASE-10575:
--

bq.I would probably rename uninitialize to terminate, otherwise looks good 
to me.
You meant the refactored 'uninitialize' method? hmmm...IMHO 'uninitialize' is 
more accurate than 'terminate' in that it only does cleanup of closing 
connection and logging before the containing thread being terminated, this 
method itself not directly terminates the replication thread, and actually 
there is *already* a terminate method which is used by ReplicationManager to 
terminate a replication thread from outside.

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.96.2, 0.98.1, 0.99.0, 0.94.18

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-23 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910041#comment-13910041
 ] 

Feng Honghua commented on HBASE-10575:
--

simply 'close', thanks...I meant 'close + logging' for 'cleanup' in above 
comment, a misuse of word 'cleanup'? :-)

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.96.2, 0.98.1, 0.99.0, 0.94.18

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-23 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Attachment: HBASE-10595-trunk_v2.patch

New patch 'fixing' previously failed TestMasterObserver case

The cause of the failure for TestMasterObserver is similar as 
TestAssignmentManagerOnCluster#testMoveRegionOfDeletedTable : 
HBaseAdmin.deleteTable is 'synchronous' to client in that it returns after it 
ensures table descriptor can't be retrieved back from master after asking 
master to delete a table. But DeleteTableHandler is processed asynchronously in 
master, and things such as 'clearing table descriptor cache', 'removing regions 
from RegionStates' and 'calling all coprocessors' postDeleteTableHandler' are 
all done *after* removing the table dir (it's 'removing table dir' now that 
makes client can't get table descriptor and believe the table is deleted after 
this patch, not from table descriptor cache).

Before this patch, the client can still get a valid table descriptor after 
master removes the table dir(first rename, then remove all region data dirs and 
finally remove table dir) until the table descriptor is removed from the table 
descriptor cache. But after this patch, client can't get table descriptor once 
master renames the table dir, so it makes the cases which assume regions are 
removed from RegionStates or coprocessors' postDeleteTableHandler are called 
much more possible to fail since now it takes longer from client can't get 
table descriptor to regions are removed from RegionStates / coprocessors' 
postDeleteTableHandler are called, and the code assuming such things fail when 
executed immediately after HBaseAdmin.deleteTable().

In short, we can't assume regions are removed from RegionStates or 
coprocessors' postDeleteTableHandler are called after 
HBaseAdmin.deleteTable() returns, though HBaseAdmin.deleteTable() is seemingly 
synchronous.

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-23 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10595:
-

Status: Patch Available  (was: Open)

 HBaseAdmin.getTableDescriptor can wrongly get the previous table's 
 TableDescriptor even after the table dir in hdfs is removed
 --

 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10595-trunk_v1.patch, HBASE-10595-trunk_v2.patch


 When a table dir (in hdfs) is removed(by outside), HMaster will still return 
 the cached TableDescriptor to client for getTableDescriptor request.
 On the contrary, HBaseAdmin.listTables() is handled correctly in current 
 implementation, for a table whose table dir in hdfs is removed by outside, 
 getTableDescriptor can still retrieve back a valid (old) table descriptor, 
 while listTables says it doesn't exist, this is inconsistent
 The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
 check if the table dir exists for getTableDescriptor() request, (while it 
 lists all existing table dirs(not firstly respects cache) and returns 
 accordingly for listTables() request)
 When a table is deleted via deleteTable, the cache will be cleared after the 
 table dir and tableInfo file is removed, listTables/getTableDescriptor 
 inconsistency should be transient(though still exists, when table dir is 
 removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-22 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909615#comment-13909615
 ] 

Feng Honghua commented on HBASE-10584:
--

I'm looking at the failed case.

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch, HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10595) HBaseAdmin.getTableDescriptor can wrongly get the previous table's TableDescriptor even after the table dir in hdfs is removed

2014-02-22 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10595:


 Summary: HBaseAdmin.getTableDescriptor can wrongly get the 
previous table's TableDescriptor even after the table dir in hdfs is removed
 Key: HBASE-10595
 URL: https://issues.apache.org/jira/browse/HBASE-10595
 Project: HBase
  Issue Type: Bug
  Components: master, util
Reporter: Feng Honghua
Assignee: Feng Honghua


When a table dir (in hdfs) is removed(by outside), HMaster will still return 
the cached TableDescriptor to client for getTableDescriptor request.

On the contrary, HBaseAdmin.listTables() is handled correctly in current 
implementation, for a table whose table dir in hdfs is removed by outside, 
getTableDescriptor can still retrieve back a valid (old) table descriptor, 
while listTables says it doesn't exist, this is inconsistent

The reason for this bug is because HMaster (via FSTableDescriptors) doesn't 
check if the table dir exists for getTableDescriptor() request, (while it lists 
all existing table dirs(not firstly respects cache) and returns accordingly for 
listTables() request)

When a table is deleted via deleteTable, the cache will be cleared after the 
table dir and tableInfo file is removed, listTables/getTableDescriptor 
inconsistency should be transient(though still exists, when table dir is 
removed while cache is not cleared) and harder to expose



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-22 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13909669#comment-13909669
 ] 

Feng Honghua commented on HBASE-10584:
--

The failed case is due to another bug in HMaster/FSTableDescriptors for which 
HBASE-10595 is created. This case can pass after that bug is fixed.

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch, HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-21 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10584:


 Summary: Inconsistency between tableExists and listTables in 
implementation
 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua


# HBaseAdmin.tableExists is implemented by scanning meta table
# HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
talking with HMaster which responses by querying the FSTableDescriptors, and 
FSTableDescriptors return all tables by scanning all the table descriptor files 
in FS(cache also plays here, so most of time it can be satisfied by cache)...

Actually HBaseAdmin requests HMaster to check if a table exists internally when 
implementing deleteTable(see below), then why does it use a different(scanning 
meta table) way to implementing tableExists() for outside user to use for the 
same purpose?
{code}
  tableExists = false;
  GetTableDescriptorsResponse htds;
  MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
  try {
GetTableDescriptorsRequest req =
RequestConverter.buildGetTableDescriptorsRequest(tableName);
htds = master.getTableDescriptors(null, req);
  } catch (ServiceException se) {
throw ProtobufUtil.getRemoteException(se);
  } finally {
master.close();
  }
  tableExists = !htds.getTableSchemaList().isEmpty();
{code}

Since creating table descriptor files and inserting item to meta table occur in 
different time without atomic semantic, this inconsistency in implementation 
can lead to confusing behaviors when create-table or delete-table fails midway, 
(before according cleanup is done) table descriptor file may exists while no 
item exists in meta table (for create-table where table descriptor file is 
created before inserting item to meta table), this leads to listTables 
including that table, while tableExists says no. Similar inconsistency if 
delete-table fails mid-way...

Confusing behavior can happen during the process even though eventually it 
succeed:
# During table creation, when a user calls listTables and then calls 
tableExists for this table after the table descriptor is created but before 
item is inserted to meta table. He will find the listTables includes a table 
but tableExists return false for that same table, this behavior is confusing 
and should only acceptable during the table is being deleted...
# Similar behavior occurs during table deletion.

Seems the benefit of implementing tableExists this way is we can avoid talking 
with HMaster, considering we talk with HMaster for listTables and 
getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-21 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10584:
-

Description: 
# HBaseAdmin.tableExists is implemented by scanning meta table
# HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
talking with HMaster which responses by querying the FSTableDescriptors, and 
FSTableDescriptors return all tables by scanning all the table descriptor files 
in FS(cache also plays here, so most of time it can be satisfied by cache)...

Actually HBaseAdmin requests HMaster to check if a table exists internally when 
implementing deleteTable(see below), then why does it use a different(scanning 
meta table) way to implementing tableExists() for outside user to use for the 
same purpose?
{code}
  tableExists = false;
  GetTableDescriptorsResponse htds;
  MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
  try {
GetTableDescriptorsRequest req =
RequestConverter.buildGetTableDescriptorsRequest(tableName);
htds = master.getTableDescriptors(null, req);
  } catch (ServiceException se) {
throw ProtobufUtil.getRemoteException(se);
  } finally {
master.close();
  }
  tableExists = !htds.getTableSchemaList().isEmpty();
{code}
(Above verifying that table descriptor file is deleted can guarantee all items 
of this table are deleted from meta table...)

Since creating table descriptor files and inserting item to meta table occur in 
different time without atomic semantic, this inconsistency in implementation 
can lead to confusing behaviors when create-table or delete-table fails midway, 
(before according cleanup is done) table descriptor file may exists while no 
item exists in meta table (for create-table where table descriptor file is 
created before inserting item to meta table), this leads to listTables 
including that table, while tableExists says no. Similar inconsistency if 
delete-table fails mid-way...

Confusing behavior can happen during the process even though eventually it 
succeed:
# During table creation, when a user calls listTables and then calls 
tableExists for this table after the table descriptor is created but before 
item is inserted to meta table. He will find the listTables includes a table 
but tableExists return false for that same table, this behavior is confusing 
and should only acceptable during the table is being deleted...
# Similar behavior occurs during table deletion.

Seems the benefit of implementing tableExists this way is we can avoid talking 
with HMaster, considering we talk with HMaster for listTables and 
getTableDescriptor, such benefit can't offset the drawback from inconsistency.

  was:
# HBaseAdmin.tableExists is implemented by scanning meta table
# HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
talking with HMaster which responses by querying the FSTableDescriptors, and 
FSTableDescriptors return all tables by scanning all the table descriptor files 
in FS(cache also plays here, so most of time it can be satisfied by cache)...

Actually HBaseAdmin requests HMaster to check if a table exists internally when 
implementing deleteTable(see below), then why does it use a different(scanning 
meta table) way to implementing tableExists() for outside user to use for the 
same purpose?
{code}
  tableExists = false;
  GetTableDescriptorsResponse htds;
  MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
  try {
GetTableDescriptorsRequest req =
RequestConverter.buildGetTableDescriptorsRequest(tableName);
htds = master.getTableDescriptors(null, req);
  } catch (ServiceException se) {
throw ProtobufUtil.getRemoteException(se);
  } finally {
master.close();
  }
  tableExists = !htds.getTableSchemaList().isEmpty();
{code}

Since creating table descriptor files and inserting item to meta table occur in 
different time without atomic semantic, this inconsistency in implementation 
can lead to confusing behaviors when create-table or delete-table fails midway, 
(before according cleanup is done) table descriptor file may exists while no 
item exists in meta table (for create-table where table descriptor file is 
created before inserting item to meta table), this leads to listTables 
including that table, while tableExists says no. Similar inconsistency if 
delete-table fails mid-way...

Confusing behavior can happen during the process even though eventually it 
succeed:
# During table creation, when a user calls listTables and then calls 
tableExists for this table after the table descriptor is created but before 
item is inserted to meta table. He will find the listTables includes a table 
but tableExists return false for that same table, this behavior is confusing 
and should only acceptable during the table is being deleted...
# Similar behavior occurs during table deletion.

Seems the benefit of implementing 

[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop

2014-02-21 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908183#comment-13908183
 ] 

Feng Honghua commented on HBASE-10516:
--

[~nkeywal], thanks! :-)

 Refactor code where Threads.sleep is called within a while/for loop
 ---

 Key: HBASE-10516
 URL: https://issues.apache.org/jira/browse/HBASE-10516
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver
Affects Versions: 0.98.0, 0.99.0
Reporter: Feng Honghua
Assignee: Feng Honghua
 Fix For: 0.99.0

 Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch, 
 HBASE-10516-trunk_v3.patch


 Threads.sleep implementation:
 {code}
  public static void sleep(long millis) {
 try {
   Thread.sleep(millis);
 } catch (InterruptedException e) {
   e.printStackTrace();
   Thread.currentThread().interrupt();
 }
   }
 {code}
 From above implementation, the current thread's interrupt status is 
 restored/reset when InterruptedException is caught and handled. If this 
 method is called within a while/for loop, if a first InterruptedException is 
 thrown during sleep, it will make the Threads.sleep in next loop immediately 
 throw InterruptedException without expected sleep. This behavior breaks the 
 intention for independent sleep in each loop
 I mentioned above in HBASE-10497 and this jira is created to handle it 
 separately per [~nkeywal]'s suggestion



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10584) Inconsistency between tableExists and listTables in implementation

2014-02-21 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10584:
-

Attachment: HBASE-10584-trunk_v1.patch

 Inconsistency between tableExists and listTables in implementation
 --

 Key: HBASE-10584
 URL: https://issues.apache.org/jira/browse/HBASE-10584
 Project: HBase
  Issue Type: Bug
  Components: Client, master
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10584-trunk_v1.patch


 # HBaseAdmin.tableExists is implemented by scanning meta table
 # HBaseAdmin.listTables(and HBaseAdmin.getTableDescriptor) is implemented by 
 talking with HMaster which responses by querying the FSTableDescriptors, and 
 FSTableDescriptors return all tables by scanning all the table descriptor 
 files in FS(cache also plays here, so most of time it can be satisfied by 
 cache)...
 Actually HBaseAdmin requests HMaster to check if a table exists internally 
 when implementing deleteTable(see below), then why does it use a 
 different(scanning meta table) way to implementing tableExists() for outside 
 user to use for the same purpose?
 {code}
   tableExists = false;
   GetTableDescriptorsResponse htds;
   MasterKeepAliveConnection master = connection.getKeepAliveMasterService();
   try {
 GetTableDescriptorsRequest req =
 RequestConverter.buildGetTableDescriptorsRequest(tableName);
 htds = master.getTableDescriptors(null, req);
   } catch (ServiceException se) {
 throw ProtobufUtil.getRemoteException(se);
   } finally {
 master.close();
   }
   tableExists = !htds.getTableSchemaList().isEmpty();
 {code}
 (Above verifying that table descriptor file is deleted can guarantee all 
 items of this table are deleted from meta table...)
 Since creating table descriptor files and inserting item to meta table occur 
 in different time without atomic semantic, this inconsistency in 
 implementation can lead to confusing behaviors when create-table or 
 delete-table fails midway, (before according cleanup is done) table 
 descriptor file may exists while no item exists in meta table (for 
 create-table where table descriptor file is created before inserting item to 
 meta table), this leads to listTables including that table, while tableExists 
 says no. Similar inconsistency if delete-table fails mid-way...
 Confusing behavior can happen during the process even though eventually it 
 succeed:
 # During table creation, when a user calls listTables and then calls 
 tableExists for this table after the table descriptor is created but before 
 item is inserted to meta table. He will find the listTables includes a table 
 but tableExists return false for that same table, this behavior is confusing 
 and should only acceptable during the table is being deleted...
 # Similar behavior occurs during table deletion.
 Seems the benefit of implementing tableExists this way is we can avoid 
 talking with HMaster, considering we talk with HMaster for listTables and 
 getTableDescriptor, such benefit can't offset the drawback from inconsistency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop and fails to contact peer's zk ensemble continuously

2014-02-20 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-10575:


 Summary: ReplicationSource thread can't be terminated if it runs 
into the loop and fails to contact peer's zk ensemble continuously
 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
 Fix For: 0.99.0


When ReplicationSource thread runs into the loop to contact peer's zk ensemble, 
it doesn't check isActive() before each retry, so if the given peer's zk 
ensemble is not reachable due to some reason, this ReplicationSource thread 
just can't be terminated by outside such as removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop and fails to contact peer's zk ensemble continuously

2014-02-20 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10575:
-

Priority: Critical  (was: Major)

 ReplicationSource thread can't be terminated if it runs into the loop and 
 fails to contact peer's zk ensemble continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-20 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10575:
-

Summary: ReplicationSource thread can't be terminated if it runs into the 
loop to contact peer's zk ensemble and fails continuously  (was: 
ReplicationSource thread can't be terminated if it runs into the loop and fails 
to contact peer's zk ensemble continuously)

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-20 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10575:
-

Attachment: HBASE-10575-trunk_v1.patch

Patch attached for the fix

And two minor changes
# exit immediately without sleep if isActive()==false after each failed try
# close this.conn and print ReplicationSource exiting log for premature 
thread-exit as well

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906766#comment-13906766
 ] 

Feng Honghua commented on HBASE-10575:
--

Looks like all branches have this same bug. I have checked 0.94, 0.98 and 
0.99...

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-20 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10575:
-

Status: Patch Available  (was: Open)

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10575) ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

2014-02-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906946#comment-13906946
 ] 

Feng Honghua commented on HBASE-10575:
--

unit tests pass in my local run, and the failed cases look like have nothing to 
do with the patch...weird

 ReplicationSource thread can't be terminated if it runs into the loop to 
 contact peer's zk ensemble and fails continuously
 --

 Key: HBASE-10575
 URL: https://issues.apache.org/jira/browse/HBASE-10575
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.1, 0.99.0, 0.94.17
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Fix For: 0.99.0

 Attachments: HBASE-10575-trunk_v1.patch


 When ReplicationSource thread runs into the loop to contact peer's zk 
 ensemble, it doesn't check isActive() before each retry, so if the given 
 peer's zk ensemble is not reachable due to some reason, this 
 ReplicationSource thread just can't be terminated by outside such as 
 removePeer etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10497) Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically

2014-02-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907792#comment-13907792
 ] 

Feng Honghua commented on HBASE-10497:
--

Thank you [~nkeywal] :-)

 Correct wrong handling and add proper handling for swallowed 
 InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server 
 folders systematically
 ---

 Key: HBASE-10497
 URL: https://issues.apache.org/jira/browse/HBASE-10497
 Project: HBase
  Issue Type: Bug
  Components: Client, regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10497-trunk_v1.patch, HBASE-10497-trunk_v2.patch


 There are two kinds of handling problems for InterruptedException thrown by 
 Thread.sleep in many places under HBase-Client/HBase-Server folders:
 # Thread.currentThread.interrupt() is called within 'while' loops which can 
 result in buggy behaviors such as expected sleep doesn't occur due to 
 restored interrupt status by former loop
 # InterruptedException thrown by Thread.sleep are swallowed silently (which 
 are neither declared in the caller method's throws clause nor rethrown 
 immediately)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop

2014-02-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907837#comment-13907837
 ] 

Feng Honghua commented on HBASE-10516:
--

[~nkeywal], can this be committed as well? thanks:-)

 Refactor code where Threads.sleep is called within a while/for loop
 ---

 Key: HBASE-10516
 URL: https://issues.apache.org/jira/browse/HBASE-10516
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch, 
 HBASE-10516-trunk_v3.patch


 Threads.sleep implementation:
 {code}
  public static void sleep(long millis) {
 try {
   Thread.sleep(millis);
 } catch (InterruptedException e) {
   e.printStackTrace();
   Thread.currentThread().interrupt();
 }
   }
 {code}
 From above implementation, the current thread's interrupt status is 
 restored/reset when InterruptedException is caught and handled. If this 
 method is called within a while/for loop, if a first InterruptedException is 
 thrown during sleep, it will make the Threads.sleep in next loop immediately 
 throw InterruptedException without expected sleep. This behavior breaks the 
 intention for independent sleep in each loop
 I mentioned above in HBASE-10497 and this jira is created to handle it 
 separately per [~nkeywal]'s suggestion



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop

2014-02-19 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10516:
-

Attachment: HBASE-10516-trunk_v3.patch

v3 patch attached per [~nkeywal]'s latest feedback on the v2 patch:
# DeleteTableHandler : IOE thrown directly
# AssignmentManager : IE thrown directly (add throws clause for its containing 
assign() method, change another RuntimeException to IE in the same method. The 
next upper containing method of its direct containing method does throw IE, so 
this change should match the exception design better)
# LruBlockCache : exit at first IE and interrupt()
# ZooKeeperWatcher : RuntimeException thrown as timeout handling nearby

Ping [~nkeywal] and thanks!

 Refactor code where Threads.sleep is called within a while/for loop
 ---

 Key: HBASE-10516
 URL: https://issues.apache.org/jira/browse/HBASE-10516
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch, 
 HBASE-10516-trunk_v3.patch


 Threads.sleep implementation:
 {code}
  public static void sleep(long millis) {
 try {
   Thread.sleep(millis);
 } catch (InterruptedException e) {
   e.printStackTrace();
   Thread.currentThread().interrupt();
 }
   }
 {code}
 From above implementation, the current thread's interrupt status is 
 restored/reset when InterruptedException is caught and handled. If this 
 method is called within a while/for loop, if a first InterruptedException is 
 thrown during sleep, it will make the Threads.sleep in next loop immediately 
 throw InterruptedException without expected sleep. This behavior breaks the 
 intention for independent sleep in each loop
 I mentioned above in HBASE-10497 and this jira is created to handle it 
 separately per [~nkeywal]'s suggestion



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905077#comment-13905077
 ] 

Feng Honghua commented on HBASE-10516:
--

bq.Nice writeup @nkeywal (and nice commentary Feng Honghua – especially the bit 
about resetting interrupt is useless unless it checked in upper layers). We 
should extract your wisdom and put on dev list? It is good stuff. I can write 
it up in refguide after we are agreed.
Sounds good, [~stack]

 Refactor code where Threads.sleep is called within a while/for loop
 ---

 Key: HBASE-10516
 URL: https://issues.apache.org/jira/browse/HBASE-10516
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch, 
 HBASE-10516-trunk_v3.patch


 Threads.sleep implementation:
 {code}
  public static void sleep(long millis) {
 try {
   Thread.sleep(millis);
 } catch (InterruptedException e) {
   e.printStackTrace();
   Thread.currentThread().interrupt();
 }
   }
 {code}
 From above implementation, the current thread's interrupt status is 
 restored/reset when InterruptedException is caught and handled. If this 
 method is called within a while/for loop, if a first InterruptedException is 
 thrown during sleep, it will make the Threads.sleep in next loop immediately 
 throw InterruptedException without expected sleep. This behavior breaks the 
 intention for independent sleep in each loop
 I mentioned above in HBASE-10497 and this jira is created to handle it 
 separately per [~nkeywal]'s suggestion



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905335#comment-13905335
 ] 

Feng Honghua commented on HBASE-10516:
--

bq.Sorry again the the review in 2 passes
No sorry here, your review is really appreciated:-)
bq.I will write up a small text on interruption for the dev list as well.
That's great!

 Refactor code where Threads.sleep is called within a while/for loop
 ---

 Key: HBASE-10516
 URL: https://issues.apache.org/jira/browse/HBASE-10516
 Project: HBase
  Issue Type: Bug
  Components: Client, master, regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch, 
 HBASE-10516-trunk_v3.patch


 Threads.sleep implementation:
 {code}
  public static void sleep(long millis) {
 try {
   Thread.sleep(millis);
 } catch (InterruptedException e) {
   e.printStackTrace();
   Thread.currentThread().interrupt();
 }
   }
 {code}
 From above implementation, the current thread's interrupt status is 
 restored/reset when InterruptedException is caught and handled. If this 
 method is called within a while/for loop, if a first InterruptedException is 
 thrown during sleep, it will make the Threads.sleep in next loop immediately 
 throw InterruptedException without expected sleep. This behavior breaks the 
 intention for independent sleep in each loop
 I mentioned above in HBASE-10497 and this jira is created to handle it 
 separately per [~nkeywal]'s suggestion



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10556) Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905351#comment-13905351
 ] 

Feng Honghua commented on HBASE-10556:
--

Thanks [~yuzhih...@gmail.com] for the review, and ping [~lhofhansl], [~stack] 
and [~apurtell] for review and another +1, thanks:-)

 Possible data loss due to non-handled DroppedSnapshotException for 
 user-triggered flush from client/shell
 -

 Key: HBASE-10556
 URL: https://issues.apache.org/jira/browse/HBASE-10556
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Critical
 Attachments: HBASE-10556-trunk_v1.patch


 During the code review when investigating HBASE-10499, a possibility of data 
 loss due to non-handled DroppedSnapshotException for user-triggered flush is 
 exposed.
 Data loss can happen as below:
 # A flush for some region is triggered via HBaseAdmin or shell
 # The request reaches regionserver and eventually HRegion.internalFlushcache 
 is called, then fails at persisting memstore's snapshot to hfile, 
 DroppedSnapshotException is thrown and the snapshot is left not cleared.
 # DroppedSnapshotException is not handled in HRegion, and is just 
 encapsulated as a ServiceException before returning to client
 # After a while, some new writes are handled and put in the current memstore, 
 then a new flush is triggered for the region due to memstoreSize exceeds 
 flush threshold
 # This second(new) flush succeeds, for the HStore which failed in the 
 previous user-triggered flush, the remained non-empty snapshot is used rather 
 than a new snapshot made from the current memstore, but HLog's latest 
 sequenceId is used for the resultant hfiles --- the sequenceId attached 
 within the hfiles says all edits with sequenceId = it have all been 
 persisted, but actually it's not the truth for the edits still in the 
 existing memstore
 # Now the regionserver hosting this region dies
 # During the replay phase of failover, the edits corresponding to the ones 
 while are in memstore and not actually persisted in hfiles when the previous 
 regionserver dies will be ignored, since they are deemed as persisted by 
 compared to the hfiles' latest consequenceID --- These edits are lost...
 For the second flush, we also can't discard the remained snapshot and make a 
 new one using current memstore, that way the data in the remained snapshot is 
 lost. We should abort the regionserver immediately and rely on the failover 
 to replay the log for data safety.
 DroppedSnapshotException is correctly handled in MemStoreFlusher for 
 internally triggered flush (which are generated by flush-size / rollWriter / 
 periodicFlusher). But user-triggered flush is processed directly by 
 HRegionServer-HRegion without putting a flush entry to flushQueue, hence not 
 handled by MemStoreFlusher



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10521) Add handling for swallowed InterruptedException thrown by Thread.sleep in RpcServer and RpcClient

2014-02-19 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10521:
-

Attachment: HBASE-10521-trunk_v3.patch

v3 patch attached per [~nkeywal]'s latest review feedback on v2, thanks!

 Add handling for swallowed InterruptedException thrown by Thread.sleep in 
 RpcServer and RpcClient
 -

 Key: HBASE-10521
 URL: https://issues.apache.org/jira/browse/HBASE-10521
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10521-trunk_v1.patch, HBASE-10521-trunk_v2.patch, 
 HBASE-10521-trunk_v3.patch


 A sub-task of HBASE-10497



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10524) Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905502#comment-13905502
 ] 

Feng Honghua commented on HBASE-10524:
--

Yes, both taskLoop and run handle InterruptedException already(though some just 
handle for its own sleep), letting getTaskList()/taskLoop() just throw out IE 
and letting run() handle IE does make the code cleaner. Good to me:-)

btw: I made a new patch for HBASE-10521, would you help review it again? 
thanks:-)

 Correct wrong handling and add proper handling for swallowed 
 InterruptedException thrown by Thread.sleep in regionserver
 

 Key: HBASE-10524
 URL: https://issues.apache.org/jira/browse/HBASE-10524
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10524-trunk_v1.patch, HBASE-10524-trunk_v2.patch, 
 split.patch


 A sub-task of HBASE-10497
 # correct wrong handling of InterruptedException where 
 Thread.currentThread.interrupt() is called within while loops
 # add proper handling for swallowed InterruptedException



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10524) Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905509#comment-13905509
 ] 

Feng Honghua commented on HBASE-10524:
--

btw: the attached patch doesn't contain change to HRegionServer? please 
remember to include that when committing, or let me make a new one?

 Correct wrong handling and add proper handling for swallowed 
 InterruptedException thrown by Thread.sleep in regionserver
 

 Key: HBASE-10524
 URL: https://issues.apache.org/jira/browse/HBASE-10524
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-10524-trunk_v1.patch, HBASE-10524-trunk_v2.patch, 
 split.patch


 A sub-task of HBASE-10497
 # correct wrong handling of InterruptedException where 
 Thread.currentThread.interrupt() is called within while loops
 # add proper handling for swallowed InterruptedException



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10521) Add handling for swallowed InterruptedException thrown by Thread.sleep in RpcServer and RpcClient

2014-02-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906538#comment-13906538
 ] 

Feng Honghua commented on HBASE-10521:
--

I also rerun the unit tests and all pass locally

 Add handling for swallowed InterruptedException thrown by Thread.sleep in 
 RpcServer and RpcClient
 -

 Key: HBASE-10521
 URL: https://issues.apache.org/jira/browse/HBASE-10521
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor
 Attachments: HBASE-10521-trunk_v1.patch, HBASE-10521-trunk_v2.patch, 
 HBASE-10521-trunk_v3.patch, HBASE-10521-trunk_v3.patch


 A sub-task of HBASE-10497



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


  1   2   3   4   5   6   >