[jira] Updated: (HBASE-2599) BaseScanner says "Current assignment of X is not valid" over and over for same region

stack (JIRA) Thu, 27 May 2010 16:22:06 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-2599:
-------------------------

    Attachment: 2599-trunk.txt

Version for trunk that has todd suggested changes.  Will apply soon.

> BaseScanner says "Current assignment of X is not valid" over and over for 
> same region
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-2599
>                 URL: https://issues.apache.org/jira/browse/HBASE-2599
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>         Attachments: 2599-0.20.txt, 2599-trunk.txt
>
>
> From IRC today
> {code}
> 12:41 < cmorgan> hey guys. I'm having a recent  issue with a single node 
> cluster running 0.20.4. After stopping for a backup I now get region 
> assignment churn. Seems master keeps thinking that region
>                  assignment is not valid even when it is. Following is a log 
> snippet:
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [        HMaster] DEBUG 
> ter.RegionServerOperationQueue  - Processing todo: PendingOpenOperation from 
> localhost.,7802,1274425405680
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [        HMaster] INFO  
> e.master.RegionServerOperation  - 
> net_troove_coin_account_AccountCredentials,,1234913258116 open on 
> 127.0.0.1:7802
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [        HMaster] INFO  
> e.master.RegionServerOperation  - Updated row 
> net_troove_coin_account_AccountCredentials,,1234913258116 in region .META.,,1 
> with
>                  startcode=1274425405680, server=127.0.0.1:7802
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [        HMaster] DEBUG 
> ter.RegionServerOperationQueue  - Processing todo: PendingOpenOperation from 
> localhost.,7802,1274425405680
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [        HMaster] INFO  
> e.master.RegionServerOperation  - 
> net_troove_application_request_TemporaryRequest,,1234913268355 open on 
> 127.0.0.1:7802
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443247 [        HMaster] INFO  
> e.master.RegionServerOperation  - Updated row 
> net_troove_application_request_TemporaryRequest,,1234913268355 in region 
> .META.,,1 with
>                  startcode=1274425405680, server=127.0.0.1:7802
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443247 [ger.metaScanner] DEBUG 
> adoop.hbase.master.BaseScanner  - Current assignment of 
> net_troove_coin_account_AccountEntry,,1271448856984 is not valid;
>                  serverAddress=127.0.0.1:7802, startCode=1274425405680 
> unknown.
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443248 [ger.metaScanner] DEBUG 
> adoop.hbase.master.BaseScanner  - Current assignment of 
> net_troove_coin_account_AccountEntry-Base_EntryDay_DESCENDING,,1273266418876
>                  is not valid;  serverAddress=127.0.0.1:7802, 
> startCode=1274425405680 unknown.
> 12:41 < cmorgan> [21/05/10 00:59:42] 3443251 [ger.metaScanner] DEBUG 
> adoop.hbase.master.BaseScanner  - Current assignment of 
> net_troove_coin_bank_BankStatement,,1266433980935 is not valid;
>                  serverAddress=127.0.0.1:7802, startCode=1274425405680 
> unknown.
> 12:58 < cmorgan> stack: I'd been running with 0.20.4 for a week or so 
> starting/stopping every night. Now this happens...
> 14:11 < cmorgan> stack: some more info: On our mini production server the 
> regionserver is getting "My address is localhost.:7802" (notice the dot after 
> localhost). But the master is also sometimes
>                  referring to it as 127.0.0.1. I just used the same data and 
> config on my laptop, and its binding to my external LAN ip ("My address is 
> 10.0.1.4:7802"). Under this setup hbase comes up
>                  stable (no region assignment churn).
> {code}
> Looking at this, I think issue is that when we register a server we use a 
> getServerName on a HServerInfo provided by the regionserver (though we are on 
> the master side) but BaseScanner uses a getServerName that is made by doing a 
> dns lookup using the IP that it finds in the server column of .META.  My 
> sense is that is possible for the regionserver hostname and what the master 
> finds when it does a lookup against dns can disagree, fatally.
> This issue seems popular over last few weeks.  Was reported at least once 
> more on a standalone instance and also on krispykola's 15-node ec2 cluster 
> (He went back to 0.20.3 and then it went away?).  It made for what looked 
> like double-assignment in his case (Our attempt at caching DNS names may be 
> amiss -- I tihnk tht the main diff between 0.20.3 and 0.20.4 in this area).
> My thought is to purge DNS from the HServerInfo passed by the RS to Master on 
> startup and heartbeating and to use IPs only (and even then, the IP that the 
> master tells the RS to use, its remote address as seen by the master).  We 
> might have to do this fix for 0.20.5 since it seems to happen more in 0.20.4.
> I'm looking into this.  Opinions welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2599) BaseScanner says "Current assignment of X is not valid" over and over for same region

Reply via email to