[ 
https://issues.apache.org/jira/browse/HBASE-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861702#comment-13861702
 ] 

Jean-Daniel Cryans commented on HBASE-10271:
--------------------------------------------

bq. In the chore, it's better to iterate the entrySet; both more efficient than 
gets, and more correct because someone can remove the entry while it's being 
processed.

You're right

bq. Is EMPTY_SERVERLOAD itself still needed?

Seems it's only used in the unit tests, I can look today if there's interest in 
the patch. 

bq. Could there be a race with znode expiry?

expireServer is synchronized so the second call would be a no-op.

bq. Would it make sense to do the check only for znode-less servers?

Haven't thought about it, but right now the solution is simple and hopefully 
harmless.

bq. Is this patch safe w.r.t. network blibs? What the network is out for 10s?

I think you are misreading the patch, 10 seconds the period that the Chore is 
using but the timeout for the RS is the same as the ZK session timeout.

bq. The master will declare the RS dead, then the network is back, RS will 
still serve the regions, but master will try to reassign.
Is that handled correctly before (or even after) ZK detects the problem?

I don't think that it's different from a GC-induced ZK session timeout where 
the region server comes back and is still serving regions but the master moved 
them elsewhere.

> [regression] Cannot use the wildcard address since HBASE-9593
> -------------------------------------------------------------
>
>                 Key: HBASE-10271
>                 URL: https://issues.apache.org/jira/browse/HBASE-10271
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.94.13, 0.96.1
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.98.0, 0.94.16, 0.96.2, 0.99.0
>
>         Attachments: HBASE-10271.patch
>
>
> HBASE-9593 moved the creation of the ephemeral znode earlier in the region 
> server startup process such that we don't have access to the ServerName from 
> the Master's POV. HRS.getMyEphemeralNodePath() calls HRS.getServerName() 
> which at that point will return this.isa.getHostName(). If you set 
> hbase.regionserver.ipc.address to 0.0.0.0, you will create a znode with that 
> address.
> What happens next is that the RS will report for duty correctly but the 
> master will do this:
> {noformat}
> 2014-01-02 11:45:49,498 INFO  [master:172.21.3.117:60000] 
> master.ServerManager: Registering server=0:0:0:0:0:0:0:0%0,60020,1388691892014
> 2014-01-02 11:45:49,498 INFO  [master:172.21.3.117:60000] master.HMaster: 
> Registered server found up in zk but who has not yet reported in: 
> 0:0:0:0:0:0:0:0%0,60020,1388691892014
> {noformat}
> The cluster is then unusable.
> I think a better solution is to track the heartbeats for the region servers 
> and expire those that haven't checked-in for some time. The 0.89-fb branch 
> has this concept, and they also use it to detect rack failures: 
> https://github.com/apache/hbase/blob/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L1224.
>  In this jira's scope I would just add the heartbeat tracking and add a unit 
> test for the wildcard address.
> What do you think [~rajesh23]?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to