[jira] [Commented] (HBASE-25251) Enable configuration based enable/disable of Unsafe package usage

2020-11-05 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227096#comment-17227096
 ] 

Sandeep Guggilam commented on HBASE-25251:
--

FYI [~apurtell]

> Enable configuration based enable/disable of Unsafe package usage
> -
>
> Key: HBASE-25251
> URL: https://issues.apache.org/jira/browse/HBASE-25251
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> We need a provide away for clients to disable Unsafe package usage . 
> Currently there is no way for clients to specify that they don't want to use 
> Unsafe conversion for Bytes conversion.
> As a result there could be some issues with missing methods of Unsafe when 
> client is on JDK 11 . So the clients can disable Unsafe package use and use 
> normal conversion if they want to.
> Also we use static references to Unsafe Availability in Bytes class assuming 
> that the Unsafe availability is set during class loading and no one can ever 
> override it later. Now that we plan to expose a util for clients to override 
> the availability if required, we need to avoid the static references for 
> computing the availability whenever we do the comparisions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25251) Enable configuration based enable/disable of Unsafe package usage

2020-11-05 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-25251:


 Summary: Enable configuration based enable/disable of Unsafe 
package usage
 Key: HBASE-25251
 URL: https://issues.apache.org/jira/browse/HBASE-25251
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


We need a provide away for clients to disable Unsafe package usage . Currently 
there is no way for clients to specify that they don't want to use Unsafe 
conversion for Bytes conversion.

As a result there could be some issues with missing methods of Unsafe when 
client is on JDK 11 . So the clients can disable Unsafe package use and use 
normal conversion if they want to.

Also we use static references to Unsafe Availability in Bytes class assuming 
that the Unsafe availability is set during class loading and no one can ever 
override it later. Now that we plan to expose a util for clients to override 
the availability if required, we need to avoid the static references for 
computing the availability whenever we do the comparisions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

2020-10-29 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223221#comment-17223221
 ] 

Sandeep Guggilam commented on HBASE-25032:
--

[~anoop.hbase] [~apurtell]  I spent some time looking at the code today. One 
thing I noticed is that we abort the RS by throwing exception in case of any 
issues with replication setup with the peer during the 
[startup|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1964]
 of RS : 
[https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java#L241]

So looks like the current design already treats some aspects of setting up the 
replication as important and aborts the RS if not setup properly as opposed to 
our thought of letting RS accept requests even if replication fails in an async 
thread

Should we consider the thought of delaying adding this RS to availableServers 
on Master till it is actually ready to accept requests ?

Thoughts ?

> Wait for region server to become online before adding it to online servers in 
> Master
> 
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-10-26 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220891#comment-17220891
 ] 

Sandeep Guggilam commented on HBASE-24972:
--

[~pratg] Sure , feel free to pick this one

> Wait for connection attempt to succeed before performing operations on ZK
> -
>
> Key: HBASE-24972
> URL: https://issues.apache.org/jira/browse/HBASE-24972
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Priority: Minor
>
> {color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
> via the passed in watcher about the  successful connection event. When we 
> attempt any operations, we try to create a connection and then perform a 
> read/write 
> ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
>  without really waiting for the notification event 
> ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}
>  
> {color:#1d1c1d}It is possible we get ConnectionLoss errors when we perform 
> operations on ZK without waiting for the connection attempt to succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-10-15 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam reassigned HBASE-24972:


Assignee: (was: Sandeep Guggilam)

> Wait for connection attempt to succeed before performing operations on ZK
> -
>
> Key: HBASE-24972
> URL: https://issues.apache.org/jira/browse/HBASE-24972
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Priority: Minor
>
> {color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
> via the passed in watcher about the  successful connection event. When we 
> attempt any operations, we try to create a connection and then perform a 
> read/write 
> ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
>  without really waiting for the notification event 
> ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}
>  
> {color:#1d1c1d}It is possible we get ConnectionLoss errors when we perform 
> operations on ZK without waiting for the connection attempt to succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-25130) Masters in-memory serverHoldings map is not cleared during hbck repair

2020-10-15 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam reassigned HBASE-25130:


Assignee: (was: Sandeep Guggilam)

> Masters in-memory serverHoldings map is not cleared during hbck repair
> --
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Priority: Major
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25130) Masters in-memory serverHoldings map is not cleared during hbck repair

2020-10-15 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215200#comment-17215200
 ] 

Sandeep Guggilam commented on HBASE-25130:
--

[~apurtell] Didn't get time to go over the full workflow and design a solution 
of how to deal with this. I have two other JIRAs in my queue and might pick 
this up only after they are completed. Sorry for the delay. Let me unassign it 
for others to take it . WIll pick it up again if it remains unassigned once I 
am done with other JIRAs

> Masters in-memory serverHoldings map is not cleared during hbck repair
> --
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

2020-10-15 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215197#comment-17215197
 ] 

Sandeep Guggilam commented on HBASE-25032:
--

Sure [~apurtell] . I am currently working on submitting a patch for another 
Jira https://issues.apache.org/jira/browse/HBASE-24768.

Will pick this Jira after that.

> Wait for region server to become online before adding it to online servers in 
> Master
> 
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25130) Masters in-memory serverHoldings map is not cleared during hbck repair

2020-09-30 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-25130:


 Summary: Masters in-memory serverHoldings map is not cleared 
during hbck repair
 Key: HBASE-25130
 URL: https://issues.apache.org/jira/browse/HBASE-25130
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


{color:#1d1c1d}Incase of repairing overlaps, bbck  essentially calls the 
closeRegion RPC on RS followed by offline RPC on Master to offline all the 
overlap regions that would be merged into a new region. {color}

{color:#1d1c1d}However the offline RPC doesn’t remove it from the 
serverHoldings map unless the new state is MERGED/SPLIT 
(https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719)
 b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}

{color:#1d1c1d}This is actually intended to match with the META entries and 
would be removed later when the region is online on a different server. 
However, in our case , the region would never be online on a new server, hence 
the region info is never cleared from the map that is used by balancer and SCP 
for incorrect reeassignment.{color}

{color:#1d1c1d}We might need to tackle this by removing the entries from the 
map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for this 
region which kind of matches the in-memory map’s expectation with the META 
state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25130) Masters in-memory serverHoldings map is not cleared during hbck repair

2020-09-30 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-25130:
-
Description: 
{color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
closeRegion RPC on RS followed by offline RPC on Master to offline all the 
overlap regions that would be merged into a new region. {color}

{color:#1d1c1d}However the offline RPC doesn’t remove it from the 
serverHoldings map unless the new state is MERGED/SPLIT 
([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
 b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}

{color:#1d1c1d}This is actually intended to match with the META entries and 
would be removed later when the region is online on a different server. 
However, in our case , the region would never be online on a new server, hence 
the region info is never cleared from the map that is used by balancer and SCP 
for incorrect reeassignment.{color}

{color:#1d1c1d}We might need to tackle this by removing the entries from the 
map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for this 
region which kind of matches the in-memory map’s expectation with the META 
state.{color}

  was:
{color:#1d1c1d}Incase of repairing overlaps, bbck  essentially calls the 
closeRegion RPC on RS followed by offline RPC on Master to offline all the 
overlap regions that would be merged into a new region. {color}

{color:#1d1c1d}However the offline RPC doesn’t remove it from the 
serverHoldings map unless the new state is MERGED/SPLIT 
(https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719)
 b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}

{color:#1d1c1d}This is actually intended to match with the META entries and 
would be removed later when the region is online on a different server. 
However, in our case , the region would never be online on a new server, hence 
the region info is never cleared from the map that is used by balancer and SCP 
for incorrect reeassignment.{color}

{color:#1d1c1d}We might need to tackle this by removing the entries from the 
map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for this 
region which kind of matches the in-memory map’s expectation with the META 
state.{color}


> Masters in-memory serverHoldings map is not cleared during hbck repair
> --
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> {color:#1d1c1d}Incase of repairing overlaps, hbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719])
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25130) Masters in-memory serverHoldings map is not cleared during hbck repair

2020-09-30 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205161#comment-17205161
 ] 

Sandeep Guggilam commented on HBASE-25130:
--

FYI [~apurtell]

> Masters in-memory serverHoldings map is not cleared during hbck repair
> --
>
> Key: HBASE-25130
> URL: https://issues.apache.org/jira/browse/HBASE-25130
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> {color:#1d1c1d}Incase of repairing overlaps, bbck  essentially calls the 
> closeRegion RPC on RS followed by offline RPC on Master to offline all the 
> overlap regions that would be merged into a new region. {color}
> {color:#1d1c1d}However the offline RPC doesn’t remove it from the 
> serverHoldings map unless the new state is MERGED/SPLIT 
> (https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java#L719)
>  b{color}{color:#1d1c1d}ut the new state in this case is OFFLINE. {color}
> {color:#1d1c1d}This is actually intended to match with the META entries and 
> would be removed later when the region is online on a different server. 
> However, in our case , the region would never be online on a new server, 
> hence the region info is never cleared from the map that is used by balancer 
> and SCP for incorrect reeassignment.{color}
> {color:#1d1c1d}We might need to tackle this by removing the entries from the 
> map when hbck actually deletes{color}{color:#1d1c1d} the meta entries for 
> this region which kind of matches the in-memory map’s expectation with the 
> META state.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

2020-09-29 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204410#comment-17204410
 ] 

Sandeep Guggilam commented on HBASE-25032:
--

That's a good idea [~anoop.hbase] . We are proceeding with making the server 
online anyways when the initialization of the replication peer fails after all 
the retries.

Instead we could just initialize it in an async thread and make the RS online 
to accept requests. Let me go over the code again and see if there is any other 
issue

> Wait for region server to become online before adding it to online servers in 
> Master
> 
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25032) Wait for region server to become online before considering it online in Master

2020-09-15 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-25032:


 Summary: Wait for region server to become online before 
considering it online in Master
 Key: HBASE-25032
 URL: https://issues.apache.org/jira/browse/HBASE-25032
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


As part of RS start up, RS reports for duty to Master . Master acknowledges the 
request and adds it to the onlineServers list for further assigning any regions 
to the RS

Once Master acknowledges the reportForDuty and sends back the response, RS does 
a bunch of stuff like initializing replication sources etc before becoming 
online. However, sometimes there could be an issue with initializing 
replication sources when it is unable to connect to peer clusters because of 
some kerberos configuration and there would be a delay of around 20 mins in 
becoming online.

 

Since master considers it online, it tries to assign regions and which fails 
with ServerNotRunningYet exception, then the master tries to unassign which 
again fails with the same exception leading the region to FAILED_CLOSE state.

 

It would be good to have a check to see if the RS is ready to accept the 
assignment requests before adding it to online servers list which would account 
for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25032) Wait for region server to become online before considering it online in Master

2020-09-15 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196506#comment-17196506
 ] 

Sandeep Guggilam commented on HBASE-25032:
--

FYI [~apurtell]

> Wait for region server to become online before considering it online in Master
> --
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

2020-09-15 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-25032:
-
Summary: Wait for region server to become online before adding it to online 
servers in Master  (was: Wait for region server to become online before 
considering it online in Master)

> Wait for region server to become online before adding it to online servers in 
> Master
> 
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-08-31 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24972:
-
Description: 
{color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
via the passed in watcher about the  successful connection event. When we 
attempt any operations, we try to create a connection and then perform a 
read/write 
({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
 without really waiting for the notification event 
([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}

 

{color:#1d1c1d}It is possible we get ConnectionLoss errors when we perform 
operations on ZK without waiting for the connection attempt to succeed{color}

  was:
{color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
via the passed in watcher about the  successful connection event. When we 
attempt any operations, we try to create a connection and then perform a 
read/write 
({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
 without really waiting for the notification event 
([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}

 

{color:#1d1c1d}it might be possible we get ConnectionLoss errors when we 
perform operations on ZK without waiting for the connection attempt to 
succeed{color}


> Wait for connection attempt to succeed before performing operations on ZK
> -
>
> Key: HBASE-24972
> URL: https://issues.apache.org/jira/browse/HBASE-24972
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
>
> {color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
> via the passed in watcher about the  successful connection event. When we 
> attempt any operations, we try to create a connection and then perform a 
> read/write 
> ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
>  without really waiting for the notification event 
> ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}
>  
> {color:#1d1c1d}It is possible we get ConnectionLoss errors when we perform 
> operations on ZK without waiting for the connection attempt to succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-08-31 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24972:
-
Issue Type: Bug  (was: Improvement)
  Priority: Minor  (was: Major)

> Wait for connection attempt to succeed before performing operations on ZK
> -
>
> Key: HBASE-24972
> URL: https://issues.apache.org/jira/browse/HBASE-24972
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
>
> {color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
> via the passed in watcher about the  successful connection event. When we 
> attempt any operations, we try to create a connection and then perform a 
> read/write 
> ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
>  without really waiting for the notification event 
> ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}
>  
> {color:#1d1c1d}it might be possible we get ConnectionLoss errors when we 
> perform operations on ZK without waiting for the connection attempt to 
> succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-08-31 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187923#comment-17187923
 ] 

Sandeep Guggilam commented on HBASE-24972:
--

FYI [~apurtell]

> Wait for connection attempt to succeed before performing operations on ZK
> -
>
> Key: HBASE-24972
> URL: https://issues.apache.org/jira/browse/HBASE-24972
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
>
> {color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
> via the passed in watcher about the  successful connection event. When we 
> attempt any operations, we try to create a connection and then perform a 
> read/write 
> ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
>  without really waiting for the notification event 
> ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}
>  
> {color:#1d1c1d}it might be possible we get ConnectionLoss errors when we 
> perform operations on ZK without waiting for the connection attempt to 
> succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24972) Wait for connection attempt to succeed before performing operations on ZK

2020-08-31 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24972:


 Summary: Wait for connection attempt to succeed before performing 
operations on ZK
 Key: HBASE-24972
 URL: https://issues.apache.org/jira/browse/HBASE-24972
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


{color:#1d1c1d}Creating the connection with ZK  is asynchronous and notified 
via the passed in watcher about the  successful connection event. When we 
attempt any operations, we try to create a connection and then perform a 
read/write 
({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d})
 without really waiting for the notification event 
([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color}

 

{color:#1d1c1d}it might be possible we get ConnectionLoss errors when we 
perform operations on ZK without waiting for the connection attempt to 
succeed{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24911) Configure a max global timeout for HMaster to wait for RS to check in

2020-08-19 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180880#comment-17180880
 ] 

Sandeep Guggilam commented on HBASE-24911:
--

FYI [~apurtell]

> Configure a max global timeout for HMaster to wait for RS to check in
> -
>
> Key: HBASE-24911
> URL: https://issues.apache.org/jira/browse/HBASE-24911
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
>
> As part of the initialization process for active master, it waits for region 
> servers to check in before completing the initialization. Currently it runs 
> into a loop where it waits in the loop for ever when there are no region 
> servers that are reported to it 
> ([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L1126)]
>  
> We should probably have a global max timeout after which the master should 
> give up and abort itself letting other masters to take over as active master 
> . This is especially useful in the cases where the incoming connections to 
> the master are not accepted because of network issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24911) Configure a max global timeout for HMaster to wait for RS to check in

2020-08-19 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24911:


 Summary: Configure a max global timeout for HMaster to wait for RS 
to check in
 Key: HBASE-24911
 URL: https://issues.apache.org/jira/browse/HBASE-24911
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


As part of the initialization process for active master, it waits for region 
servers to check in before completing the initialization. Currently it runs 
into a loop where it waits in the loop for ever when there are no region 
servers that are reported to it 
([https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L1126)]

 

We should probably have a global max timeout after which the master should give 
up and abort itself letting other masters to take over as active master . This 
is especially useful in the cases where the incoming connections to the master 
are not accepted because of network issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24823) Port HBASE-22762 Print the delta between phases in the split/merge/compact/flush transaction journals to master branch

2020-08-05 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171750#comment-17171750
 ] 

Sandeep Guggilam commented on HBASE-24823:
--

FYI [~apurtell]

> Port HBASE-22762 Print the delta between phases in the 
> split/merge/compact/flush transaction journals to master branch
> --
>
> Key: HBASE-24823
> URL: https://issues.apache.org/jira/browse/HBASE-24823
> Project: HBase
>  Issue Type: Improvement
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24823) Port HBASE-22762 Print the delta between phases in the split/merge/compact/flush transaction journals to master branch

2020-08-05 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24823:


 Summary: Port HBASE-22762 Print the delta between phases in the 
split/merge/compact/flush transaction journals to master branch
 Key: HBASE-24823
 URL: https://issues.apache.org/jira/browse/HBASE-24823
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24768) Clear cached service kerberos ticket in case of SASL failures thrown from server side

2020-07-28 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166687#comment-17166687
 ] 

Sandeep Guggilam commented on HBASE-24768:
--

Sure [~apurtell]

> Clear cached service kerberos ticket in case of SASL failures thrown from 
> server side
> -
>
> Key: HBASE-24768
> URL: https://issues.apache.org/jira/browse/HBASE-24768
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> We setup a SASL connection using different mechanisms like Digest, Kerberos 
> from master to RS for various activities like region assignment etc. In case 
> of SASL connect failures, we try to dispose of the SaslRpcClient and try to 
> relogin from the keytab on the client side. However the relogin from keytab 
> method doesn't clear off the service ticket cached in memory unless TGT is 
> about to expire within a timeframe.
> This actually causes an issue where there is a keytab refresh that happens 
> because of expiry  on the RS server and throws a SASL connect error when 
> Master reaches out to the RS server with the cached service ticket that no 
> longer works with the new refreshed keytab. We might need to clear off the 
> service ticket cached as there could be a credential refresh on the RS server 
> side when handling connect failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24768) Clear cached service kerberos ticket in case of SASL failures thrown from server side

2020-07-23 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164083#comment-17164083
 ] 

Sandeep Guggilam commented on HBASE-24768:
--

FYI [~apurtell] [~abhishek.chouhan]

> Clear cached service kerberos ticket in case of SASL failures thrown from 
> server side
> -
>
> Key: HBASE-24768
> URL: https://issues.apache.org/jira/browse/HBASE-24768
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> We setup a SASL connection using different mechanisms like Digest, Kerberos 
> from master to RS for various activities like region assignment etc. In case 
> of SASL connect failures, we try to dispose of the SaslRpcClient and try to 
> relogin from the keytab on the client side. However the relogin from keytab 
> method doesn't clear off the service ticket cached in memory unless TGT is 
> about to expire within a timeframe.
> This actually causes an issue where there is a keytab refresh that happens 
> because of expiry  on the RS server and throws a SASL connect error when 
> Master reaches out to the RS server with the cached service ticket that no 
> longer works with the new refreshed keytab. We might need to clear off the 
> service ticket cached as there could be a credential refresh on the RS server 
> side when handling connect failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24768) Clear cached service kerberos ticket in case of SASL failures thrown from server side

2020-07-23 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24768:
-
Summary: Clear cached service kerberos ticket in case of SASL failures 
thrown from server side  (was: Clear service kerberos ticket in case of SASL 
failures from server side)

> Clear cached service kerberos ticket in case of SASL failures thrown from 
> server side
> -
>
> Key: HBASE-24768
> URL: https://issues.apache.org/jira/browse/HBASE-24768
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> We setup a SASL connection using different mechanisms like Digest, Kerberos 
> from master to RS for various activities like region assignment etc. In case 
> of SASL connect failures, we try to dispose of the SaslRpcClient and try to 
> relogin from the keytab on the client side. However the relogin from keytab 
> method doesn't clear off the service ticket cached in memory unless TGT is 
> about to expire within a timeframe.
> This actually causes an issue where there is a keytab refresh that happens 
> because of expiry  on the RS server and throws a SASL connect error when 
> Master reaches out to the RS server with the cached service ticket that no 
> longer works with the new refreshed keytab. We might need to clear off the 
> service ticket cached as there could be a credential refresh on the RS server 
> side when handling connect failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24768) Clear service kerberos ticket in case of SASL failures from server side

2020-07-23 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24768:


 Summary: Clear service kerberos ticket in case of SASL failures 
from server side
 Key: HBASE-24768
 URL: https://issues.apache.org/jira/browse/HBASE-24768
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


We setup a SASL connection using different mechanisms like Digest, Kerberos 
from master to RS for various activities like region assignment etc. In case of 
SASL connect failures, we try to dispose of the SaslRpcClient and try to 
relogin from the keytab on the client side. However the relogin from keytab 
method doesn't clear off the service ticket cached in memory unless TGT is 
about to expire within a timeframe.

This actually causes an issue where there is a keytab refresh that happens 
because of expiry  on the RS server and throws a SASL connect error when Master 
reaches out to the RS server with the cached service ticket that no longer 
works with the new refreshed keytab. We might need to clear off the service 
ticket cached as there could be a credential refresh on the RS server side when 
handling connect failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22762) Print the delta between phases in the split/merge/compact/flush transaction journals

2020-07-20 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161557#comment-17161557
 ] 

Sandeep Guggilam commented on HBASE-22762:
--

[~apurtell] I just observed that we don't have the delta thing in the journal 
entry for master branch when I was adding journal for snapshot operation. Do we 
want to port this to master branch as well ?

> Print the delta between phases in the split/merge/compact/flush transaction 
> journals
> 
>
> Key: HBASE-22762
> URL: https://issues.apache.org/jira/browse/HBASE-22762
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Minor
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-22762-branch-1-addendum.patch, 
> HBASE-22762.branch-1.001.patch, HBASE-22762.branch-1.002.patch, 
> HBASE-22762.branch-1.004.patch
>
>
> We print the start timestamp for each phase when logging the 
> split/merge/compact/flush transaction journals and so when debugging an 
> operator must do the math by hand. It would be trivial to also print the 
> delta from the start timestamp of the previous phase and helpful to do so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24740) Enable journal logging for HBase snapshot operation

2020-07-14 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24740:


 Summary: Enable journal logging for HBase snapshot operation
 Key: HBASE-24740
 URL: https://issues.apache.org/jira/browse/HBASE-24740
 Project: HBase
  Issue Type: Improvement
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


The HBase snapshot operation contains multiple steps like actual snapshot 
creation, consolidate phase (reading region manifests from HDFS) , verifier 
phase ( validate the consolidated manifests against the actual number of 
regions for the table)

 

Sometimes it happens to be taking time in one of the phases and we don't know 
exactly which one is taking time unless we have a thread dump at the very same 
time. The journal logging would definitely help us give more insights into the 
time taken for each phase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-06-06 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24069:
-
Fix Version/s: (was: 2.4.0)
   (was: 3.0.0-alpha-1)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: HBASE-24069.branch-1.001.patch
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24511) Ability to configure timeout between RPC retry to RS from master

2020-06-05 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126966#comment-17126966
 ] 

Sandeep Guggilam commented on HBASE-24511:
--

[~apurtell] FYI

> Ability to configure timeout between RPC retry to RS from master
> 
>
> Key: HBASE-24511
> URL: https://issues.apache.org/jira/browse/HBASE-24511
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Major
>
> This is useful for the cases where there is an injected environment edge and 
> when the first RS RPC request fails causing it to go to the retry block.
> In the absence of this , the default timeout would be set to 100 ms and the 
> DelayedUtil class is meant to execute the retry after 100 ms. However as per 
> the getRemainingTime() logic here 
> (https://github.com/apache/hbase/blob/5b01e613fbbb92e243e99a1d199b4ffbb21ed2d9/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/util/DelayedUtil.java#L94),
>  the equation would be evaluated to
>  
> EnvironmentEdgeManager.currentTime() >= EnvironmentEdgeManager.currentTime() 
> + 100 which would never get evaluated to true in case of injected edge and 
> retry never happens. Hence this config helps you to override it to 0 in cases 
> we want to test with a manual injected environment edge so that the retry 
> would succeed
>  
> An example would be Master trying to open the meta region even before the RS 
> is online, In this case the retry of opening the meta region would never 
> happen in the case of injected environment edge and master would never finish 
> the initialization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24511) Ability to configure timeout between RPC retry to RS from master

2020-06-05 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24511:


 Summary: Ability to configure timeout between RPC retry to RS from 
master
 Key: HBASE-24511
 URL: https://issues.apache.org/jira/browse/HBASE-24511
 Project: HBase
  Issue Type: Bug
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam


This is useful for the cases where there is an injected environment edge and 
when the first RS RPC request fails causing it to go to the retry block.

In the absence of this , the default timeout would be set to 100 ms and the 
DelayedUtil class is meant to execute the retry after 100 ms. However as per 
the getRemainingTime() logic here 
(https://github.com/apache/hbase/blob/5b01e613fbbb92e243e99a1d199b4ffbb21ed2d9/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/util/DelayedUtil.java#L94),
 the equation would be evaluated to
 
EnvironmentEdgeManager.currentTime() >= EnvironmentEdgeManager.currentTime() + 
100 which would never get evaluated to true in case of injected edge and retry 
never happens. Hence this config helps you to override it to 0 in cases we want 
to test with a manual injected environment edge so that the retry would succeed

 

An example would be Master trying to open the meta region even before the RS is 
online, In this case the retry of opening the meta region would never happen in 
the case of injected environment edge and master would never finish the 
initialization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24446) Use EnvironmentEdgeManager to compute clock skew in Master

2020-05-27 Thread Sandeep Guggilam (Jira)
Sandeep Guggilam created HBASE-24446:


 Summary: Use EnvironmentEdgeManager to compute clock skew in Master
 Key: HBASE-24446
 URL: https://issues.apache.org/jira/browse/HBASE-24446
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Sandeep Guggilam
Assignee: Sandeep Guggilam
 Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0


There are few cases where the Master is not able to complete the initialization 
as it waiting for the region server to report to it. The region server actually 
reported to the master but the master rejected the request because of clock 
skew issue though both of them are on  same JVM

The Region server uses EnvironmentEdgeManager.currentTime to report the current 
time and HMaster uses System.currentTimeMillis() to get the current time for 
computation against the reported time by RS.  We should also just use 
EnvironmentEdgeManager even in Master as we are expected not to use 
System.currentTime directly and instead go through EnvironmentEdgeManager

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24446) Use EnvironmentEdgeManager to compute clock skew in Master

2020-05-27 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117902#comment-17117902
 ] 

Sandeep Guggilam commented on HBASE-24446:
--

FYI [~apurtell]

> Use EnvironmentEdgeManager to compute clock skew in Master
> --
>
> Key: HBASE-24446
> URL: https://issues.apache.org/jira/browse/HBASE-24446
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Sandeep Guggilam
>Assignee: Sandeep Guggilam
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> There are few cases where the Master is not able to complete the 
> initialization as it waiting for the region server to report to it. The 
> region server actually reported to the master but the master rejected the 
> request because of clock skew issue though both of them are on  same JVM
> The Region server uses EnvironmentEdgeManager.currentTime to report the 
> current time and HMaster uses System.currentTimeMillis() to get the current 
> time for computation against the reported time by RS.  We should also just 
> use EnvironmentEdgeManager even in Master as we are expected not to use 
> System.currentTime directly and instead go through EnvironmentEdgeManager
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-26 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116973#comment-17116973
 ] 

Sandeep Guggilam commented on HBASE-24069:
--

FYI [~bharathv]

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: HBASE-24069.branch-1.001.patch
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-22 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114458#comment-17114458
 ] 

Sandeep Guggilam commented on HBASE-24069:
--

[~apurtell] Uploaded the patch for branch-1. As discussed, since the split 
requests doesn't even have the retry in place, limiting this to have the retry 
with backoff for failed close requests .  

Also the Assignment Manager V2 already have the retry with back off logic for 
failed open/close requests 
([https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/TransitRegionStateProcedure.java#L360])
 , so the change is not needed for branch-2 and above. 

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: HBASE-24069.branch-1.001.patch
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-20 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24069:
-
Status: Patch Available  (was: Open)

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: HBASE-24069.branch-1.001.patch
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-20 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam updated HBASE-24069:
-
Attachment: HBASE-24069.branch-1.001.patch

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: HBASE-24069.branch-1.001.patch
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-14 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107789#comment-17107789
 ] 

Sandeep Guggilam commented on HBASE-24069:
--

Thanks [~apurtell]  for the clarification

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-14 Thread Sandeep Guggilam (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107722#comment-17107722
 ] 

Sandeep Guggilam commented on HBASE-24069:
--

[~apurtell] I started looking at this , need few clarifications with respect to 
the use cases we are trying to solve

 
 # Retry with exponentialBackOff for region close requests

Is this primarily meant to deal with the case where the RS might be busy in 
responding to the close request temporarily because of incoming load but might 
get free in some time where the retry without exponentialBackOff would quickly 
fail it to FAILED_CLOSE state ? Is there any other case which would help with 
the fix ?

 

2. Retry with exponentialBackOff for split requests

Can you please help me understand the case where the SPLIT request would 
ultimately end up in FAILED_CLOSE state ? 

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24069) Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep between failed region open requests) to region close and split requests

2020-05-05 Thread Sandeep Guggilam (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Guggilam reassigned HBASE-24069:


Assignee: Sandeep Guggilam

> Extend HBASE-16209 strategy (Provide an ExponentialBackOffPolicy sleep 
> between failed region open requests) to region close and split requests
> --
>
> Key: HBASE-24069
> URL: https://issues.apache.org/jira/browse/HBASE-24069
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Affects Versions: 1.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Sandeep Guggilam
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> In HBASE-16209 we provide an ExponentialBackOffPolicy sleep between failed 
> region open requests. This should be extended to also apply to region close 
> and split requests. Will reduce the likelihood of FAILED_CLOSE transitions in 
> production by being more tolerant of temporary regionserver loading issues, 
> e.g. CallQueueTooBigException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-17521) Avoid stopping the load balancer in graceful stop

2017-04-07 Thread Sandeep Guggilam (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960353#comment-15960353
 ] 

Sandeep Guggilam commented on HBASE-17521:
--

[~lhofhansl] We found there is a JIRA which added the support for 
adding/removing znodes through a ruby script. We are now planning to use that 
for adding/removing draining znodes since the functionality is already there. 
However we have been working on 0.98 branch till date which doesn't contain 
this change. We need to now use some 1.x code line which has this change so 
that we could leverage it in graceful_stop.sh script. We will work on this soon

[~mnpoonia]

> Avoid stopping the load balancer in graceful stop
> -
>
> Key: HBASE-17521
> URL: https://issues.apache.org/jira/browse/HBASE-17521
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>
> ... instead setting the regionserver in question to draining.
> [~sandeep.guggilam], FYI



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17521) Avoid stopping the load balancer in graceful stop

2017-02-07 Thread Sandeep Guggilam (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857393#comment-15857393
 ] 

Sandeep Guggilam commented on HBASE-17521:
--

Yes [~stack] , we are doing some testing on the same. Will upload the patch as 
soon as it is done

[~mnpoonia]

> Avoid stopping the load balancer in graceful stop
> -
>
> Key: HBASE-17521
> URL: https://issues.apache.org/jira/browse/HBASE-17521
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>
> ... instead setting the regionserver in question to draining.
> [~sandeep.guggilam], FYI



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)