[
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junegunn Choi updated HBASE-30101:
----------------------------------
Attachment: image-2026-04-22-13-47-59-850.png
Description:
h2. Description
{{NettyRpcServer}} binds the server socket in its constructor, but auth setup
(SASL secret manager, authorization manager, scheduler) only runs in
{{{}start(){}}}.
In the window between bind and start, netty workers can accept connections and
run handler code that reaches into {{UserGroupInformation}} before the main
thread has finished Kerberos login.
h2. Observed symptoms
We observed two issues due to this window during server restarts.
h3. Error logs from handler code running before login completes
{noformat}
026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is hbase
2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is hbase
2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error when
trying to create instance of HBaseSaslRpcServer with sasl provider:
org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal does
NOT contain an instance (hostname): hbase
{noformat}
h3. TGT Renewer thread spawned, continuously emitting warning logs
{noformat}
java.io.IOException: Cannot run program "kinit": error=2, No such file or
directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
at org.apache.hadoop.util.Shell.run(Shell.java:900)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
at
org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
at
org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
{noformat}
We use keytabs, so this {{TGT Renewer}} thread should never be created, but
when the keytab login is delayed, it is created due to the timing issue.
{noformat}
2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
/10.228.198.67:11471
2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
Connection /10.227.143.27:31898; # active connections=0
2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
Connection /10.192.163.175:2080; # active connections=1
2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
Connection /10.227.225.174:25343; # active connections=2
2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
Connection /10.192.163.189:32775; # active connections=3
2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
Connection /10.192.163.203:43663; # active connections=4
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer:
Connection /10.228.110.148:20935; # active connections=5
2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
successful for user ***/*** using keytab file hbase.keytab. Keytab auto renewal
enabled : false
2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
{noformat}
!image-2026-04-22-13-47-59-850.png!
h2. Proposed fix
Disable {{AUTO_READ}} on the server channel at bootstrap time so no connections
are accepted until {{start()}} completes, then re-enable it at the end of
{{{}start(){}}}.
was:
h2. Description
{{NettyRpcServer}} binds the server socket in its constructor, but auth setup
(SASL secret manager, authorization manager, scheduler) only runs in
{{{}start(){}}}.
In the window between bind and start, netty workers can accept connections and
run handler code that reaches into {{UserGroupInformation}} before the main
thread has finished Kerberos login.
h2. Observed symptoms
We observed two issues due to this window during server restarts.
h3. Error logs from handler code running before login completes
{noformat}
026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is hbase
2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is hbase
2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error when
trying to create instance of HBaseSaslRpcServer with sasl provider:
org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal does
NOT contain an instance (hostname): hbase
{noformat}
h3. TGT Renewer thread spawned, continuously emitting warning logs
{noformat}
java.io.IOException: Cannot run program "kinit": error=2, No such file or
directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
at org.apache.hadoop.util.Shell.run(Shell.java:900)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
at
org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
at
org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
{noformat}
We use keytabs, so this {{TGT Renewer}} thread should never be created, but
when the keytab login is delayed, it is created due to the timing issue.
{noformat}
2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
/10.228.198.67:11471
2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
Connection /10.227.143.27:31898; # active connections=0
2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
Connection /10.192.163.175:2080; # active connections=1
2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
Connection /10.227.225.174:25343; # active connections=2
2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
Connection /10.192.163.189:32775; # active connections=3
2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
Connection /10.192.163.203:43663; # active connections=4
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer:
Connection /10.228.110.148:20935; # active connections=5
2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
successful for user ***/*** using keytab file hbase.keytab. Keytab auto renewal
enabled : false
2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7]
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name
is ***/***
{noformat}
h2. Proposed fix
Disable {{AUTO_READ}} on the server channel at bootstrap time so no connections
are accepted until {{start()}} completes, then re-enable it at the end of
{{{}start(){}}}.
> NettyRpcServer may accept connections before start() completes
> --------------------------------------------------------------
>
> Key: HBASE-30101
> URL: https://issues.apache.org/jira/browse/HBASE-30101
> Project: HBase
> Issue Type: Bug
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Minor
> Labels: pull-request-available
> Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Description
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup
> (SASL secret manager, authorization manager, scheduler) only runs in
> {{{}start(){}}}.
> In the window between bind and start, netty workers can accept connections
> and run handler code that reaches into {{UserGroupInformation}} before the
> main thread has finished Kerberos login.
> h2. Observed symptoms
> We observed two issues due to this window during server restarts.
> h3. Error logs from handler code running before login completes
> {noformat}
> 026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error
> when trying to create instance of HBaseSaslRpcServer with sasl provider:
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal
> does NOT contain an instance (hostname): hbase
> {noformat}
> h3. TGT Renewer thread spawned, continuously emitting warning logs
> {noformat}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or
> directory
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> at org.apache.hadoop.util.Shell.run(Shell.java:900)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
> at
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
> at
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {noformat}
> We use keytabs, so this {{TGT Renewer}} thread should never be created, but
> when the keytab login is delayed, it is created due to the timing issue.
> {noformat}
> 2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer:
> Connection /10.228.110.148:20935; # active connections=5
> 2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto
> renewal enabled : false
> 2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> {noformat}
> !image-2026-04-22-13-47-59-850.png!
> h2. Proposed fix
> Disable {{AUTO_READ}} on the server channel at bootstrap time so no
> connections are accepted until {{start()}} completes, then re-enable it at
> the end of {{{}start(){}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)