[
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junegunn Choi reassigned HBASE-30101:
-------------------------------------
Assignee: Junegunn Choi
> NettyRpcServer may accept connections before start() completes
> --------------------------------------------------------------
>
> Key: HBASE-30101
> URL: https://issues.apache.org/jira/browse/HBASE-30101
> Project: HBase
> Issue Type: Bug
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Minor
>
> h2. Description
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup
> (SASL secret manager, authorization manager, scheduler) only runs in
> {{{}start(){}}}.
> In the window between bind and start, netty workers can accept connections
> and run handler code that reaches into {{UserGroupInformation}} before the
> main thread has finished Kerberos login.
> h2. Observed symptoms
> We observed two issues due to this window during server restarts.
> h3. Error logs from handler code running before login completes
> {noformat}
> 026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error
> when trying to create instance of HBaseSaslRpcServer with sasl provider:
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal
> does NOT contain an instance (hostname): hbase
> {noformat}
> h3. TGT Renewer thread spawned, continuously emitting warning logs
> {noformat}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or
> directory
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> at org.apache.hadoop.util.Shell.run(Shell.java:900)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
> at
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
> at
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {noformat}
> We use keytabs, so this {{TGT Renewer}} thread should never be created, but
> when the keytab login is delayed, it is created due to the timing issue.
> {noformat}
> 2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer:
> Connection /10.228.110.148:20935; # active connections=5
> 2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto
> renewal enabled : false
> 2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> {noformat}
> h2. Proposed fix
> Disable {{AUTO_READ}} on the server channel at bootstrap time so no
> connections are accepted until {{start()}} completes, then re-enable it at
> the end of {{{}start(){}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)