[ 
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-30101:
-----------------------------------
    Labels: pull-request-available  (was: )

> NettyRpcServer may accept connections before start() completes
> --------------------------------------------------------------
>
>                 Key: HBASE-30101
>                 URL: https://issues.apache.org/jira/browse/HBASE-30101
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>
> h2. Description
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
> (SASL secret manager, authorization manager, scheduler) only runs in 
> {{{}start(){}}}.
> In the window between bind and start, netty workers can accept connections 
> and run handler code that reaches into {{UserGroupInformation}} before the 
> main thread has finished Kerberos login.
> h2. Observed symptoms
> We observed two issues due to this window during server restarts.
> h3. Error logs from handler code running before login completes
> {noformat}
> 026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error 
> when trying to create instance of HBaseSaslRpcServer with sasl provider: 
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal 
> does NOT contain an instance (hostname): hbase
> {noformat}
> h3. TGT Renewer thread spawned, continuously emitting warning logs
> {noformat}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or 
> directory
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
>     at org.apache.hadoop.util.Shell.run(Shell.java:900)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
>     at 
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
>     at 
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>     at java.base/java.lang.Thread.run(Thread.java:1583)
> {noformat}
> We use keytabs, so this {{TGT Renewer}} thread should never be created, but 
> when the keytab login is delayed, it is created due to the timing issue.
> {noformat}
> 2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer: 
> Connection /10.228.110.148:20935; # active connections=5
> 2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto 
> renewal enabled : false
> 2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> {noformat}
> h2. Proposed fix
> Disable {{AUTO_READ}} on the server channel at bootstrap time so no 
> connections are accepted until {{start()}} completes, then re-enable it at 
> the end of {{{}start(){}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to