LompleZ opened a new issue, #64138:
URL: https://github.com/apache/doris/issues/64138

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   doris 1.2+
   
   ### What's Wrong?
   
   apache hdfs broker有线程泄露的问题,随着broker使用时间的增长,最终会oom。
   
   broker日志如下:
   ```java
   [INFO ] 2025-05-25 00:33:57,190 
method:org.apache.hadoop.lite.client.LiteClientImpl.batchPing(LiteClientImpl.java:1158)
   batch ping size: 3, first 3 sessions: 5796d47541baad0c, 3784319f7320ff12, 
3c9338822793ba8c, lparam: 3d1129a5242ff0dd, current ping num:1
   Exception in thread "TThreadPoolServer WorkerProcess-319" 
java.lang.OutOfMemoryError: Java heap space
   Exception in thread "TThreadPoolServer WorkerProcess-343" 
java.lang.OutOfMemoryError: Java heap space
   Exception in thread "TThreadPoolServer WorkerProcess-264" 
java.lang.OutOfMemoryError: Java heap space
   [WARN ] 2025-05-25 00:34:49,256 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:381)
   failed to send message: DFS_OPEN, lparam: e2065745f259d106, from: 
/10.138.71.133:36690:1102: java.lang.OutOfMemoryError: Java heap space
   [ERROR] 2025-05-25 00:34:49,257 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:243)
   failed to send message: DFS_OPEN, lparam: e2065745f259d106, from: 
/10.138.71.133:36690:1102: java.lang.OutOfMemoryError: Java heap space
   [ERROR] 2025-05-25 00:34:49,257 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:243)
   failed to send request, will try again, lparm: e2065745f259d106:
   java.io.IOException: failed to send from: /10.138.71.133:36690:1102:
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:384)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:241)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.openInternal(LiteClientImpl.java:924)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.open(LiteClientImpl.java:905)
           at 
org.apache.hadoop.fs.lite.file.LiteFileStreamWrapperImpl.open(LiteFileStreamWrapperImpl.java:27)
           at 
org.apache.hadoop.fs.LibDFileSystemImpl.openFile(LibDFileSystemImpl.java:454)
           at 
org.apache.hadoop.fs.LibDFSInputStream.<init>(LibDFSInputStream.java:26)
           at org.apache.hadoop.fs.LiteFileSystem.open(LiteFileSystem.java:133)
           at 
org.apache.doris.broker.hdfs.FileSystemManager.openReader(FileSystemManager.java:1224)
           at 
org.apache.doris.broker.hdfs.HDFSBrokerServiceImpl.openReader(HDFSBrokerServiceImpl.java:184)
   failed to send request, will try again, lparm: e2065745f259d106:
   java.io.IOException: failed to send from: /10.138.71.133:36690:1102:
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:384)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:241)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.openInternal(LiteClientImpl.java:924)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.open(LiteClientImpl.java:905)
           at 
org.apache.hadoop.fs.lite.file.LiteFileStreamWrapperImpl.open(LiteFileStreamWrapperImpl.java:27)
   [ERROR] 2025-05-25 00:34:49,257 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:243)
   failed to send request, will try again, lparm: e2065745f259d106:
   java.io.IOException: failed to send from: /10.138.71.133:36690:1102:
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:384)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:241)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.openInternal(LiteClientImpl.java:924)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.open(LiteClientImpl.java:905)
   [ERROR] 2025-05-25 00:34:49,257 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:243)
   Exception in thread "TThreadPoolServer WorkerProcess-264" 
java.lang.OutOfMemoryError: Java heap space
   [WARN ] 2025-05-25 00:34:49,256 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:381)
   failed to send message: DFS_OPEN, lparam: e2065745f259d106, from: 
/10.138.71.133:36690:1102: java.lang.OutOfMemoryError: Java heap space
   [ERROR] 2025-05-25 00:34:49,257 
method:org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:243)
   failed to send request, will try again, lparm: e2065745f259d106:
   java.io.IOException: failed to send from: /10.138.71.133:36690:1102:
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendRequest(LiteClientImpl.java:384)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.sendMsg(LiteClientImpl.java:241)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.openInternal(LiteClientImpl.java:924)
           at 
org.apache.hadoop.lite.client.LiteClientImpl.open(LiteClientImpl.java:905)
           at 
org.apache.hadoop.fs.lite.file.LiteFileStreamWrapperImpl.open(LiteFileStreamWrapperImpl.java:27)
           at 
org.apache.hadoop.fs.LibDFileSystemImpl.openFile(LibDFileSystemImpl.java:454)
           at 
org.apache.hadoop.fs.LibDFSInputStream.<init>(LibDFSInputStream.java:26)
           at org.apache.hadoop.fs.LiteFileSystem.open(LiteFileSystem.java:133)
           at 
org.apache.doris.broker.hdfs.FileSystemManager.openReader(FileSystemManager.java:1224)
           at 
org.apache.doris.broker.hdfs.HDFSBrokerServiceImpl.openReader(HDFSBrokerServiceImpl.java:184)
           at 
org.apache.doris.thrift.TPaloBrokerService$Processor$openReader.getResult(TPaloBrokerService.java:1145)
           at 
org.apache.doris.thrift.TPaloBrokerService$Processor$openReader.getResult(TPaloBrokerService.java:1125)
           at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
           at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
           at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: io.netty.handler.codec.EncoderException: 
java.lang.OutOfMemoryError: Java heap space
           at 
io.netty.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:125)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
           at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
           at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
           at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
           at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
           ... 1 more
   Caused by: java.lang.OutOfMemoryError: Java heap space
   ``` 
   
   原因:
   
[6fe207eb4b85c92c2b11c266de90bd57e23c9922](https://github.com/apache/doris/commit/6fe207eb4b85c92c2b11c266de90bd57e23c9922
 )该commit 注释掉了代码中主动关闭FileSystem.close()的方法。
   
我翻阅百度代码库的历史记录,那时的broker还没有引入hadoop-common.jar,当时情况下注释这里确实不会产生负面影响,但是后期随着doris的开源转而引用了hadoop-common.jar
 ,该jar在创建FileSystem的时候会创建一个netty线程池(EventLoopGroup),线程对象属于gc 
roots对象,如果不手动调用.close()方法无法被jvm回收。
   如果只是简单的回滚这个commit是不可行的,因为当前broker的情况多线程竞态模式下,可能A线程调用.close()方法导致B线程的读取出现异常。
   同时 updateCachedFileSystem() 函数在多线程竞态场景下存在bug
   
   ### What You Expected?
   
   我已经修复了代码,很快会提交 
   
   ### How to Reproduce?
   
   
当be通过broker频繁的进行错误的导入和导出的时候,broker会不断创建新的FileSystem对象和线程池对象,可以用下面这条命令,看到jvm中有大量未被回收的线程池对象
   ```bash
   jps | awk '/BrokerBootstrap/{print $1}' | xargs jstack | grep -P 
"pool-\d+-thread-\d+"
   ```
   
   
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to