Liu Zhao created KYLIN-5539:
-------------------------------
Summary: Kylin4 JobServer在每次Cube构建后未关闭已打开的HDFS文件流,产生大量CLOSE_WAIT
Key: KYLIN-5539
URL: https://issues.apache.org/jira/browse/KYLIN-5539
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v4.0.3, v4.0.2, v4.0.1
Reporter: Liu Zhao
1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自与Kylin4 JobServer
进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而改端口是 Hadoop DataNode
数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。
2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT
数增加1,至此可以确定时JobServer代码中未关闭流导致。
3.定位代码:深入kylin4 构建代码进行debug,最终定位到
org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote
94行 (Apache Kylin main分支)
String resKey = toUpdateSeg.getStatisticsResourcePath();
String statisticsDir =
config.getJobTmpDir(currentInstanceCopy.getProject()) + "/"
+ nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" +
ResourceStore.CUBE_STATISTICS_ROOT + "/"
+ cubeId + "/" + segmentId + "/";
Path statisticsFile = new Path(statisticsDir,
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
FileSystem fs = HadoopUtil.getWorkingFileSystem();
if (fs.exists(statisticsFile)) {
FSDataInputStream is = fs.open(statisticsFile);
//{color:red}未关闭流{color}
ResourceStore.getStore(config).putBigResource(resKey, is,
System.currentTimeMillis());
}
CubeUpdate update = new CubeUpdate(currentInstanceCopy);
update.setCuboids(distCube.getCuboids());
List<CubeSegment> toRemoveSegs = Lists.newArrayList();
4. 研测验证:
Path statisticsFile = new Path(statisticsDir,
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
FileSystem fs = HadoopUtil.getWorkingFileSystem();
if (fs.exists(statisticsFile)) {
try (FSDataInputStream is = fs.open(statisticsFile)) { //
{color:red}关闭流{color}
ResourceStore.getStore(config).putBigResource(resKey, is,
System.currentTimeMillis());
}
}
修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。
--
This message was sent by Atlassian Jira
(v8.20.10#820010)