[ https://issues.apache.org/jira/browse/KYLIN-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liu Zhao updated KYLIN-5539: ---------------------------- Attachment: image-2023-04-28-16-29-06-576.png image-2023-04-28-16-25-47-430.png Description: 1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自与Kylin4 JobServer 进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而改端口是 Hadoop DataNode 数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。 2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT 数增加1,至此可以确定时JobServer代码中未关闭流导致。 3.定位代码:深入kylin4 构建代码进行debug,最终定位到 org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote 94行 (Apache Kylin main分支) String resKey = toUpdateSeg.getStatisticsResourcePath(); String statisticsDir = config.getJobTmpDir(currentInstanceCopy.getProject()) + "/" + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" + ResourceStore.CUBE_STATISTICS_ROOT + "/" + cubeId + "/" + segmentId + "/"; Path statisticsFile = new Path(statisticsDir, BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); FileSystem fs = HadoopUtil.getWorkingFileSystem(); if (fs.exists(statisticsFile)) { FSDataInputStream is = fs.open(statisticsFile); //{color:red}未关闭流{color} ResourceStore.getStore(config).putBigResource(resKey, is, System.currentTimeMillis()); } CubeUpdate update = new CubeUpdate(currentInstanceCopy); update.setCuboids(distCube.getCuboids()); List<CubeSegment> toRemoveSegs = Lists.newArrayList(); 4. 研测验证: Path statisticsFile = new Path(statisticsDir, BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); FileSystem fs = HadoopUtil.getWorkingFileSystem(); if (fs.exists(statisticsFile)) { try (FSDataInputStream is = fs.open(statisticsFile)) { // {color:red}关闭流{color} ResourceStore.getStore(config).putBigResource(resKey, is, System.currentTimeMillis()); } } 修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。 !image-2023-04-28-16-29-06-576.png! was: 1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自与Kylin4 JobServer 进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而改端口是 Hadoop DataNode 数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。 2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT 数增加1,至此可以确定时JobServer代码中未关闭流导致。 3.定位代码:深入kylin4 构建代码进行debug,最终定位到 org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote 94行 (Apache Kylin main分支) String resKey = toUpdateSeg.getStatisticsResourcePath(); String statisticsDir = config.getJobTmpDir(currentInstanceCopy.getProject()) + "/" + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" + ResourceStore.CUBE_STATISTICS_ROOT + "/" + cubeId + "/" + segmentId + "/"; Path statisticsFile = new Path(statisticsDir, BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); FileSystem fs = HadoopUtil.getWorkingFileSystem(); if (fs.exists(statisticsFile)) { FSDataInputStream is = fs.open(statisticsFile); //{color:red}未关闭流{color} ResourceStore.getStore(config).putBigResource(resKey, is, System.currentTimeMillis()); } CubeUpdate update = new CubeUpdate(currentInstanceCopy); update.setCuboids(distCube.getCuboids()); List<CubeSegment> toRemoveSegs = Lists.newArrayList(); 4. 研测验证: Path statisticsFile = new Path(statisticsDir, BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); FileSystem fs = HadoopUtil.getWorkingFileSystem(); if (fs.exists(statisticsFile)) { try (FSDataInputStream is = fs.open(statisticsFile)) { // {color:red}关闭流{color} ResourceStore.getStore(config).putBigResource(resKey, is, System.currentTimeMillis()); } } 修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。 > Kylin4 JobServer在每次Cube构建后未关闭已打开的HDFS文件流,产生大量CLOSE_WAIT > ------------------------------------------------------- > > Key: KYLIN-5539 > URL: https://issues.apache.org/jira/browse/KYLIN-5539 > Project: Kylin > Issue Type: Bug > Components: Job Engine > Affects Versions: v4.0.1, v4.0.2, v4.0.3 > Reporter: Liu Zhao > Priority: Major > Attachments: image-2023-04-28-16-29-06-576.png > > > 1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自与Kylin4 JobServer > 进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而改端口是 Hadoop DataNode > 数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。 > 2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT > 数增加1,至此可以确定时JobServer代码中未关闭流导致。 > 3.定位代码:深入kylin4 构建代码进行debug,最终定位到 > org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote > 94行 (Apache Kylin main分支) > String resKey = toUpdateSeg.getStatisticsResourcePath(); > String statisticsDir = > config.getJobTmpDir(currentInstanceCopy.getProject()) + "/" > + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" > + ResourceStore.CUBE_STATISTICS_ROOT + "/" > + cubeId + "/" + segmentId + "/"; > Path statisticsFile = new Path(statisticsDir, > BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); > FileSystem fs = HadoopUtil.getWorkingFileSystem(); > if (fs.exists(statisticsFile)) { > FSDataInputStream is = fs.open(statisticsFile); > //{color:red}未关闭流{color} > ResourceStore.getStore(config).putBigResource(resKey, is, > System.currentTimeMillis()); > } > CubeUpdate update = new CubeUpdate(currentInstanceCopy); > update.setCuboids(distCube.getCuboids()); > List<CubeSegment> toRemoveSegs = Lists.newArrayList(); > 4. 研测验证: > Path statisticsFile = new Path(statisticsDir, > BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME); > FileSystem fs = HadoopUtil.getWorkingFileSystem(); > if (fs.exists(statisticsFile)) { > try (FSDataInputStream is = fs.open(statisticsFile)) { // > {color:red}关闭流{color} > ResourceStore.getStore(config).putBigResource(resKey, is, > System.currentTimeMillis()); > } > } > 修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。 > !image-2023-04-28-16-29-06-576.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)