[ 
https://issues.apache.org/jira/browse/KYLIN-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Zhao updated KYLIN-5539:
----------------------------
    Description: 
1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自于Kylin4 JobServer 
进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而该端口是 Hadoop DataNode 
数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。
2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT 
数增加1,至此可以确定时JobServer代码中未关闭流导致。
3.定位代码:深入kylin4 构建代码进行debug,最终定位到 
org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote
 94行 (Apache Kylin main分支)

  String resKey = toUpdateSeg.getStatisticsResourcePath();
        String statisticsDir = 
config.getJobTmpDir(currentInstanceCopy.getProject()) + "/"
                + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" + 
ResourceStore.CUBE_STATISTICS_ROOT + "/"
                + cubeId + "/" + segmentId + "/";
        Path statisticsFile = new Path(statisticsDir, 
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
        FileSystem fs = HadoopUtil.getWorkingFileSystem();
        if (fs.exists(statisticsFile)) {
            FSDataInputStream is = fs.open(statisticsFile);   
//{color:red}未关闭流{color}
            ResourceStore.getStore(config).putBigResource(resKey, is, 
System.currentTimeMillis());
        }

        CubeUpdate update = new CubeUpdate(currentInstanceCopy);
        update.setCuboids(distCube.getCuboids());
        List<CubeSegment> toRemoveSegs = Lists.newArrayList();
4. 研测验证:
 Path statisticsFile = new Path(statisticsDir, 
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
        FileSystem fs = HadoopUtil.getWorkingFileSystem();
        if (fs.exists(statisticsFile)) {
            try (FSDataInputStream is = fs.open(statisticsFile)) {  // 
{color:red}关闭流{color}
                ResourceStore.getStore(config).putBigResource(resKey, is, 
System.currentTimeMillis());
            }
        }
修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。


 !image-2023-04-28-16-29-06-576.png! 


  was:
1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自与Kylin4 JobServer 
进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而改端口是 Hadoop DataNode 
数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。
2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT 
数增加1,至此可以确定时JobServer代码中未关闭流导致。
3.定位代码:深入kylin4 构建代码进行debug,最终定位到 
org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote
 94行 (Apache Kylin main分支)

  String resKey = toUpdateSeg.getStatisticsResourcePath();
        String statisticsDir = 
config.getJobTmpDir(currentInstanceCopy.getProject()) + "/"
                + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" + 
ResourceStore.CUBE_STATISTICS_ROOT + "/"
                + cubeId + "/" + segmentId + "/";
        Path statisticsFile = new Path(statisticsDir, 
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
        FileSystem fs = HadoopUtil.getWorkingFileSystem();
        if (fs.exists(statisticsFile)) {
            FSDataInputStream is = fs.open(statisticsFile);   
//{color:red}未关闭流{color}
            ResourceStore.getStore(config).putBigResource(resKey, is, 
System.currentTimeMillis());
        }

        CubeUpdate update = new CubeUpdate(currentInstanceCopy);
        update.setCuboids(distCube.getCuboids());
        List<CubeSegment> toRemoveSegs = Lists.newArrayList();
4. 研测验证:
 Path statisticsFile = new Path(statisticsDir, 
BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
        FileSystem fs = HadoopUtil.getWorkingFileSystem();
        if (fs.exists(statisticsFile)) {
            try (FSDataInputStream is = fs.open(statisticsFile)) {  // 
{color:red}关闭流{color}
                ResourceStore.getStore(config).putBigResource(resKey, is, 
System.currentTimeMillis());
            }
        }
修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。


 !image-2023-04-28-16-29-06-576.png! 



> Kylin4 JobServer在每次Cube构建后未关闭已打开的HDFS文件流,产生大量CLOSE_WAIT
> -------------------------------------------------------
>
>                 Key: KYLIN-5539
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5539
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v4.0.1, v4.0.2, v4.0.3
>            Reporter: Liu Zhao
>            Priority: Major
>         Attachments: image-2023-04-28-16-29-06-576.png
>
>
> 1.初始发现:线上告警某节点存在大量的CLOSE_WAIT,通过 netstat -anp 发现来自于Kylin4 JobServer 
> 进程,CLOSE_WAIT数达到9000多。并且 CLOSE_WAIT 来自的外部地址端口都是 50010,而该端口是 Hadoop DataNode 
> 数据传输使用,故此怀疑是 JobServer在每次作业构建时 fileSystem.open() 一个流后没有进行close。
> 2.模拟复现:在研测环境提交cube构建任务,并观察 CLOSE_WAIT 数及增长情况,发现每次cube构建结束后,CLOSE_WAIT 
> 数增加1,至此可以确定时JobServer代码中未关闭流导致。
> 3.定位代码:深入kylin4 构建代码进行debug,最终定位到 
> org.apache.kylin.engine.spark.utils.UpdateMetadataUtil#syncLocalMetadataToRemote
>  94行 (Apache Kylin main分支)
>   String resKey = toUpdateSeg.getStatisticsResourcePath();
>         String statisticsDir = 
> config.getJobTmpDir(currentInstanceCopy.getProject()) + "/"
>                 + nsparkExecutable.getParam(MetadataConstants.P_JOB_ID) + "/" 
> + ResourceStore.CUBE_STATISTICS_ROOT + "/"
>                 + cubeId + "/" + segmentId + "/";
>         Path statisticsFile = new Path(statisticsDir, 
> BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
>         FileSystem fs = HadoopUtil.getWorkingFileSystem();
>         if (fs.exists(statisticsFile)) {
>             FSDataInputStream is = fs.open(statisticsFile);   
> //{color:red}未关闭流{color}
>             ResourceStore.getStore(config).putBigResource(resKey, is, 
> System.currentTimeMillis());
>         }
>         CubeUpdate update = new CubeUpdate(currentInstanceCopy);
>         update.setCuboids(distCube.getCuboids());
>         List<CubeSegment> toRemoveSegs = Lists.newArrayList();
> 4. 研测验证:
>  Path statisticsFile = new Path(statisticsDir, 
> BatchConstants.CFG_STATISTICS_CUBOID_ESTIMATION_FILENAME);
>         FileSystem fs = HadoopUtil.getWorkingFileSystem();
>         if (fs.exists(statisticsFile)) {
>             try (FSDataInputStream is = fs.open(statisticsFile)) {  // 
> {color:red}关闭流{color}
>                 ResourceStore.getStore(config).putBigResource(resKey, is, 
> System.currentTimeMillis());
>             }
>         }
> 修改代码后,在研测环境进行多轮cube构建测试,CLOSE_WAIT 均无增加,验证解决。
>  !image-2023-04-28-16-29-06-576.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to