Guoliang created YARN-11784:
-------------------------------
Summary: Counter inaccurate and resource negative
Key: YARN-11784
URL: https://issues.apache.org/jira/browse/YARN-11784
Project: Hadoop YARN
Issue Type: Bug
Reporter: Guoliang
Attachments: tic_stream_49-1, yarnweb1.png, yarnweb2.png
We have encountered the following issues in the production environment, which
may be caused by the same problem. Please help analyze which bug it is (YARN
version 2.7.2.22)
(1) Running for a period of time may result in fewer resources being actually
scheduled. But the RM page shows that there are still resources available, but
the actual scheduling cannot be provided. For example, some queues have pending
jobs, but the resources in this queue are sufficient, and the scheduling
strategy and threshold are normal. The total resources of the cluster are also
sufficient. Note: In the history of RM scheduler counters, there have been
cases where a single Container is -1. How many of them are still -1? I guess
it's related to this
(2) The A queue and cluster also have resources, and the A task cannot be
scheduled as AM or allocated to NM, so it has been stuck in this analysis. A
solution has been found, which is to increase the maxAMShare value. After
modifying the pending job history, it can be scheduled immediately. The A queue
resources have always had a large amount of idle time Discovered a phenomenon.
The number of resourcemanageability, queueinfo, and apps pending in the queue
obtained by RM web and jmx is inaccurate, but there is no problem when calling
the rm 8088 API ws/v1/cluster/apps.
Refer to the attached file tic_steam-49 as shown in the figure below Actually,
all 49 jobs are in Running status, but the numbers displayed on the web page
and jmx are incorrect
我们在生产环境遇到了如下问题,可能是同一问题导致,请帮分析下是哪的bug(YARN版本2.7.2.22)
(1)运行一段时间,可实际调度资源会少。但RM页面显示还有资源,实际调度不出来 例: 部分队列存在Pending
的job,但此队列资源够,调度策略和阀值都正常。 集群总资源也够。
注:历史看RM调度器计数器里,存在单Container是-1的情况,怎么个数还是-1,猜跟这有关
(2)也是A队列和集群有资源,A任务无法调度起AM,也不分配到NM,就一直卡这 分析:
发现了解决方式,增大maxAMShare值就可以。改完历史Pending 的job马上就能调度起来。A队列资源一直有大量空闲. 分发现个现象。 RM
web和jmx获取到的队列里的resourcemanager_queueinfo_AppsPending个数是不准的,但调rm 8088 api
ws/v1/cluster/apps 里的没有问题。
看如下图和tic_stream_49附件文件. 实际49个job都是Running状态,但web页面和jmx显示的数都是不对的
!yarnweb1.png!
!yarnweb2.png!
[^tic_stream_49]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]