[
https://issues.apache.org/jira/browse/TEZ-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ayush Saxena resolved TEZ-4646.
-------------------------------
Fix Version/s: 1.0.0
Resolution: Fixed
> Periodic jstack collection for tez (tez.thread.dump.interval) only collects
> jstacks once.
> -----------------------------------------------------------------------------------------
>
> Key: TEZ-4646
> URL: https://issues.apache.org/jira/browse/TEZ-4646
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Urmas Tamassy
> Assignee: Ayush Saxena
> Priority: Major
> Fix For: 1.0.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> *Issue description:*
> https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to
> configure periodic jstack collection for tez AM(dag) and executor (task)
> containers.
> Unfortunately the current implementation only allows a single jstack
> collection after an initial delay via the tez.thread.dump.interval
> configuration.
> The issue seems to be due to the improper use of the ScheduledExecutorService
> schedule method where it seems to be more appropriate to use
> scheduleAtFixedRate or scheduleWithFixedDelay.
> [https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-]
> [https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94]
> *Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect
> all releases):
> {code:java}
> set hive.fetch.task.conversion=none;
> -- set hive.security.authorization.sqlstd.confwhitelist.append to
> tez\.thread\.dump\.interval ahead of time
> set tez.thread.dump.interval=3s;
> -- also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time
> to allow reflects
> select java_method("java.lang.Thread","sleep",10000L);{code}
> With a 10s duration we would expect 2-3 jstacks (depending on initial delay),
> but we only receive 1 after 3 seconds. Log snippets:
> {code:java}
> Container: container_e06_1756713041685_0004_01_000002 on
> ccycloud-3.tamassyurmas.root.comops.site_8041
> LogAggregationType: AGGREGATED
> ======================================================================================================
> LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
> ...
> 2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic
> Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms
> frequency and at path:
> /var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002{code}
> {code:java}
> LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack{code}
> 1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after
> 3 seconds, but no other are observed for the same task attempt or the dag
> (also only a single dump). Attaching the app logs of the same.
> *Expectation/severity:*
> The feature should allow periodic collections rather than a singular
> collection after an initial delay, please ensure the feature works as
> expected. The initial delay might also need to be configurable.
> The feature simplifies the investigation of long-running or stuck tez
> applications where jstacks of specific containers (Yarn) or K8s
> executor/coordinator pod processes (CDW) may be necessary. Manual collection
> may be difficult or near-impossible in certain situations and as such it is a
> valuable diagnostic feature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)