[ 
https://issues.apache.org/jira/browse/TEZ-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Urmas Tamassy updated TEZ-4646:
-------------------------------
    Description: 
*Issue description:*

https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to configure 
periodic jstack collection for tez AM(dag) and executor (task) containers.

Unfortunately the current implementation only allows a single jstack collection 
after an initial delay via the tez.thread.dump.interval configuration.

The issue seems to be due to the improper use of the ScheduledExecutorService 
schedule method where it seems to be more appropriate to use 
scheduleAtFixedRate or scheduleWithFixedDelay.

[https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-]

[https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94]

*Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect 
all releases):
{code:java}
set hive.fetch.task.conversion=none;
-- set hive.security.authorization.sqlstd.confwhitelist.append to 
tez\.thread\.dump\.interval ahead of time
set tez.thread.dump.interval=3s;
-- also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time 
to allow reflects
select java_method("java.lang.Thread","sleep",10000L);{code}
With a 10s duration we would expect 2-3 jstacks (depending on initial delay), 
but we only receive 1 after 3 seconds. Log snippets:
{code:java}
Container: container_e06_1756713041685_0004_01_000002 on 
ccycloud-3.tamassyurmas.root.comops.site_8041
LogAggregationType: AGGREGATED
======================================================================================================
LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
...
2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic 
Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms 
frequency and at path: 
/var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002{code}
{code:java}
LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack{code}
1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 3 
seconds, but no other are observed for the same task attempt or the dag (also 
only a single dump). Attaching the app logs of the same.

*Expectation/severity:*

The feature should allow periodic collections rather than a singular collection 
after an initial delay, please ensure the feature works as expected. The 
initial delay might also need to be configurable.

The feature simplifies the investigation of long-running or stuck tez 
applications where jstacks of specific containers (Yarn) or K8s 
executor/coordinator pod processes (CDW) may be necessary. Manual collection 
may be difficult or near-impossible in certain situations and as such it is a 
valuable diagnostic feature.

  was:
*Issue description:*

https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to configure 
periodic jstack collection for tez AM(dag) and executor (task) containers.

Unfortunately the current implementation only allows a single jstack collection 
after an initial delay via the tez.thread.dump.interval configuration.

The issue seems to be due to the improper use of the ScheduledExecutorService 
schedule method where it seems to be more appropriate to use 
scheduleAtFixedRate or scheduleWithFixedDelay.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-

https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94

*Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect 
all releases):

set hive.fetch.task.conversion=none;
– set hive.security.authorization.sqlstd.confwhitelist.append to 
tez\.thread\.dump\.interval ahead of time
set tez.thread.dump.interval=3s;
– also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time to 
allow reflects
select java_method("java.lang.Thread","sleep",10000L);

With a 10s duration we would expect 2-3 jstacks (depending on initial delay), 
but we only receive 1 after 3 seconds. Log snippets:

Container: container_e06_1756713041685_0004_01_000002 on 
ccycloud-3.tamassyurmas.root.comops.site_8041
LogAggregationType: AGGREGATED
======================================================================================================
LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
...
2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic 
Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms 
frequency and at path: 
/var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002

LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack

1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 3 
seconds, but no other are observed for the same task attempt or the dag (also 
only a single dump). Attaching the app logs of the same.

*Expectation/severity:*

The feature should allow periodic collections rather than a singular collection 
after an initial delay, please ensure the feature works as expected. The 
initial delay might also need to be configurable.

The feature simplifies the investigation of long-running or stuck tez 
applications where jstacks of specific containers (Yarn) or K8s 
executor/coordinator pod processes (CDW) may be necessary. Manual collection 
may be difficult or near-impossible in certain situations and as such it is a 
valuable diagnostic feature.


> Periodic jstack collection for tez (tez.thread.dump.interval) only collects 
> jstacks once.
> -----------------------------------------------------------------------------------------
>
>                 Key: TEZ-4646
>                 URL: https://issues.apache.org/jira/browse/TEZ-4646
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Urmas Tamassy
>            Priority: Major
>
> *Issue description:*
> https://issues.apache.org/jira/browse/TEZ-4344 intends allow users to 
> configure periodic jstack collection for tez AM(dag) and executor (task) 
> containers.
> Unfortunately the current implementation only allows a single jstack 
> collection after an initial delay via the tez.thread.dump.interval 
> configuration.
> The issue seems to be due to the improper use of the ScheduledExecutorService 
> schedule method where it seems to be more appropriate to use 
> scheduleAtFixedRate or scheduleWithFixedDelay.
> [https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#schedule-java.lang.Runnable-long-java.util.concurrent.TimeUnit-]
> [https://github.com/apache/tez/blob/c981a9459c48b5e0c49fb4197173a910dcf7a39a/tez-runtime-internals/src/main/java/org/apache/tez/runtime/TezThreadDumpHelper.java#L94]
> *Reproduction* (tested on CDP7.1.9SP1, but based on code it appears to affect 
> all releases):
> {code:java}
> set hive.fetch.task.conversion=none;
> -- set hive.security.authorization.sqlstd.confwhitelist.append to 
> tez\.thread\.dump\.interval ahead of time
> set tez.thread.dump.interval=3s;
> -- also set hive.server2.builtin.udf.blacklist to a dummy value ahead of time 
> to allow reflects
> select java_method("java.lang.Thread","sleep",10000L);{code}
> With a 10s duration we would expect 2-3 jstacks (depending on initial delay), 
> but we only receive 1 after 3 seconds. Log snippets:
> {code:java}
> Container: container_e06_1756713041685_0004_01_000002 on 
> ccycloud-3.tamassyurmas.root.comops.site_8041
> LogAggregationType: AGGREGATED
> ======================================================================================================
> LogType:syslog_attempt_1756713041685_0004_1_00_000000_0
> ...
> 2025-09-01 10:04:42,948 [INFO] [main] |runtime.TezThreadDumpHelper|: Periodic 
> Thread Dump Capture Service Configured to capture Thread Dumps at 3000 ms 
> frequency and at path: 
> /var/log/hadoop-yarn/container/application_1756713041685_0004/container_e06_1756713041685_0004_01_000002{code}
> {code:java}
> LogType:attempt_1756713041685_0004_1_00_000000_0_1756721085950.jstack{code}
> 1756721085950 = Mon Sep 1 10:04:45 UTC 2025 which confirms a stack dump after 
> 3 seconds, but no other are observed for the same task attempt or the dag 
> (also only a single dump). Attaching the app logs of the same.
> *Expectation/severity:*
> The feature should allow periodic collections rather than a singular 
> collection after an initial delay, please ensure the feature works as 
> expected. The initial delay might also need to be configurable.
> The feature simplifies the investigation of long-running or stuck tez 
> applications where jstacks of specific containers (Yarn) or K8s 
> executor/coordinator pod processes (CDW) may be necessary. Manual collection 
> may be difficult or near-impossible in certain situations and as such it is a 
> valuable diagnostic feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to