[ 
https://issues.apache.org/jira/browse/IMPALA-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Smith updated IMPALA-12997:
-----------------------------------
    Description: 
In some test runs, most tests under test_query_log will start to fail on 
various conditions like
{code}
custom_cluster/test_query_log.py:452: in 
test_query_log_table_query_select_mt_dop
    "impala-server.completed-queries.written", 1, 60)
common/impala_service.py:144: in wait_for_metric_value
    self.__metric_timeout_assert(metric_name, expected_value, timeout)
common/impala_service.py:213: in __metric_timeout_assert
    assert 0, assert_string
E   AssertionError: Metric impala-server.completed-queries.written did not 
reach value 1 in 60s.
E   Dumping debug webpages in JSON format...
E   Dumped memz JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
E   Dumped metrics JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
E   Dumped queries JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
E   Dumped sessions JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
E   Dumped threadz JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
E   Dumped rpcz JSON to 
$IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
E   Dumping minidumps for impalads/catalogds...
E   Dumped minidump for Impalad PID 3680802
E   Dumped minidump for Impalad PID 3680805
E   Dumped minidump for Impalad PID 3680809
E   Dumped minidump for Catalogd PID 3680732
{code}
or
{code}
custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
    assert len(sql_results.data) == 1, "query not found in completed queries 
table"
E   AssertionError: query not found in completed queries table
E   assert 0 == 1
E    +  where 0 = len([])
E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet object 
at 0xffffa00cc350>.data
{code}

One symptom that seems related to this is INSERT operations into 
sys.impala_query_log that start "UnregisterQuery()" but never finish (with 
"Query successfully unregistered").

We can identify cases like that with
{code}
for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo $log; 
for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully 
unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; done; 
done
{code}
A similar case may occur with creating the table too
{code}
for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' 
impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*);
 do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query successfully 
unregistered: query_id=$QID" $log; done
{code}
although these frequently fail because the test completes and shuts down Impala 
before the CREATE TABLE query completes.

Tracking one of those cases led to catalogd errors that repeated for 1m27s 
before the test suite restarted catalogd:
{code}
W0410 12:48:05.051760 3681790 Tasks.java:456] 
6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for 
lock on table sys.impala_query_log
Java exception follows:
org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for lock 
on table sys.impala_query_log
        at 
org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
        at 
org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
        at 
org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
        at 
org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
        at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
        at 
org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
        at 
org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
        at 
org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
        at 
org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
        at 
org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
        at 
org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
        at 
org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
        at 
org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
        at 
org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
        at 
org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
        at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
        at 
org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
        at 
org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
        at 
org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
        at 
org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
        at 
org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
{code}
This seems to correspond to the first test failure and persisted through all 
tests afterwards. So this lock was likely acquired from HMS, then never 
released.

> test_query_log tests get stuck trying to write to the log
> ---------------------------------------------------------
>
>                 Key: IMPALA-12997
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12997
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 4.4.0
>            Reporter: Michael Smith
>            Priority: Major
>
> In some test runs, most tests under test_query_log will start to fail on 
> various conditions like
> {code}
> custom_cluster/test_query_log.py:452: in 
> test_query_log_table_query_select_mt_dop
>     "impala-server.completed-queries.written", 1, 60)
> common/impala_service.py:144: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:213: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric impala-server.completed-queries.written did not 
> reach value 1 in 60s.
> E   Dumping debug webpages in JSON format...
> E   Dumped memz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
> E   Dumped metrics JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
> E   Dumped queries JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
> E   Dumped sessions JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
> E   Dumped threadz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
> E   Dumped rpcz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
> E   Dumping minidumps for impalads/catalogds...
> E   Dumped minidump for Impalad PID 3680802
> E   Dumped minidump for Impalad PID 3680805
> E   Dumped minidump for Impalad PID 3680809
> E   Dumped minidump for Catalogd PID 3680732
> {code}
> or
> {code}
> custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
>     assert len(sql_results.data) == 1, "query not found in completed queries 
> table"
> E   AssertionError: query not found in completed queries table
> E   assert 0 == 1
> E    +  where 0 = len([])
> E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet 
> object at 0xffffa00cc350>.data
> {code}
> One symptom that seems related to this is INSERT operations into 
> sys.impala_query_log that start "UnregisterQuery()" but never finish (with 
> "Query successfully unregistered").
> We can identify cases like that with
> {code}
> for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo 
> $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
> sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully 
> unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; 
> done; done
> {code}
> A similar case may occur with creating the table too
> {code}
> for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' 
> impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*);
>  do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
> sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query 
> successfully unregistered: query_id=$QID" $log; done
> {code}
> although these frequently fail because the test completes and shuts down 
> Impala before the CREATE TABLE query completes.
> Tracking one of those cases led to catalogd errors that repeated for 1m27s 
> before the test suite restarted catalogd:
> {code}
> W0410 12:48:05.051760 3681790 Tasks.java:456] 
> 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for 
> lock on table sys.impala_query_log
> Java exception follows:
> org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for 
> lock on table sys.impala_query_log
>       at 
> org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
>       at 
> org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
>       at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
>       at 
> org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
>       at 
> org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
>       at 
> org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
>       at 
> org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
>       at 
> org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
>       at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
>       at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
>       at 
> org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
>       at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>       at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>       at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>       at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>       at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
>       at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
>       at 
> org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> {code}
> This seems to correspond to the first test failure and persisted through all 
> tests afterwards. So this lock was likely acquired from HMS, then never 
> released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to