[jira] [Commented] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states

Joe McDonnell (Jira) Thu, 06 Jun 2024 15:34:04 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852961#comment-17852961
 ]


Joe McDonnell commented on IMPALA-12616:
----------------------------------------

This is looking timing-related. I was able to get this to pass by adjusting 
some of the sleep times. Basically, it looks like the catalog is slower on s3 
and some operations don't finish in the time we thought they would.

 
{noformat}
    debug_action_sleep_time_sec = 10 (NEW: 30)
    DEBUG_ACTION = ("WAIT_BEFORE_PROCESSING_CATALOG_UPDATE:SLEEP@{}"
                    .format(debug_action_sleep_time_sec * 1000))

    query = "alter table {} add columns (age int)".format(tbl_name)
    handle = self.execute_query_async(query, query_options={"debug_action": 
DEBUG_ACTION})

    # Wait a bit so the RPC from the catalogd arrives to the coordinator.
    time.sleep(0.5) (NEW: 5)

    self.cluster.catalogd.restart()

    # Wait for the query to finish.
    max_wait_time = (debug_action_sleep_time_sec
        + self.WAIT_FOR_CATALOG_UPDATE_TIMEOUT_SEC + 10)
    self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], 
max_wait_time){noformat}
A successful timeline looks like this:

 
 # Submit an alter table that sleeps before processing the catalog update
 # Sleep a little bit so the catalog knows about the alter table
 # Restart the catalogd
 # The catalog sends an update via the statestore. This has the new catalog ID 
and causes this message: "There was an error processing the impalad catalog 
update. Requesting a full topic update to recover: CatalogException: Detected 
catalog service ID changes from 9c9f7ff13f0e4f72:a896bee4d52fd37e to 
da67610b2c304198:a05daf1bc3d6a4b3. Aborting updateCatalog()"
 # The catalogd sends a full topic update
 # The alter table wakes up and prints this message: Catalog service ID 
mismatch. Current ID: da67610b2c304198:a05daf1bc3d6a4b3. ID in response: 
9c9f7ff13f0e4f72:a896bee4d52fd37e. Catalogd may have been restarted. Waiting 
for new catalog update from statestore.
 # Either it times out or there are too many non-empty updates, and the alter 
table bails out with "W0506 22:42:10.316627 23066 impala-server.cc:2369] 
e14b23a22458ab75:6b26941400000000] Ignoring catalog update result of catalog 
service ID 9c9f7ff13f0e4f72:a896bee4d52fd37e because it does not match with 
current catalog service ID da67610b2c304198:a05daf1bc3d6a4b3. The current 
catalog service ID may be stale (this may be caused by the catalogd having been 
restarted more than once) or newer than the catalog service ID of the update 
result."

If the alter table wakes up from its sleep before #5 happens, the alter table 
will see the catalog service ID change and fail. To avoid that, we adjust the 
WAIT_BEFORE_PROCESSING_CATALOG_UPDATE higher. I also lengthened the sleep in #2 
to give the initial catalog some extra time to hear about the alter table. The 
test verifies that the logs contain the expected messages, so this should be a 
safe modification to the test.

> test_restart_catalogd_while_handling_rpc_response* tests fail not reaching 
> expected states
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-12616
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12616
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 1.4.2
>            Reporter: Andrew Sherman
>            Assignee: Daniel Becker
>            Priority: Critical
>
> There are failures in both 
> custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout
>  and 
> custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters,
>  both look the same:
> {code:java}
> custom_cluster/test_restart_services.py:232: in 
> test_restart_catalogd_while_handling_rpc_response_with_timeout
>     self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], 
> max_wait_time)
> common/impala_test_suite.py:1181: in wait_for_state
>     self.wait_for_any_state(handle, [expected_state], timeout, client)
> common/impala_test_suite.py:1199: in wait_for_any_state
>     raise Timeout(timeout_msg)
> E   Timeout: query '6a4e0bad9b511ccf:bf93de6800000000' did not reach one of 
> the expected states [4], last known state 5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states

Reply via email to