Alexey Serbin created KUDU-3666:
-----------------------------------
Summary: ReplicatedAlterTableTest.AlterTableAndDropTablet fails
from time to time due to absence of only-once semantics for AlterTable
Key: KUDU-3666
URL: https://issues.apache.org/jira/browse/KUDU-3666
Project: Kudu
Issue Type: Bug
Components: master, test
Reporter: Alexey Serbin
Attachments: alter_table-test.00-debug.txt.xz,
alter_table-test.00-release.txt.xz, alter_table-test.01-release.txt.xz
The ReplicatedAlterTableTest.AlterTableAndDropTablet fails from time to time.
Failures are manifested by error messages like below:
{noformat}
src/kudu/integration-tests/alter_table-test.cc:2378: Failure
Failed
Bad status: Already present: The column already exists: new_c39
{noformat}
{noformat}
src/kudu/integration-tests/alter_table-test.cc:2378: Failure
Failed
Bad status: Already present: The column already exists: new_c44
{noformat}
{noformat}
src/kudu/integration-tests/alter_table-test.cc:2385: Failure
Failed
Bad status: Invalid argument: no range partition to drop: 9 <= VALUES < 10
{noformat}
The culprit seems to be a retried AlterTable RPC request. The client assumed
that the request failed, but in fact the request succeeded at the server side.
To address the issue, we need to enable exactly-once RPC semantics (i.e.
kudu.rpc.track_rpc_result option in protobuf) for
AlterTable(AlterTableRequestPB) RPC method of masters as well. At the time of
writing, we have it enabled only for Write(WriteRequestPB) RPC method of tablet
servers.
Full test logs are attached for convenience. In each of the logs, the evidence
of re-attempted RPC request can be found, e.g.:
{noformat}
W20250531 02:02:56.259193 30345 master_proxy_rpc.cc:203] Re-attempting
AlterTable request to leader Master (127.22.204.254:43629)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)