czy006 commented on code in PR #3950:
URL: https://github.com/apache/amoro/pull/3950#discussion_r3008863780


##########
amoro-optimizer/amoro-optimizer-common/src/main/java/org/apache/amoro/optimizer/common/AbstractOptimizerOperator.java:
##########
@@ -86,25 +157,77 @@ private boolean shouldReturnNull(Throwable t) {
     return false;
   }
 
-  protected <T> T callAuthenticatedAms(AmsAuthenticatedCallOperation<T> 
operation)
+  /**
+   * Call authenticated AMS with a specific AMS URL.
+   *
+   * @param amsUrl The AMS node URL to call
+   * @param operation The operation to execute
+   * @return The result of the operation
+   * @throws TException If the operation fails
+   */
+  protected <T> T callAuthenticatedAms(String amsUrl, 
AmsAuthenticatedCallOperation<T> operation)
       throws TException {
+    // Maximum retry time window for auth errors in master-slave mode (30 
seconds)
+    long maxAuthRetryTimeWindow = TimeUnit.SECONDS.toMillis(30);
+    Long firstAuthErrorTime = null;
+
     while (isStarted()) {
       if (tokenIsReady()) {

Review Comment:
   I suspect that if a node keeps failing, it will get stuck on that node for a 
long time, and subsequent nodes will stop polling.We can Enforce per-node retry 
budget (attempt count or retry window), then fail fast to continue with next 
node.



##########
amoro-optimizer/amoro-optimizer-common/src/main/java/org/apache/amoro/optimizer/common/AbstractOptimizerOperator.java:
##########
@@ -86,25 +157,79 @@ private boolean shouldReturnNull(Throwable t) {
     return false;
   }
 
-  protected <T> T callAuthenticatedAms(AmsAuthenticatedCallOperation<T> 
operation)
+  /**
+   * Call authenticated AMS with a specific AMS URL.
+   *
+   * @param amsUrl The AMS node URL to call
+   * @param operation The operation to execute
+   * @return The result of the operation
+   * @throws TException If the operation fails
+   */
+  protected <T> T callAuthenticatedAms(String amsUrl, 
AmsAuthenticatedCallOperation<T> operation)
       throws TException {
+    // Maximum retry time window for auth errors in master-slave mode (30 
seconds)
+    long maxAuthRetryTimeWindow = TimeUnit.SECONDS.toMillis(90);
+    Long firstAuthErrorTime = null;
+
     while (isStarted()) {
       if (tokenIsReady()) {
         String token = getToken();
         try {
-          return 
operation.call(OptimizingClientPools.getClient(config.getAmsUrl()), token);
+          // Reset retry time on successful call
+          firstAuthErrorTime = null;
+          return operation.call(OptimizingClientPools.getClient(amsUrl), 
token);

Review Comment:
   elapsedTime every time is near 0, maxAuthRetryTimeWindow is not to do it



##########
amoro-ams/src/main/java/org/apache/amoro/server/DBBucketAssignStore.java:
##########
@@ -180,6 +180,25 @@ public void updateLastUpdateTime(AmsServerInfo nodeInfo) 
throws BucketAssignStor
     }
   }
 
+  @Override
+  public List<AmsServerInfo> getAliveNodes() throws BucketAssignStoreException 
{
+    try {
+      List<BucketAssignmentMeta> rows =
+          getAs(BucketAssignMapper.class, mapper -> 
mapper.selectAllByCluster(clusterName));

Review Comment:
   `getAliveNodes()` directly iterates through all `bucket_assignments` and 
returns them (only verifying the port), without filtering by 
`node_heartbeat_ts`. Could this result in the optimizer's node list containing 
offline nodes, triggering additional retries?



##########
amoro-ams/src/main/resources/mysql/upgrade.sql:
##########
@@ -164,12 +164,13 @@ ADD COLUMN `process_parameters` mediumtext COMMENT 'Table 
process parameters';
 
 -- ADD table bucket_assignments for storing assigned info
 CREATE TABLE IF NOT EXISTS bucket_assignments (
-                                                  cluster_name       
VARCHAR(64)   NOT NULL COMMENT 'AMS cluster name',
+    cluster_name       VARCHAR(64)   NOT NULL COMMENT 'AMS cluster name',
     node_key           VARCHAR(256) NOT NULL COMMENT 'Node key 
(host:thriftBindPort)',
     server_info_json   TEXT         NULL COMMENT 'JSON encoded AmsServerInfo',
     assignments_json   TEXT         NULL COMMENT 'JSON array of bucket IDs',
     last_update_time   BIGINT       NOT NULL DEFAULT 0 COMMENT 'Last update 
timestamp (ms since epoch)',
+    node_heartbeat_ts  BIGINT       NOT NULL DEFAULT 0 COMMENT 'Per-node 
heartbeat timestamp updated only by the owning node (ms since epoch)',

Review Comment:
   `ALTER TABLE ... ADD COLUMN (node_heartbeat_ts)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to