Copilot commented on code in PR #64275:
URL: https://github.com/apache/doris/pull/64275#discussion_r3378484699


##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void 
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
     }
 
     public ComputeGroup getComputeGroupByName(String computeGroupName) {
-        LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        // rlock guards the compound name->id->group lookup: writers 
(add/remove/rename)
+        // update both maps under wlock, and the read must observe a 
consistent snapshot
+        // so callers like getPhysicalCluster don't transiently see a virtual 
group name
+        // with a null group and fall back to treating it as a physical 
cluster.
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        }
         try {
             rlock.lock();
-            if (!clusterNameToId.containsKey(computeGroupName)) {
-                return null;
-            }
-            return 
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+            String id = clusterNameToId.get(computeGroupName);
+            return id == null ? null : computeGroupIdToComputeGroup.get(id);
         } finally {
             rlock.unlock();
         }
     }
 
+    public boolean containsCloudCluster(String clusterName) {
+        return !Strings.isNullOrEmpty(clusterName) && 
clusterNameToId.containsKey(clusterName);
+    }

Review Comment:
   `containsCloudCluster` reads `clusterNameToId` without taking `rlock`, but 
writers update this map under `wlock` using a remove+put sequence (e.g. 
`updateClusterNameToId`). Without the read lock, callers can observe the 
transient state (name removed, new name not yet added) and incorrectly treat a 
valid cluster as not registered, throwing `CURRENT_COMPUTE_GROUP_NOT_EXIST`.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void 
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
     }
 
     public ComputeGroup getComputeGroupByName(String computeGroupName) {
-        LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        // rlock guards the compound name->id->group lookup: writers 
(add/remove/rename)
+        // update both maps under wlock, and the read must observe a 
consistent snapshot
+        // so callers like getPhysicalCluster don't transiently see a virtual 
group name
+        // with a null group and fall back to treating it as a physical 
cluster.
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        }
         try {
             rlock.lock();
-            if (!clusterNameToId.containsKey(computeGroupName)) {
-                return null;
-            }
-            return 
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+            String id = clusterNameToId.get(computeGroupName);
+            return id == null ? null : computeGroupIdToComputeGroup.get(id);
         } finally {
             rlock.unlock();
         }
     }
 
+    public boolean containsCloudCluster(String clusterName) {
+        return !Strings.isNullOrEmpty(clusterName) && 
clusterNameToId.containsKey(clusterName);
+    }
+
+    // Resolve the cluster id for the current ConnectContext: physical-cluster 
lookup,
+    // priv check, status check (reject MANUAL_SHUTDOWN), wait-for-autoStart, 
existence
+    // check, finally name->id mapping. The result is identical for every 
tablet/replica
+    // within a single request, so hot paths should resolve once and reuse the 
cached value.
+    public String getCurrentClusterId() throws ComputeGroupException {
+        ConnectContext context = ConnectContext.get();
+        if (context == null) {
+            if (LOG.isDebugEnabled()) {
+                LOG.debug("connect context is null in getCurrentClusterId");
+            }
+            throw new ComputeGroupException("connect context not set cluster ",
+                    
ComputeGroupException.FailedTypeEnum.CONNECT_CONTEXT_NOT_SET);
+        }
+
+        String cluster = getPhysicalCluster(context.getCloudCluster());
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get compute group by context {}", cluster);
+        }
+
+        UserIdentity currentUid = context.getCurrentUserIdentity();
+        if (currentUid == null || 
Strings.isNullOrEmpty(currentUid.getQualifiedUser())) {
+            LOG.info("connect context user is null.");
+            throw new ComputeGroupException("connect context's user is null",
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+        }
+        try {
+            ((CloudEnv) Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
+        } catch (Exception e) {
+            LOG.warn("check compute group {} for {} auth failed.", cluster, 
currentUid);
+            throw new ComputeGroupException(
+                    String.format("context compute group %s check auth failed, 
user is %s", cluster, currentUid),
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);

Review Comment:
   `checkCloudClusterPriv` failures are logged without the caught exception, 
which drops the stack trace and makes auth/debugging issues much harder to 
diagnose in production. Please include the exception as the last parameter to 
`LOG.warn`.



##########
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java:
##########
@@ -4218,9 +4232,18 @@ public TReplacePartitionResult 
replacePartition(TReplacePartitionRequest request
                     // BE id -> path hash
                     Multimap<Long, Long> bePathsMap;
                     try {
-                        if (Config.isCloudMode() && request.isSetBeEndpoint()) 
{
-                            bePathsMap = ((CloudTablet) tablet)
-                                    
.getNormalReplicaBackendPathMapCloud(request.be_endpoint);
+                        if (tablet instanceof CloudTablet) {
+                            CloudTablet cloudTablet = (CloudTablet) tablet;
+                            if (replaceHasBeEndpoint) {
+                                bePathsMap = 
cloudTablet.getNormalReplicaBackendPathMap(request.be_endpoint);
+                            } else {
+                                if (replaceCachedClusterId == null) {
+                                    replaceCachedClusterId = 
((CloudSystemInfoService) Env.getCurrentSystemInfo())
+                                            .getCurrentClusterId();
+                                }

Review Comment:
   `replacePartition` caches `clusterId` by casting 
`Env.getCurrentSystemInfo()` to `CloudSystemInfoService`. This can throw 
`ClassCastException` (bypassing the existing `UserException` handling) if the 
system info service is not cloud-aware (common in unit tests where 
`Env::getCurrentSystemInfo` is mocked).



##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapTableSink.java:
##########
@@ -773,7 +777,16 @@ private List<TOlapTableLocationParam> createLocation(long 
dbId, OlapTable table)
                     StringBuilder errMsgBuilder = new StringBuilder();
                     Multimap<Long, Long> bePathsMap = HashMultimap.create();
                     try {
-                        bePathsMap = tablet.getNormalReplicaBackendPathMap();
+                        if (tablet instanceof CloudTablet) {
+                            if (cachedClusterId == null) {
+                                cachedClusterId = ((CloudSystemInfoService) 
Env.getCurrentSystemInfo())
+                                        .getCurrentClusterId();
+                            }
+                            bePathsMap = ((CloudTablet) tablet)
+                                    
.getNormalReplicaBackendPathMapByClusterId(cachedClusterId);
+                        } else {

Review Comment:
   `createLocation` resolves `cachedClusterId` via a hard cast to 
`CloudSystemInfoService`. If `Env.getCurrentSystemInfo()` is mocked/initialized 
as the base `SystemInfoService`, this will throw a `ClassCastException` that is 
not caught by the existing `catch (ComputeGroupException e)` block.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void 
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
     }
 
     public ComputeGroup getComputeGroupByName(String computeGroupName) {
-        LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        // rlock guards the compound name->id->group lookup: writers 
(add/remove/rename)
+        // update both maps under wlock, and the read must observe a 
consistent snapshot
+        // so callers like getPhysicalCluster don't transiently see a virtual 
group name
+        // with a null group and fall back to treating it as a physical 
cluster.
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        }
         try {
             rlock.lock();
-            if (!clusterNameToId.containsKey(computeGroupName)) {
-                return null;
-            }
-            return 
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+            String id = clusterNameToId.get(computeGroupName);
+            return id == null ? null : computeGroupIdToComputeGroup.get(id);
         } finally {
             rlock.unlock();
         }
     }
 
+    public boolean containsCloudCluster(String clusterName) {
+        return !Strings.isNullOrEmpty(clusterName) && 
clusterNameToId.containsKey(clusterName);
+    }
+
+    // Resolve the cluster id for the current ConnectContext: physical-cluster 
lookup,
+    // priv check, status check (reject MANUAL_SHUTDOWN), wait-for-autoStart, 
existence
+    // check, finally name->id mapping. The result is identical for every 
tablet/replica
+    // within a single request, so hot paths should resolve once and reuse the 
cached value.
+    public String getCurrentClusterId() throws ComputeGroupException {
+        ConnectContext context = ConnectContext.get();
+        if (context == null) {
+            if (LOG.isDebugEnabled()) {
+                LOG.debug("connect context is null in getCurrentClusterId");
+            }
+            throw new ComputeGroupException("connect context not set cluster ",
+                    
ComputeGroupException.FailedTypeEnum.CONNECT_CONTEXT_NOT_SET);
+        }
+
+        String cluster = getPhysicalCluster(context.getCloudCluster());
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get compute group by context {}", cluster);
+        }
+
+        UserIdentity currentUid = context.getCurrentUserIdentity();
+        if (currentUid == null || 
Strings.isNullOrEmpty(currentUid.getQualifiedUser())) {
+            LOG.info("connect context user is null.");
+            throw new ComputeGroupException("connect context's user is null",
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+        }
+        try {
+            ((CloudEnv) Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
+        } catch (Exception e) {
+            LOG.warn("check compute group {} for {} auth failed.", cluster, 
currentUid);
+            throw new ComputeGroupException(
+                    String.format("context compute group %s check auth failed, 
user is %s", cluster, currentUid),
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+        }
+
+        String clusterStatus = getCloudStatusByName(cluster);
+        if (!Strings.isNullOrEmpty(clusterStatus)
+                && Cloud.ClusterStatus.valueOf(clusterStatus) == 
Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
+            LOG.warn("auto start compute group {} in manual shutdown status", 
cluster);
+            throw new ComputeGroupException(
+                    String.format("The current compute group %s has been 
manually shutdown", cluster),
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_COMPUTE_GROUP_BEEN_MANUAL_SHUTDOWN);
+        }
+
+        return resolveClusterIdByName(cluster);
+    }
+
+    // Resolve a known cluster name to its id, handling auto-start (cluster 
may resume
+    // under a different name) and validating the cluster is registered.
+    public String resolveClusterIdByName(String cluster) throws 
ComputeGroupException {
+        String wakeUPCluster = "";
+        try {
+            wakeUPCluster = waitForAutoStart(cluster);
+        } catch (DdlException e) {
+            LOG.warn("cant resume compute group {}, exception", cluster, e);
+        }
+        if (!Strings.isNullOrEmpty(wakeUPCluster) && 
!cluster.equals(wakeUPCluster)) {
+            cluster = wakeUPCluster;
+            LOG.warn("get backend input compute group {} useless, so auto 
start choose a new one compute group {}",
+                    cluster, wakeUPCluster);
+        }

Review Comment:
   In `resolveClusterIdByName`, `cluster` is overwritten with `wakeUPCluster` 
before logging, so the warn log prints the new cluster name twice and loses the 
original input value. This makes troubleshooting auto-start/redirect behavior 
confusing.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java:
##########
@@ -71,7 +72,19 @@ private Multimap<Long, Long> 
backendPathMapReprocess(Multimap<Long, Long> pathMa
 
     @Override
     public Multimap<Long, Long> getNormalReplicaBackendPathMap() throws 
UserException {
-        Multimap<Long, Long> pathMap = super.getNormalReplicaBackendPathMap();
+        // Per-tablet entry point: resolves cluster id here for callers that 
don't hoist it.
+        // High-tablet-count callers should resolve once and pass the cluster 
id in via
+        // getNormalReplicaBackendPathMapByClusterId to amortize the 
resolution.
+        String clusterId = ((CloudSystemInfoService) 
Env.getCurrentSystemInfo()).getCurrentClusterId();
+        return getNormalReplicaBackendPathMapByClusterId(clusterId);
+    }

Review Comment:
   `CloudTablet.getNormalReplicaBackendPathMap()` now unconditionally casts 
`Env.getCurrentSystemInfo()` to `CloudSystemInfoService`. In cloud-mode unit 
tests, `Env.getCurrentSystemInfo()` is sometimes mocked as the base 
`SystemInfoService` (see e.g. tests that mock `Env::getCurrentSystemInfo`), 
which would now fail with a `ClassCastException` instead of a controlled 
`UserException`.



##########
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java:
##########
@@ -3899,9 +3902,17 @@ public TCreatePartitionResult 
createPartition(TCreatePartitionRequest request) t
                     // BE id -> path hash
                     Multimap<Long, Long> bePathsMap;
                     try {
-                        if (Config.isCloudMode() && request.isSetBeEndpoint()) 
{
-                            bePathsMap = ((CloudTablet) tablet)
-                                    
.getNormalReplicaBackendPathMapCloud(request.be_endpoint);
+                        if (tablet instanceof CloudTablet) {
+                            CloudTablet cloudTablet = (CloudTablet) tablet;
+                            if (hasBeEndpoint) {
+                                bePathsMap = 
cloudTablet.getNormalReplicaBackendPathMap(request.be_endpoint);
+                            } else {
+                                if (cachedClusterId == null) {
+                                    cachedClusterId = 
((CloudSystemInfoService) Env.getCurrentSystemInfo())
+                                            .getCurrentClusterId();
+                                }
+                                bePathsMap = 
cloudTablet.getNormalReplicaBackendPathMapByClusterId(cachedClusterId);

Review Comment:
   `createPartition` caches `clusterId` by casting `Env.getCurrentSystemInfo()` 
to `CloudSystemInfoService`. If FE is running with a non-cloud 
`SystemInfoService` (or tests mock the base type), this will throw 
`ClassCastException` and bypass the existing `UserException` handling.



##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java:
##########
@@ -573,7 +578,16 @@ private void addScanRangeLocations(Partition partition,
                 replicas.sort(Replica.ID_COMPARATOR);
                 Replica replica = replicas.get(useFixReplica >= 
replicas.size() ? replicas.size() - 1 : useFixReplica);
                 if 
(context.getSessionVariable().fallbackOtherReplicaWhenFixedCorrupt) {
-                    long beId = replica.getBackendId();
+                    long beId;
+                    if (replica instanceof CloudReplica) {
+                        if (cachedClusterId == null) {
+                            cachedClusterId = ((CloudSystemInfoService) 
Env.getCurrentSystemInfo())
+                                    .getCurrentClusterId();
+                        }
+                        beId = ((CloudReplica) 
replica).getBackendIdWithClusterId(cachedClusterId);

Review Comment:
   `addScanRangeLocations` caches `clusterId` via `((CloudSystemInfoService) 
Env.getCurrentSystemInfo()).getCurrentClusterId()`. If 
`Env.getCurrentSystemInfo()` is not actually a `CloudSystemInfoService` (e.g. 
in cloud-mode unit tests that mock the base type), this will throw 
`ClassCastException` and bypass the existing `ComputeGroupException` handling.



##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java:
##########
@@ -627,7 +641,15 @@ private void addScanRangeLocations(Partition partition,
                 Backend backend = null;
                 long backendId = -1;
                 try {
-                    backendId = replica.getBackendId();
+                    if (replica instanceof CloudReplica) {
+                        if (cachedClusterId == null) {
+                            cachedClusterId = ((CloudSystemInfoService) 
Env.getCurrentSystemInfo())
+                                    .getCurrentClusterId();
+                        }
+                        backendId = ((CloudReplica) 
replica).getBackendIdWithClusterId(cachedClusterId);
+                    } else {
+                        backendId = replica.getBackendId();
+                    }

Review Comment:
   Same as above: `Env.getCurrentSystemInfo()` is hard-cast to 
`CloudSystemInfoService` when resolving `cachedClusterId`, which can throw 
`ClassCastException` and escape the existing `ComputeGroupException` catch 
paths. Use an `instanceof` guard and throw a `ComputeGroupException` instead.



##########
fe/fe-core/src/test/java/org/apache/doris/cloud/system/CloudSystemInfoServiceTest.java:
##########
@@ -973,6 +973,21 @@ public void 
testGetMinPipelineExecutorSizeWithConnectContext() {
         }
     }
 
+    @Test
+    public void testContainsCloudCluster() {
+        infoService = new CloudSystemInfoService();
+        // Empty / null inputs short-circuit without touching the map.
+        Assert.assertFalse(infoService.containsCloudCluster(null));
+        Assert.assertFalse(infoService.containsCloudCluster(""));
+        // Unknown cluster name -> false.
+        Assert.assertFalse(infoService.containsCloudCluster("absent_cluster"));
+        // Register a cluster; lookup must hit.
+        infoService.addVirtualClusterInfoToMapsNoLock("cid_1", "cluster_1");
+        Assert.assertTrue(infoService.containsCloudCluster("cluster_1"));
+        // Different name in same map -> still false.
+        Assert.assertFalse(infoService.containsCloudCluster("cluster_2"));
+    }

Review Comment:
   The PR adds substantial new logic (`getCurrentClusterId` / 
`resolveClusterIdByName`) that centralizes auth/status/auto-start/existence 
checks, but the new test only covers `containsCloudCluster`. Consider extending 
the unit tests to cover at least: (1) auth failure path, (2) MANUAL_SHUTDOWN 
status rejection, and (3) auto-start rename via `waitForAutoStart`, to prevent 
regressions on these critical control paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to