Copilot commented on code in PR #64275:
URL: https://github.com/apache/doris/pull/64275#discussion_r3378484699
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
}
public ComputeGroup getComputeGroupByName(String computeGroupName) {
- LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ // rlock guards the compound name->id->group lookup: writers
(add/remove/rename)
+ // update both maps under wlock, and the read must observe a
consistent snapshot
+ // so callers like getPhysicalCluster don't transiently see a virtual
group name
+ // with a null group and fall back to treating it as a physical
cluster.
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ }
try {
rlock.lock();
- if (!clusterNameToId.containsKey(computeGroupName)) {
- return null;
- }
- return
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+ String id = clusterNameToId.get(computeGroupName);
+ return id == null ? null : computeGroupIdToComputeGroup.get(id);
} finally {
rlock.unlock();
}
}
+ public boolean containsCloudCluster(String clusterName) {
+ return !Strings.isNullOrEmpty(clusterName) &&
clusterNameToId.containsKey(clusterName);
+ }
Review Comment:
`containsCloudCluster` reads `clusterNameToId` without taking `rlock`, but
writers update this map under `wlock` using a remove+put sequence (e.g.
`updateClusterNameToId`). Without the read lock, callers can observe the
transient state (name removed, new name not yet added) and incorrectly treat a
valid cluster as not registered, throwing `CURRENT_COMPUTE_GROUP_NOT_EXIST`.
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
}
public ComputeGroup getComputeGroupByName(String computeGroupName) {
- LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ // rlock guards the compound name->id->group lookup: writers
(add/remove/rename)
+ // update both maps under wlock, and the read must observe a
consistent snapshot
+ // so callers like getPhysicalCluster don't transiently see a virtual
group name
+ // with a null group and fall back to treating it as a physical
cluster.
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ }
try {
rlock.lock();
- if (!clusterNameToId.containsKey(computeGroupName)) {
- return null;
- }
- return
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+ String id = clusterNameToId.get(computeGroupName);
+ return id == null ? null : computeGroupIdToComputeGroup.get(id);
} finally {
rlock.unlock();
}
}
+ public boolean containsCloudCluster(String clusterName) {
+ return !Strings.isNullOrEmpty(clusterName) &&
clusterNameToId.containsKey(clusterName);
+ }
+
+ // Resolve the cluster id for the current ConnectContext: physical-cluster
lookup,
+ // priv check, status check (reject MANUAL_SHUTDOWN), wait-for-autoStart,
existence
+ // check, finally name->id mapping. The result is identical for every
tablet/replica
+ // within a single request, so hot paths should resolve once and reuse the
cached value.
+ public String getCurrentClusterId() throws ComputeGroupException {
+ ConnectContext context = ConnectContext.get();
+ if (context == null) {
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("connect context is null in getCurrentClusterId");
+ }
+ throw new ComputeGroupException("connect context not set cluster ",
+
ComputeGroupException.FailedTypeEnum.CONNECT_CONTEXT_NOT_SET);
+ }
+
+ String cluster = getPhysicalCluster(context.getCloudCluster());
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("get compute group by context {}", cluster);
+ }
+
+ UserIdentity currentUid = context.getCurrentUserIdentity();
+ if (currentUid == null ||
Strings.isNullOrEmpty(currentUid.getQualifiedUser())) {
+ LOG.info("connect context user is null.");
+ throw new ComputeGroupException("connect context's user is null",
+
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+ }
+ try {
+ ((CloudEnv) Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
+ } catch (Exception e) {
+ LOG.warn("check compute group {} for {} auth failed.", cluster,
currentUid);
+ throw new ComputeGroupException(
+ String.format("context compute group %s check auth failed,
user is %s", cluster, currentUid),
+
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
Review Comment:
`checkCloudClusterPriv` failures are logged without the caught exception,
which drops the stack trace and makes auth/debugging issues much harder to
diagnose in production. Please include the exception as the last parameter to
`LOG.warn`.
##########
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java:
##########
@@ -4218,9 +4232,18 @@ public TReplacePartitionResult
replacePartition(TReplacePartitionRequest request
// BE id -> path hash
Multimap<Long, Long> bePathsMap;
try {
- if (Config.isCloudMode() && request.isSetBeEndpoint())
{
- bePathsMap = ((CloudTablet) tablet)
-
.getNormalReplicaBackendPathMapCloud(request.be_endpoint);
+ if (tablet instanceof CloudTablet) {
+ CloudTablet cloudTablet = (CloudTablet) tablet;
+ if (replaceHasBeEndpoint) {
+ bePathsMap =
cloudTablet.getNormalReplicaBackendPathMap(request.be_endpoint);
+ } else {
+ if (replaceCachedClusterId == null) {
+ replaceCachedClusterId =
((CloudSystemInfoService) Env.getCurrentSystemInfo())
+ .getCurrentClusterId();
+ }
Review Comment:
`replacePartition` caches `clusterId` by casting
`Env.getCurrentSystemInfo()` to `CloudSystemInfoService`. This can throw
`ClassCastException` (bypassing the existing `UserException` handling) if the
system info service is not cloud-aware (common in unit tests where
`Env::getCurrentSystemInfo` is mocked).
##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapTableSink.java:
##########
@@ -773,7 +777,16 @@ private List<TOlapTableLocationParam> createLocation(long
dbId, OlapTable table)
StringBuilder errMsgBuilder = new StringBuilder();
Multimap<Long, Long> bePathsMap = HashMultimap.create();
try {
- bePathsMap = tablet.getNormalReplicaBackendPathMap();
+ if (tablet instanceof CloudTablet) {
+ if (cachedClusterId == null) {
+ cachedClusterId = ((CloudSystemInfoService)
Env.getCurrentSystemInfo())
+ .getCurrentClusterId();
+ }
+ bePathsMap = ((CloudTablet) tablet)
+
.getNormalReplicaBackendPathMapByClusterId(cachedClusterId);
+ } else {
Review Comment:
`createLocation` resolves `cachedClusterId` via a hard cast to
`CloudSystemInfoService`. If `Env.getCurrentSystemInfo()` is mocked/initialized
as the base `SystemInfoService`, this will throw a `ClassCastException` that is
not caught by the existing `catch (ComputeGroupException e)` block.
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
}
public ComputeGroup getComputeGroupByName(String computeGroupName) {
- LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ // rlock guards the compound name->id->group lookup: writers
(add/remove/rename)
+ // update both maps under wlock, and the read must observe a
consistent snapshot
+ // so callers like getPhysicalCluster don't transiently see a virtual
group name
+ // with a null group and fall back to treating it as a physical
cluster.
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("get id {} computeGroupIdToComputeGroup : {} ",
computeGroupName, computeGroupIdToComputeGroup);
+ }
try {
rlock.lock();
- if (!clusterNameToId.containsKey(computeGroupName)) {
- return null;
- }
- return
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+ String id = clusterNameToId.get(computeGroupName);
+ return id == null ? null : computeGroupIdToComputeGroup.get(id);
} finally {
rlock.unlock();
}
}
+ public boolean containsCloudCluster(String clusterName) {
+ return !Strings.isNullOrEmpty(clusterName) &&
clusterNameToId.containsKey(clusterName);
+ }
+
+ // Resolve the cluster id for the current ConnectContext: physical-cluster
lookup,
+ // priv check, status check (reject MANUAL_SHUTDOWN), wait-for-autoStart,
existence
+ // check, finally name->id mapping. The result is identical for every
tablet/replica
+ // within a single request, so hot paths should resolve once and reuse the
cached value.
+ public String getCurrentClusterId() throws ComputeGroupException {
+ ConnectContext context = ConnectContext.get();
+ if (context == null) {
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("connect context is null in getCurrentClusterId");
+ }
+ throw new ComputeGroupException("connect context not set cluster ",
+
ComputeGroupException.FailedTypeEnum.CONNECT_CONTEXT_NOT_SET);
+ }
+
+ String cluster = getPhysicalCluster(context.getCloudCluster());
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("get compute group by context {}", cluster);
+ }
+
+ UserIdentity currentUid = context.getCurrentUserIdentity();
+ if (currentUid == null ||
Strings.isNullOrEmpty(currentUid.getQualifiedUser())) {
+ LOG.info("connect context user is null.");
+ throw new ComputeGroupException("connect context's user is null",
+
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+ }
+ try {
+ ((CloudEnv) Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
+ } catch (Exception e) {
+ LOG.warn("check compute group {} for {} auth failed.", cluster,
currentUid);
+ throw new ComputeGroupException(
+ String.format("context compute group %s check auth failed,
user is %s", cluster, currentUid),
+
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+ }
+
+ String clusterStatus = getCloudStatusByName(cluster);
+ if (!Strings.isNullOrEmpty(clusterStatus)
+ && Cloud.ClusterStatus.valueOf(clusterStatus) ==
Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
+ LOG.warn("auto start compute group {} in manual shutdown status",
cluster);
+ throw new ComputeGroupException(
+ String.format("The current compute group %s has been
manually shutdown", cluster),
+
ComputeGroupException.FailedTypeEnum.CURRENT_COMPUTE_GROUP_BEEN_MANUAL_SHUTDOWN);
+ }
+
+ return resolveClusterIdByName(cluster);
+ }
+
+ // Resolve a known cluster name to its id, handling auto-start (cluster
may resume
+ // under a different name) and validating the cluster is registered.
+ public String resolveClusterIdByName(String cluster) throws
ComputeGroupException {
+ String wakeUPCluster = "";
+ try {
+ wakeUPCluster = waitForAutoStart(cluster);
+ } catch (DdlException e) {
+ LOG.warn("cant resume compute group {}, exception", cluster, e);
+ }
+ if (!Strings.isNullOrEmpty(wakeUPCluster) &&
!cluster.equals(wakeUPCluster)) {
+ cluster = wakeUPCluster;
+ LOG.warn("get backend input compute group {} useless, so auto
start choose a new one compute group {}",
+ cluster, wakeUPCluster);
+ }
Review Comment:
In `resolveClusterIdByName`, `cluster` is overwritten with `wakeUPCluster`
before logging, so the warn log prints the new cluster name twice and loses the
original input value. This makes troubleshooting auto-start/redirect behavior
confusing.
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTablet.java:
##########
@@ -71,7 +72,19 @@ private Multimap<Long, Long>
backendPathMapReprocess(Multimap<Long, Long> pathMa
@Override
public Multimap<Long, Long> getNormalReplicaBackendPathMap() throws
UserException {
- Multimap<Long, Long> pathMap = super.getNormalReplicaBackendPathMap();
+ // Per-tablet entry point: resolves cluster id here for callers that
don't hoist it.
+ // High-tablet-count callers should resolve once and pass the cluster
id in via
+ // getNormalReplicaBackendPathMapByClusterId to amortize the
resolution.
+ String clusterId = ((CloudSystemInfoService)
Env.getCurrentSystemInfo()).getCurrentClusterId();
+ return getNormalReplicaBackendPathMapByClusterId(clusterId);
+ }
Review Comment:
`CloudTablet.getNormalReplicaBackendPathMap()` now unconditionally casts
`Env.getCurrentSystemInfo()` to `CloudSystemInfoService`. In cloud-mode unit
tests, `Env.getCurrentSystemInfo()` is sometimes mocked as the base
`SystemInfoService` (see e.g. tests that mock `Env::getCurrentSystemInfo`),
which would now fail with a `ClassCastException` instead of a controlled
`UserException`.
##########
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java:
##########
@@ -3899,9 +3902,17 @@ public TCreatePartitionResult
createPartition(TCreatePartitionRequest request) t
// BE id -> path hash
Multimap<Long, Long> bePathsMap;
try {
- if (Config.isCloudMode() && request.isSetBeEndpoint())
{
- bePathsMap = ((CloudTablet) tablet)
-
.getNormalReplicaBackendPathMapCloud(request.be_endpoint);
+ if (tablet instanceof CloudTablet) {
+ CloudTablet cloudTablet = (CloudTablet) tablet;
+ if (hasBeEndpoint) {
+ bePathsMap =
cloudTablet.getNormalReplicaBackendPathMap(request.be_endpoint);
+ } else {
+ if (cachedClusterId == null) {
+ cachedClusterId =
((CloudSystemInfoService) Env.getCurrentSystemInfo())
+ .getCurrentClusterId();
+ }
+ bePathsMap =
cloudTablet.getNormalReplicaBackendPathMapByClusterId(cachedClusterId);
Review Comment:
`createPartition` caches `clusterId` by casting `Env.getCurrentSystemInfo()`
to `CloudSystemInfoService`. If FE is running with a non-cloud
`SystemInfoService` (or tests mock the base type), this will throw
`ClassCastException` and bypass the existing `UserException` handling.
##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java:
##########
@@ -573,7 +578,16 @@ private void addScanRangeLocations(Partition partition,
replicas.sort(Replica.ID_COMPARATOR);
Replica replica = replicas.get(useFixReplica >=
replicas.size() ? replicas.size() - 1 : useFixReplica);
if
(context.getSessionVariable().fallbackOtherReplicaWhenFixedCorrupt) {
- long beId = replica.getBackendId();
+ long beId;
+ if (replica instanceof CloudReplica) {
+ if (cachedClusterId == null) {
+ cachedClusterId = ((CloudSystemInfoService)
Env.getCurrentSystemInfo())
+ .getCurrentClusterId();
+ }
+ beId = ((CloudReplica)
replica).getBackendIdWithClusterId(cachedClusterId);
Review Comment:
`addScanRangeLocations` caches `clusterId` via `((CloudSystemInfoService)
Env.getCurrentSystemInfo()).getCurrentClusterId()`. If
`Env.getCurrentSystemInfo()` is not actually a `CloudSystemInfoService` (e.g.
in cloud-mode unit tests that mock the base type), this will throw
`ClassCastException` and bypass the existing `ComputeGroupException` handling.
##########
fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java:
##########
@@ -627,7 +641,15 @@ private void addScanRangeLocations(Partition partition,
Backend backend = null;
long backendId = -1;
try {
- backendId = replica.getBackendId();
+ if (replica instanceof CloudReplica) {
+ if (cachedClusterId == null) {
+ cachedClusterId = ((CloudSystemInfoService)
Env.getCurrentSystemInfo())
+ .getCurrentClusterId();
+ }
+ backendId = ((CloudReplica)
replica).getBackendIdWithClusterId(cachedClusterId);
+ } else {
+ backendId = replica.getBackendId();
+ }
Review Comment:
Same as above: `Env.getCurrentSystemInfo()` is hard-cast to
`CloudSystemInfoService` when resolving `cachedClusterId`, which can throw
`ClassCastException` and escape the existing `ComputeGroupException` catch
paths. Use an `instanceof` guard and throw a `ComputeGroupException` instead.
##########
fe/fe-core/src/test/java/org/apache/doris/cloud/system/CloudSystemInfoServiceTest.java:
##########
@@ -973,6 +973,21 @@ public void
testGetMinPipelineExecutorSizeWithConnectContext() {
}
}
+ @Test
+ public void testContainsCloudCluster() {
+ infoService = new CloudSystemInfoService();
+ // Empty / null inputs short-circuit without touching the map.
+ Assert.assertFalse(infoService.containsCloudCluster(null));
+ Assert.assertFalse(infoService.containsCloudCluster(""));
+ // Unknown cluster name -> false.
+ Assert.assertFalse(infoService.containsCloudCluster("absent_cluster"));
+ // Register a cluster; lookup must hit.
+ infoService.addVirtualClusterInfoToMapsNoLock("cid_1", "cluster_1");
+ Assert.assertTrue(infoService.containsCloudCluster("cluster_1"));
+ // Different name in same map -> still false.
+ Assert.assertFalse(infoService.containsCloudCluster("cluster_2"));
+ }
Review Comment:
The PR adds substantial new logic (`getCurrentClusterId` /
`resolveClusterIdByName`) that centralizes auth/status/auto-start/existence
checks, but the new test only covers `containsCloudCluster`. Consider extending
the unit tests to cover at least: (1) auth failure path, (2) MANUAL_SHUTDOWN
status rejection, and (3) auto-start rename via `waitForAutoStart`, to prevent
regressions on these critical control paths.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]