Copilot commented on code in PR #61776:
URL: https://github.com/apache/doris/pull/61776#discussion_r2994467348


##########
fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java:
##########
@@ -436,6 +453,86 @@ public Status globList(String remotePath, List<RemoteFile> 
result, boolean fileN
         return st;
     }
 
+    /**
+     * Get file metadata using getProperties requests for deterministic paths.
+     * This avoids requiring list permission when only read permission is 
granted.
+     *
+     * @param bucket       Azure container name
+     * @param keyPattern   The key pattern (may contain {..} brace or [...] 
bracket patterns but no wildcards)
+     * @param result       List to store matching RemoteFile objects
+     * @param fileNameOnly If true, only store file names; otherwise store 
full paths
+     * @param startTime    Start time for logging duration
+     * @return Status if successful, null if should fall back to listing
+     */
+    private Status globListByGetProperties(String bucket, String keyPattern,
+            List<RemoteFile> result, boolean fileNameOnly, long startTime) {
+        try {
+            // First expand [...] brackets to {...} braces, then expand {..} 
ranges, then expand braces
+            String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
+            expandedPattern = S3Util.extendGlobs(expandedPattern);
+            List<String> expandedPaths = 
S3Util.expandBracePatterns(expandedPattern);
+
+            // Fall back to listing if too many paths to avoid overwhelming 
Azure with requests
+            // Controlled by config: s3_head_request_max_paths
+            if (expandedPaths.size() > Config.s3_head_request_max_paths) {
+                LOG.info("Expanded path count {} exceeds limit {}, falling 
back to LIST",
+                        expandedPaths.size(), 
Config.s3_head_request_max_paths);
+                return null;
+            }

Review Comment:
   `s3_head_request_max_paths` is checked only after expanding patterns into 
the full `expandedPaths` list, so extremely large deterministic patterns can 
still consume significant CPU/memory during expansion before triggering the 
fallback. Consider adding short-circuiting/early-abort expansion once the 
configured max is exceeded.



##########
fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java:
##########
@@ -718,6 +741,91 @@ private GlobListResult globListInternal(String remotePath, 
List<RemoteFile> resu
         }
     }
 
+    /**
+     * Get file metadata using HEAD requests for deterministic paths.
+     * This avoids requiring ListBucket permission when only GetObject 
permission is granted.
+     *
+     * @param bucket       S3 bucket name
+     * @param keyPattern   The key pattern (may contain {..} brace or [...] 
bracket patterns but no wildcards)
+     * @param result       List to store matching RemoteFile objects
+     * @param fileNameOnly If true, only store file names; otherwise store 
full S3 paths
+     * @param startTime    Start time for logging duration
+     * @return GlobListResult if successful, null if should fall back to 
listing
+     */
+    private GlobListResult globListByHeadRequests(String bucket, String 
keyPattern,
+            List<RemoteFile> result, boolean fileNameOnly, long startTime) {
+        try {
+            // First expand [...] brackets to {...} braces, then expand {..} 
ranges, then expand braces
+            String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
+            expandedPattern = S3Util.extendGlobs(expandedPattern);
+            List<String> expandedPaths = 
S3Util.expandBracePatterns(expandedPattern);
+
+            // Fall back to listing if too many paths to avoid overwhelming S3 
with HEAD requests
+            // Controlled by config: s3_head_request_max_paths
+            if (expandedPaths.size() > Config.s3_head_request_max_paths) {
+                LOG.info("Expanded path count {} exceeds limit {}, falling 
back to LIST",
+                        expandedPaths.size(), 
Config.s3_head_request_max_paths);
+                return null;
+            }

Review Comment:
   `s3_head_request_max_paths` is checked only after fully expanding 
bracket/range/brace patterns. For very large deterministic patterns, expansion 
itself can be expensive (CPU/memory) before you ever hit the limit. Consider 
adding an expansion routine that can short-circuit once the max is exceeded (or 
pre-compute/estimate expansion cardinality) so the config actually protects the 
system.



##########
fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java:
##########
@@ -436,6 +453,86 @@ public Status globList(String remotePath, List<RemoteFile> 
result, boolean fileN
         return st;
     }
 
+    /**
+     * Get file metadata using getProperties requests for deterministic paths.
+     * This avoids requiring list permission when only read permission is 
granted.
+     *
+     * @param bucket       Azure container name
+     * @param keyPattern   The key pattern (may contain {..} brace or [...] 
bracket patterns but no wildcards)
+     * @param result       List to store matching RemoteFile objects
+     * @param fileNameOnly If true, only store file names; otherwise store 
full paths
+     * @param startTime    Start time for logging duration
+     * @return Status if successful, null if should fall back to listing
+     */
+    private Status globListByGetProperties(String bucket, String keyPattern,
+            List<RemoteFile> result, boolean fileNameOnly, long startTime) {
+        try {
+            // First expand [...] brackets to {...} braces, then expand {..} 
ranges, then expand braces
+            String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
+            expandedPattern = S3Util.extendGlobs(expandedPattern);
+            List<String> expandedPaths = 
S3Util.expandBracePatterns(expandedPattern);
+
+            // Fall back to listing if too many paths to avoid overwhelming 
Azure with requests
+            // Controlled by config: s3_head_request_max_paths
+            if (expandedPaths.size() > Config.s3_head_request_max_paths) {
+                LOG.info("Expanded path count {} exceeds limit {}, falling 
back to LIST",
+                        expandedPaths.size(), 
Config.s3_head_request_max_paths);
+                return null;
+            }
+
+            if (LOG.isDebugEnabled()) {
+                LOG.debug("Using getProperties requests for deterministic path 
pattern, expanded to {} paths",
+                        expandedPaths.size());
+            }
+
+            BlobContainerClient containerClient = 
getClient().getBlobContainerClient(bucket);
+            long matchCnt = 0;
+            for (String key : expandedPaths) {
+                String fullPath = constructS3Path(key, bucket);
+                try {
+                    BlobClient blobClient = containerClient.getBlobClient(key);
+                    BlobProperties props = blobClient.getProperties();
+
+                    matchCnt++;
+                    RemoteFile remoteFile = new RemoteFile(
+                            fileNameOnly ? 
Paths.get(key).getFileName().toString() : fullPath,
+                            true, // isFile
+                            props.getBlobSize(),
+                            props.getBlobSize(),
+                            props.getLastModified() != null
+                                    ? props.getLastModified().toEpochSecond() 
: 0
+                    );
+                    result.add(remoteFile);
+
+                    if (LOG.isDebugEnabled()) {
+                        LOG.debug("getProperties success for {}: size={}", 
fullPath, props.getBlobSize());
+                    }
+                } catch (BlobStorageException e) {
+                    if (e.getStatusCode() == HttpStatus.SC_NOT_FOUND
+                            || 
BlobErrorCode.BLOB_NOT_FOUND.equals(e.getErrorCode())) {
+                        // File does not exist, skip it (this is expected for 
some expanded patterns)
+                        if (LOG.isDebugEnabled()) {
+                            LOG.debug("File does not exist (skipped): {}", 
fullPath);
+                        }
+                    } else {
+                        throw e;
+                    }
+                }
+            }
+
+            if (LOG.isDebugEnabled()) {
+                long duration = System.nanoTime() - startTime;
+                LOG.debug("Deterministic path getProperties requests: checked 
{} paths, found {} files, took {} ms",
+                        expandedPaths.size(), matchCnt, duration / 1000 / 
1000);
+            }
+
+            return Status.OK;
+        } catch (Exception e) {
+            LOG.warn("Failed to use getProperties requests, falling back to 
listing: {}", e.getMessage());
+            return null;
+        }

Review Comment:
   If an exception occurs after some successful getProperties calls, this 
method returns null to fall back to LIST but leaves already-added entries in 
`result`, which can lead to duplicates/partial results when the listing path 
runs. Consider building results in a temporary list and only appending on 
success (or rollback `result` before returning null).



##########
fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java:
##########
@@ -718,6 +741,91 @@ private GlobListResult globListInternal(String remotePath, 
List<RemoteFile> resu
         }
     }
 
+    /**
+     * Get file metadata using HEAD requests for deterministic paths.
+     * This avoids requiring ListBucket permission when only GetObject 
permission is granted.
+     *
+     * @param bucket       S3 bucket name
+     * @param keyPattern   The key pattern (may contain {..} brace or [...] 
bracket patterns but no wildcards)
+     * @param result       List to store matching RemoteFile objects
+     * @param fileNameOnly If true, only store file names; otherwise store 
full S3 paths
+     * @param startTime    Start time for logging duration
+     * @return GlobListResult if successful, null if should fall back to 
listing
+     */
+    private GlobListResult globListByHeadRequests(String bucket, String 
keyPattern,
+            List<RemoteFile> result, boolean fileNameOnly, long startTime) {
+        try {
+            // First expand [...] brackets to {...} braces, then expand {..} 
ranges, then expand braces
+            String expandedPattern = S3Util.expandBracketPatterns(keyPattern);
+            expandedPattern = S3Util.extendGlobs(expandedPattern);
+            List<String> expandedPaths = 
S3Util.expandBracePatterns(expandedPattern);
+
+            // Fall back to listing if too many paths to avoid overwhelming S3 
with HEAD requests
+            // Controlled by config: s3_head_request_max_paths
+            if (expandedPaths.size() > Config.s3_head_request_max_paths) {
+                LOG.info("Expanded path count {} exceeds limit {}, falling 
back to LIST",
+                        expandedPaths.size(), 
Config.s3_head_request_max_paths);
+                return null;
+            }
+
+            if (LOG.isDebugEnabled()) {
+                LOG.debug("Using HEAD requests for deterministic path pattern, 
expanded to {} paths",
+                        expandedPaths.size());
+            }
+
+            long matchCnt = 0;
+            for (String key : expandedPaths) {
+                String fullPath = "s3://" + bucket + "/" + key;
+                try {
+                    HeadObjectResponse headResponse = getClient()
+                            .headObject(HeadObjectRequest.builder()
+                                    .bucket(bucket)
+                                    .key(key)
+                                    .build());
+
+                    matchCnt++;
+                    RemoteFile remoteFile = new RemoteFile(
+                            fileNameOnly ? 
Paths.get(key).getFileName().toString() : fullPath,
+                            true, // isFile
+                            headResponse.contentLength(),
+                            headResponse.contentLength(),
+                            headResponse.lastModified() != null
+                                    ? 
headResponse.lastModified().toEpochMilli() : 0
+                    );
+                    result.add(remoteFile);
+
+                    if (LOG.isDebugEnabled()) {
+                        LOG.debug("HEAD success for {}: size={}", fullPath, 
headResponse.contentLength());
+                    }
+                } catch (NoSuchKeyException e) {
+                    // File does not exist, skip it (this is expected for some 
expanded patterns)
+                    if (LOG.isDebugEnabled()) {
+                        LOG.debug("File does not exist (skipped): {}", 
fullPath);
+                    }
+                } catch (S3Exception e) {
+                    if (e.statusCode() == HttpStatus.SC_NOT_FOUND) {
+                        if (LOG.isDebugEnabled()) {
+                            LOG.debug("File does not exist (skipped): {}", 
fullPath);
+                        }
+                    } else {
+                        throw e;
+                    }
+                }
+            }
+
+            if (LOG.isDebugEnabled()) {
+                long duration = System.nanoTime() - startTime;
+                LOG.debug("Deterministic path HEAD requests: checked {} paths, 
found {} files, took {} ms",
+                        expandedPaths.size(), matchCnt, duration / 1000 / 
1000);
+            }
+
+            return new GlobListResult(Status.OK, "", bucket, "");
+        } catch (Exception e) {
+            LOG.warn("Failed to use HEAD requests, falling back to listing: 
{}", e.getMessage());
+            return null;
+        }

Review Comment:
   If an exception occurs after some successful HEADs, this method returns null 
to fall back to LIST but leaves already-added entries in `result`, which can 
lead to duplicates or partial results when listing adds the same files again. 
Consider accumulating results in a temporary list and only appending to 
`result` if the whole HEAD pass succeeds (or rollback to the original size 
before returning null).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to