ArafatKhan2198 commented on code in PR #10524:
URL: https://github.com/apache/ozone/pull/10524#discussion_r3453413668


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ChatbotUtils.java:
##########
@@ -129,82 +71,21 @@ public static boolean isBucketScopedListKeysPrefix(String 
startPrefix) {
   // JSON & Text Utilities
   // =========================================================================
 
-  /**
-   * <p>LLMs sometimes wrap their JSON response in prose text (e.g. "Here is 
the result: {...}")
-   * despite being instructed to return JSON only. A simple greedy regex like 
{@code \{.*\}}
-   * fails for nested objects because it can match from the first {@code {} to 
the last {@code }}
-   * in the entire string, returning multiple concatenated objects or 
truncating nested ones.
-   *
-   * <p>This method uses brace-counting with string-awareness to reliably 
extract the first
-   * outermost JSON object regardless of surrounding text, nesting depth, or 
number of
-   * objects in the response:
-   *
-   * @param text the raw LLM response string, which may contain prose 
before/after JSON
-   * @return the first complete JSON object string, or {@code null} if none is 
found
-   */
-  public static String extractFirstJsonObject(String text) {
-    if (text == null) {
-      return null;
-    }
-    int depth = 0;
-    int start = -1;
-    boolean inString = false;
-    boolean escape = false;
-    for (int i = 0; i < text.length(); i++) {
-      char c = text.charAt(i);
-      if (escape) {
-        escape = false;
-        continue;
-      }
-      if (c == '\\' && inString) {
-        escape = true;
-        continue;
-      }
-      if (c == '"') {
-        inString = !inString;
-        continue;
-      }
-      if (inString) {
-        continue;
-      }
-      if (c == '{') {
-        if (depth == 0) {
-          start = i;
-        }
-        depth++;
-      } else if (c == '}') {
-        depth--;
-        if (depth == 0 && start != -1) {
-          return text.substring(start, i + 1);
-        }
-      }
-    }
-    return null;
-  }
-
   public static int parsePositiveInt(String value, int defaultValue) {
     if (StringUtils.isBlank(value)) {
       return defaultValue;
     }
     try {
       int parsed = Integer.parseInt(value.trim());
-      return parsed > 0 ? parsed : defaultValue;
+      if (parsed <= 0) {
+        throw new IllegalArgumentException("limit must be a positive integer");
+      }
+      return parsed;
     } catch (NumberFormatException e) {
       return defaultValue;
     }
   }
 
-  public static String extractStringField(JsonNode node, String field) {
-    if (node == null || field == null || field.isEmpty()) {
-      return null;
-    }
-    JsonNode fieldNode = node.get(field);
-    if (fieldNode == null || fieldNode.isNull()) {
-      return null;
-    }
-    return fieldNode.asText("");
-  }
-
   public static int estimateRecordCount(JsonNode response) {

Review Comment:
   You're right, and this was a real bug. The old `estimateRecordCount` only 
looked for a record array at a few fixed spots (top-level array, `keys`, or 
`data`), so for responses like `listKeys` (records under `fso`/`nonFSO`) or 
containers (records under `data.containers`) it found nothing and returned 0. 
That meant `truncated` never fired and the chatbot would report a capped 
1000-record sample as the complete result.
   
   Rather than hardcode the record field name for each endpoint (which would 
just break again whenever an API's response shape changes), I made the counter 
**shape-agnostic**. It now walks the whole JSON tree and sums the size of every 
array whose elements are objects — i.e. every record-like collection — wherever 
it sits and whatever it's called. Scalar arrays (e.g. histogram bin boundaries) 
are ignored so they don't inflate the count, and split collections like `fso` + 
`nonFSO` are summed.
   
   This keys off structure ("an array of objects") instead of field names, so 
it keeps working when an endpoint renames or nests a field, or a new endpoint 
is added. It's deliberately an estimate used only to detect truncation against 
the cap — if anything it can slightly over-count, which only makes `truncated` 
fire a bit early. That's the safe direction: the chatbot warns the result may 
be a partial sample rather than ever claiming a capped list is complete.
   
   I've also added unit tests covering the various response shapes and sizes 
(nested `data.containers`, `fso`+`nonFSO`, deleted/mismatch arrays, scalar 
arrays ignored, and the 1000-record truncation boundary).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to