ArafatKhan2198 commented on code in PR #10524:
URL: https://github.com/apache/ozone/pull/10524#discussion_r3453413668
##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/chatbot/agent/ChatbotUtils.java:
##########
@@ -129,82 +71,21 @@ public static boolean isBucketScopedListKeysPrefix(String
startPrefix) {
// JSON & Text Utilities
// =========================================================================
- /**
- * <p>LLMs sometimes wrap their JSON response in prose text (e.g. "Here is
the result: {...}")
- * despite being instructed to return JSON only. A simple greedy regex like
{@code \{.*\}}
- * fails for nested objects because it can match from the first {@code {} to
the last {@code }}
- * in the entire string, returning multiple concatenated objects or
truncating nested ones.
- *
- * <p>This method uses brace-counting with string-awareness to reliably
extract the first
- * outermost JSON object regardless of surrounding text, nesting depth, or
number of
- * objects in the response:
- *
- * @param text the raw LLM response string, which may contain prose
before/after JSON
- * @return the first complete JSON object string, or {@code null} if none is
found
- */
- public static String extractFirstJsonObject(String text) {
- if (text == null) {
- return null;
- }
- int depth = 0;
- int start = -1;
- boolean inString = false;
- boolean escape = false;
- for (int i = 0; i < text.length(); i++) {
- char c = text.charAt(i);
- if (escape) {
- escape = false;
- continue;
- }
- if (c == '\\' && inString) {
- escape = true;
- continue;
- }
- if (c == '"') {
- inString = !inString;
- continue;
- }
- if (inString) {
- continue;
- }
- if (c == '{') {
- if (depth == 0) {
- start = i;
- }
- depth++;
- } else if (c == '}') {
- depth--;
- if (depth == 0 && start != -1) {
- return text.substring(start, i + 1);
- }
- }
- }
- return null;
- }
-
public static int parsePositiveInt(String value, int defaultValue) {
if (StringUtils.isBlank(value)) {
return defaultValue;
}
try {
int parsed = Integer.parseInt(value.trim());
- return parsed > 0 ? parsed : defaultValue;
+ if (parsed <= 0) {
+ throw new IllegalArgumentException("limit must be a positive integer");
+ }
+ return parsed;
} catch (NumberFormatException e) {
return defaultValue;
}
}
- public static String extractStringField(JsonNode node, String field) {
- if (node == null || field == null || field.isEmpty()) {
- return null;
- }
- JsonNode fieldNode = node.get(field);
- if (fieldNode == null || fieldNode.isNull()) {
- return null;
- }
- return fieldNode.asText("");
- }
-
public static int estimateRecordCount(JsonNode response) {
Review Comment:
You're right, and this was a real bug. The old `estimateRecordCount` only
looked for a record array at a few fixed spots (top-level array, `keys`, or
`data`), so for responses like `listKeys` (records under `fso`/`nonFSO`) or
containers (records under `data.containers`) it found nothing and returned 0.
That meant `truncated` never fired and the chatbot would report a capped
1000-record sample as the complete result.
Rather than hardcode the record field name for each endpoint (which would
just break again whenever an API's response shape changes), I made the counter
**shape-agnostic**.
So as it walks the JSON:
- **Array of objects** (e.g. `containers`, `fso`, `nonFSO`,
`deletedKeyInfo`) → counts each item, and keeps going deeper.
- **Array of plain values** (numbers/strings, e.g. histogram bins) →
**ignored**, because those aren't records.
- **Plain fields** (`totalCount`, `prevKey`, `state`, etc.) → ignored.
- **Nested objects** → it just walks into them looking for more record
arrays.
Then it returns the **sum** of all the record-like arrays it found, no
matter where they sit or what they're named.
A few consequences worth knowing:
- **Split lists get summed** — `listKeys` returns `fso` *and* `nonFSO`, so
`600 + 400 = 1000`. That's intended.
- **It can slightly over-count** if records themselves contain nested object
arrays (e.g. a container that lists its replicas). We accept that on purpose —
over-counting only makes the "this might be a partial sample" warning fire a
touch early, which is the safe direction. It will never under-count and wrongly
claim a capped list is complete.
- **It's an estimate, not an exact total** — its only job is to answer "did
we likely hit the 1000 cap?", not to be a precise record count for display.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]