jineshparakh opened a new pull request, #18515:
URL: https://github.com/apache/pinot/pull/18515

   ## Summary
   
   Pinot broker's user-facing query endpoints return HTTP 500 for client errors 
such as malformed JSON request bodies, missing required fields (`sql`, 
`query`), and invalid query IDs. This makes it impossible to distinguish 
genuine server failures from bad client requests in metrics and monitoring. 
This PR fixes all broker `POST` endpoints and the `cancelQuery` `DELETE` 
endpoint to return proper HTTP 400 responses, introduces a new 
`BAD_REQUEST_EXCEPTIONS` metric, and adds comprehensive test coverage.
   
   GET endpoints (`processSqlQueryGet`, 
`processSqlWithMultiStageQueryEngineGet`) were reviewed and intentionally 
excluded. They construct JSON from URL parameters internally (`ObjectNode 
requestJson = JsonUtils.newObjectNode()`), have no JSON body parsing, and no 
downstream code throws `BadRequestException`. No changes needed.
   
   ## Changes
   
   ### New metric: `BAD_REQUEST_EXCEPTIONS` (`BrokerMeter.java`)
   
   Added `BAD_REQUEST_EXCEPTIONS` meter alongside the existing 
`UNCAUGHT_POST_EXCEPTIONS` and `WEB_APPLICATION_EXCEPTIONS` exception meters. 
   
   ### Exception handling fixes (`PinotClientRequest.java`)
   
   **Missing-field validations (5 POST endpoints):**
   
   All five POST endpoints previously threw `IllegalStateException` for missing 
required fields. Changed to `javax.ws.rs.BadRequestException` (JAX-RS standard 
400 exception type):
   
   | Endpoint | Required field | Old exception | New exception |
   |----------|---------------|---------------|---------------|
   | `processSqlQueryPost` | `sql` | `IllegalStateException` | 
`BadRequestException` |
   | `getQueryFingerprint` | `sql` | `IllegalStateException` | 
`BadRequestException` |
   | `processSqlWithMultiStageQueryEnginePost` | `sql` | 
`IllegalStateException` | `BadRequestException` |
   | `processTimeSeriesQueryEngine` | `query` | `IllegalStateException` | 
`BadRequestException` |
   | `processSqlQueryWithBothEnginesAndCompareResults` | `sql` or (`sqlV1` + 
`sqlV2`) | `IllegalStateException` | `BadRequestException` |
   
   **Structured exception catch blocks (all 5 POST endpoints):**
   
   Added `catch (BadRequestException bre)` and `catch (JsonProcessingException 
e)` blocks to all POST endpoints. The catch order follows Java's subclass-first 
requirement and groups client errors together:
   
   ```
   BadRequestException → JsonProcessingException → WebApplicationException → 
Exception
   ```
   
   - `BadRequestException` before `WebApplicationException` (subclass must 
precede parent class)
   - `JsonProcessingException` is grouped with 400s since malformed JSON is a 
client error
   - No logging in client-error catches. Metric emission is sufficient, and 
logging at broker QPS would cause log bloat
   - Each client-error catch increments `BAD_REQUEST_EXCEPTIONS`
   - Each generic catch increments `UNCAUGHT_POST_EXCEPTIONS`
   
   **`cancelQuery` silent no-op fix:**
   
   The `catch (NumberFormatException e)` block was:
   ```java
   Response.status(Response.Status.BAD_REQUEST).entity(String.format("Invalid 
internal query id: %s", id));
   ```
   This built a `Response` object but never returned or threw it. The method 
silently fell through to the 404 path, returning "Query not found" instead of 
"Invalid query id". Fixed to:
   ```java
   throw new BadRequestException(String.format("Invalid internal query id: %s", 
id), e);
   ```
   
   **`generateQueryFingerprint` private method:**
   
   - SQL parse failures: was rethrowing generic `Exception` → now 
`BadRequestException`
   - Non-DQL SQL type: was throwing generic `Exception` → now 
`BadRequestException`
   - Null fingerprint from `QueryFingerprintUtils.generateFingerprint`: was 
generic `Exception` → now `IllegalStateException` (server-side invariant 
violation, not a client error, so it correctly remains 500)
   
   **`getQueryFingerprint` JSON injection fix:**
   
   The HTTP 500 error response body was built via unsafe string concatenation:
   ```java
   "{\"error\": \"" + e.getMessage() + "\"}"
   ```
   If `e.getMessage()` contained `"` characters, this produced malformed JSON. 
Fixed to use `ObjectNode`:
   ```java
   ObjectNode errorJson = JsonUtils.newObjectNode();
   errorJson.put("error", e.getMessage());
   ```
   
   **`WEB_APPLICATION_EXCEPTIONS` metric consistency:**
   
   Added `WEB_APPLICATION_EXCEPTIONS` metric to WAE catch blocks in 
`processTimeSeriesQueryEngine` and 
`processSqlQueryWithBothEnginesAndCompareResults` that were previously missing 
it. Also added the missing `UNCAUGHT_POST_EXCEPTIONS` metric to the generic 
catch in `processSqlQueryWithBothEnginesAndCompareResults`.
   
   ### Tests (`PinotClientRequestTest.java`)
   
   **Updated existing tests:**
   - `testGetQueryFingerprintWithInvalidSql`: was asserting HTTP 500, now 
asserts HTTP 400 and verifies `BAD_REQUEST_EXCEPTIONS` metric
   - `testPinotQueryComparisonApiMissingSql` → renamed to 
`testPinotQueryComparisonApiMissingV2Sql` (more precise: payload has `v1sql` 
but is missing `v2sql`); enhanced with `ArgumentCaptor` and metric assertions
   - `testCancelQueryWithInvalidId`: `never(UNCAUGHT_POST_EXCEPTIONS)` changed 
to `never(WEB_APPLICATION_EXCEPTIONS)` because the former was vacuously true (a 
DELETE handler never emits the POST metric)
   
   **9 new tests added (17 total, all passing):**
   
   | Test | Validates |
   |------|-----------|
   | `testProcessSqlQueryPostMissingSql` | Missing `sql` field → 400 |
   | `testProcessSqlQueryPostInvalidJson` | Malformed JSON body → 400 |
   | `testGetQueryFingerprintMissingSql` | Missing `sql` field → 400 |
   | `testGetQueryFingerprintInvalidJson` | Malformed JSON body → 400 |
   | `testCancelQueryWithInvalidId` | Non-numeric query ID → 400 |
   | `testProcessSqlWithMultiStageQueryEnginePostMissingSql` | Missing `sql` 
field → 400 |
   | `testProcessSqlWithMultiStageQueryEnginePostInvalidJson` | Malformed JSON 
body → 400 |
   | `testPinotQueryComparisonApiInvalidJson` | Malformed JSON body → 400 |
   | `testProcessTimeSeriesQueryEngineMissingQuery` | Missing `query` field → 
400 |
   
   Each test verifies three things:
   1. HTTP 400 status code returned
   2. `BAD_REQUEST_EXCEPTIONS` metric incremented
   3. `UNCAUGHT_POST_EXCEPTIONS` metric NOT incremented (no false 5xx signal)
   
   ## Metrics impact
   
   - **`BAD_REQUEST_EXCEPTIONS`** (new): incremented on all 400 paths. 
Previously these were counted under `UNCAUGHT_POST_EXCEPTIONS`.
   - **`UNCAUGHT_POST_EXCEPTIONS`**: loses counts that were never real 5xx 
errors. Existing dashboards keyed on this metric will see a count drop, which 
means fewer false positives.
   - **`WEB_APPLICATION_EXCEPTIONS`**: now excludes `BadRequestException` as a 
category. No downstream broker code throws `BadRequestException`, so the 
practical count change is zero.
   
   ## Behavioral note
   
   `processTimeSeriesQueryEngine` POST takes `JsonNode` as a parameter 
(Jersey-deserialized). Invalid JSON to this endpoint causes Jersey to fail 
before the method is called, so `BAD_REQUEST_EXCEPTIONS` is not emitted for 
that specific error path. All other POST endpoints parse JSON themselves and do 
emit the metric for invalid JSON. This is a pre-existing behavioral difference 
due to the parameter type, not introduced by this PR.
   
   ## Testing
   
   ```
   mvn -pl pinot-broker -Dtest=PinotClientRequestTest test
   Tests run: 17, Failures: 0, Errors: 0, Skipped: 0
   BUILD SUCCESS
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to