siddharthteotia opened a new issue, #18568:
URL: https://github.com/apache/pinot/issues/18568

   ## Summary
   
   With `SET enableNullHandling = true`, queries that use 
`jsonExtractScalar(...)` (no default value argument) over JSON columns where 
some rows have no value at the queried path fail with 
`IllegalArgumentException("Cannot resolve JSON path on some records. Consider 
setting a default value.")`. The same shape with `jsonExtractIndex(...)` 
likewise throws (`RuntimeException("Cannot resolve JSON path...")`).
   
   Both `JsonExtractScalarTransformFunction` and 
`JsonExtractIndexTransformFunction` have `_nullHandlingEnabled` plumbed through 
`init()` but the six typed SV methods 
(`transformTo{Int,Long,Float,Double,BigDecimal,String}ValuesSV`) ignore it and 
throw unconditionally.
   
   Meanwhile, `JsonIndexDistinctOperator` (which `SELECT DISTINCT 
jsonExtractIndex(...)` routes through when a JSON index is present) does honor 
null handling — it folds unresolved rows into a single SQL `NULL` group via 
`handleMissingDocs`. This creates a real divergence: same logical query, 
different outcomes based on whether the DISTINCT operator route is eligible.
   
   ## Fixture used throughout
   
   A `clicks` table with a single JSON column:
   
   ```
   row 0: {"country": "US", "clicks": 5}
   row 1: {"country": "CA", "clicks": 3}
   row 2: {"country": null, "clicks": null}
   row 3: {}
   ```
   
   JSON index is present on `json`.
   
   ## Today on master — `enableNullHandling = true`
   
   | # | Query shape | Default value | `jsonExtractScalar(json, '\$.country', 
'STRING'[, …])` | `jsonExtractIndex(json, '\$.country', 'STRING'[, …])` | 
Consistent? |
   |---|---|---|---|---|---|
   | 1 | **Projection**<br>`SELECT <expr> FROM clicks` | none | throws 
`IllegalArgumentException("Cannot resolve JSON path on some 
records…")`<br>*scalar transform fn* | throws `RuntimeException("Cannot resolve 
JSON path…")`<br>*index transform fn* | ✓ both throw |
   | 2 | **DISTINCT**<br>`SELECT DISTINCT <expr> FROM clicks` | none | throws 
(same scalar transform path as row 1) | returns `"US"`, `"CA"`, `NULL`<br>*via 
`JsonIndexDistinctOperator.handleMissingDocs` → `addNull()`* | ✗ 
**inconsistent** |
   | 3 | **GROUP BY + COUNT**<br>`SELECT <expr> AS c, COUNT(*) FROM clicks 
GROUP BY c` | none | throws (same scalar transform path as row 1) | throws 
(same index transform path as row 1) | ✓ both throw |
   | 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`, 
`"foobar"` | returns `"US"`, `"CA"`, `"foobar"`, `"foobar"` | ✓ |
   | 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"` | 
returns `"US"`, `"CA"`, `"foobar"`<br>*via 
`JsonIndexDistinctOperator.handleMissingDocs` → 
`addValueToDistinctTable("foobar", …)`* | ✓ |
   | 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`, 
`("foobar",2)` | returns `("CA",1)`, `("US",1)`, `("foobar",2)` | ✓ |
   
   ## After fix — `enableNullHandling = true`
   
   Both `JsonExtractScalarTransformFunction` and 
`JsonExtractIndexTransformFunction` patched to honor `_nullHandlingEnabled` 
when no default literal is supplied.
   
   | # | Query shape | Default value | `jsonExtractScalar(json, '\$.country', 
'STRING'[, …])` | `jsonExtractIndex(json, '\$.country', 'STRING'[, …])` | 
Consistent? |
   |---|---|---|---|---|---|
   | 1 | **Projection** | none | returns `"US"`, `"CA"`, `NULL`, 
`NULL`<br>*scalar transform fn — new null-emit branch* | returns `"US"`, 
`"CA"`, `NULL`, `NULL`<br>*index transform fn — new null-emit branch* | ✓ |
   | 2 | **DISTINCT** | none | returns `"US"`, `"CA"`, `NULL`<br>*scalar SV 
result (with null bitmap) deduped by generic distinct operator* | returns 
`"US"`, `"CA"`, `NULL`<br>*via `JsonIndexDistinctOperator.handleMissingDocs` → 
`addNull()` (unchanged)* | ✓ |
   | 3 | **GROUP BY + COUNT** | none | returns `("CA",1)`, `("US",1)`, 
`(NULL,2)`<br>*scalar SV result (with null bitmap), then standard GROUP BY* | 
returns `("CA",1)`, `("US",1)`, `(NULL,2)`<br>*index transform fn SV result 
(with null bitmap), then standard GROUP BY* | ✓ |
   | 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`, 
`"foobar"`<br>*unchanged — default-value branch* | returns `"US"`, `"CA"`, 
`"foobar"`, `"foobar"`<br>*unchanged — default-value branch* | ✓ |
   | 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`, 
`"foobar"`<br>*unchanged* | returns `"US"`, `"CA"`, `"foobar"`<br>*unchanged* | 
✓ |
   | 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`, 
`("foobar",2)`<br>*unchanged* | returns `("CA",1)`, `("US",1)`, 
`("foobar",2)`<br>*unchanged* | ✓ |
   
   **Change summary (NH = ON):** rows 1–3 in both function-name columns flip 
from `throws` to a SQL `NULL` (or null group / null row). Rows 4–6 are 
unchanged. Every cell is consistent across both function names post-fix.
   
   ## Today on master — `enableNullHandling = false`
   
   | # | Query shape | Default value | `jsonExtractScalar(...)` | 
`jsonExtractIndex(...)` | Consistent? |
   |---|---|---|---|---|---|
   | 1 | **Projection** | none | throws `IllegalArgumentException` | throws 
`RuntimeException` (index transform fn) | ✓ both throw |
   | 2 | **DISTINCT** | none | throws (scalar transform) | throws 
`RuntimeException("Illegal Json Path: [...], for some docIds in segment 
[...]")`<br>*via `JsonIndexDistinctOperator.handleMissingDocs` — NH-off, 
no-default branch* | ✓ both throw |
   | 3 | **GROUP BY** | none | throws (scalar transform) | throws (index 
transform) | ✓ both throw |
   | 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`, 
`"foobar"` | returns `"US"`, `"CA"`, `"foobar"`, `"foobar"` | ✓ |
   | 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"` | 
returns `"US"`, `"CA"`, `"foobar"` | ✓ |
   | 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`, 
`("foobar",2)` | returns `("CA",1)`, `("US",1)`, `("foobar",2)` | ✓ |
   
   With NH off, all cells are already consistent on master — the fix preserves 
the legacy throw for the no-default case (gated on `_nullHandlingEnabled`).
   
   ## Proposed fix
   
   In both `JsonExtractScalarTransformFunction` and 
`JsonExtractIndexTransformFunction`:
   
   1. When `_nullHandlingEnabled = true` and no default value was supplied: 
emit SQL `NULL` (write the type's null placeholder + mark the row in 
`getNullBitmap()`) for unresolved rows, instead of throwing.
   2. When `_nullHandlingEnabled = false`: preserve the legacy throw.
   3. MV transforms unchanged.
   4. `JsonIndexDistinctOperator` already handles this correctly; no change 
needed.
   
   ## Out of scope (separate follow-up)
   
   - **4-arg with SQL `NULL` literal as default:** 
`LiteralContext.getStringValue()` returns `""` for SQL `NULL`, causing 
`JsonIndexDistinctOperator` to insert an empty string instead of a null group. 
Separate, pre-existing inconsistency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to