itschrispeck opened a new pull request, #11739:
URL: https://github.com/apache/pinot/pull/11739

   This is a follow up PR to https://github.com/apache/pinot/issues/11494
   
   I worked on a poc for this, and took the approach of reading the JSON index 
through a transform function. This enables group by/regexp filtering/the 
majority of the functionality of `JSON_EXTRACT_SCALAR` and attempts to maintain 
the same syntax. This does not solve the generic problem of adding a path to 
use inverted index to speed up group by. 
   
   We've seen large improvements for large time ranges and large tables. This 
can reduce query latency for equivalent queries and massively improve memory 
pressure, preventing the OOM caused cluster crashes we encountered. 
   
   As expected, for small time ranges where number of docs extracted is small, 
the existing `JSON_EXTRACT_SCALAR` function is faster. 
   
   Initial benchmark results:
   
![image](https://github.com/apache/pinot/assets/27231838/40534dec-576c-4261-80a8-7d931cbbc1b4)
   The outlined graphs signal a cluster crash and no data was able to be 
recorded. 
   ```
   A: 10TB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)
   B: 10TB table, group by json_extract_scalar(col, ‘$.keyB’, ‘STRING’, ‘null)
   C: 10TB table, regexp_like(json_extract_index(json_data, '$.keyA, 'STRING', 
'null'), 'val') group by json_extract_index(json_data, '$.keyA, 'STRING', 
'null') 
   D: 10TB table, regexp_like(json_extract_index(json_data, '$.keyA, 'STRING', 
'null'), 'val')
   E: 20GB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)
   F: 20GB table, regexp_like(json_extract_index(json_data, '$.keyA, 'STRING', 
'null'), 'val') group by json_extract_index(json_data, '$.keyA, 'STRING', 
'null')
   ```
   
   Thought not low latency, I think this would be a solid functionality to have 
available as it allows for queries that were otherwise unanswerable. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to