kkrugler opened a new issue #7206:
URL: https://github.com/apache/pinot/issues/7206


   Currently per-segment results from an aggregation are trimmed to `Max(5000, 
<query limit> * 5)`. This can result in incorrect results when these partial 
results are combined by the broker, most obviously where the ordering of groups 
is opposite the value being used to do the ordering.
   
   For example, assume a table with two fields, `studentId` and `testScore`. If 
there are multiple test scores per student, and rows for the same student can 
be stored in two different segments, and there are more than 5000 unique 
student ids in each segment, then this query is problematic:
   
   ``` sql
   SELECT studentId, min(testScore) AS lowScore 
     FROM testResultsTable 
     GROUP BY studentId 
     ORDER BY lowScore DESC 
     LIMIT 10
   ```
   
   If `Student1` has a single row with a `testScore` of 0 in segment A, and 
that isn't one of the highest 5000 group results from that segment (likely), 
then it doesn't get returned to the broker. If `Student1` also has a single row 
with `testScore` of 100 in segment B, and this is one of the top 5000 results 
from this segment (likely), then it will be returned to the broker.
   
   So now when the broker recreates the combined aggregation results for 
`Student1`, it appears that the low score is 100, and it gets returned as one 
of the top 10 results...which isn't correct.
   
   In this situation, the only way to get accurate results is to return **ALL** 
of the groups from each segment, so that the broker can calculate correct 
results. You can force this by using a very high limit, but (a) you'd need to 
know the max cardinality of of groups for any segment that participates in the 
query, and (b) the broker is doing a lot of wasted work in generating a very 
large result that will mostly be ignored.
   
   @Jackie-Jiang has suggested supporting a new option (e.g. 
`OPTION(noGroupTrim=true)`) which would force all groups to be returned to the 
broker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to