[GitHub] [druid] renatocron opened a new issue, #12986: Druid returns wrong results for GroupBy queries over LongMax metric (rollup)

GitBox Fri, 26 Aug 2022 19:51:25 -0700


renatocron opened a new issue, #12986:
URL: https://github.com/apache/druid/issues/12986


   Hi! I'm having druid returning wrong / unexpected results,
   I have a rollup set up with daily segments, ingesting from kafka, and two 
metrics (count and longmax),
   
   The query objective is to get the most recent value (even if some dimensions 
changed values during the day) using order-by "max(last_seen) desc" in order to 
get the most recent version of the rollup in the querySegment.
   
   The query is bellow (the original query has more columns, but I keep 
reducing the query to make it easier):
   
   ```sql
   SELECT
       mm, 
       apikey,
       version, 
       MAX( last_seen ) as last_seen
   FROM datasource 
   WHERE 
       __time >= CURRENT_TIMESTAMP - INTERVAL '18' DAY 
       AND "mm" = '1234' 
       and  apikey IN ('foo', 'bar')
      
   GROUP BY 
       mm, version, apikey 
   ```
   
   The data source has data about ~30 million rows,
   but the column `mm` was filtered and returned just 4 rows (2 for each 
apikey, for each day, after the rollup daily) 
   before rollup, on one of the days, there were 75 rows collapsed into one.
   
   Here I set a few example data with the same schema, but no error can be 
reproduced with just this ingestion spec 
https://gist.github.com/renatocron/11a43d7ad4ee1fc3d43c4cda9a17fee2
   
   
   The column `last_seen` is generated from the `LongMax` value for `_time`, 
both segmentGranularity and queryGranularity are set to DAY, so, it should 
always return some timestamp between midnight and the next day, but I was 
getting some values 30 years in the future. That was how I discovered that 
something was wrong with the results of some executions.
   
   The crazy thing is, to get the correct result, I can either change the 
SELECT to add any column using LATEST or add `array_agg(distinct last_seen)`, 
then the Max(last_seen) column returns the correct data (100% of the time? 
IDK,but what I would expect from a database, even without the array_agg to 
begin with :P [I'm not dropping shade towards druid, but that's the truth]) 
   
   If you want to see it happening in 'real time', here's a video where I ran 
some queries https://youtu.be/X1zHJ05GoJ8
   
   Further Information/data can be found on this slack thread:
   https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661203934675279
   
   since I reported this on Slack (2022-08-23), I keep testing to get more 
information: 
   
   during the initial re-ingestion from Kafka, the new data source showed wrong 
results initially (see 
https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661248129192669?thread_ts=1661203934.675279&cid=C0309C9L90D)
 but after the first task was completed, the results were correct for the next 
~24 hours. Then, enabled compression, the same I set on the production data 
source (see 
https://gist.github.com/renatocron/11a43d7ad4ee1fc3d43c4cda9a17fee2?permalink_comment_id=4281304#gistcomment-4281304
 for the setting), and waited until today (2022-08-26) and now all the results 
are wrong again:
   
![image](https://user-images.githubusercontent.com/1919076/187011316-06fab6f5-dcc0-4753-9276-12e56f45eb25.png)
   if I add the `array_agg`
   
![image](https://user-images.githubusercontent.com/1919076/187011284-2fca28e5-ef9d-4fc0-b185-85a939190a45.png)
   or change the filter to only match one of the apikeys:
   
![image](https://user-images.githubusercontent.com/1919076/187011331-d3dab653-60d2-448f-b545-9da798f4c0e3.png)
   
   The reingested currenly has 14M rows and is 90.48% segments compacted, so 
the auto compaction is already 'done', so now I think that could be something 
related to the compaction process.
   
   Running the query with group-by but not selecting the 'apikey' column 
returns the right result (and two rows, one for each apikey):
   
![image](https://user-images.githubusercontent.com/1919076/187011477-e88486c0-76c4-4deb-a6e1-e61e2891bfbb.png)
   but adding the apikey in the select, returns bad results again:
   
![image](https://user-images.githubusercontent.com/1919076/187011424-ebfed2ee-7448-4ade-9900-08eb6b8f676d.png)
   
   ##
   
   Druid version is 0.23 from the official docker image.
   
   
   Thanks for reading until here!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] renatocron opened a new issue, #12986: Druid returns wrong results for GroupBy queries over LongMax metric (rollup)

Reply via email to