Re: [I] GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled (druid)

via GitHub Thu, 19 Oct 2023 18:30:24 -0700


funguy-tech commented on issue #15191:
URL: https://github.com/apache/druid/issues/15191#issuecomment-1771934983


   @clintropolis 
   
   Alright, I lied about the 48h turnaround. Here's something a bit more 
plug-and-play. I have a Python Lambda set up that generates pseudorandom data 
and publishes to a Kinesis stream to emulate a Kinesis ingestion setup.
   
   The generated data has the following schema:
   
   ```
   {
     "firstName": random.choice(common_first_names),
     "lastName": random.choice(common_last_names),
     "favoriteBrand": random.choice(common_brands),
     "age": random.randint(25, 70),
     "hairColor": random.choice(hair_colors),
     "eyeColor": random.choice(eye_colors),
     "streamId": str(uuid.uuid4()), # This gives us a 'random' id for each 
record
     "streamTime": get_record_time() # This is a random time between now and 15 
minutes ago (truncated to minute)
   }
   ```
   
   1. Create new Kinesis stream (in my case, named "generic-data-test-stream")
      - I made this stream with 10 provisioned shards, but this is likely 
overkill
   1. Create a Python Lambda in the same region
      - Replace the default code with the code in 
[dummy_data_generator.txt](https://github.com/apache/druid/files/13048851/dummy_data_generator.txt)
      - Populate `stream_name` Environment Variable with the stream name (name 
only, not ARN)
        - "generic-data-test-stream" if unchanged
      - Populate `records_to_generate` Environment Variable with number of 
records to generate per execution
        - 500 is a good starting point
      - Ensure Lambda has permission to write to Kinesis stream
        - Lazy route for test accounts: Managed Policy is 
`AmazonKinesisFullAccess`
      - Add an `EventBridge` trigger with  `rate(1 minute)`
        - Not explicitly required, but gives it a steady stream of data
   1. Add the ingestion spec provided in 
[dummy_ingestion_spec.txt](https://github.com/apache/druid/files/13048796/dummy_ingestion_spec.txt)
      - Swap `"ioConfig.stream"` and `"ioConfig.endpoint"` as needed for your 
stream name and AWS region 
      - This assumes the cluster already has the proper AWS wiring in place, 
permissions for kinesis, etc
   
   With this data ingestion setup, I am able to trivially reproduce the problem 
with the following query (make sure to adjust to cover realtime segments when 
executed)
   
   ```
   {
     "queryType": "groupBy",
     "dataSource": "generic-data-test-stream",
     "granularity": "all",
     "filter": {
       "type": "selector",
       "dimension": "firstName",
       "value": "Amy"
     },
     "dimensions": ["firstName"],
     "intervals": [
       "2023-10-19T00:00:00+0000/2023-10-21T00:00:00+0000"
     ]
   }
   ```
   
   Instead of the expected response of one single value:
   
   | firstName |
   |-----------|
   | Amy       |
   
   I am instead returned a response of multiple firstName values:
   
   | firstName |
   |-----------|
   | Amy       |
   | Charles   |
   | Logan     |
   
   If this query is executed later on only finalized segments, the result will 
be the expected value (only `Amy`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled (druid)

Reply via email to