funguy-tech commented on issue #15191:
URL: https://github.com/apache/druid/issues/15191#issuecomment-1771934983
@clintropolis
Alright, I lied about the 48h turnaround. Here's something a bit more
plug-and-play. I have a Python Lambda set up that generates pseudorandom data
and publishes to a Kinesis stream to emulate a Kinesis ingestion setup.
The generated data has the following schema:
```
{
"firstName": random.choice(common_first_names),
"lastName": random.choice(common_last_names),
"favoriteBrand": random.choice(common_brands),
"age": random.randint(25, 70),
"hairColor": random.choice(hair_colors),
"eyeColor": random.choice(eye_colors),
"streamId": str(uuid.uuid4()), # This gives us a 'random' id for each
record
"streamTime": get_record_time() # This is a random time between now and 15
minutes ago (truncated to minute)
}
```
1. Create new Kinesis stream (in my case, named "generic-data-test-stream")
- I made this stream with 10 provisioned shards, but this is likely
overkill
1. Create a Python Lambda in the same region
- Replace the default code with the code in
[dummy_data_generator.txt](https://github.com/apache/druid/files/13048851/dummy_data_generator.txt)
- Populate `stream_name` Environment Variable with the stream name (name
only, not ARN)
- "generic-data-test-stream" if unchanged
- Populate `records_to_generate` Environment Variable with number of
records to generate per execution
- 500 is a good starting point
- Ensure Lambda has permission to write to Kinesis stream
- Lazy route for test accounts: Managed Policy is
`AmazonKinesisFullAccess`
- Add an `EventBridge` trigger with `rate(1 minute)`
- Not explicitly required, but gives it a steady stream of data
1. Add the ingestion spec provided in
[dummy_ingestion_spec.txt](https://github.com/apache/druid/files/13048796/dummy_ingestion_spec.txt)
- Swap `"ioConfig.stream"` and `"ioConfig.endpoint"` as needed for your
stream name and AWS region
- This assumes the cluster already has the proper AWS wiring in place,
permissions for kinesis, etc
With this data ingestion setup, I am able to trivially reproduce the problem
with the following query (make sure to adjust to cover realtime segments when
executed)
```
{
"queryType": "groupBy",
"dataSource": "generic-data-test-stream",
"granularity": "all",
"filter": {
"type": "selector",
"dimension": "firstName",
"value": "Amy"
},
"dimensions": ["firstName"],
"intervals": [
"2023-10-19T00:00:00+0000/2023-10-21T00:00:00+0000"
]
}
```
Instead of the expected response of one single value:
| firstName |
|-----------|
| Amy |
I am instead returned a response of multiple firstName values:
| firstName |
|-----------|
| Amy |
| Charles |
| Logan |
If this query is executed later on only finalized segments, the result will
be the expected value (only `Amy`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]