funguy-tech opened a new issue, #15191:
URL: https://github.com/apache/druid/issues/15191
### Affected Version
V27.0.0
### Impact
This issue appears to be reliably reproduced by executing a
single-dimension, single-filter Native Druid query on any `string` dimension in
a `kinesis` ingestion task that is derived from a `Schema Auto-Discovery` spec,
as long as the data has not been handed off. The issue resolves after hand-off
to Historicals.
### Expected Result
GroupBy and Timeseries Queries against actively ingested single dimension
values are consistently filtered without regard to data residency (realtime vs
fully persisted segment).
### Actual Result
GroupBy and Timeseries Queries against actively ingested single dimension
values temporarily ignore or mis-apply filters until data segments are
persisted, at which point filters are correctly applied.
### Description
My team operates multiple large-scale Druid clusters with roughly identical
base configurations. Pertinent details are as follows:
- Ingestion Method: `kinesis`
- Segment size: `1 hour`
- Lookback period: `3 hours` (a small portion of our data is late-arriving)
- Relevant Middle Manager architecture: ARM processors, statically defined
hardware, dedicated to kinesis ingestion tasks
- Other Middle Manager tasks, such as compaction, are delegated to a
separate Middle Manager tier
As part of Schema Auto-discovery migration, we migrated one of our regions
to a new schema in which we only define a few legacy lists (to retain them as
MVDs) and aggregations - the rest of our fields are ingested via discovery. In
total, we produce records with ~100-150 fields, and the dataTypes do appear to
align correctly post-migration.
In the process of migrating, we stumbled across a perplexing issue with
GroupBy and Timeseries queries. Whenever we perform a single dimension query
that overlaps/involves data on the Middle Managers (in our case, queries that
touch the most recent 3 hours), the results received are nonsensical - the
filter appears to be either inconsistently applied or not applied at all,
resulting in other dimension values 'leaking' into the results despite being
ruled out by the filter. This behavior is almost reminiscient of some sort of
MVD edge case, but again, the fields experiencing this issue are strictly
singular string values (and, as mentioned further down, the behavior changes
between different points of the segment's lifecycle).
Consider the following minimally-reproducible query, a GroupBy that groups
and filters by an `example_field` dimension.
```
{
"queryType": "groupBy",
"dataSource": "Example_Records",
"granularity": "all",
"filter": {
"type": "selector",
"dimension": "example_field",
"value": "expected_value"
},
"dimensions": ["example_field"],
"intervals": [
"2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
]
}
```
Assuming `example_field` is guaranteed to be a simple string value (and is
identified as such in the schema), this query should return at a maximum 1 row
- the value `expected_value`. However, that is not what happens.
- When executed on a data range that still resides on Middle Managers, this
query returns between 20-40 different rows with miscellaneous values for
`example_field`.
- When executed on a data range that has been successfully handed off to
Historicals, this query returns the correct / expected value of only
`expected_value`.
- When the same query is executed twice with a 3-hour delay between runs, it
will first return the nonsensical result - and then later return the expected
result - indicating a behavior change between the comparable Middle Manager and
Historical queries.
Oddly enough, a modification to the original query appears to fix it. If an
additional dimension - even one that doesn't exist - is added to the query
(ordering does not matter), it returns the expected result 100% of the time:
```
{
"queryType": "groupBy",
"dataSource": "Sample_Sessions",
"granularity": "all",
"filter": {
"type": "selector",
"dimension": "example_field",
"value": "expected_value"
},
"dimensions": ["example_field", "oof"],
"intervals": [
"2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
]
}
```
The above query will always return one row with an `example_field` value of
`expected_value` and an `oof` value of `null`, somehow avoiding the nonsensical
condition of the first query.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]