clintropolis opened a new pull request, #14319:
URL: https://github.com/apache/druid/pull/14319
### Description
This PR adds a new interface to control how `SegmentMetadataCache` chooses
`ColumnType` when faced with differences between segments for SQL schemas which
are computed, exposed as `druid.sql.planner.metadataColumnTypeMergePolicy` and
adds a new 'least restrictive type' mode to allow choosing the type that data
across all segments can best be coerced into. The existing "newest first"
behavior remains the default, primarily because this is a behavior change
around when schema migrations take effect for the SQL schema. With
`{"type":"newestFirst"}`, the SQL schema would be updated as soon as the first
job with the new schema has published segments, while using
`{"type":"leastRestrictive"}`, the schema would only be updated once all
segments are reindexed to the new type. The benefit of `leastRestrictive` is
that it eliminates a bunch of type coercion errors that can happen in SQL when
types are varied across segments with `newestFirst` because the newest type is
not able to correctly
represent older data, such as if the segments have a mix of ARRAY and number
types, or any other combinations that lead to odd query plans.
I am not at all attached to these names, so if they should be called
something else more intuitive then feel free to suggest.
#### Release note
A new broker configuration,
`druid.sql.planner.metadataColumnTypeMergePolicy` adds configurable modes to
how column types are computed for the SQL table schema when faced with
differences between segments. A new 'least restrictive type' mode allows
choosing the most appropriate type that data across all segments can best be
coerced into. The existing "newest first" behavior remains the default,
primarily because this is a behavior change around when schema migrations will
take effect for the SQL schema. With `{"type":"newestFirst"}`, the SQL schema
would be updated as soon as the first job with the new schema has published
segments, while using `{"type":"leastRestrictive"}`, the schema would only be
updated once all segments are reindexed to the new type. However,
`{"type":"leastRestrictive"}` is likely to have "better" query time behavior
and eliminates some query time errors that can occur when using `newestFirst`.
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not
all of these items apply to every PR. Remove the items which are not done or
not relevant to the PR. None of the items from the checklist below are strictly
necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]