JulianJaffePinterest opened a new issue #9463: Add namespaces to Druid segments within a data source URL: https://github.com/apache/druid/issues/9463 ### Motivation Currently, Druid versions segments by their data source, covered interval, and an arbitrary version string (usually the time at which they were written). However, for cases where users want to be able to ingest and query data from multiple sources or selectively update subsets of their data, this requires intermediate processing or over-processing. To support these use cases, we can introduce the concept of a namespace to shard specs and VersionedIntervalTimelines, versioning namespaced shard specs by their data source, covered interval, _namespace_, and version string. This would allow data sources to be transparently populated from multiple input sources, with overshadowing and atomic updates continuing to function within the context of a name space without affecting data in other name spaces and without affecting the behavior of non-namespaced segments. ### Proposed changes The proposed changes primarily fall into two buckets: First, modify the `PartitionChunk` and `ShardSpec` classes to add the methods `default Object getChunkIdentifier` and `default Object getIdentifier`, which simply return the chunk number and partition number, respectively. Further, modify the `SegmentId` and `SegmentDescriptor` classes to add an identifier. From here, all calls to ShardSpec's `getPartitionNum` should be replaced with calls to `getIdentifier` and all SegmentDescriptor/SegmentId creations updated. Namespaced shard specs can be created that approximate existing shard specs (e.g. a `NamedNumberedShardSpec` can be created that duplicates the logic of a `NumberedShardSpec` with the `abuts` and `compareTo` methods updated to check for matching name spaces as well). Additionally, real time ingestion specs (which don't allow specifying a shard spec directly) can be updated to optionally take a name space. These changes are transparent to existing code, except for a small performance hit of comparing Objects instead of ints. The other piece of the puzzle is to introduce a NamespacedVersionedIntervalTimeline that similarly apes the logic of VersionedIntervalTimeline with a few key changes. Primarily, NamespacedVersionedIntervalTimelines can contain a map of string (namespaces) to VersionedIntervalTimelines (timelines for each namespace). Adding, removing, and overshadowing segments and partitions can then be done in the context of the appropriate namespace while look ups can be done across _all_ namespaces. VersionedIntervalTimelines in CachingClusteredClient, BrokerServerView, etc. can then be replaced with NamespacedVersionedIntervalTimelines without affecting any non-namespaced segments (since they'll all be added to the default namespace and thus the same underlying VersionedIntervalTimeline, as they currently are). These changes should be entirely transparent to users who aren't interested in using them, since all existing behavior will be unchanged. For users who do want to use namespaced shard specs, they can simply specify the appropriate shard spec in their ingestion config (or for real-time ingestion specs, a name space) and namespaced shard specs will be created. These shard specs will only extend or overshadow data with same data source and name space, but all segments with a given data source will be queried together, regardless of name space. Here, this proposal takes advantage of the fact that Druid already handles cases where some dimensions or metrics are available in certain segments for a datasource but not all of them. Metrics are inferred to be 0 and dimensions are inferred to be null. If there are shared dimensions between namespaces, post aggregators can be used to return combined results for any given combination, even if not all metrics or dimensions are present in all name spaces. ### Rationale The primary motivation for this proposal is to support scenarios where users produce data from multiple sources but wish to query this data as if it were a single table. Although there are ongoing efforts in the Druid community to address this via joins #(8728) and union data sources, both approaches have certain drawbacks. The initial phases for join support envision supporting joining enhanced lookup-style dimension tables onto data sources, not joining data sources, and union data sources require all unioned data sources to have identical schemata. Namespacing allows multiple logical data sources to be merged, even if they have differing schemata. The obvious downside to namespacing is that this merging must be performed at ingestion time and doesn't support arbitrary joining at query time. ### Operational impact These proposed changes modify many internal APIs, but do not deprecate or remove any external features or behavior. These changes are transparent to non-namespaced segments and data sources, meaning that clusters can be upgraded and downgraded without issue. Once namespaced segments are created, data sources cannot be downgraded (whichever namespace was lexicographically last would overshadow all other namespaces). We have been running these changes in production for ~12 months without a noticeable affect on latency and performance. Of course, our implementation and a community implementation will likely differ to some degree to account for the broader use cases throughout the community. It is possible that this change would introduce small latency regressions for some use cases due to the move to comparing objects instead of ints for partition numbers/identifiers. ### Test plan There are two main pieces of this proposal that need testing: first, that namespacing works as intended, and also that there are no observable changes for users who do not adopt namespacing. For testing namespacing, we would want to test the following scenarios that new namespaced segments with a higher version should overshadow only existing segments with the same namespace and that namespaced segments with the same datasource are queried across namespaces when the data source is queried. We would also want to test that all current realtime and batch ingestion mechanisms can correctly create namespaced segments. For testing that there are changes required for non-namespaced segments, the existing Druid tests should be sufficient (e.g. if all our existing tests work without changes, we can be confident that existing use cases will be unaffected).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org