JulianJaffePinterest opened a new issue #9463: Add namespaces to Druid segments 
within a data source
URL: https://github.com/apache/druid/issues/9463
 
 
   ### Motivation
   
   Currently, Druid versions segments by their data source, covered interval, 
and an arbitrary version string (usually the time at which they were written). 
However, for cases where users want to be able to ingest and query data from 
multiple sources or selectively update subsets of their data, this requires 
intermediate processing or over-processing. To support these use cases, we can 
introduce the concept of a namespace to shard specs and 
VersionedIntervalTimelines, versioning namespaced shard specs by their data 
source, covered interval, _namespace_, and version string. This would allow 
data sources to be transparently populated from multiple input sources, with 
overshadowing and atomic updates continuing to function within the context of a 
name space without affecting data in other name spaces and without affecting 
the behavior of non-namespaced segments.
   
   ### Proposed changes
   
   The proposed changes primarily fall into two buckets:
   
   First, modify the `PartitionChunk` and `ShardSpec` classes to add the 
methods `default Object getChunkIdentifier` and `default Object getIdentifier`, 
which simply return the chunk number and partition number, respectively. 
Further, modify the `SegmentId` and `SegmentDescriptor` classes to add an 
identifier. From here, all calls to ShardSpec's `getPartitionNum` should be 
replaced with calls to `getIdentifier` and all SegmentDescriptor/SegmentId 
creations updated. Namespaced shard specs can be created that approximate 
existing shard specs (e.g. a `NamedNumberedShardSpec` can be created that 
duplicates the logic of a `NumberedShardSpec` with the `abuts` and `compareTo` 
methods updated to check for matching name spaces as well). Additionally, real 
time ingestion specs (which don't allow specifying a shard spec directly) can 
be updated to optionally take a name space. These changes are transparent to 
existing code, except for a small performance hit of comparing Objects instead 
of ints.
   
   The other piece of the puzzle is to introduce a 
NamespacedVersionedIntervalTimeline that similarly apes the logic of 
VersionedIntervalTimeline with a few key changes. Primarily, 
NamespacedVersionedIntervalTimelines can contain a map of string (namespaces) 
to VersionedIntervalTimelines (timelines for each namespace). Adding, removing, 
and overshadowing segments and partitions can then be done in the context of 
the appropriate namespace while look ups can be done across _all_ namespaces. 
VersionedIntervalTimelines in CachingClusteredClient, BrokerServerView, etc. 
can then be replaced with NamespacedVersionedIntervalTimelines without 
affecting any non-namespaced segments (since they'll all be added to the 
default namespace and thus the same underlying VersionedIntervalTimeline, as 
they currently are).
   
   These changes should be entirely transparent to users who aren't interested 
in using them, since all existing behavior will be unchanged. For users who do 
want to use namespaced shard specs, they can simply specify the appropriate 
shard spec in their ingestion config (or for real-time ingestion specs, a name 
space) and namespaced shard specs will be created. These shard specs will only 
extend or overshadow data with same data source and name space, but all 
segments with a given data source will be queried together, regardless of name 
space. Here, this proposal takes advantage of the fact that Druid already 
handles cases where some dimensions or metrics are available in certain 
segments for a datasource but not all of them. Metrics are inferred to be 0 and 
dimensions are inferred to be null. If there are shared dimensions between 
namespaces, post aggregators can be used to return combined results for any 
given combination, even if not all metrics or dimensions are present in all 
name spaces.
   
   ### Rationale
   
   The primary motivation for this proposal is to support scenarios where users 
produce data from multiple sources but wish to query this data as if it were a 
single table. Although there are ongoing efforts in the Druid community to 
address this via joins  #(8728) and union data sources, both approaches have 
certain drawbacks. The initial phases for join support envision supporting 
joining enhanced lookup-style dimension tables onto data sources, not joining 
data sources, and union data sources require all unioned data sources to have 
identical schemata. Namespacing allows multiple logical data sources to be 
merged, even if they have differing schemata. The obvious downside to 
namespacing is that this merging must be performed at ingestion time and 
doesn't support arbitrary joining at query time.
   
   ### Operational impact
   
   These proposed changes modify many internal APIs, but do not deprecate or 
remove any external features or behavior.
   
   These changes are transparent to non-namespaced segments and data sources, 
meaning that clusters can be upgraded and downgraded without issue. Once 
namespaced segments are created, data sources cannot be downgraded (whichever 
namespace was lexicographically last would overshadow all other namespaces).
   
   We have been running these changes in production for ~12 months without a 
noticeable affect on latency and performance. Of course, our implementation and 
a community implementation will likely differ to some degree to account for the 
broader use cases throughout the community. It is possible that this change 
would introduce small latency regressions for some use cases due to the move to 
comparing objects instead of ints for partition numbers/identifiers.
   
   ### Test plan
   
   There are two main pieces of this proposal that need testing: first, that 
namespacing works as intended, and also that there are no observable changes 
for users who do not adopt namespacing.
   
   For testing namespacing, we would want to test the following scenarios that 
new namespaced segments with a higher version should overshadow only existing 
segments with the same namespace and that namespaced segments with the same 
datasource are queried across namespaces when the data source is queried. We 
would also want to test that all current realtime and batch ingestion 
mechanisms can correctly create namespaced segments.
   
   For testing that there are changes required for non-namespaced segments, the 
existing Druid tests should be sufficient (e.g. if all our existing tests work 
without changes, we can be confident that existing use cases will be 
unaffected).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to