Hi Benedict, I am curious to understand what functionality of Druid are you seeing the slowness in? Is it the coordinator work of assigning segments to historicals that is slower or is it the querying of segment information that is slower? Have you looked into CPU/network metrics for your metadata RDS? Maybe scaling up to a bigger instance would help. It would also be good to see the query patterns and possibly tweak or add new indexes to help speed up. Also, do you have the cleanup of metadata rows enabled ( https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html#run-a-kill-task and *druid.coordinator.kill*.*on*) that should further help control the size of druid_segments table.
On Tue, Apr 6, 2021 at 8:08 AM Ben Krug <ben.k...@imply.io> wrote: > I suppose, if we were going down this path, something like tombstones in > Cassandra could be used. > But it would increase the complexity significantly. > Ie, a new row is inserted with a deletion marker and a timestamp, that > indicates that the corresponding row is deleted. > Now, when anyone does scan the table, they need to check for tombstones too > and process that logic. Then, after > a configurable amount of time, both the original row and the tombstone row > can be cleaned up. > > Probably a lot of work and complexity for this one use case, though. > > On Tue, Apr 6, 2021 at 4:02 AM Abhishek Agarwal <abhishek.agar...@imply.io > > > wrote: > > > If an entry is deleted from the metadata, how is the coordinator going to > > update its own state? > > > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe <itai.ya...@gmail.com> wrote: > > > > > Hey, > > > I'm not a Druid developer, so it's quite possible I'm missing many > > > considerations here, but from a first glance, I like your offer, as it > > > resembles the *tsColumn *in JDBC lookups ( > > > > > > > > > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup > > > ). > > > > > > Anyway, just my 2 cents. > > > > > > Thanks! > > > Itai > > > > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin <asdf2...@apache.org> > wrote: > > > > > > > Hi all, > > > > > > > > Recently, when the Coordinator in our company's Druid cluster pulls > > > > metadata, there is a performance bottleneck. The main reason is the > > huge > > > > amount of metadata, which leads to a very slow process of scanning > the > > > full > > > > table of metadata storage and deserializing metadata. The size of the > > > full > > > > metadata has been reduced through TTL, Compaction, Rollup, and etc., > > but > > > > the effect is not very significant. Therefore, I want to design a > > scheme > > > > for Coordinator to pull metadata incrementally, that is, each time > > > > Coordinator only pulls newly added metadata, so as to reduce the > query > > > > pressure of metadata storage and the pressure of deserializing > > metadata. > > > > The general idea is to add a column last_update to the druid_segments > > > table > > > > to record the update time of each record. Furthermore, when we query > > the > > > > metadata table, we can add filter conditions for the last_update > column > > > to > > > > avoid full table scan operations. Moreover, whether it is MySQL or > > > > PostgreSQL as the metadata storage medium, it can support > > > > automatic update of the timestamp field, which is somewhat similar > to > > > the > > > > characteristics of triggers. So, have you encountered this problem > > > before? > > > > If so, how did you solve it? In addition, do you have any suggestions > > or > > > > comments on the above incremental acquisition of metadata? Please let > > me > > > > know, thanks a lot. > > > > > > > > Regards, > > > > Benedict Jin > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > >