Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-07 Thread Benedict Jin
Hi Julian Jaffe,

Thank you very much. I haven't tried it yet. Can you provide a more specific 
example. In theory, adding indexes will slow down the speed of adding and 
updating operations. In your scenario, what percentage is this performance loss 
reached? Yes, for the bottleneck of Coordinator, do we consider introducing the 
Federation architecture to Apache Druid?

Regards,
Benedict Jin

On 2021/04/07 06:27:58, Julian Jaffe  wrote: 
> Hey Benedict,
> 
> Have you tried creating indices on your segments table? I’ve managed Druid 
> clusters with orders of magnitude more segments without this issue by 
> indexing key filter columns. (The coordinator is still a painful bottle neck, 
> just not due to query times to the metadata server 😛)
> 
> Best,
> Julian
> 
> > On Apr 6, 2021, at 8:53 PM, Benedict Jin  wrote:
> > 
> > Hi Jihoon Son,
> > 
> > Yes, it does bring some compatibility issues. I was checking the latest 
> > metadata information just now. At present, the total number of records in 
> > the metadata table is five million, of which nearly half are marked as 
> > used, and the physical resources of the machine where the metadata is 
> > stored are relatively idle.
> > 
> > Regards,
> > Benedict Jin
> > 
> >> On 2021/04/07 02:35:32, Jihoon Son  wrote: 
> >> For this sort of issue, we should think about if there is any other
> >> way that can address the same problem without modifying metadata table
> >> schema.
> >> Because, modifying metadata table schema introduces compatibility
> >> issues, such as the upgrade path for existing users.
> >> 
> >> Benedict, as Samarth and Lucas pointed out, it would be nice if you
> >> share more details of exactly where the bottleneck is. That will make
> >> the problem clearer and get everyone on the same page.
> >> 
> >>> On Tue, Apr 6, 2021 at 6:54 PM Benedict Jin  wrote:
> >>> 
> >>> Hi Ben Krug,
> >>> 
> >>> +1 for adding the is_deleted column, and then we can create a timing 
> >>> trigger to clear these old records.
> >>> 
> >>> Regards,
> >>> Benedict Jin
> >>> 
> >>> On 2021/04/06 18:28:45, Ben Krug  wrote:
>  Oh, that's easier than tombstones.  flag is_deleted and update timestamp
>  (so it gets pulled again).
>  
>  On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  
>  wrote:
>  
> > Abhishek,
> > Good point.  Do we need one more col for storing if it's deleted or not?
> > 
> > On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal 
> >  >> 
> > wrote:
> > 
> >> If an entry is deleted from the metadata, how is the coordinator going 
> >> to
> >> update its own state?
> >> 
> >> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> >> 
> >>> Hey,
> >>> I'm not a Druid developer, so it's quite possible I'm missing many
> >>> considerations here, but from a first glance, I like your offer, as it
> >>> resembles the *tsColumn *in JDBC lookups (
> >>> 
> >>> 
> >> 
> > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> >>> ).
> >>> 
> >>> Anyway, just my 2 cents.
> >>> 
> >>> Thanks!
> >>>  Itai
> >>> 
> >>> On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > wrote:
> >>> 
>  Hi all,
>  
>  Recently, when the Coordinator in our company's Druid cluster pulls
>  metadata, there is a performance bottleneck. The main reason is the
> >> huge
>  amount of metadata, which leads to a very slow process of scanning
> > the
> >>> full
>  table of metadata storage and deserializing metadata. The size of the
> >>> full
>  metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> >> but
>  the effect is not very significant. Therefore, I want to design a
> >> scheme
>  for Coordinator to pull metadata incrementally, that is, each time
>  Coordinator only pulls newly added metadata, so as to reduce the
> > query
>  pressure of metadata storage and the pressure of deserializing
> >> metadata.
>  The general idea is to add a column last_update to the druid_segments
> >>> table
>  to record the update time of each record. Furthermore, when we query
> >> the
>  metadata table, we can add filter conditions for the last_update
> > column
> >>> to
>  avoid full table scan operations. Moreover, whether it is MySQL or
>  PostgreSQL as the metadata storage medium, it can support
>  automatic update of the timestamp field, which is somewhat similar
> > to
> >>> the
>  characteristics of triggers. So, have you encountered this problem
> >>> before?
>  If so, how did you solve it? In addition, do you have any suggestions
> >> or
>  comments on the above incremental acquisition of metadata? Please let
> >> me
>  know, thanks a lot.
>  

Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Julian Jaffe
Hey Benedict,

Have you tried creating indices on your segments table? I’ve managed Druid 
clusters with orders of magnitude more segments without this issue by indexing 
key filter columns. (The coordinator is still a painful bottle neck, just not 
due to query times to the metadata server 😛)

Best,
Julian

> On Apr 6, 2021, at 8:53 PM, Benedict Jin  wrote:
> 
> Hi Jihoon Son,
> 
> Yes, it does bring some compatibility issues. I was checking the latest 
> metadata information just now. At present, the total number of records in the 
> metadata table is five million, of which nearly half are marked as used, and 
> the physical resources of the machine where the metadata is stored are 
> relatively idle.
> 
> Regards,
> Benedict Jin
> 
>> On 2021/04/07 02:35:32, Jihoon Son  wrote: 
>> For this sort of issue, we should think about if there is any other
>> way that can address the same problem without modifying metadata table
>> schema.
>> Because, modifying metadata table schema introduces compatibility
>> issues, such as the upgrade path for existing users.
>> 
>> Benedict, as Samarth and Lucas pointed out, it would be nice if you
>> share more details of exactly where the bottleneck is. That will make
>> the problem clearer and get everyone on the same page.
>> 
>>> On Tue, Apr 6, 2021 at 6:54 PM Benedict Jin  wrote:
>>> 
>>> Hi Ben Krug,
>>> 
>>> +1 for adding the is_deleted column, and then we can create a timing 
>>> trigger to clear these old records.
>>> 
>>> Regards,
>>> Benedict Jin
>>> 
>>> On 2021/04/06 18:28:45, Ben Krug  wrote:
 Oh, that's easier than tombstones.  flag is_deleted and update timestamp
 (so it gets pulled again).
 
 On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  wrote:
 
> Abhishek,
> Good point.  Do we need one more col for storing if it's deleted or not?
> 
> On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal > 
> wrote:
> 
>> If an entry is deleted from the metadata, how is the coordinator going to
>> update its own state?
>> 
>> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
>> 
>>> Hey,
>>> I'm not a Druid developer, so it's quite possible I'm missing many
>>> considerations here, but from a first glance, I like your offer, as it
>>> resembles the *tsColumn *in JDBC lookups (
>>> 
>>> 
>> 
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
>>> ).
>>> 
>>> Anyway, just my 2 cents.
>>> 
>>> Thanks!
>>>  Itai
>>> 
>>> On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> wrote:
>>> 
 Hi all,
 
 Recently, when the Coordinator in our company's Druid cluster pulls
 metadata, there is a performance bottleneck. The main reason is the
>> huge
 amount of metadata, which leads to a very slow process of scanning
> the
>>> full
 table of metadata storage and deserializing metadata. The size of the
>>> full
 metadata has been reduced through TTL, Compaction, Rollup, and etc.,
>> but
 the effect is not very significant. Therefore, I want to design a
>> scheme
 for Coordinator to pull metadata incrementally, that is, each time
 Coordinator only pulls newly added metadata, so as to reduce the
> query
 pressure of metadata storage and the pressure of deserializing
>> metadata.
 The general idea is to add a column last_update to the druid_segments
>>> table
 to record the update time of each record. Furthermore, when we query
>> the
 metadata table, we can add filter conditions for the last_update
> column
>>> to
 avoid full table scan operations. Moreover, whether it is MySQL or
 PostgreSQL as the metadata storage medium, it can support
 automatic update of the timestamp field, which is somewhat similar
> to
>>> the
 characteristics of triggers. So, have you encountered this problem
>>> before?
 If so, how did you solve it? In addition, do you have any suggestions
>> or
 comments on the above incremental acquisition of metadata? Please let
>> me
 know, thanks a lot.
 
 Regards,
 Benedict Jin
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
 For additional commands, e-mail: dev-h...@druid.apache.org
 
 
>>> 
>> 
> 
> 
> --
> Thanks & Regards
> Tijo Thomas
> 
 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>>> For additional commands, e-mail: dev-h...@druid.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@dru

Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Jihoon Son,

Yes, it does bring some compatibility issues. I was checking the latest 
metadata information just now. At present, the total number of records in the 
metadata table is five million, of which nearly half are marked as used, and 
the physical resources of the machine where the metadata is stored are 
relatively idle.

Regards,
Benedict Jin

On 2021/04/07 02:35:32, Jihoon Son  wrote: 
> For this sort of issue, we should think about if there is any other
> way that can address the same problem without modifying metadata table
> schema.
> Because, modifying metadata table schema introduces compatibility
> issues, such as the upgrade path for existing users.
> 
> Benedict, as Samarth and Lucas pointed out, it would be nice if you
> share more details of exactly where the bottleneck is. That will make
> the problem clearer and get everyone on the same page.
> 
> On Tue, Apr 6, 2021 at 6:54 PM Benedict Jin  wrote:
> >
> > Hi Ben Krug,
> >
> > +1 for adding the is_deleted column, and then we can create a timing 
> > trigger to clear these old records.
> >
> > Regards,
> > Benedict Jin
> >
> > On 2021/04/06 18:28:45, Ben Krug  wrote:
> > > Oh, that's easier than tombstones.  flag is_deleted and update timestamp
> > > (so it gets pulled again).
> > >
> > > On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  
> > > wrote:
> > >
> > > > Abhishek,
> > > > Good point.  Do we need one more col for storing if it's deleted or not?
> > > >
> > > > On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal 
> > > >  > > > >
> > > > wrote:
> > > >
> > > > > If an entry is deleted from the metadata, how is the coordinator 
> > > > > going to
> > > > > update its own state?
> > > > >
> > > > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  
> > > > > wrote:
> > > > >
> > > > > > Hey,
> > > > > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > > > > considerations here, but from a first glance, I like your offer, as 
> > > > > > it
> > > > > > resembles the *tsColumn *in JDBC lookups (
> > > > > >
> > > > > >
> > > > >
> > > > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > > > > ).
> > > > > >
> > > > > > Anyway, just my 2 cents.
> > > > > >
> > > > > > Thanks!
> > > > > >   Itai
> > > > > >
> > > > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Recently, when the Coordinator in our company's Druid cluster 
> > > > > > > pulls
> > > > > > > metadata, there is a performance bottleneck. The main reason is 
> > > > > > > the
> > > > > huge
> > > > > > > amount of metadata, which leads to a very slow process of scanning
> > > > the
> > > > > > full
> > > > > > > table of metadata storage and deserializing metadata. The size of 
> > > > > > > the
> > > > > > full
> > > > > > > metadata has been reduced through TTL, Compaction, Rollup, and 
> > > > > > > etc.,
> > > > > but
> > > > > > > the effect is not very significant. Therefore, I want to design a
> > > > > scheme
> > > > > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > > > > Coordinator only pulls newly added metadata, so as to reduce the
> > > > query
> > > > > > > pressure of metadata storage and the pressure of deserializing
> > > > > metadata.
> > > > > > > The general idea is to add a column last_update to the 
> > > > > > > druid_segments
> > > > > > table
> > > > > > > to record the update time of each record. Furthermore, when we 
> > > > > > > query
> > > > > the
> > > > > > > metadata table, we can add filter conditions for the last_update
> > > > column
> > > > > > to
> > > > > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > > > > PostgreSQL as the metadata storage medium, it can support
> > > > > > >  automatic update of the timestamp field, which is somewhat 
> > > > > > > similar
> > > > to
> > > > > > the
> > > > > > > characteristics of triggers. So, have you encountered this problem
> > > > > > before?
> > > > > > > If so, how did you solve it? In addition, do you have any 
> > > > > > > suggestions
> > > > > or
> > > > > > > comments on the above incremental acquisition of metadata? Please 
> > > > > > > let
> > > > > me
> > > > > > > know, thanks a lot.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Benedict Jin
> > > > > > >
> > > > > > > -
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Regards
> > > > Tijo Thomas
> > > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> 
> --

Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Jihoon Son
For this sort of issue, we should think about if there is any other
way that can address the same problem without modifying metadata table
schema.
Because, modifying metadata table schema introduces compatibility
issues, such as the upgrade path for existing users.

Benedict, as Samarth and Lucas pointed out, it would be nice if you
share more details of exactly where the bottleneck is. That will make
the problem clearer and get everyone on the same page.

On Tue, Apr 6, 2021 at 6:54 PM Benedict Jin  wrote:
>
> Hi Ben Krug,
>
> +1 for adding the is_deleted column, and then we can create a timing trigger 
> to clear these old records.
>
> Regards,
> Benedict Jin
>
> On 2021/04/06 18:28:45, Ben Krug  wrote:
> > Oh, that's easier than tombstones.  flag is_deleted and update timestamp
> > (so it gets pulled again).
> >
> > On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  wrote:
> >
> > > Abhishek,
> > > Good point.  Do we need one more col for storing if it's deleted or not?
> > >
> > > On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal  > > >
> > > wrote:
> > >
> > > > If an entry is deleted from the metadata, how is the coordinator going 
> > > > to
> > > > update its own state?
> > > >
> > > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> > > >
> > > > > Hey,
> > > > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > > > considerations here, but from a first glance, I like your offer, as it
> > > > > resembles the *tsColumn *in JDBC lookups (
> > > > >
> > > > >
> > > >
> > > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > > > ).
> > > > >
> > > > > Anyway, just my 2 cents.
> > > > >
> > > > > Thanks!
> > > > >   Itai
> > > > >
> > > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > > > metadata, there is a performance bottleneck. The main reason is the
> > > > huge
> > > > > > amount of metadata, which leads to a very slow process of scanning
> > > the
> > > > > full
> > > > > > table of metadata storage and deserializing metadata. The size of 
> > > > > > the
> > > > > full
> > > > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > > > but
> > > > > > the effect is not very significant. Therefore, I want to design a
> > > > scheme
> > > > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > > > Coordinator only pulls newly added metadata, so as to reduce the
> > > query
> > > > > > pressure of metadata storage and the pressure of deserializing
> > > > metadata.
> > > > > > The general idea is to add a column last_update to the 
> > > > > > druid_segments
> > > > > table
> > > > > > to record the update time of each record. Furthermore, when we query
> > > > the
> > > > > > metadata table, we can add filter conditions for the last_update
> > > column
> > > > > to
> > > > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > > > PostgreSQL as the metadata storage medium, it can support
> > > > > >  automatic update of the timestamp field, which is somewhat similar
> > > to
> > > > > the
> > > > > > characteristics of triggers. So, have you encountered this problem
> > > > > before?
> > > > > > If so, how did you solve it? In addition, do you have any 
> > > > > > suggestions
> > > > or
> > > > > > comments on the above incremental acquisition of metadata? Please 
> > > > > > let
> > > > me
> > > > > > know, thanks a lot.
> > > > > >
> > > > > > Regards,
> > > > > > Benedict Jin
> > > > > >
> > > > > > -
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Regards
> > > Tijo Thomas
> > >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Ben Krug,

+1 for adding the is_deleted column, and then we can create a timing trigger to 
clear these old records.

Regards,
Benedict Jin

On 2021/04/06 18:28:45, Ben Krug  wrote: 
> Oh, that's easier than tombstones.  flag is_deleted and update timestamp
> (so it gets pulled again).
> 
> On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  wrote:
> 
> > Abhishek,
> > Good point.  Do we need one more col for storing if it's deleted or not?
> >
> > On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal  > >
> > wrote:
> >
> > > If an entry is deleted from the metadata, how is the coordinator going to
> > > update its own state?
> > >
> > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> > >
> > > > Hey,
> > > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > > considerations here, but from a first glance, I like your offer, as it
> > > > resembles the *tsColumn *in JDBC lookups (
> > > >
> > > >
> > >
> > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > > ).
> > > >
> > > > Anyway, just my 2 cents.
> > > >
> > > > Thanks!
> > > >   Itai
> > > >
> > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > > metadata, there is a performance bottleneck. The main reason is the
> > > huge
> > > > > amount of metadata, which leads to a very slow process of scanning
> > the
> > > > full
> > > > > table of metadata storage and deserializing metadata. The size of the
> > > > full
> > > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > > but
> > > > > the effect is not very significant. Therefore, I want to design a
> > > scheme
> > > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > > Coordinator only pulls newly added metadata, so as to reduce the
> > query
> > > > > pressure of metadata storage and the pressure of deserializing
> > > metadata.
> > > > > The general idea is to add a column last_update to the druid_segments
> > > > table
> > > > > to record the update time of each record. Furthermore, when we query
> > > the
> > > > > metadata table, we can add filter conditions for the last_update
> > column
> > > > to
> > > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > > PostgreSQL as the metadata storage medium, it can support
> > > > >  automatic update of the timestamp field, which is somewhat similar
> > to
> > > > the
> > > > > characteristics of triggers. So, have you encountered this problem
> > > > before?
> > > > > If so, how did you solve it? In addition, do you have any suggestions
> > > or
> > > > > comments on the above incremental acquisition of metadata? Please let
> > > me
> > > > > know, thanks a lot.
> > > > >
> > > > > Regards,
> > > > > Benedict Jin
> > > > >
> > > > > -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Thanks & Regards
> > Tijo Thomas
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Samarth Jain,

Thanks. The main reason is the huge amount of metadata, which leads to a very 
slow process of scanning the full table of metadata storage and deserializing 
metadata. Yes, I have tried to clean up the metadata.

Regards,
Benedict Jin

On 2021/04/06 17:20:26, Samarth Jain  wrote: 
> Hi Benedict,
> 
> I am curious to understand what functionality of Druid are you seeing the
> slowness in? Is it the coordinator work of assigning segments to
> historicals that is slower or is it the querying of segment information
> that is slower? Have you looked into CPU/network metrics for your metadata
> RDS? Maybe scaling up to a bigger instance would help. It would also be
> good to see the query patterns and possibly tweak or add new indexes to
> help speed up. Also, do you have the cleanup of metadata rows enabled (
> https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html#run-a-kill-task
> and *druid.coordinator.kill*.*on*)   that should further help control the
> size of druid_segments table.
> 
> On Tue, Apr 6, 2021 at 8:08 AM Ben Krug  wrote:
> 
> > I suppose, if we were going down this path, something like tombstones in
> > Cassandra could be used.
> > But it would increase the complexity significantly.
> > Ie, a new row is inserted with a deletion marker and a timestamp, that
> > indicates that the corresponding row is deleted.
> > Now, when anyone does scan the table, they need to check for tombstones too
> > and process that logic.  Then, after
> > a configurable amount of time, both the original row and the tombstone row
> > can be cleaned up.
> >
> > Probably a lot of work and complexity for this one use case, though.
> >
> > On Tue, Apr 6, 2021 at 4:02 AM Abhishek Agarwal  > >
> > wrote:
> >
> > > If an entry is deleted from the metadata, how is the coordinator going to
> > > update its own state?
> > >
> > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> > >
> > > > Hey,
> > > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > > considerations here, but from a first glance, I like your offer, as it
> > > > resembles the *tsColumn *in JDBC lookups (
> > > >
> > > >
> > >
> > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > > ).
> > > >
> > > > Anyway, just my 2 cents.
> > > >
> > > > Thanks!
> > > >   Itai
> > > >
> > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > > metadata, there is a performance bottleneck. The main reason is the
> > > huge
> > > > > amount of metadata, which leads to a very slow process of scanning
> > the
> > > > full
> > > > > table of metadata storage and deserializing metadata. The size of the
> > > > full
> > > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > > but
> > > > > the effect is not very significant. Therefore, I want to design a
> > > scheme
> > > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > > Coordinator only pulls newly added metadata, so as to reduce the
> > query
> > > > > pressure of metadata storage and the pressure of deserializing
> > > metadata.
> > > > > The general idea is to add a column last_update to the druid_segments
> > > > table
> > > > > to record the update time of each record. Furthermore, when we query
> > > the
> > > > > metadata table, we can add filter conditions for the last_update
> > column
> > > > to
> > > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > > PostgreSQL as the metadata storage medium, it can support
> > > > >  automatic update of the timestamp field, which is somewhat similar
> > to
> > > > the
> > > > > characteristics of triggers. So, have you encountered this problem
> > > > before?
> > > > > If so, how did you solve it? In addition, do you have any suggestions
> > > or
> > > > > comments on the above incremental acquisition of metadata? Please let
> > > me
> > > > > know, thanks a lot.
> > > > >
> > > > > Regards,
> > > > > Benedict Jin
> > > > >
> > > > > -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Ben Krug,

Thank you very much for your ideas, but I also feel that the introduction of 
Cassandra is too heavy. The tombstones feature in Cassandra you mentioned can 
actually be supported by timed tasks in MySQL or PostgreSQL.

Regards,
Benedict Jin

On 2021/04/06 15:08:03, Ben Krug  wrote: 
> I suppose, if we were going down this path, something like tombstones in
> Cassandra could be used.
> But it would increase the complexity significantly.
> Ie, a new row is inserted with a deletion marker and a timestamp, that
> indicates that the corresponding row is deleted.
> Now, when anyone does scan the table, they need to check for tombstones too
> and process that logic.  Then, after
> a configurable amount of time, both the original row and the tombstone row
> can be cleaned up.
> 
> Probably a lot of work and complexity for this one use case, though.
> 
> On Tue, Apr 6, 2021 at 4:02 AM Abhishek Agarwal 
> wrote:
> 
> > If an entry is deleted from the metadata, how is the coordinator going to
> > update its own state?
> >
> > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> >
> > > Hey,
> > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > considerations here, but from a first glance, I like your offer, as it
> > > resembles the *tsColumn *in JDBC lookups (
> > >
> > >
> > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > ).
> > >
> > > Anyway, just my 2 cents.
> > >
> > > Thanks!
> > >   Itai
> > >
> > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > metadata, there is a performance bottleneck. The main reason is the
> > huge
> > > > amount of metadata, which leads to a very slow process of scanning the
> > > full
> > > > table of metadata storage and deserializing metadata. The size of the
> > > full
> > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > but
> > > > the effect is not very significant. Therefore, I want to design a
> > scheme
> > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > Coordinator only pulls newly added metadata, so as to reduce the query
> > > > pressure of metadata storage and the pressure of deserializing
> > metadata.
> > > > The general idea is to add a column last_update to the druid_segments
> > > table
> > > > to record the update time of each record. Furthermore, when we query
> > the
> > > > metadata table, we can add filter conditions for the last_update column
> > > to
> > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > PostgreSQL as the metadata storage medium, it can support
> > > >  automatic update of the timestamp field, which is somewhat similar to
> > > the
> > > > characteristics of triggers. So, have you encountered this problem
> > > before?
> > > > If so, how did you solve it? In addition, do you have any suggestions
> > or
> > > > comments on the above incremental acquisition of metadata? Please let
> > me
> > > > know, thanks a lot.
> > > >
> > > > Regards,
> > > > Benedict Jin
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Abhishek Agarwal,

You made a very important point, thank you very much.

Regards,
Benedict Jin

On 2021/04/06 11:02:34, Abhishek Agarwal  wrote: 
> If an entry is deleted from the metadata, how is the coordinator going to
> update its own state?
> 
> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> 
> > Hey,
> > I'm not a Druid developer, so it's quite possible I'm missing many
> > considerations here, but from a first glance, I like your offer, as it
> > resembles the *tsColumn *in JDBC lookups (
> >
> > https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > ).
> >
> > Anyway, just my 2 cents.
> >
> > Thanks!
> >   Itai
> >
> > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
> >
> > > Hi all,
> > >
> > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > metadata, there is a performance bottleneck. The main reason is the huge
> > > amount of metadata, which leads to a very slow process of scanning the
> > full
> > > table of metadata storage and deserializing metadata. The size of the
> > full
> > > metadata has been reduced through TTL, Compaction, Rollup, and etc., but
> > > the effect is not very significant. Therefore, I want to design a scheme
> > > for Coordinator to pull metadata incrementally, that is, each time
> > > Coordinator only pulls newly added metadata, so as to reduce the query
> > > pressure of metadata storage and the pressure of deserializing metadata.
> > > The general idea is to add a column last_update to the druid_segments
> > table
> > > to record the update time of each record. Furthermore, when we query the
> > > metadata table, we can add filter conditions for the last_update column
> > to
> > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > PostgreSQL as the metadata storage medium, it can support
> > >  automatic update of the timestamp field, which is somewhat similar to
> > the
> > > characteristics of triggers. So, have you encountered this problem
> > before?
> > > If so, how did you solve it? In addition, do you have any suggestions or
> > > comments on the above incremental acquisition of metadata? Please let me
> > > know, thanks a lot.
> > >
> > > Regards,
> > > Benedict Jin
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Benedict Jin
Hi Itai Yaffe,

Thank you very much for your support, thank you.

Regards,
Benedict Jin

On 2021/04/06 10:06:45, Itai Yaffe  wrote: 
> Hey,
> I'm not a Druid developer, so it's quite possible I'm missing many
> considerations here, but from a first glance, I like your offer, as it
> resembles the *tsColumn *in JDBC lookups (
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> ).
> 
> Anyway, just my 2 cents.
> 
> Thanks!
>   Itai
> 
> On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
> 
> > Hi all,
> >
> > Recently, when the Coordinator in our company's Druid cluster pulls
> > metadata, there is a performance bottleneck. The main reason is the huge
> > amount of metadata, which leads to a very slow process of scanning the full
> > table of metadata storage and deserializing metadata. The size of the full
> > metadata has been reduced through TTL, Compaction, Rollup, and etc., but
> > the effect is not very significant. Therefore, I want to design a scheme
> > for Coordinator to pull metadata incrementally, that is, each time
> > Coordinator only pulls newly added metadata, so as to reduce the query
> > pressure of metadata storage and the pressure of deserializing metadata.
> > The general idea is to add a column last_update to the druid_segments table
> > to record the update time of each record. Furthermore, when we query the
> > metadata table, we can add filter conditions for the last_update column to
> > avoid full table scan operations. Moreover, whether it is MySQL or
> > PostgreSQL as the metadata storage medium, it can support
> >  automatic update of the timestamp field, which is somewhat similar to the
> > characteristics of triggers. So, have you encountered this problem before?
> > If so, how did you solve it? In addition, do you have any suggestions or
> > comments on the above incremental acquisition of metadata? Please let me
> > know, thanks a lot.
> >
> > Regards,
> > Benedict Jin
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Lucas Capistrant
Hey Benedict,

Adding on to what Samarth says in their reply, could you provide some more
context on this one to help the group understand more about your issue:

   - Is this the area of the code that you are saying in non-performant?
   Link
   

   - How many rows is your druid_segments table
   - Out of those rows, how many segments match "used=true"
   - What are the general specs of the machine running your metastore and
   which metastore are you using?

I'm always eager to see coordinator performance improve, but I think we
should be cautious about any changes to metastore table schemas!

- Lucas

On Tue, Apr 6, 2021 at 1:28 PM Ben Krug  wrote:

> Oh, that's easier than tombstones.  flag is_deleted and update timestamp
> (so it gets pulled again).
>
> On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas 
> wrote:
>
> > Abhishek,
> > Good point.  Do we need one more col for storing if it's deleted or not?
> >
> > On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal <
> abhishek.agar...@imply.io
> > >
> > wrote:
> >
> > > If an entry is deleted from the metadata, how is the coordinator going
> to
> > > update its own state?
> > >
> > > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe 
> wrote:
> > >
> > > > Hey,
> > > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > > considerations here, but from a first glance, I like your offer, as
> it
> > > > resembles the *tsColumn *in JDBC lookups (
> > > >
> > > >
> > >
> >
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > > ).
> > > >
> > > > Anyway, just my 2 cents.
> > > >
> > > > Thanks!
> > > >   Itai
> > > >
> > > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > > metadata, there is a performance bottleneck. The main reason is the
> > > huge
> > > > > amount of metadata, which leads to a very slow process of scanning
> > the
> > > > full
> > > > > table of metadata storage and deserializing metadata. The size of
> the
> > > > full
> > > > > metadata has been reduced through TTL, Compaction, Rollup, and
> etc.,
> > > but
> > > > > the effect is not very significant. Therefore, I want to design a
> > > scheme
> > > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > > Coordinator only pulls newly added metadata, so as to reduce the
> > query
> > > > > pressure of metadata storage and the pressure of deserializing
> > > metadata.
> > > > > The general idea is to add a column last_update to the
> druid_segments
> > > > table
> > > > > to record the update time of each record. Furthermore, when we
> query
> > > the
> > > > > metadata table, we can add filter conditions for the last_update
> > column
> > > > to
> > > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > > PostgreSQL as the metadata storage medium, it can support
> > > > >  automatic update of the timestamp field, which is somewhat similar
> > to
> > > > the
> > > > > characteristics of triggers. So, have you encountered this problem
> > > > before?
> > > > > If so, how did you solve it? In addition, do you have any
> suggestions
> > > or
> > > > > comments on the above incremental acquisition of metadata? Please
> let
> > > me
> > > > > know, thanks a lot.
> > > > >
> > > > > Regards,
> > > > > Benedict Jin
> > > > >
> > > > >
> -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Thanks & Regards
> > Tijo Thomas
> >
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Ben Krug
Oh, that's easier than tombstones.  flag is_deleted and update timestamp
(so it gets pulled again).

On Tue, Apr 6, 2021 at 10:48 AM Tijo Thomas  wrote:

> Abhishek,
> Good point.  Do we need one more col for storing if it's deleted or not?
>
> On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal  >
> wrote:
>
> > If an entry is deleted from the metadata, how is the coordinator going to
> > update its own state?
> >
> > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> >
> > > Hey,
> > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > considerations here, but from a first glance, I like your offer, as it
> > > resembles the *tsColumn *in JDBC lookups (
> > >
> > >
> >
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > ).
> > >
> > > Anyway, just my 2 cents.
> > >
> > > Thanks!
> > >   Itai
> > >
> > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > metadata, there is a performance bottleneck. The main reason is the
> > huge
> > > > amount of metadata, which leads to a very slow process of scanning
> the
> > > full
> > > > table of metadata storage and deserializing metadata. The size of the
> > > full
> > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > but
> > > > the effect is not very significant. Therefore, I want to design a
> > scheme
> > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > Coordinator only pulls newly added metadata, so as to reduce the
> query
> > > > pressure of metadata storage and the pressure of deserializing
> > metadata.
> > > > The general idea is to add a column last_update to the druid_segments
> > > table
> > > > to record the update time of each record. Furthermore, when we query
> > the
> > > > metadata table, we can add filter conditions for the last_update
> column
> > > to
> > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > PostgreSQL as the metadata storage medium, it can support
> > > >  automatic update of the timestamp field, which is somewhat similar
> to
> > > the
> > > > characteristics of triggers. So, have you encountered this problem
> > > before?
> > > > If so, how did you solve it? In addition, do you have any suggestions
> > or
> > > > comments on the above incremental acquisition of metadata? Please let
> > me
> > > > know, thanks a lot.
> > > >
> > > > Regards,
> > > > Benedict Jin
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
>
>
> --
> Thanks & Regards
> Tijo Thomas
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Tijo Thomas
Abhishek,
Good point.  Do we need one more col for storing if it's deleted or not?

On Tue, Apr 6, 2021 at 4:32 PM Abhishek Agarwal 
wrote:

> If an entry is deleted from the metadata, how is the coordinator going to
> update its own state?
>
> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
>
> > Hey,
> > I'm not a Druid developer, so it's quite possible I'm missing many
> > considerations here, but from a first glance, I like your offer, as it
> > resembles the *tsColumn *in JDBC lookups (
> >
> >
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > ).
> >
> > Anyway, just my 2 cents.
> >
> > Thanks!
> >   Itai
> >
> > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
> >
> > > Hi all,
> > >
> > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > metadata, there is a performance bottleneck. The main reason is the
> huge
> > > amount of metadata, which leads to a very slow process of scanning the
> > full
> > > table of metadata storage and deserializing metadata. The size of the
> > full
> > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> but
> > > the effect is not very significant. Therefore, I want to design a
> scheme
> > > for Coordinator to pull metadata incrementally, that is, each time
> > > Coordinator only pulls newly added metadata, so as to reduce the query
> > > pressure of metadata storage and the pressure of deserializing
> metadata.
> > > The general idea is to add a column last_update to the druid_segments
> > table
> > > to record the update time of each record. Furthermore, when we query
> the
> > > metadata table, we can add filter conditions for the last_update column
> > to
> > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > PostgreSQL as the metadata storage medium, it can support
> > >  automatic update of the timestamp field, which is somewhat similar to
> > the
> > > characteristics of triggers. So, have you encountered this problem
> > before?
> > > If so, how did you solve it? In addition, do you have any suggestions
> or
> > > comments on the above incremental acquisition of metadata? Please let
> me
> > > know, thanks a lot.
> > >
> > > Regards,
> > > Benedict Jin
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
>


-- 
Thanks & Regards
Tijo Thomas


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Samarth Jain
Hi Benedict,

I am curious to understand what functionality of Druid are you seeing the
slowness in? Is it the coordinator work of assigning segments to
historicals that is slower or is it the querying of segment information
that is slower? Have you looked into CPU/network metrics for your metadata
RDS? Maybe scaling up to a bigger instance would help. It would also be
good to see the query patterns and possibly tweak or add new indexes to
help speed up. Also, do you have the cleanup of metadata rows enabled (
https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html#run-a-kill-task
and *druid.coordinator.kill*.*on*)   that should further help control the
size of druid_segments table.

On Tue, Apr 6, 2021 at 8:08 AM Ben Krug  wrote:

> I suppose, if we were going down this path, something like tombstones in
> Cassandra could be used.
> But it would increase the complexity significantly.
> Ie, a new row is inserted with a deletion marker and a timestamp, that
> indicates that the corresponding row is deleted.
> Now, when anyone does scan the table, they need to check for tombstones too
> and process that logic.  Then, after
> a configurable amount of time, both the original row and the tombstone row
> can be cleaned up.
>
> Probably a lot of work and complexity for this one use case, though.
>
> On Tue, Apr 6, 2021 at 4:02 AM Abhishek Agarwal  >
> wrote:
>
> > If an entry is deleted from the metadata, how is the coordinator going to
> > update its own state?
> >
> > On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
> >
> > > Hey,
> > > I'm not a Druid developer, so it's quite possible I'm missing many
> > > considerations here, but from a first glance, I like your offer, as it
> > > resembles the *tsColumn *in JDBC lookups (
> > >
> > >
> >
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > > ).
> > >
> > > Anyway, just my 2 cents.
> > >
> > > Thanks!
> > >   Itai
> > >
> > > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > > metadata, there is a performance bottleneck. The main reason is the
> > huge
> > > > amount of metadata, which leads to a very slow process of scanning
> the
> > > full
> > > > table of metadata storage and deserializing metadata. The size of the
> > > full
> > > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> > but
> > > > the effect is not very significant. Therefore, I want to design a
> > scheme
> > > > for Coordinator to pull metadata incrementally, that is, each time
> > > > Coordinator only pulls newly added metadata, so as to reduce the
> query
> > > > pressure of metadata storage and the pressure of deserializing
> > metadata.
> > > > The general idea is to add a column last_update to the druid_segments
> > > table
> > > > to record the update time of each record. Furthermore, when we query
> > the
> > > > metadata table, we can add filter conditions for the last_update
> column
> > > to
> > > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > > PostgreSQL as the metadata storage medium, it can support
> > > >  automatic update of the timestamp field, which is somewhat similar
> to
> > > the
> > > > characteristics of triggers. So, have you encountered this problem
> > > before?
> > > > If so, how did you solve it? In addition, do you have any suggestions
> > or
> > > > comments on the above incremental acquisition of metadata? Please let
> > me
> > > > know, thanks a lot.
> > > >
> > > > Regards,
> > > > Benedict Jin
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Ben Krug
I suppose, if we were going down this path, something like tombstones in
Cassandra could be used.
But it would increase the complexity significantly.
Ie, a new row is inserted with a deletion marker and a timestamp, that
indicates that the corresponding row is deleted.
Now, when anyone does scan the table, they need to check for tombstones too
and process that logic.  Then, after
a configurable amount of time, both the original row and the tombstone row
can be cleaned up.

Probably a lot of work and complexity for this one use case, though.

On Tue, Apr 6, 2021 at 4:02 AM Abhishek Agarwal 
wrote:

> If an entry is deleted from the metadata, how is the coordinator going to
> update its own state?
>
> On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:
>
> > Hey,
> > I'm not a Druid developer, so it's quite possible I'm missing many
> > considerations here, but from a first glance, I like your offer, as it
> > resembles the *tsColumn *in JDBC lookups (
> >
> >
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> > ).
> >
> > Anyway, just my 2 cents.
> >
> > Thanks!
> >   Itai
> >
> > On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
> >
> > > Hi all,
> > >
> > > Recently, when the Coordinator in our company's Druid cluster pulls
> > > metadata, there is a performance bottleneck. The main reason is the
> huge
> > > amount of metadata, which leads to a very slow process of scanning the
> > full
> > > table of metadata storage and deserializing metadata. The size of the
> > full
> > > metadata has been reduced through TTL, Compaction, Rollup, and etc.,
> but
> > > the effect is not very significant. Therefore, I want to design a
> scheme
> > > for Coordinator to pull metadata incrementally, that is, each time
> > > Coordinator only pulls newly added metadata, so as to reduce the query
> > > pressure of metadata storage and the pressure of deserializing
> metadata.
> > > The general idea is to add a column last_update to the druid_segments
> > table
> > > to record the update time of each record. Furthermore, when we query
> the
> > > metadata table, we can add filter conditions for the last_update column
> > to
> > > avoid full table scan operations. Moreover, whether it is MySQL or
> > > PostgreSQL as the metadata storage medium, it can support
> > >  automatic update of the timestamp field, which is somewhat similar to
> > the
> > > characteristics of triggers. So, have you encountered this problem
> > before?
> > > If so, how did you solve it? In addition, do you have any suggestions
> or
> > > comments on the above incremental acquisition of metadata? Please let
> me
> > > know, thanks a lot.
> > >
> > > Regards,
> > > Benedict Jin
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Abhishek Agarwal
If an entry is deleted from the metadata, how is the coordinator going to
update its own state?

On Tue, Apr 6, 2021 at 3:38 PM Itai Yaffe  wrote:

> Hey,
> I'm not a Druid developer, so it's quite possible I'm missing many
> considerations here, but from a first glance, I like your offer, as it
> resembles the *tsColumn *in JDBC lookups (
>
> https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
> ).
>
> Anyway, just my 2 cents.
>
> Thanks!
>   Itai
>
> On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:
>
> > Hi all,
> >
> > Recently, when the Coordinator in our company's Druid cluster pulls
> > metadata, there is a performance bottleneck. The main reason is the huge
> > amount of metadata, which leads to a very slow process of scanning the
> full
> > table of metadata storage and deserializing metadata. The size of the
> full
> > metadata has been reduced through TTL, Compaction, Rollup, and etc., but
> > the effect is not very significant. Therefore, I want to design a scheme
> > for Coordinator to pull metadata incrementally, that is, each time
> > Coordinator only pulls newly added metadata, so as to reduce the query
> > pressure of metadata storage and the pressure of deserializing metadata.
> > The general idea is to add a column last_update to the druid_segments
> table
> > to record the update time of each record. Furthermore, when we query the
> > metadata table, we can add filter conditions for the last_update column
> to
> > avoid full table scan operations. Moreover, whether it is MySQL or
> > PostgreSQL as the metadata storage medium, it can support
> >  automatic update of the timestamp field, which is somewhat similar to
> the
> > characteristics of triggers. So, have you encountered this problem
> before?
> > If so, how did you solve it? In addition, do you have any suggestions or
> > comments on the above incremental acquisition of metadata? Please let me
> > know, thanks a lot.
> >
> > Regards,
> > Benedict Jin
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Itai Yaffe
Hey,
I'm not a Druid developer, so it's quite possible I'm missing many
considerations here, but from a first glance, I like your offer, as it
resembles the *tsColumn *in JDBC lookups (
https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
).

Anyway, just my 2 cents.

Thanks!
  Itai

On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:

> Hi all,
>
> Recently, when the Coordinator in our company's Druid cluster pulls
> metadata, there is a performance bottleneck. The main reason is the huge
> amount of metadata, which leads to a very slow process of scanning the full
> table of metadata storage and deserializing metadata. The size of the full
> metadata has been reduced through TTL, Compaction, Rollup, and etc., but
> the effect is not very significant. Therefore, I want to design a scheme
> for Coordinator to pull metadata incrementally, that is, each time
> Coordinator only pulls newly added metadata, so as to reduce the query
> pressure of metadata storage and the pressure of deserializing metadata.
> The general idea is to add a column last_update to the druid_segments table
> to record the update time of each record. Furthermore, when we query the
> metadata table, we can add filter conditions for the last_update column to
> avoid full table scan operations. Moreover, whether it is MySQL or
> PostgreSQL as the metadata storage medium, it can support
>  automatic update of the timestamp field, which is somewhat similar to the
> characteristics of triggers. So, have you encountered this problem before?
> If so, how did you solve it? In addition, do you have any suggestions or
> comments on the above incremental acquisition of metadata? Please let me
> know, thanks a lot.
>
> Regards,
> Benedict Jin
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Propose a scheme for Coordinator to pull metadata incrementally

2021-04-05 Thread Benedict Jin
Hi all,

Recently, when the Coordinator in our company's Druid cluster pulls metadata, 
there is a performance bottleneck. The main reason is the huge amount of 
metadata, which leads to a very slow process of scanning the full table of 
metadata storage and deserializing metadata. The size of the full metadata has 
been reduced through TTL, Compaction, Rollup, and etc., but the effect is not 
very significant. Therefore, I want to design a scheme for Coordinator to pull 
metadata incrementally, that is, each time Coordinator only pulls newly added 
metadata, so as to reduce the query pressure of metadata storage and the 
pressure of deserializing metadata. The general idea is to add a column 
last_update to the druid_segments table to record the update time of each 
record. Furthermore, when we query the metadata table, we can add filter 
conditions for the last_update column to avoid full table scan operations. 
Moreover, whether it is MySQL or PostgreSQL as the metadata storage medium, it 
can support 
 automatic update of the timestamp field, which is somewhat similar to the 
characteristics of triggers. So, have you encountered this problem before? If 
so, how did you solve it? In addition, do you have any suggestions or comments 
on the above incremental acquisition of metadata? Please let me know, thanks a 
lot.

Regards,
Benedict Jin

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org