[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-04-15 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837100#comment-17837100
 ] 

Venugopal Reddy K commented on IMPALA-12709:


Have observed significant improvement in event processing time just with 10 
Databases, with 10 tables on each db and 100 partitions on each table. Working 
on adding more cases. Feel free to provide your inputs. Can find the comparison 
with and without hierarchical processing at 
https://docs.google.com/spreadsheets/d/1ByjwPhRy75v_KzWq69iqNmRernze_LZ-qyj6If0-lkY/edit?usp=sharing

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-04-09 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835459#comment-17835459
 ] 

Maxwell Guo commented on IMPALA-12709:
--

[~VenuReddy] any update here ?:D

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-04-02 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833178#comment-17833178
 ] 

Maxwell Guo commented on IMPALA-12709:
--

[~VenuReddy] Thanks very much , looking forward to your update. 

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-04-02 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833161#comment-17833161
 ] 

Venugopal Reddy K commented on IMPALA-12709:


[~maxwellguo] Currently measuring and comparing the time taken with base and 
modified versions. Also tuning the configuration paramters added with the 
gerrit to see the change in the event processing time. Since there are no 
existing tests to measure the event processing time, I am adding some tests. 
[https://gerrit.cloudera.org/#/c/21031/8/fe/src/test/java/org/apache/impala/catalog/events/EventsProcessorPerfTest.java]

has a test to create 10 databases and 10 tables(non-transactional) on each db, 
inserted data into all these 100 tables and dropped tables and databases. 
Results showed that with Hierarchical Processing enabled, insert into 
table(generates ALTER and INSERT events) looks to be much faster.(nearly 
5times). But create databases, tables and drop tables and databases it is not. 
I am checking it.  Also planning to add more perf tests with partitioned 
tables, transactional tables, different possible sequence of events etc

Test output for 2 runs:

 
{noformat}
I0402 17:04:40.532240 712808 EventsProcessorPerfTest.java:131] [Performance] 
With Hierarchical Processing: false
I0402 17:04:40.705119 712808 EventsProcessorPerfTest.java:140] [Performance] 
Time taken to process create database events: 75.11 ms
I0402 17:04:43.643066 712808 EventsProcessorPerfTest.java:153] [Performance] 
Time taken to process create table events: 136.4 ms
I0402 17:04:44.130368 712808 EventsProcessorPerfTest.java:181] [Performance] 
Time taken to load table: 486.9 ms
I0402 17:05:15.474153 712808 EventsProcessorPerfTest.java:194] [Performance] 
Time taken to process insert events : 1.955 s
I0402 17:05:24.419824 712808 EventsProcessorPerfTest.java:206] [Performance] 
Time taken to process drop table events : 97.01 ms
I0402 17:05:24.684505 712808 EventsProcessorPerfTest.java:216] [Performance] 
Time taken to process database events : 26.55 ms
I0402 17:05:25.107113 712808 EventsProcessorPerfTest.java:131] [Performance] 
With Hierarchical Processing: true
I0402 17:05:25.196743 712808 EventsProcessorPerfTest.java:140] [Performance] 
Time taken to process create database events: 15.21 ms
I0402 17:05:28.118330 712808 EventsProcessorPerfTest.java:153] [Performance] 
Time taken to process create table events: 50.12 ms
I0402 17:05:28.473388 712808 EventsProcessorPerfTest.java:181] [Performance] 
Time taken to load table: 354.8 ms
I0402 17:05:52.529421 712808 EventsProcessorPerfTest.java:194] [Performance] 
Time taken to process insert events : 402.1 ms
I0402 17:06:01.460664 712808 EventsProcessorPerfTest.java:206] [Performance] 
Time taken to process drop table events : 132.2 ms
I0402 17:06:01.848369 712808 EventsProcessorPerfTest.java:216] [Performance] 
Time taken to process database events : 27.53 ms
I0402 17:06:02.227852 712808 EventsProcessorPerfTest.java:131] [Performance] 
With Hierarchical Processing: false
I0402 17:06:02.435050 712808 EventsProcessorPerfTest.java:140] [Performance] 
Time taken to process create database events: 18.10 ms
I0402 17:06:05.132701 712808 EventsProcessorPerfTest.java:153] [Performance] 
Time taken to process create table events: 110.8 ms
I0402 17:06:05.726616 712808 EventsProcessorPerfTest.java:181] [Performance] 
Time taken to load table: 593.7 ms
I0402 17:06:30.767912 712808 EventsProcessorPerfTest.java:194] [Performance] 
Time taken to process insert events : 2.246 s
I0402 17:06:40.019438 712808 EventsProcessorPerfTest.java:206] [Performance] 
Time taken to process drop table events : 122.7 ms
I0402 17:06:40.383190 712808 EventsProcessorPerfTest.java:216] [Performance] 
Time taken to process database events : 22.18 ms
I0402 17:06:40.801436 712808 EventsProcessorPerfTest.java:131] [Performance] 
With Hierarchical Processing: true
I0402 17:06:41.036427 712808 EventsProcessorPerfTest.java:140] [Performance] 
Time taken to process create database events: 21.29 ms
I0402 17:06:43.558152 712808 EventsProcessorPerfTest.java:153] [Performance] 
Time taken to process create table events: 101.3 ms
I0402 17:06:43.942732 712808 EventsProcessorPerfTest.java:181] [Performance] 
Time taken to load table: 384.1 ms
I0402 17:07:08.202667 712808 EventsProcessorPerfTest.java:194] [Performance] 
Time taken to process insert events : 465.2 ms
I0402 17:07:17.037060 712808 EventsProcessorPerfTest.java:206] [Performance] 
Time taken to process drop table events : 137.3 ms
I0402 17:07:17.377442 712808 EventsProcessorPerfTest.java:216] [Performance] 
Time taken to process database events : 20.56 ms
{noformat}
 

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue 

[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-03-26 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831139#comment-17831139
 ] 

Maxwell Guo commented on IMPALA-12709:
--

[~VenuReddy] Sorry to bother you again , is there any update on this ? :)

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-03-12 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825575#comment-17825575
 ] 

Maxwell Guo commented on IMPALA-12709:
--

Thanks [~VenuReddy]

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-03-12 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825564#comment-17825564
 ] 

Venugopal Reddy K commented on IMPALA-12709:


[~maxwellguo] Agree Need to control it with a configuration flag. Was occupied 
with IMPALA-12851 and IMPALA-12832 lately. Couldn't spend much time. I am yet 
to compare the performance with and without it. Will give an update about it 
soon.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-03-11 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825340#comment-17825340
 ] 

Maxwell Guo commented on IMPALA-12709:
--

Is there any update on this issue  ? [~VenuReddy][~rizaon][~stigahuang]

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-02-17 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818229#comment-17818229
 ] 

Maxwell Guo commented on IMPALA-12709:
--

Hi [~VenuReddy] ,After reading the code ,I only found 
[EventsProcessorStressTest 
title|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/catalog/events/EventsProcessorStressTest.java]
 which may has some relations with performance, But I think some function 
customization is required if we want to you the code.  [~stigahuang] 
[~mylogi...@gmail.com] any more suggestions?
Besides, What about make this patch configurable, one of the benefits is that 
you can visually see the comparison results through configuration without 
changing this code, and 
I think new features are generally turned off by default. 

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-02-13 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817025#comment-17817025
 ] 

Venugopal Reddy K commented on IMPALA-12709:


Have put the POC code to this WIP gerrit: 
[https://gerrit.cloudera.org/#/c/21031/] 

Have covered the positive cases. Yet to measure the performance. Do we have 
performance testsuite for catalogd that can be used to measure ? Any baselined 
performance report or specification about how to measure? Like what and how 
much data to use for test etc. Any pointers/suggestions would be appreciated. 
Thanks!

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-25 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810960#comment-17810960
 ] 

Maxwell Guo commented on IMPALA-12709:
--

[~VenuReddy]thank for your reply ,looking  forward to your update.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-25 Thread Riza Suminto (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810952#comment-17810952
 ] 

Riza Suminto commented on IMPALA-12709:
---

[~VenuReddy] what I want to point out is, you might be able to avoid having 2 
different thread pool and event routing between them if you pre-process the 
incoming batch into a DAG of non-overlapping mini-batches. You can process each 
DAG node that does not have predecessor in parallel and remove them from the 
DAG upon completion. Then, repeat the process until all mini-batches removed 
from DAG. That way, you just need 1 thread pool and no event routing.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-25 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810813#comment-17810813
 ] 

Venugopal Reddy K commented on IMPALA-12709:


[~rizaon] Thats exactly what i am trying to do. Group 4(db event) in the 
example is a barrier event. All the table events(of the db) occurred prior to 
the barrier event are ensured to be processed before processing it. And, all 
the table events(of the db) occurred after the barrier event are ensured to be 
processed only after barrier event processing.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-25 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810774#comment-17810774
 ] 

Venugopal Reddy K commented on IMPALA-12709:


[~maxwellguo] I am doing a small proof of concept  based on the design doc. 
Actively working on it. Once it is done, will add the poc code link to the 
design doc for reference. Parallely working on the review comments in the 
design doc as well. Will keep updating the status of it periodically in this 
jira.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-24 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810730#comment-17810730
 ] 

Maxwell Guo commented on IMPALA-12709:
--

Hi [~VenuReddy], are there any plan on this patch ? such as the release 
timeline . 

If this patch is going to split into some small task , and I think I can do 
some help with some of the tasks.

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-14 Thread Maxwell Guo (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806610#comment-17806610
 ] 

Maxwell Guo commented on IMPALA-12709:
--

I may have a different point of view. Is it possible to divide the db into 
buckets according to the original operation time and parallelize each bucket?

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12709) Hierarchical metastore event processing

2024-01-12 Thread Riza Suminto (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806242#comment-17806242
 ] 

Riza Suminto commented on IMPALA-12709:
---

Since event batch from HMS already in serialized order, is it possible to just 
reorder the events in single batch into disjoint groups and process 
non-overlapping groups in parallel while maintaining correctness?

Let say there are following events in single batch, ordered by event id 
ascending:

 
{code:java}
DROP_TABLE db1.tbl1
ALTER_TABLE db1.tbl2
CREATE_TABLE db1.tbl1
DROP_DATABASE db2
ALTER_DATABASE db1
CREATE_DATABASE db2
INSERT db1.tbl2{code}
That can be reordered into the following:

 
{code:java}
DROP_TABLE db1.tbl1
CREATE_TABLE db1.tbl1

ALTER_TABLE db1.tbl2

DROP_DATABASE db2
CREATE_DATABASE db2

ALTER_DATABASE db1

INSERT db1.tbl2{code}
Group 1, 2, and 3 can be processed in parallel. Meanwhile, Group 5 need to wait 
for Group 4 completion, which in turn wait for Group 1 and 2 completion.

 

 

> Hierarchical metastore event processing
> ---
>
> Key: IMPALA-12709
> URL: https://issues.apache.org/jira/browse/IMPALA-12709
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
> Attachments: Hierarchical metastore event processing.docx
>
>
> *Current Issue:*
> At present, metastore event processor is single threaded. Notification events 
> are processed sequentially with a maximum limit of 1000 events fetched and 
> processed in a single batch. Multiple locks are used to address the 
> concurrency issues that may arise when catalog DDL operation processing and 
> metastore event processing tries to access/update the catalog objects 
> concurrently. Waiting for a lock or file metadata loading of a table can slow 
> the event processing and can affect the processing of other events following 
> it. Those events may not be dependent on the previous event. Altogether it 
> takes a very long time to synchronize all the HMS events.
> *Proposal:*
> Existing metastore event processing can be turned into multi-level event 
> processing. Idea is to segregate the events based on their dependency, 
> maintain the order of events as they occur within the dependency and process 
> them independently as much as possible:
>  # All the events of a table are processed in the same order they have 
> actually occurred.
>  # Events of different tables are processed in parallel.
>  # When a database is altered, all the events relating to the database(i.e., 
> for all its tables) occurring after the alter db event are processed only 
> after the alter database event is processed ensuring the order.
> Have attached an initial proposal design document
> https://docs.google.com/document/d/1KZ-ANko-qn5CYmY13m4OVJXAYjLaS1VP-c64Pumipq8/edit?pli=1#heading=h.qyk8qz8ez37b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org