[ https://issues.apache.org/jira/browse/IMPALA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779112#comment-16779112 ]
ASF subversion and git services commented on IMPALA-7224: --------------------------------------------------------- Commit 0b9e8cf8f4cd1dd9f0b653efd907d214dcbb2049 in impala's branch refs/heads/2.x from Todd Lipcon [ https://gitbox.apache.org/repos/asf?p=impala.git;h=0b9e8cf ] IMPALA-7224. Improve performance of UpdateCatalogMetrics This function is called after every DDL query, and was implemented by fetching the entire list of table names, even though only the length of that list was needed. In workloads with millions of tables, this could add several seconds of overhead following even simple requests like 'USE' or 'DESCRIBE'. I tested a backported version of this patch against one such workload. It reduced the time taken for a simple DESCRIBE query from 12-14sec down to about 40ms. I also tested locally that the metrics on impalad were still updated by DDL operations. Change-Id: Ic5467adbce1e760ff93996925db5611748efafc0 Reviewed-on: http://gerrit.cloudera.org:8080/10846 Reviewed-by: Vuk Ercegovac <vercego...@cloudera.com> Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > UpdateCatalogMetrics very slow when there are many tables > --------------------------------------------------------- > > Key: IMPALA-7224 > URL: https://issues.apache.org/jira/browse/IMPALA-7224 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Major > Fix For: Impala 3.1.0 > > > impalad calls UpdateCatalogMetrics after each statement which is considered a > DDL. This includes statements like USE, SHOW TABLES, DESCRIBE, etc, which > don't actually change the number of tables in the catalog, and therefore > probably don't need to update metrics. That aside, even when the metrics _do_ > need to be updated, the implementation is very slow. It calls getTableNames > on each database, which results in (a) creating an array of all the names, > (b) sorting that array and (c) encoding/decoding that whole array into > Thrift. This is very expensive: on a use case with approximately 8M tables, > each such call takes 10-12 seconds of CPU, most of which is spent in sorting > and encoding. All that's really needed is a _count_ of tables, which could be > fetched directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org