Thank you for sharing the result. (Does your result imply that HIVE-14187 is introducing an intended bug?)
Another issue that could be of your interest is the connection leak problem reported in HIVE-20600. Do you see the connection leak problem, or is it not relevant to your environment (e.g., because you don't use HiveServer2)? --- Sungwoo On Fri, Mar 1, 2024 at 9:45 AM Takanobu Asanuma <takanobu.asan...@gmail.com> wrote: > Hi Pau and Sungwoo, > > Thanks for sharing the information. > > We tested a set of simple queries which just referenced the Hive table and > didn't execute any Hive jobs. The result is below. > > No. Version rawstore.impl connectionPoolingType HIVE-14187 QueryTime > -------------------------------------------------- > 1 1.2.1 ObjectStore None Not Applied 11:38 > 2 3.1.3 ObjectStore None Applied 34:00 > 3 3.1.3 CachedStore None Applied 25:00 > 4 3.1.3 ObjectStore HikariCP Applied 21:10 > 5 3.1.3 CachedStore HikariCP Applied 14:30 > 6 3.1.3 ObjectStore None Reverted 13:00 > 7 3.1.3 ObjectStore HikariCP Reverted 11:23 > -------------------------------------------------- > > Initially, we encountered an issue of Hive MetaStore slowness when we > upgraded from environment No.1 to No.2. As shown in the table, environment > No.2 showed the worst test results. > > A unique aspect of our environment is that we don't use connection > pooling. After some investigation, we thought that the combination of > HIVE-14187 and connectionPoolingType=None was negatively impacting > performance. > The fastest case in our tests was when we reverted HIVE-14187 and set > connectionPoolingType=HikariCP (see No.7). Even with connectionPoolingType > set to None, the environment where we reverted HIVE-14187 still performed > reasonably well (see No.6). > > Please note our investigation is still ongoing and we haven't yet come to > a conclusion. > > Regards, > - Takanobu > > 2024年2月29日(木) 12:18 Sungwoo Park <glap...@gmail.com>: > >> We didn't make any other attempt to fix the problem and just decided not >> to use CachedStore. However, I think our installation of Metastore based on >> Hive 3.1.3 is running without any serious problems. >> >> Could you share how long it takes to compile typical queries in your >> environment (with Hive 1 and with Hive 3)? >> >> FYI, in our environment, sometimes it takes about 10 seconds to compile a >> query on TPC-DS 10TB datasets. Specifically, the average compilation time >> of 103 queries is 1.7 seconds (as reported by Hive), and the longest >> compilation time is 9.6 seconds (query 49). The compilation time includes >> the time for accessing Metastore. >> >> Thanks, >> >> --- Sungwoo >> >> >> On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma <tasan...@apache.org> >> wrote: >> >>> Thanks for your detailed answer! >>> >>> In the original email, you reported "the query compilation takes long" >>> in Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3? >>> Thank you for sharing the issue with CachedStore and the JIRA tickets. >>> I will also try out metastore.stats.fetch.bitvector=true. >>> >>> Regards, >>> - Takanobu >>> >>> 2024年2月28日(水) 18:49 Sungwoo Park <glap...@gmail.com>: >>> >>>> Hello Takanobu, >>>> >>>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be >>>> different, so I don't know why Metastore responses are very slow. I can >>>> only share some results of testing CachedStore in Metastore. Please note >>>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of >>>> Hive 3.1.3 (which applies many additional patches). >>>> >>>> 1. >>>> When CachedStore is enabled, column stats are not computed. As a >>>> result, some queries generate very inefficient plans because of >>>> wrong/inaccurate stats. >>>> >>>> Perhaps this is because not all patches for CachedStore have been >>>> merged to Hive 3.1.3. For example, these patches are not merged. Or, there >>>> might be some way to properly configure CachedStore so that it correctly >>>> computes column stats. >>>> >>>> HIVE-20896: CachedStore fail to cache stats in multiple code paths >>>> HIVE-21063: Support statistics in cachedStore for transactional table >>>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for >>>> constraint >>>> >>>> So, we decided that CachedStore should not be enabled in Hive 3.1.3. >>>> >>>> (If anyone is running Hive Metastore 3.1.3 in production with >>>> CachedStore enabled, please let us know how you configure it.) >>>> >>>> 2. >>>> Setting metastore.stats.fetch.bitvector=true can also help generate >>>> more efficient query plans. >>>> >>>> --- Sungwoo >>>> >>>> >>>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma <tasan...@apache.org> >>>> wrote: >>>> >>>>> Hi Sungwoo Park, >>>>> >>>>> I'm sorry for the late reply to this old email. >>>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and >>>>> noticed that the response of the Hive3 MetaStore is very slow. >>>>> We suspect that HIVE-14187 might be causing this slowness. >>>>> Could you tell me if you have resolved this problem? Are there still >>>>> any problems when you enable CachedStore? >>>>> >>>>> Regards, >>>>> - Takanobu >>>>> >>>>> 2018年6月13日(水) 0:37 Sungwoo Park <glap...@gmail.com>: >>>>> >>>>>> Hello Hive users, >>>>>> >>>>>> I am experience a problem with MetaStore in Hive 3.0. >>>>>> >>>>>> 1. Start MetaStore >>>>>> with >>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore. >>>>>> >>>>>> 2. Generate TPC-DS data. >>>>>> >>>>>> 3. TPC-DS queries run okay and produce correct results. E.g., from >>>>>> query 1: >>>>>> +-------------------+ >>>>>> | c_customer_id | >>>>>> +-------------------+ >>>>>> | AAAAAAAAAAAACHAA | >>>>>> | AAAAAAAAAAAADCAA | >>>>>> | AAAAAAAAAAAADDAA | >>>>>> ... >>>>>> | AAAAAAAAAAAILIAA | >>>>>> +-------------------+ >>>>>> 100 rows selected (69.901 seconds) >>>>>> >>>>>> However, the query compilation takes long ( >>>>>> https://issues.apache.org/jira/browse/HIVE-16520). >>>>>> >>>>>> 4. Now, restart MetaStore with >>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore. >>>>>> >>>>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from >>>>>> query 1: >>>>>> +----------------+ >>>>>> | c_customer_id | >>>>>> +----------------+ >>>>>> +----------------+ >>>>>> No rows selected (37.448 seconds) >>>>>> >>>>>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore, >>>>>> HiveServer2 produces such log messages: >>>>>> >>>>>> 2018-06-12T23:50:04,223 WARN [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats >>>>>> for >>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year >>>>>> 2018-06-12T23:50:04,223 INFO [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for >>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year >>>>>> 2018-06-12T23:50:04,225 WARN [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats >>>>>> for >>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk >>>>>> 2018-06-12T23:50:04,225 INFO [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for >>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk >>>>>> 2018-06-12T23:50:04,226 WARN [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats >>>>>> for >>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk, >>>>>> c_customer_id >>>>>> 2018-06-12T23:50:04,226 INFO [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for >>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk, >>>>>> c_customer_id >>>>>> >>>>>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory: >>>>>> Invalid column stats: No of nulls > cardinality >>>>>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory: >>>>>> Invalid column stats: No of nulls > cardinality >>>>>> 2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad >>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory: >>>>>> Invalid column stats: No of nulls > cardinality >>>>>> >>>>>> However, even after computing column stats, queries still return >>>>>> wrong results, despite the fact that the above log messages disappear. >>>>>> >>>>>> I guess I am missing some configuration parameters (because I >>>>>> imported hive-site.xml from Hive 2). Any suggestion would be appreciated. >>>>>> >>>>>> Thanks a lot, >>>>>> >>>>>> --- Sungwoo Park >>>>>> >>>>>>