Thank you for sharing the result. (Does your result imply that HIVE-14187
is introducing an intended bug?)

Another issue that could be of your interest is the connection leak problem
reported in HIVE-20600. Do you see the connection leak problem, or is it
not relevant to your environment (e.g., because you don't use HiveServer2)?

--- Sungwoo

On Fri, Mar 1, 2024 at 9:45 AM Takanobu Asanuma <takanobu.asan...@gmail.com>
wrote:

> Hi Pau and Sungwoo,
>
> Thanks for sharing the information.
>
> We tested a set of simple queries which just referenced the Hive table and
> didn't execute any Hive jobs. The result is below.
>
> No. Version rawstore.impl connectionPoolingType HIVE-14187 QueryTime
> --------------------------------------------------
> 1   1.2.1   ObjectStore    None                Not Applied 11:38
> 2   3.1.3   ObjectStore    None                Applied     34:00
> 3   3.1.3   CachedStore    None                Applied     25:00
> 4   3.1.3   ObjectStore    HikariCP            Applied     21:10
> 5   3.1.3   CachedStore    HikariCP            Applied     14:30
> 6   3.1.3   ObjectStore    None                Reverted    13:00
> 7   3.1.3   ObjectStore    HikariCP            Reverted    11:23
> --------------------------------------------------
>
> Initially, we encountered an issue of Hive MetaStore slowness when we
> upgraded from environment No.1 to No.2. As shown in the table, environment
> No.2 showed the worst test results.
>
> A unique aspect of our environment is that we don't use connection
> pooling. After some investigation, we thought that the combination of
> HIVE-14187 and connectionPoolingType=None was negatively impacting
> performance.
> The fastest case in our tests was when we reverted HIVE-14187 and set
> connectionPoolingType=HikariCP (see No.7). Even with connectionPoolingType
> set to None, the environment where we reverted HIVE-14187 still performed
> reasonably well (see No.6).
>
> Please note our investigation is still ongoing and we haven't yet come to
> a conclusion.
>
> Regards,
> - Takanobu
>
> 2024年2月29日(木) 12:18 Sungwoo Park <glap...@gmail.com>:
>
>> We didn't make any other attempt to fix the problem and just decided not
>> to use CachedStore. However, I think our installation of Metastore based on
>> Hive 3.1.3 is running without any serious problems.
>>
>> Could you share how long it takes to compile typical queries in your
>> environment (with Hive 1 and with Hive 3)?
>>
>> FYI, in our environment, sometimes it takes about 10 seconds to compile a
>> query on TPC-DS 10TB datasets. Specifically, the average compilation time
>> of 103 queries is 1.7 seconds (as reported by Hive), and the longest
>> compilation time is 9.6 seconds (query 49). The compilation time includes
>> the time for accessing Metastore.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>
>> On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma <tasan...@apache.org>
>> wrote:
>>
>>> Thanks for your detailed answer!
>>>
>>> In the original email, you reported "the query compilation takes long"
>>> in Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
>>> Thank you for sharing the issue with CachedStore and the JIRA tickets.
>>> I will also try out metastore.stats.fetch.bitvector=true.
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2024年2月28日(水) 18:49 Sungwoo Park <glap...@gmail.com>:
>>>
>>>> Hello Takanobu,
>>>>
>>>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>>>> different, so I don't know why Metastore responses are very slow. I can
>>>> only share some results of testing CachedStore in Metastore. Please note
>>>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>>>> Hive 3.1.3 (which applies many additional patches).
>>>>
>>>> 1.
>>>> When CachedStore is enabled, column stats are not computed. As a
>>>> result, some queries generate very inefficient plans because of
>>>> wrong/inaccurate stats.
>>>>
>>>> Perhaps this is because not all patches for CachedStore have been
>>>> merged to Hive 3.1.3. For example, these patches are not merged. Or, there
>>>> might be some way to properly configure CachedStore so that it correctly
>>>> computes column stats.
>>>>
>>>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>>>> HIVE-21063: Support statistics in cachedStore for transactional table
>>>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>>>> constraint
>>>>
>>>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>>>
>>>> (If anyone is running Hive Metastore 3.1.3 in production with
>>>> CachedStore enabled, please let us know how you configure it.)
>>>>
>>>> 2.
>>>> Setting metastore.stats.fetch.bitvector=true can also help generate
>>>> more efficient query plans.
>>>>
>>>> --- Sungwoo
>>>>
>>>>
>>>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma <tasan...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Sungwoo Park,
>>>>>
>>>>> I'm sorry for the late reply to this old email.
>>>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>>>> noticed that the response of the Hive3 MetaStore is very slow.
>>>>> We suspect that HIVE-14187 might be causing this slowness.
>>>>> Could you tell me if you have resolved this problem? Are there still
>>>>> any problems when you enable CachedStore?
>>>>>
>>>>> Regards,
>>>>> - Takanobu
>>>>>
>>>>> 2018年6月13日(水) 0:37 Sungwoo Park <glap...@gmail.com>:
>>>>>
>>>>>> Hello Hive users,
>>>>>>
>>>>>> I am experience a problem with MetaStore in Hive 3.0.
>>>>>>
>>>>>> 1. Start MetaStore
>>>>>> with 
>>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>>>>>
>>>>>> 2. Generate TPC-DS data.
>>>>>>
>>>>>> 3. TPC-DS queries run okay and produce correct results. E.g., from
>>>>>> query 1:
>>>>>> +-------------------+
>>>>>> |   c_customer_id   |
>>>>>> +-------------------+
>>>>>> | AAAAAAAAAAAACHAA  |
>>>>>> | AAAAAAAAAAAADCAA  |
>>>>>> | AAAAAAAAAAAADDAA  |
>>>>>> ...
>>>>>> | AAAAAAAAAAAILIAA  |
>>>>>> +-------------------+
>>>>>> 100 rows selected (69.901 seconds)
>>>>>>
>>>>>> However, the query compilation takes long (
>>>>>> https://issues.apache.org/jira/browse/HIVE-16520).
>>>>>>
>>>>>> 4. Now, restart MetaStore with
>>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>>>>>
>>>>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from
>>>>>> query 1:
>>>>>> +----------------+
>>>>>> | c_customer_id  |
>>>>>> +----------------+
>>>>>> +----------------+
>>>>>> No rows selected (37.448 seconds)
>>>>>>
>>>>>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>>>>>> HiveServer2 produces such log messages:
>>>>>>
>>>>>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>> for
>>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>> for
>>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>> for
>>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>>> c_customer_id
>>>>>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>>> c_customer_id
>>>>>>
>>>>>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>> 2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>>
>>>>>> However, even after computing column stats, queries still return
>>>>>> wrong results, despite the fact that the above log messages disappear.
>>>>>>
>>>>>> I guess I am missing some configuration parameters (because I
>>>>>> imported hive-site.xml from Hive 2). Any suggestion would be appreciated.
>>>>>>
>>>>>> Thanks a lot,
>>>>>>
>>>>>> --- Sungwoo Park
>>>>>>
>>>>>>

Reply via email to