[ https://issues.apache.org/jira/browse/HIVE-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546625#comment-14546625 ]
Raj Bains commented on HIVE-10725: ---------------------------------- Please consider the following when looking at resource design. 1. Yes, the end goal is to prevent HS2 from becoming unresponsive/going OOM - and maybe we add more context to the target here - are we going to target 100 concurrent sessions on a box with 8 cores and 64GB of RAM with a particular query with load balancing across two HS2 instances. This will give us a better target and help test our resource management. Operational databases can handle thousands of queries per second - though the queries are simpler. HS2 should not be a bottleneck. 2. Admission control should be simple, but richer than the number of queries. Also, it should be in terms that a database user understands - number of threads is a bad metric to expose since it is not in terms of user workload - this is punting our engineering work to the end user. To show a relevant metric, assume that I setup a HS2 for 100 short concurrent queries and someone throws in a 5 fact table join, then what should happen? Admission control should allow rules/limits on number of queries AND query attributes. Number of tables accessed/ joins might be a good place to start. Admission control definitely needs to have an API with the assumption that there will be an external Workload Management module / UI that will allow the user to set it up along with other aspects of workload management. 3. We'll have to implement parallel compilation at some point, we should have guardrails around allowed parallelism and ensure resource management will work then. Is the lifetime of a query a pipeline with stages? Can we know how much resources the query takes in each stage? Should we allow limits on how many queries are allowed in each stage? You might allow 100 threads, but only 5 to run CBO at one time. There can be 50 waiting on a stats fetch from remote server or on query execution to finish. 4. We'll have caches inside HS2 - statistics cache, query compile cache - how will the resource utilization of those be handled? 5. What's the performance degradation model? A good model is that we allow entry of queries till we reach maximum throughput. Then we queue the queries - increasing latency but not reducing throughput - degrading gracefully to a point. After a certain amount of latency increase, we reject further incoming queries from the queue. Is the architecture setup to be able to reason in such terms? I haven't looked at the code. > Better resource management in HiveServer2 > ----------------------------------------- > > Key: HIVE-10725 > URL: https://issues.apache.org/jira/browse/HIVE-10725 > Project: Hive > Issue Type: Improvement > Components: HiveServer2, JDBC > Affects Versions: 1.3.0 > Reporter: Vaibhav Gumashta > > We have various ways to control the number of queries that can be run on one > HS2 instance (max threads, thread pool queuing etc). We also have ways to run > multiple HS2 instances using dynamic service discovery. We should do a better > job at: > 1. Monitoring resource utilization (sessions, ophandles, memory, threads etc). > 2. Being upfront to the client when we cannot accept new queries. > 3. Throttle among different server instances in case dynamic service > discovery is used. > 4. Consolidate existing ways to control #queries into a simpler model. > 5. See if we can recommend reasonable values for OS resources or provide > alerts if we run out of those. > 6. Health reports, server status API (to get number of queries, sessions etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332)