[ 
https://issues.apache.org/jira/browse/HIVE-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562408#comment-16562408
 ] 

BELUGA BEHR commented on HIVE-19847:
------------------------------------

I think we need to look at this more holistically.

There are four different instances of Executors.newFixedThreadPool in 
{{Hive.java}}

I think we need one global thread pool that all these file-system threads can 
live in so that Hive administrators can have control over the total number of 
threads which are used.  Right now, the code seems to indicate that each thread 
pool can have up to 25 threads.  If we have 40 simultaneous connections, worse 
case, we could be using (40x25) = 1,000 threads just for moving files.  That is 
too many threads and will create a lot of thread churn and Hive would benefit 
from re-using these threads instead of creating/destroying them for each 
invocation of a file operation.


https://github.com/apache/hive/blob/1203ee834d709d4710fd6a41daaeb6da48c4d8f6/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java

> Create Separate getInputSummary Service
> ---------------------------------------
>
>                 Key: HIVE-19847
>                 URL: https://issues.apache.org/jira/browse/HIVE-19847
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>    Affects Versions: 3.0.0, 4.0.0
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Major
>         Attachments: HIVE-19847.1.patch, HIVE-19847.2.patch
>
>
> The Hive {{org.apache.hadoop.hive.ql.exec.Utilities.java}} file has taken on 
> a life of its own.  We should consider separating out the various components 
> into their own classes.  For this ticket, I propose separating out the 
> {{getInputSummary}} functionality into its own class.
> There are several issues with the current implementation:
> # It is 
> [synchronized|https://github.com/apache/hive/blob/f27c38ff55902827499192a4f8cf8ed37d6fd967/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2383].
>   Only one query can get file input summary at a time.  For a query which 
> deals with a large data set with a large number of files, this can block 
> other queries for a long period of time.  This is especially painful when 
> most queries use a small data set, but a large data set is submitted on 
> occasion.
> # For each query, time is spend setting up and tearing down a ThreadPool
> # It uses deprecated code
> I propose breaking it out into its own class and creating a single thread 
> pool that all queries pull from.  In this way, the bottle neck will be one 
> the number of available threads, not on a single query and if a big query is 
> running and a small query is also submitted, the smaller query will be able 
> to proceed.
> In regards to setup/teardown... if a query uses 15 threads to perform this 
> summary action, then finishes, it will tear down the threads, the next query 
> may immediate create 15 new threads for processing.  With a single pool, 
> those threads are never performing tear down and setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to