Chaozhong Yang created HIVE-16972:
-------------------------------------

             Summary: FetchOperator: filter out inputSplits which length is zero
                 Key: HIVE-16972
                 URL: https://issues.apache.org/jira/browse/HIVE-16972
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2, Physical Optimizer, Query Planning
    Affects Versions: 2.1.1, 2.1.0
            Reporter: Chaozhong Yang
            Assignee: Chaozhong Yang
             Fix For: 2.1.2


* Background
   We can describe the basic work flow of  common HQL query as follows:
  1. compile and execute
  2. fetch results
  In many cases, we don't need to  worry about the issues fetching results from 
HDFS(iff there are mapreduce jobs generated in planning step). However, the 
number of results files on HDFS and data distribution will affect the final 
status of HQL query, especially for HiveServer2. We have some map-only queries, 
e.g: 
{code:sql}
select * from myTable where date > '20170201' and date <= '20170301' and id = 
88;
{code}
    This query will generate more than 10,000 files on HDFS and most of those 
files are empty. Of course, they are very sparse. If we send 
TFetchResultsRequest from HiveServer2 client with  some parameters(timeout: 
90s, maxRows: 1024) , FetchOperator can not fetch 1024 rows in 90 seconds and 
our HiveServer2 client will mark this TFetchResultsRequest as timed out 
failure. Why? In fact, It's expensive to fetch results from empty file. In our 
HDFS cluster( 5000+ DataNodes) , reading data from an empty file will cost 
almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). Obviously, we can filter 
out those empty files or splits to speed up the process of FetchResults. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to