[ https://issues.apache.org/jira/browse/HIVE-17004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bing Li reassigned HIVE-17004: ------------------------------ Assignee: Bing Li > Calculating Number Of Reducers Looks At All Files > ------------------------------------------------- > > Key: HIVE-17004 > URL: https://issues.apache.org/jira/browse/HIVE-17004 > Project: Hive > Issue Type: Improvement > Components: Hive > Affects Versions: 2.1.1 > Reporter: BELUGA BEHR > Assignee: Bing Li > > When calculating the number of Mappers and Reducers, the two algorithms are > looking at different data sets. The number of Mappers are calculated based > on the number of splits and the number of Reducers are based on the number of > files within the HDFS directory. What you see is that if I add files to a > sub-directory of the HDFS directory, the number of splits remains the same > since I did not tell Hive to search recursively, and the number of Reducers > increases. Please improve this so that Reducers are looking at the same > files that are considered for splits and not at files within sub-directories > (unless configured to do so). > {code} > CREATE EXTERNAL TABLE Complaints ( > a string, > b string, > c string, > d string, > e string, > f string, > g string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > LOCATION '/user/admin/complaints'; > {code} > {code} > [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.1.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.2.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.3.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.4.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.5.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.csv > {code} > {code} > INFO : Compiling > command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): > select a, count(1) from complaints group by a limit 10 > INFO : Semantic Analysis Completed > INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, > type:string, comment:null), FieldSchema(name:_c1, type:bigint, > comment:null)], properties:null) > INFO : Completed compiling > command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); > Time taken: 0.077 seconds > INFO : Executing > command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): > select a, count(1) from complaints group by a limit 10 > INFO : Query ID = hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae > INFO : Total jobs = 1 > INFO : Launching Job 1 out of 1 > INFO : Starting task [Stage-1:MAPRED] in serial mode > INFO : Number of reduce tasks not specified. Estimated from input data size: > 11 > INFO : In order to change the average load for a reducer (in bytes): > INFO : set hive.exec.reducers.bytes.per.reducer=<number> > INFO : In order to limit the maximum number of reducers: > INFO : set hive.exec.reducers.max=<number> > INFO : In order to set a constant number of reducers: > INFO : set mapreduce.job.reduces=<number> > INFO : number of splits:2 > INFO : Submitting tokens for job: job_1493729203063_0003 > INFO : The url to track the job: > http://host:8088/proxy/application_1493729203063_0003/ > INFO : Starting Job = job_1493729203063_0003, Tracking URL = > http://host:8088/proxy/application_1493729203063_0003/ > INFO : Kill Command = > /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job > -kill job_1493729203063_0003 > INFO : Hadoop job information for Stage-1: number of mappers: 2; number of > reducers: 11 > INFO : 2017-05-02 14:20:14,206 Stage-1 map = 0%, reduce = 0% > INFO : 2017-05-02 14:20:22,520 Stage-1 map = 100%, reduce = 0%, Cumulative > CPU 4.48 sec > INFO : 2017-05-02 14:20:34,029 Stage-1 map = 100%, reduce = 27%, Cumulative > CPU 15.72 sec > INFO : 2017-05-02 14:20:35,069 Stage-1 map = 100%, reduce = 55%, Cumulative > CPU 21.94 sec > INFO : 2017-05-02 14:20:36,110 Stage-1 map = 100%, reduce = 64%, Cumulative > CPU 23.97 sec > INFO : 2017-05-02 14:20:39,233 Stage-1 map = 100%, reduce = 73%, Cumulative > CPU 25.26 sec > INFO : 2017-05-02 14:20:43,392 Stage-1 map = 100%, reduce = 100%, > Cumulative CPU 30.9 sec > INFO : MapReduce Total cumulative CPU time: 30 seconds 900 msec > INFO : Ended Job = job_1493729203063_0003 > INFO : MapReduce Jobs Launched: > INFO : Stage-Stage-1: Map: 2 Reduce: 11 Cumulative CPU: 30.9 sec HDFS > Read: 735691149 HDFS Write: 153 SUCCESS > INFO : Total MapReduce CPU Time Spent: 30 seconds 900 msec > INFO : Completed executing > command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); > Time taken: 36.035 seconds > INFO : OK > {code} > {code} > [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.1.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.2.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.3.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.4.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.5.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 > /user/admin/complaints/Consumer_Complaints.csv > drwxr-xr-x - admin admin 0 2017-05-02 14:16 > /user/admin/complaints/t > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.1.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.2.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.3.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.4.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.5.csv > -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 > /user/admin/complaints/t/Consumer_Complaints.csv > {code} > {code} > INFO : Compiling > command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): > select a, count(1) from complaints group by a limit 10 > INFO : Semantic Analysis Completed > INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, > type:string, comment:null), FieldSchema(name:_c1, type:bigint, > comment:null)], properties:null) > INFO : Completed compiling > command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); > Time taken: 0.073 seconds > INFO : Executing > command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): > select a, count(1) from complaints group by a limit 10 > INFO : Query ID = hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e > INFO : Total jobs = 1 > INFO : Launching Job 1 out of 1 > INFO : Starting task [Stage-1:MAPRED] in serial mode > INFO : Number of reduce tasks not specified. Estimated from input data size: > 22 > INFO : In order to change the average load for a reducer (in bytes): > INFO : set hive.exec.reducers.bytes.per.reducer=<number> > INFO : In order to limit the maximum number of reducers: > INFO : set hive.exec.reducers.max=<number> > INFO : In order to set a constant number of reducers: > INFO : set mapreduce.job.reduces=<number> > INFO : number of splits:2 > INFO : Submitting tokens for job: job_1493729203063_0004 > INFO : The url to track the job: > http://host:8088/proxy/application_1493729203063_0004/ > INFO : Starting Job = job_1493729203063_0004, Tracking URL = > http://host:8088/proxy/application_1493729203063_0004/ > INFO : Kill Command = > /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job > -kill job_1493729203063_0004 > INFO : Hadoop job information for Stage-1: number of mappers: 2; number of > reducers: 22 > INFO : 2017-05-02 14:29:27,464 Stage-1 map = 0%, reduce = 0% > INFO : 2017-05-02 14:29:36,829 Stage-1 map = 100%, reduce = 0%, Cumulative > CPU 10.2 sec > INFO : 2017-05-02 14:29:47,287 Stage-1 map = 100%, reduce = 14%, Cumulative > CPU 15.36 sec > INFO : 2017-05-02 14:29:49,381 Stage-1 map = 100%, reduce = 27%, Cumulative > CPU 20.76 sec > INFO : 2017-05-02 14:29:50,433 Stage-1 map = 100%, reduce = 32%, Cumulative > CPU 22.69 sec > INFO : 2017-05-02 14:29:56,743 Stage-1 map = 100%, reduce = 45%, Cumulative > CPU 27.73 sec > INFO : 2017-05-02 14:30:00,916 Stage-1 map = 100%, reduce = 64%, Cumulative > CPU 34.95 sec > INFO : 2017-05-02 14:30:06,142 Stage-1 map = 100%, reduce = 77%, Cumulative > CPU 41.49 sec > INFO : 2017-05-02 14:30:10,297 Stage-1 map = 100%, reduce = 82%, Cumulative > CPU 42.92 sec > INFO : 2017-05-02 14:30:11,334 Stage-1 map = 100%, reduce = 86%, Cumulative > CPU 45.24 sec > INFO : 2017-05-02 14:30:12,365 Stage-1 map = 100%, reduce = 100%, > Cumulative CPU 50.33 sec > INFO : MapReduce Total cumulative CPU time: 50 seconds 330 msec > INFO : Ended Job = job_1493729203063_0004 > INFO : MapReduce Jobs Launched: > INFO : Stage-Stage-1: Map: 2 Reduce: 22 Cumulative CPU: 50.33 sec HDFS > Read: 735731640 HDFS Write: 153 SUCCESS > INFO : Total MapReduce CPU Time Spent: 50 seconds 330 msec > INFO : Completed executing > command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); > Time taken: 51.841 seconds > INFO : OK > {code} > https://github.com/apache/hive/blob/bc510f63de9d6baab3a5ad8a4bf4eed9c6fde8b1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2959 > Number of splits (Mappers) stay the same between the two runs, number of > Reducers increases. > *INFO : number of splits:2* > # Number of reduce tasks not specified. Estimated from input data size: 11 > # Number of reduce tasks not specified. Estimated from input data size: 22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)