BELUGA BEHR created HIVE-17004:
----------------------------------
Summary: Calculating Number Of Reducers Looks At All Files
Key: HIVE-17004
URL: https://issues.apache.org/jira/browse/HIVE-17004
Project: Hive
Issue Type: Improvement
Components: Hive
Affects Versions: 2.1.1
Reporter: BELUGA BEHR
When calculating the number of Mappers and Reducers, the two algorithms are
looking at different data sets. The number of Mappers are calculated based on
the number of splits and the number of Reducers are based on the number of
files within the HDFS directory. What you see is that if I add files to a
sub-directory of the HDFS directory, the number of splits remains the same
since I did not tell Hive to search recursively, and the number of Reducers
increases. Please improve this so that Reducers are looking at the same files
that are considered for splits and not at files within sub-directories (unless
configured to do so).
{code}
CREATE EXTERNAL TABLE Complaints (
a string,
b string,
c string,
d string,
e string,
f string,
g string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/complaints';
{code}
{code}
[root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.1.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.2.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.3.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.4.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.5.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.csv
{code}
{code}
INFO : Compiling
command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae):
select a, count(1) from complaints group by a limit 10
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a,
type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)],
properties:null)
INFO : Completed compiling
command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); Time
taken: 0.077 seconds
INFO : Executing
command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae):
select a, count(1) from complaints group by a limit 10
INFO : Query ID = hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Estimated from input data size: 11
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:2
INFO : Submitting tokens for job: job_1493729203063_0003
INFO : The url to track the job:
http://host:8088/proxy/application_1493729203063_0003/
INFO : Starting Job = job_1493729203063_0003, Tracking URL =
http://host:8088/proxy/application_1493729203063_0003/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job
-kill job_1493729203063_0003
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 11
INFO : 2017-05-02 14:20:14,206 Stage-1 map = 0%, reduce = 0%
INFO : 2017-05-02 14:20:22,520 Stage-1 map = 100%, reduce = 0%, Cumulative
CPU 4.48 sec
INFO : 2017-05-02 14:20:34,029 Stage-1 map = 100%, reduce = 27%, Cumulative
CPU 15.72 sec
INFO : 2017-05-02 14:20:35,069 Stage-1 map = 100%, reduce = 55%, Cumulative
CPU 21.94 sec
INFO : 2017-05-02 14:20:36,110 Stage-1 map = 100%, reduce = 64%, Cumulative
CPU 23.97 sec
INFO : 2017-05-02 14:20:39,233 Stage-1 map = 100%, reduce = 73%, Cumulative
CPU 25.26 sec
INFO : 2017-05-02 14:20:43,392 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 30.9 sec
INFO : MapReduce Total cumulative CPU time: 30 seconds 900 msec
INFO : Ended Job = job_1493729203063_0003
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 11 Cumulative CPU: 30.9 sec HDFS
Read: 735691149 HDFS Write: 153 SUCCESS
INFO : Total MapReduce CPU Time Spent: 30 seconds 900 msec
INFO : Completed executing
command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); Time
taken: 36.035 seconds
INFO : OK
{code}
{code}
[root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.1.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.2.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.3.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.4.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.5.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12
/user/admin/complaints/Consumer_Complaints.csv
drwxr-xr-x - admin admin 0 2017-05-02 14:16 /user/admin/complaints/t
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.1.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.2.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.3.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.4.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.5.csv
-rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16
/user/admin/complaints/t/Consumer_Complaints.csv
{code}
{code}
INFO : Compiling
command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e):
select a, count(1) from complaints group by a limit 10
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a,
type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)],
properties:null)
INFO : Completed compiling
command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); Time
taken: 0.073 seconds
INFO : Executing
command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e):
select a, count(1) from complaints group by a limit 10
INFO : Query ID = hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Estimated from input data size: 22
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:2
INFO : Submitting tokens for job: job_1493729203063_0004
INFO : The url to track the job:
http://host:8088/proxy/application_1493729203063_0004/
INFO : Starting Job = job_1493729203063_0004, Tracking URL =
http://host:8088/proxy/application_1493729203063_0004/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job
-kill job_1493729203063_0004
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 22
INFO : 2017-05-02 14:29:27,464 Stage-1 map = 0%, reduce = 0%
INFO : 2017-05-02 14:29:36,829 Stage-1 map = 100%, reduce = 0%, Cumulative
CPU 10.2 sec
INFO : 2017-05-02 14:29:47,287 Stage-1 map = 100%, reduce = 14%, Cumulative
CPU 15.36 sec
INFO : 2017-05-02 14:29:49,381 Stage-1 map = 100%, reduce = 27%, Cumulative
CPU 20.76 sec
INFO : 2017-05-02 14:29:50,433 Stage-1 map = 100%, reduce = 32%, Cumulative
CPU 22.69 sec
INFO : 2017-05-02 14:29:56,743 Stage-1 map = 100%, reduce = 45%, Cumulative
CPU 27.73 sec
INFO : 2017-05-02 14:30:00,916 Stage-1 map = 100%, reduce = 64%, Cumulative
CPU 34.95 sec
INFO : 2017-05-02 14:30:06,142 Stage-1 map = 100%, reduce = 77%, Cumulative
CPU 41.49 sec
INFO : 2017-05-02 14:30:10,297 Stage-1 map = 100%, reduce = 82%, Cumulative
CPU 42.92 sec
INFO : 2017-05-02 14:30:11,334 Stage-1 map = 100%, reduce = 86%, Cumulative
CPU 45.24 sec
INFO : 2017-05-02 14:30:12,365 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 50.33 sec
INFO : MapReduce Total cumulative CPU time: 50 seconds 330 msec
INFO : Ended Job = job_1493729203063_0004
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 22 Cumulative CPU: 50.33 sec HDFS
Read: 735731640 HDFS Write: 153 SUCCESS
INFO : Total MapReduce CPU Time Spent: 50 seconds 330 msec
INFO : Completed executing
command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); Time
taken: 51.841 seconds
INFO : OK
{code}
https://github.com/apache/hive/blob/bc510f63de9d6baab3a5ad8a4bf4eed9c6fde8b1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2959
Number of splits (Mappers) stay the same between the two runs, number of
Reducers increases.
*INFO : number of splits:2*
# Number of reduce tasks not specified. Estimated from input data size: 11
# Number of reduce tasks not specified. Estimated from input data size: 22
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)