I added “set hive.execution.engine=mr;” at top of the script, seems the result
is correct…
NFO : Completed compiling
command(queryId=hive_20160302005157_6ca43017-d52c-48c8-b4e4-e11c724c3d39); Time
taken: 0.032 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing
command(queryId=hive_20160302005157_6ca43017-d52c-48c8-b4e4-e11c724c3d39):
SELECT
a.key,
a.a_one,
b.b_one,
a.a_zero,
b.b_zero
FROM
(
SELECT
11 key,
0 confuse_you,
1 a_one,
0 a_zero
) a
LEFT JOIN
(
SELECT
11 key,
0 confuse_you,
1 b_one,
0 b_zero
) b
ON a.key = b.key
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
the future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
INFO : Query ID = hive_20160302005157_6ca43017-d52c-48c8-b4e4-e11c724c3d39
INFO : Total jobs = 1
INFO : Starting task [Stage-4:MAPREDLOCAL] in serial mode
INFO : Execution completed successfully
INFO : MapredLocal task succeeded
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-3:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1456723490535_5586
INFO : The url to track the job:
http://apache-master:8088/proxy/application_1456723490535_5586/
INFO : Starting Job = job_1456723490535_5586, Tracking URL =
http://apache-master:8088/proxy/application_1456723490535_5586/
INFO : Kill Command = /opt/hadoop/hadoop-2.7.1/bin/hadoop job -kill
job_1456723490535_5586
INFO : Hadoop job information for Stage-3: number of mappers: 0; number of
reducers: 0
INFO : 2016-03-02 00:52:13,316 Stage-3 map = 0%, reduce = 0%
INFO : 2016-03-02 00:52:17,416 Stage-3 map = 100%, reduce = 0%
INFO : Ended Job = job_1456723490535_5586
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-3: HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO : Total MapReduce CPU Time Spent: 0 msec
INFO : Completed executing
command(queryId=hive_20160302005157_6ca43017-d52c-48c8-b4e4-e11c724c3d39); Time
taken: 19.56 seconds
INFO : OK
+--------+----------+----------+-----------+-----------+--+
| a.key | a.a_one | b.b_one | a.a_zero | b.b_zero |
+--------+----------+----------+-----------+-----------+--+
| 11 | 1 | 1 | 0 | 0 |
+--------+----------+----------+-----------+-----------+--+
1 row selected (19.643 seconds)
From: Stephen Sprague [mailto:[email protected]]
Sent: Tuesday, March 1, 2016 11:45 PM
To: [email protected]
Subject: Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join
good to know. then per Jeff Zhang's thinking if you were to set the exec
engine to 'mr' would it still fail? if so, then its not Tez . :)
On Mon, Feb 29, 2016 at 9:31 PM, GAO Chi <[email protected]
<mailto:[email protected]> > wrote:
Yes. We have not changed our script, and this only appears after we upgraded to
new version at 24th.
Previously we’re using HIVE 1.2.0 + TEZ 0.7.0
Thanks!
Chi
From: Stephen Sprague [mailto:[email protected] <mailto:[email protected]> ]
Sent: Tuesday, March 1, 2016 12:31 PM
To: [email protected] <mailto:[email protected]>
Subject: Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join
very interesting. so this did work correctly on your previous distribution of
these two products? May i ask what they were?
On Mon, Feb 29, 2016 at 8:24 PM, GAO Chi <[email protected]
<mailto:[email protected]> > wrote:
Hi all,
We encountered a strange behavior after upgrading to HIVE 2.0.0 + TEZ 0.8.2.
I simplified our query to this:
SELECT
a.key,
a.a_one,
b.b_one,
a.a_zero,
b.b_zero
FROM
(
SELECT
11 key,
0 confuse_you,
1 a_one,
0 a_zero
) a
LEFT JOIN
(
SELECT
11 key,
0 confuse_you,
1 b_one,
0 b_zero
) b
ON a.key = b.key
;
Above query generates this unexpected result:
INFO : Status: Running (Executing on YARN cluster with App id
application_1456723490535_3653)
INFO : Map 1: 0/1 Map 2: 0/1
INFO : Map 1: 0/1 Map 2: 0(+1)/1
INFO : Map 1: 0(+1)/1 Map 2: 0(+1)/1
INFO : Map 1: 0(+1)/1 Map 2: 1/1
INFO : Map 1: 1/1 Map 2: 1/1
INFO : Completed executing
command(queryId=hive_20160301115630_0a0dbee5-ba4b-45e7-b027-085f655640fd); Time
taken: 10.225 seconds
INFO : OK
+--------+----------+----------+-----------+-----------+--+
| a.key | a.a_one | b.b_one | a.a_zero | b.b_zero |
+--------+----------+----------+-----------+-----------+--+
| 11 | 1 | 0 | 0 | 1 |
+--------+----------+----------+-----------+-----------+--+
If you change the constant value of subquery-b’s confuse_you column from 0 to
2, the problem disappears. The plan returned from EXPLAIN shows the incorrect
one is picking _col1 and _col2, while the correct one is picking _col2 and
_col3 form sub query b.
Seems it cannot distinguish 2 columns with same constant value?
Anyone encountered similar problem?
Thanks!
Chi