Hi all,
Noticed something extremely strange and concerning regarding subquery column
selection. Here's the query, and the sample results in presto vs hive
SELECT ipm.ip, ipm.master_id AS mid, COALESCE(ad.cookie, '') AS cid,
COALESCE(ad.url_cid, '') AS campaign_id, COALESCE(ad.url_sid, '') AS
segment_id, CASE WHEN ad.url_event='clk' THEN 'media_click'
ELSE 'media_impression' END AS event, count(*) AS count, CASE
WHEN COALESCE(o.name, '') != '' THEN 'first_party_media'
ELSE ot.token_type END AS campaign_type, ad.dt,
lower(COALESCE(ot.organization_name, o.name)) AS orgFROM ( SELECT
base_ip, url_event, url_orgid,
url_token, url_cid, url_sid, cookie,
url, date_time, dt FROM
tag.pixel WHERE dt BETWEEN '2017-08-23' AND
'2017-08-23' AND user_agent NOT LIKE 'ELB-HealthChecker%'
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ) ad JOIN
default.ip_to_master_v2 ipm ON ipm.ip = ad.base_ip lEFT JOIN
default.organization_token ot ON ot.token = ad.url_token LEFT JOIN
aa.organization o ON ad.url_orgid = o.idWHERE
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY 1, 2, 3, 4, 5, 6,
8, 9, 10LIMIT 10
Presto ip | mid
| cid
| campaign_id | segment_id | event | count |
campaign_type | dt | org (ommitted)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
216.110.25.2 | 93c8a694f6c0d39- |
420ad817041900006549e258d9030000aba10300 | 50 | 2
| media_impression | 1 | first_party_media | 2017-08-23 | ******
14.140.116.135 | 6dfbca80b99c365-60934b2c332bcaa |
96453917a70d0000cbeb59598e010000fe970400 | 43 | 7
| media_impression | 2 | first_party_media | 2017-08-23 | ******
Hive103.4.16.84 51 | ba82f9c1e7ef7-51ba82f9c1e7ef7 | 50 | 50 | 2 |
media_impression | 2 | first_party_media | 2017-08-23 | 5012.175.247.66 |
c4028d27f0ebbb2-ef86eff7815a004 | 109 | 109 | 21 | media_impression | 23 |
first_party_media | 2017-08-23 | 109116.197.188.12 |
e78268e3ce3e13c-19ffd91bd5236a8 | 43 | 43 | 7 | media_impression | 4 |
first_party_media | 2017-08-23 | 43
Alright, so clearly the results are very different in presto compared to hive.
What's interesting is that columns 3, 4, and 10 show the same stat: campaign_id.
I tried a bunch of different things (different aliasing, grouping, etc.), but
nothing worked. The thing that did the trick was pulling out the subquery into
its own table
DROP TABLE IF EXISTS aa.tmp_subquery;CREATE TABLE IF NOT EXISTS aa.tmp_subquery
ASSELECT base_ip, url_event, url_orgid, url_token, url_cid,
url_sid, cookie, url, date_time, dtFROM tag.pixelWHERE dt
BETWEEN '{start-yyyy-mm-dd}' AND '{end-yyyy-mm-dd}' AND user_agent NOT LIKE
'ELB-HealthChecker%'GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
INSERT OVERWRITE TABLE aa.campaign_accounts PARTITION (campaign_type, dt, org)
SELECT ipm.ip, ipm.master_id AS mid, COALESCE(ad.cookie, '') AS cid,
COALESCE(ad.url_cid, '') AS campaign_id, COALESCE(ad.url_sid, '') AS
segment_id, CASE WHEN ad.url_event='clk' THEN 'media_click'
ELSE 'media_impression' END AS event, count(*) AS count, CASE
WHEN COALESCE(o.name, '') != '' THEN 'first_party_media'
ELSE ot.token_type END AS campaign_type, ad.dt,
lower(COALESCE(ot.organization_name, o.name)) AS orgFROM aa.tmp_subquery
ad JOIN default.ip_to_master_v2 ipm ON ipm.ip = ad.base_ip lEFT
JOIN default.organization_token ot ON ot.token = ad.url_token LEFT
JOIN aa.organization o ON ad.url_orgid = o.idWHERE
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY 1, 2, 3, 4, 5, 6,
8, 9, 10DISTRIBUTE BY campaign_type, dt, orgSORT BY mid;
The hive output matches presto's.Seems like a bug that's easy to miss and can
cause some major damage.
Anyone encountered this before?
Thanks,
Nitin