Hi all,
Noticed something extremely strange and concerning regarding subquery column
selection. Here's the query, and the sample results in presto vs hive
SELECT    ipm.ip,    ipm.master_id AS mid,    COALESCE(ad.cookie, '') AS cid,
    COALESCE(ad.url_cid, '') AS campaign_id,    COALESCE(ad.url_sid, '') AS
segment_id,    CASE        WHEN ad.url_event='clk'        THEN 'media_click' 
ELSE 'media_impression'    END AS event,    count(*) AS count,    CASE
        WHEN COALESCE(o.name, '') != ''        THEN 'first_party_media'
        ELSE ot.token_type END AS campaign_type,    ad.dt,
    lower(COALESCE(ot.organization_name, o.name)) AS orgFROM    (        SELECT
            base_ip,            url_event,            url_orgid,
            url_token, url_cid,            url_sid,            cookie,
            url,            date_time,            dt        FROM
            tag.pixel        WHERE            dt BETWEEN '2017-08-23' AND
'2017-08-23'            AND user_agent NOT LIKE 'ELB-HealthChecker%'
        GROUP BY            1, 2, 3, 4, 5, 6, 7, 8, 9, 10    ) ad    JOIN
default.ip_to_master_v2 ipm        ON ipm.ip = ad.base_ip    lEFT JOIN
default.organization_token ot        ON ot.token = ad.url_token    LEFT JOIN
aa.organization o        ON ad.url_orgid = o.idWHERE 
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY    1, 2, 3, 4, 5, 6,
8, 9, 10LIMIT 10
Presto       ip                 | mid                                          
            | cid                                                              
             | campaign_id | segment_id | event                      | count |
campaign_type    | dt                | org (ommitted)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 216.110.25.2     | 93c8a694f6c0d39-                               |
420ad817041900006549e258d9030000aba10300 | 50                 | 2              
   | media_impression | 1        | first_party_media | 2017-08-23 | ******
 14.140.116.135 | 6dfbca80b99c365-60934b2c332bcaa |
96453917a70d0000cbeb59598e010000fe970400  | 43                 | 7              
    | media_impression | 2        | first_party_media | 2017-08-23 | ******
Hive103.4.16.84 51 | ba82f9c1e7ef7-51ba82f9c1e7ef7       | 50    | 50    | 2   |
media_impression | 2   | first_party_media | 2017-08-23 | 5012.175.247.66  |
c4028d27f0ebbb2-ef86eff7815a004   | 109  | 109  | 21 | media_impression | 23 |
first_party_media | 2017-08-23 | 109116.197.188.12 |
e78268e3ce3e13c-19ffd91bd5236a8 | 43    | 43    | 7   | media_impression | 4   |
first_party_media | 2017-08-23 | 43
Alright, so clearly the results are very different in presto compared to hive. 
What's interesting is that columns 3, 4, and 10 show the same stat: campaign_id.
I tried a bunch of different things (different aliasing, grouping, etc.), but
nothing worked. The thing that did the trick was pulling out the subquery into
its own table
DROP TABLE IF EXISTS aa.tmp_subquery;CREATE TABLE IF NOT EXISTS aa.tmp_subquery
ASSELECT    base_ip,    url_event,    url_orgid,    url_token, url_cid,
    url_sid,    cookie,    url,    date_time,    dtFROM    tag.pixelWHERE    dt
BETWEEN '{start-yyyy-mm-dd}' AND '{end-yyyy-mm-dd}'    AND user_agent NOT LIKE
'ELB-HealthChecker%'GROUP BY    1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
INSERT OVERWRITE TABLE aa.campaign_accounts PARTITION (campaign_type, dt, org)
SELECT    ipm.ip,    ipm.master_id AS mid,    COALESCE(ad.cookie, '') AS cid,
    COALESCE(ad.url_cid, '') AS campaign_id,    COALESCE(ad.url_sid, '') AS
segment_id,    CASE        WHEN ad.url_event='clk'        THEN 'media_click' 
ELSE 'media_impression'    END AS event,    count(*) AS count,    CASE
        WHEN COALESCE(o.name, '') != ''        THEN 'first_party_media'
        ELSE ot.token_type END AS campaign_type,    ad.dt,
    lower(COALESCE(ot.organization_name, o.name)) AS orgFROM    aa.tmp_subquery
ad    JOIN default.ip_to_master_v2 ipm        ON ipm.ip = ad.base_ip    lEFT
JOIN default.organization_token ot        ON ot.token = ad.url_token    LEFT
JOIN aa.organization o        ON ad.url_orgid = o.idWHERE 
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY    1, 2, 3, 4, 5, 6,
8, 9, 10DISTRIBUTE BY    campaign_type, dt, orgSORT BY    mid;
The hive output matches presto's.Seems like a bug that's easy to miss and can
cause some major damage.
Anyone encountered this before?
Thanks,
Nitin

Reply via email to