Re: A problem of left outer join when statement contains two aggregation for the same column

Csaba Ringhofer Thu, 04 Apr 2019 08:09:04 -0700

Hi!

I have checked the queries, and I can verify that Impala incorrectly
returns 1 row while the same query with Hive (or common sense..) returns 2
rows.


> "but if remove the "t2.amount2"  like this："
Indeed, the issue seems to be related to returning the same aggregate twice
+ the fact that one of these values is NULL. The planner introduces a
predicate that checks if amount1=amount2, which is false, if both values
are NULL:

explain select * from (select t2.a_id,t2.amount1,t2.amount2
    from( select a_id from a) t1
    left outer join (
        select c.a_id,sum(amount) as amount1,sum(amount) as amount2
        from b join c  on b.b_id = c.b_id group by c.a_id) t2
    on t1.a_id = t2.a_id) t;

results in:
PLAN-ROOT SINK
|
05:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: c.a_id = a_id
|  other predicates: sum(amount) = sum(amount) <----- I don't know why this
predicate is added.
|  runtime filters: RF000 <- a_id
|  row-size=16B cardinality=2
....

If I EXPLAIN the query without the outer select, the sum(amount) =
sum(amount) is not added, which explains the difference.

I do not know why the planner adds this predicate, my guess is that this is
some kind of bug in Impala.


On Thu, Apr 4, 2019 at 2:27 PM skyyws <[email protected]> wrote:

> Hi all, I met a problem of left outer join recently, and I reproduce this
> problem by some simple test data with three tables a, b, c：
> table A
> +------+
> | a_id |
> +------+
> | 1    |
> | 2    |
> +------+
> table B
> +------+--------+
> | b_id | amount |
> +------+--------+
> | 1    | 10     |
> | 1    | 20     |
> | 2    | NULL   |
> +------+--------+
> table C
> +------+------+
> | a_id | b_id |
> +------+------+
> | 1    | 1    |
> | 2    | 2    |
> +------+------+
> The sql below：
> select count(1) from (
>     select t2.a_id,t2.amount1,t2.amount2
>     from( select a_id from a) t1
>     left outer join (
>         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>         from b join c  on b.b_id = c.b_id group by c.a_id) t2
>     on t1.a_id = t2.a_id
> ) t;
> +----------+
> | count(1) |
> +----------+
> | 1        |
> +----------+
> but if remove the "t2.amount2"  like this：
> select count(1) from (
>     select t2.a_id,t2.amount1
>     from( select a_id from a) t1
>     left outer join (
>         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>         from b join c  on b.b_id = c.b_id group by c.a_id) t2
>     on t1.a_id = t2.a_id
> ) t;
> +----------+
> | count(1) |
> +----------+
> | 2        |
> +----------+
> Here is the result of two subquery without count(1)：
> +------+---------+---------+
> | a_id | amount1 | amount2 |
> +------+---------+---------+
> | 1    | 30      | 30      |
> | 2    | NULL    | NULL    |
> +------+---------+---------+             why the count(1) of this
> resultset is 1?
> +------+---------+
> | a_id | amount1 |
> +------+---------+
> | 1    | 30      |
> | 2    | NULL    |
> +------+---------+                           why the count(1) of this
> resultset is 2?
> I want to ask why the first sql return just 1， but second return 2，is this
> correct or impala bug？How impala deal with count aggr.?
> If I change the sum to other aggr. function like count/max/min, result is
> same. I test this on 2.12.0 and 3.1.0 version.
>
>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to