Hi!
I have checked the queries, and I can verify that Impala incorrectly
returns 1 row while the same query with Hive (or common sense..) returns 2
rows.
> "but if remove the "t2.amount2" like this:"
Indeed, the issue seems to be related to returning the same aggregate twice
+ the fact that one of these values is NULL. The planner introduces a
predicate that checks if amount1=amount2, which is false, if both values
are NULL:
explain select * from (select t2.a_id,t2.amount1,t2.amount2
from( select a_id from a) t1
left outer join (
select c.a_id,sum(amount) as amount1,sum(amount) as amount2
from b join c on b.b_id = c.b_id group by c.a_id) t2
on t1.a_id = t2.a_id) t;
results in:
PLAN-ROOT SINK
|
05:HASH JOIN [RIGHT OUTER JOIN]
| hash predicates: c.a_id = a_id
| other predicates: sum(amount) = sum(amount) <----- I don't know why this
predicate is added.
| runtime filters: RF000 <- a_id
| row-size=16B cardinality=2
....
If I EXPLAIN the query without the outer select, the sum(amount) =
sum(amount) is not added, which explains the difference.
I do not know why the planner adds this predicate, my guess is that this is
some kind of bug in Impala.
On Thu, Apr 4, 2019 at 2:27 PM skyyws <[email protected]> wrote:
> Hi all, I met a problem of left outer join recently, and I reproduce this
> problem by some simple test data with three tables a, b, c:
> table A
> +------+
> | a_id |
> +------+
> | 1 |
> | 2 |
> +------+
> table B
> +------+--------+
> | b_id | amount |
> +------+--------+
> | 1 | 10 |
> | 1 | 20 |
> | 2 | NULL |
> +------+--------+
> table C
> +------+------+
> | a_id | b_id |
> +------+------+
> | 1 | 1 |
> | 2 | 2 |
> +------+------+
> The sql below:
> select count(1) from (
> select t2.a_id,t2.amount1,t2.amount2
> from( select a_id from a) t1
> left outer join (
> select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> from b join c on b.b_id = c.b_id group by c.a_id) t2
> on t1.a_id = t2.a_id
> ) t;
> +----------+
> | count(1) |
> +----------+
> | 1 |
> +----------+
> but if remove the "t2.amount2" like this:
> select count(1) from (
> select t2.a_id,t2.amount1
> from( select a_id from a) t1
> left outer join (
> select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> from b join c on b.b_id = c.b_id group by c.a_id) t2
> on t1.a_id = t2.a_id
> ) t;
> +----------+
> | count(1) |
> +----------+
> | 2 |
> +----------+
> Here is the result of two subquery without count(1):
> +------+---------+---------+
> | a_id | amount1 | amount2 |
> +------+---------+---------+
> | 1 | 30 | 30 |
> | 2 | NULL | NULL |
> +------+---------+---------+ why the count(1) of this
> resultset is 1?
> +------+---------+
> | a_id | amount1 |
> +------+---------+
> | 1 | 30 |
> | 2 | NULL |
> +------+---------+ why the count(1) of this
> resultset is 2?
> I want to ask why the first sql return just 1, but second return 2,is this
> correct or impala bug?How impala deal with count aggr.?
> If I change the sum to other aggr. function like count/max/min, result is
> same. I test this on 2.12.0 and 3.1.0 version.
>
>