Hi Rex,
Could you also attach one example for these sql / table ? And one possible
issue to confirm is that does the operators with the same names also have the
same inputs ?
Best,
Yun
------------------Original Mail ------------------
Sender:Rex Fenley <[email protected]>
Send Date:Fri Dec 4 02:55:41 2020
Recipients:user <[email protected]>
Subject:Duplicate operators generated by plan
Hello,
I'm running into an issue where my execution plan is creating the same exact
join operator multiple times simply because the subsequent operator filters on
a different boolean value. This is a massive duplication of storage and work.
The filtered operators which follow result in only a small set of elements
filtered out per set too.
eg. of two separate operators that are equal
Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id,
organization_id, user_id, roles, id_splatted, org_user_is_admin,
org_user_is_teacher, org_user_is_student, org_user_is_parent],
leftInputSpec=[JoinKeyContainsUniqueKey],
rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id,
user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher,
org_user_is_student, org_user_is_parent]
Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id,
organization_id, user_id, roles, id_splatted, org_user_is_admin,
org_user_is_teacher, org_user_is_student, org_user_is_parent],
leftInputSpec=[JoinKeyContainsUniqueKey],
rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id,
user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher,
org_user_is_student, org_user_is_parent])
Which are entirely the same datasets being processed.
The first one points to
GroupAggregate(groupBy=[user_id], select=[user_id,
IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS
admin_organization_ids])
The second one points to
GroupAggregate(groupBy=[user_id], select=[user_id,
IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS
teacher_organization_ids])
And these are both intersecting sets of data though slightly different. I don't
see why that would make the 1 join from before split into 2 though. There's
even a case where I'm seeing a join tripled.
Is there a good reason why this should happen? Is there a way to tell flink to
not duplicate operators where it doesn't need to?
Thanks!
--
Rex Fenley | Software Engineer - Mobile and Backend
Remind.com | BLOG | FOLLOW US | LIKE US