Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17387 )

Change subject: IMPALA-10681: Improve join cardinality estimates
......................................................................


Patch Set 3:

(9 comments)

Hi Aman, Thanks a lot for the work. I think it should help with the cases 
nicely.

http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG@20
PS3, Line 20: and
nit. extra


http://gerrit.cloudera.org:8080/#/c/17387/3//COMMIT_MSG@21
PS3, Line 21: If it has a group-by and the group-by
            : columns alread have associated NDV, we can can still know the
            : combined NDV.
Just wonder if this case is handled in the code.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@430
PS3, Line 430: getOtherJoinCardinality
getNdvAdjustedJoinCardinality() may sound more close in semanticds of this 
function.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@431
PS3, Line 431: long lhsCard, long rhsCard
May add a comment for these two input variables.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@441
PS3, Line 441: if (stats.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / 
stats.lhsNumRows();
I wonder if we should trust lhsCard more here. On paper, NDV of an aggregate 
function is always 1, while the row count at the table level is somewhat 
accurate, and the row count above the scan is much less.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@448
PS3, Line 448: long joinCard = Math.round((lhsCard / Math.max(1, 
Math.max(lhsAdjNdv, rhsAdjNdv))) *
             :           rhsCard);
For equi- inner joins and the max. cardinality is to be returned, it seems an 
alternative formula could be

(CardL / NdvL) * (CardR / NdvR)

which is more intuitive to read and takes care of the occurrence frequency for 
both sides.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@550
PS3, Line 550: if (!hasNdvStats(slotDesc)) return false;
             :       if (!hasNumRowsStats(slotDesc)) return false;
             :       return true;
nit. May shorten to return (hasNdvStats() && hasNumRowsStats()).


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@606
PS3, Line 606: return Math.min(lhsNdv_, lhsNumRows_);
nit. May adjust the lhsNdv_ in the cstr to avoid repeated computation.


http://gerrit.cloudera.org:8080/#/c/17387/3/fe/src/main/java/org/apache/impala/planner/JoinNode.java@607
PS3, Line 607: return Math.min(rhsNdv_, rhsNumRows_);
Same as above.



--
To view, visit http://gerrit.cloudera.org:8080/17387
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8aa9d3b8f3c4848b3e9414fe19ad7ad348d12ecc
Gerrit-Change-Number: 17387
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 May 2021 15:09:17 +0000
Gerrit-HasComments: Yes

Reply via email to