pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/18413 )
Change subject: IMPALA-11205: Implement CORR() function ...................................................................... Patch Set 5: (9 comments) > Patch Set 4: > > (9 comments) http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18413/4//COMMIT_MSG@10 PS4, Line 10: t > nit: remove the leading space Done http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@292 PS4, Line 292: // type and returns the correlation coefficient between them. > Could you comment the formula here? Done http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@359 PS4, Line 359: dst_state->count += src_state->count; > nit: remove this blank line Done http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@366 PS4, Line 366: return result; > Could you explain why we need to free it here? It's not allocated by this f Yeah, we can remove it, it'll get cleared after the function pops from the stack trace so its not needed. http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@373 PS4, Line 373: // Calculating correlation coefficient using Pearson Product Moment Correlation formula > Why do we free it here? We still use the CorrState below. Yeah, we can remove it. I intended to remove it, but probably forgot. http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@375 PS4, Line 375: (pow((((state->count*state->sum_squaredx) - : (state->sumx*state->sumx)) * ((state->count*state->sum_squaredy) - : (state->sumy*state->sumy))), 0.5)); : return DoubleVal(corr); : } > I just realized that we are using a different formula than Hive's: Yes, they're equivalent. Both are using Pearson's correlation coefficient itself. In hive, there are functions for covariance and standard deviation, so the formula might look a little different. But on simplifying its the same. http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test File testdata/workloads/functional-query/queries/QueryTest/uda.test: http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@12 PS4, Line 12: 0.000321849166368 > I checked the test codes and found that it already handles rounding inaccur Done http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@15 PS4, Line 15: ==== > Can we add tests on NULL values? Some tpcds tables contain NULLs, e.g. cata Done http://gerrit.cloudera.org:8080/#/c/18413/4/testdata/workloads/functional-query/queries/QueryTest/uda.test@32 PS4, Line 32: ==== > Could you add tests for other types as well? E.g. Sure, done! -- To view, visit http://gerrit.cloudera.org:8080/18413 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06 Gerrit-Change-Number: 18413 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward <pranav.lo...@cloudera.com> Gerrit-Reviewer: Anonymous Coward <pranav.lo...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Mon, 18 Apr 2022 09:23:02 +0000 Gerrit-HasComments: Yes