Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18413 )
Change subject: IMPALA-11205: Implement CORR() function ...................................................................... Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@366 PS4, Line 366: return result; > Yeah, we can remove it, it'll get cleared after the function pops from the Oh, I think we need this. Otherwise we'll see warnings like WARNINGS: UDF WARNING: Memory leaked via FunctionContext::Allocate() Look at other udas, they use StringValSerializeOrFinalize() for such simple purpose, i.e. copy and free. I think we don't need this CorrSerialize() method, just do the same as other udas, e.g. https://github.com/apache/impala/blob/1358700740dbeff799f6a6a95f95b1d9fe7281d2/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java#L1376 http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@373 PS4, Line 373: // Calculating correlation coefficient using Pearson Product Moment Correlation formula > Yeah, we can remove it. I intended to remove it, but probably forgot. Based on the above comment, I think we do need this. http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc@372 PS5, Line 372: if (state->count == 0 || state->count == 1) return DoubleVal::null(); Can we add test cases for these? http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc@374 PS5, Line 374: ((state->prod*state->count) - (state->sumx*state->sumy)) / : (pow((((state->count*state->sum_squaredx) - : (state->sumx*state->sumx)) * ((state->count*state->sum_squaredy) - : (state->sumy*state->sumy))), 0.5)); Can we format this a little bit? There are too many brackets. Maybe this is better: int64_t n = state->count; double r1 = n * state->prod - state->sumx * state->sumy; double r2 = (n * state->sum_squaredx - state->sumx * state->sumx) * (n * state->sum_squaredy - state->sumy * state->sumy); double corr = r1 / sqrt(r2); Note that we can replace pow(x, 0.5) with sqrt(x). We should also check if r2 is 0 and return NULL in that case. Hive does it this way: https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java#L366 http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test File testdata/workloads/functional-query/queries/QueryTest/uda.test: http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test@19 PS5, Line 19: NULL Is this correct? In Hive, it's 1.0 http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test@26 PS5, Line 26: corr(cast(timestamp_col as int),cast(timestamp_col as int)) Hive doesn't need to cast timestamp in the query. Can we improve this? -- To view, visit http://gerrit.cloudera.org:8080/18413 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06 Gerrit-Change-Number: 18413 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward <pranav.lo...@cloudera.com> Gerrit-Reviewer: Anonymous Coward <pranav.lo...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Thu, 21 Apr 2022 03:27:18 +0000 Gerrit-HasComments: Yes