Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18413 )

Change subject: IMPALA-11205: Implement CORR() function
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@366
PS4, Line 366:   return result;
> Yeah, we can remove it, it'll get cleared after the function pops from the
Oh, I think we need this. Otherwise we'll see warnings like

 WARNINGS: UDF WARNING: Memory leaked via FunctionContext::Allocate()

Look at other udas, they use StringValSerializeOrFinalize() for such simple 
purpose, i.e. copy and free. I think we don't need this CorrSerialize() method, 
just do the same as other udas, e.g.
https://github.com/apache/impala/blob/1358700740dbeff799f6a6a95f95b1d9fe7281d2/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java#L1376


http://gerrit.cloudera.org:8080/#/c/18413/4/be/src/exprs/aggregate-functions-ir.cc@373
PS4, Line 373:   // Calculating correlation coefficient using Pearson Product 
Moment Correlation formula
> Yeah, we can remove it. I intended to remove it, but probably forgot.
Based on the above comment, I think we do need this.


http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc@372
PS5, Line 372:   if (state->count == 0 || state->count == 1) return 
DoubleVal::null();
Can we add test cases for these?


http://gerrit.cloudera.org:8080/#/c/18413/5/be/src/exprs/aggregate-functions-ir.cc@374
PS5, Line 374: ((state->prod*state->count) - (state->sumx*state->sumy)) /
             :       (pow((((state->count*state->sum_squaredx) -
             :       (state->sumx*state->sumx)) * 
((state->count*state->sum_squaredy) -
             :                    (state->sumy*state->sumy))), 0.5));
Can we format this a little bit? There are too many brackets. Maybe this is 
better:

  int64_t n = state->count;
  double r1 = n * state->prod - state->sumx * state->sumy;
  double r2 = (n * state->sum_squaredx - state->sumx * state->sumx) *
      (n * state->sum_squaredy - state->sumy * state->sumy);
  double corr = r1 / sqrt(r2);

Note that we can replace pow(x, 0.5) with sqrt(x).

We should also check if r2 is 0 and return NULL in that case. Hive does it this 
way:
https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java#L366


http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test
File testdata/workloads/functional-query/queries/QueryTest/uda.test:

http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test@19
PS5, Line 19: NULL
Is this correct? In Hive, it's 1.0


http://gerrit.cloudera.org:8080/#/c/18413/5/testdata/workloads/functional-query/queries/QueryTest/uda.test@26
PS5, Line 26: corr(cast(timestamp_col as int),cast(timestamp_col as int))
Hive doesn't need to cast timestamp in the query. Can we improve this?



--
To view, visit http://gerrit.cloudera.org:8080/18413
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06
Gerrit-Change-Number: 18413
Gerrit-PatchSet: 5
Gerrit-Owner: Anonymous Coward <pranav.lo...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <pranav.lo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 03:27:18 +0000
Gerrit-HasComments: Yes

Reply via email to