How are you defining the window? It looks like it's something like "rows
unbounded proceeding, current" or the reverse, as the correlation varies
across the elements of the group as if it's computing them on 1, then 2,
then 3 elements. Don't you want the correlation across the group? otherwise
this answer is 'right' for what you're doing it seems.

On Mon, Feb 28, 2022 at 7:49 AM Edgar H <kaotix...@gmail.com> wrote:

> My bad completely, missed the example by a mile sorry for that, let me
> change a couple of things.
>
> - Got to add "id" to the initial grouping and also add more elements to
> the initial set;
>
> val sampleSet = Seq(
>   ("group1", "id1", 1, 1, 6),
>   ("group1", "id1", 4, 4, 6),
>   ("group1", "id2", 2, 2, 5),
>   ("group1", "id3", 3, 3, 4),
>   ("group2", "id1", 4, 4, 3),
>   ("group2", "id2", 5, 5, 2),
>   ("group2", "id3", 6, 6, 1),
>   ("group2", "id3", 15, 6, 1)
> )
>
> val groupedSet = initialSet
>   .groupBy(
>     "group", "id"
>   ).agg(
>     sum("count1").as("count1Sum"),
>     sum("count2").as("count2Sum"),
>     sum("orderCount").as("orderCountSum")
> )
>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>
> Now, with this in place, in case the correlation is applied, the following
> is shown:
>
> +------+---+---------+---------+-------------+------------------+
> | group| id|count1Sum|count2Sum|orderCountSum|                cf|
> +------+---+---------+---------+-------------+------------------+
> |group1|id3|        3|        3|            4|              null|
> |group1|id2|        2|        2|            5|               1.0|
> |group1|id1|        5|        5|           12|               1.0|
> |group2|id3|       21|       12|            2|              null|
> |group2|id2|        5|        5|            2|               1.0|
> |group2|id1|        4|        4|            3|0.9980460957560549|
> +------+---+---------+---------+-------------+------------------+
>
> Taking into account what you just mentioned... Even if the Window is only
> partitioned by "group", would it still be impossible to obtain a
> correlation? I'm trying to do like...
>
> group1 = id1, id2, id3 (and their respective counts) - apply the
> correlation over the set of ids within the group (without taking into
> account they are a sum)
> group2 = id1, id2, id3 (and their respective counts) - same as before
>
> However, the highest element is still null. When changing the rowsBetween
> call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> it will just calculate the whole subset correlation. Shouldn't the first
> element of the correlation calculate itself?
>

Reply via email to