I'm new to Spark and would like to seek some advice on how to approach a 
problem.

I have a large dataset that has dated observations. There are also columns that 
are running sums of some of other columns.

   date     | thing |   foo   |   bar   |  foo_sum  |  bar_sum  |
============+=======+=========+=========+===========+===========+
 2020-01-01 |  101  |      1  |       3 |        1  |        3  |
 2020-01-01 |  202  |      0  |       2 |        0  |        2  |
 2020-01-01 |  303  |      1  |       1 |        1  |        1  |
------------+-------+---------+---------+-----------+-----------+
 2020-01-02 |  101  |      1  |       2 |        2  |        5  |
 2020-01-02 |  202  |      0  |       0 |        0  |        2  |
 2020-01-02 |  303  |      4  |       1 |        5  |        2  |
 2020-01-02 |  404  |      2  |       2 |        2  |        2  |
------------+-------+---------+---------+-----------+-----------+
Currently I generate the running sums using a WindowSpec:

final WindowSpec w =
    Window.partitionBy(col("thing"))
        .orderBy(col("date"), col("thing"))
        .rowsBetween(Window.unboundedPreceding(), Window.currentRow());

return df
    .withColumn(col("foo_sum"), sum("foo").over(w))
    .withColumn(col("bar_sum"), sum("bar").over(w));
Once these extra sum columns are computed, they are written back to storage.

Periodically this dataset is appended to with new observations. These new 
observations are all chronologically later than any of the previous 
observations.

I need to "continue" the previous running sums for the new observations -- but 
I want to avoid having to recompute the running sums completely from scratch.

   date     | thing |   foo   |   bar   |  foo_sum  |  bar_sum  |
============+=======+=========+=========+===========+===========+
 2020-01-01 |  101  |      1  |       3 |        1  |        3  |
 2020-01-01 |  202  |      0  |       2 |        0  |        2  |
 2020-01-01 |  303  |      1  |       1 |        1  |        1  |
------------+-------+---------+---------+-----------+-----------+
 2020-01-02 |  101  |      1  |       2 |        2  |        5  |
 2020-01-02 |  202  |      0  |       0 |        0  |        2  |
 2020-01-02 |  303  |      4  |       1 |        5  |        2  |
 2020-01-02 |  404  |      2  |       2 |        2  |        2  |
------------+-------+---------+---------+-----------+-----------+   new data
 2020-01-03 |  101  |      2  |       2 |        .  |        .  |
 2020-01-03 |  303  |      1  |       1 |        .  |        .  |
 2020-01-02 |  404  |      2  |       1 |        .  |        .  |
I would appreciate it if anyone had any pointers about how to approach this 
sort of problem that they could share.

Kind regards,

—
Greg

Reply via email to