Do you have to use SQL/window function for this? If I understand this
correctly, you could just keep track of the last record of each "thing",
then calculate the new sum by adding the current value of "thing" to the
sum of last record when a new record is generated. Looks like your
problem will get much easier if thinking outside the SQL box or maybe
outside the Spark box...
On 10/11/22 8:33 PM, Greg Kopff wrote:
I'm new to Spark and would like to seek some advice on how to approach
a problem.
I have a large dataset that has dated observations. There are also
columns that are running sums of some of other columns.
|date | thing | foo | bar | foo_sum | bar_sum |
============+=======+=========+=========+===========+===========+
2020-01-01 | 101 | 1 | 3 | 1 | 3 | 2020-01-01 | 202 | 0 | 2 | 0 | 2 |
2020-01-01 | 303 | 1 | 1 | 1 | 1 |
------------+-------+---------+---------+-----------+-----------+
2020-01-02 | 101 | 1 | 2 | 2 | 5 | 2020-01-02 | 202 | 0 | 0 | 0 | 2 |
2020-01-02 | 303 | 4 | 1 | 5 | 2 | 2020-01-02 | 404 | 2 | 2 | 2 | 2 |
------------+-------+---------+---------+-----------+-----------+|
Currently I generate the running sums using a |WindowSpec|:
|final WindowSpec w = Window.partitionBy(col("thing"))
.orderBy(col("date"), col("thing"))
.rowsBetween(Window.unboundedPreceding(), Window.currentRow()); return
df .withColumn(col("foo_sum"), sum("foo").over(w))
.withColumn(col("bar_sum"), sum("bar").over(w));|
Once these extra sum columns are computed, they are written back to
storage.
Periodically this dataset is appended to with new observations. These
new observations are all chronologically later than any of the
previous observations.
I need to "continue" the previous running sums for the new
observations -- but I want to avoid having to recompute the running
sums completely from scratch.
|date | thing | foo | bar | foo_sum | bar_sum |
============+=======+=========+=========+===========+===========+
2020-01-01 | 101 | 1 | 3 | 1 | 3 | 2020-01-01 | 202 | 0 | 2 | 0 | 2 |
2020-01-01 | 303 | 1 | 1 | 1 | 1 |
------------+-------+---------+---------+-----------+-----------+
2020-01-02 | 101 | 1 | 2 | 2 | 5 | 2020-01-02 | 202 | 0 | 0 | 0 | 2 |
2020-01-02 | 303 | 4 | 1 | 5 | 2 | 2020-01-02 | 404 | 2 | 2 | 2 | 2 |
------------+-------+---------+---------+-----------+-----------+ new
data 2020-01-03 | 101 | 2 | 2 | . | . | 2020-01-03 | 303 | 1 | 1 | . |
. | 2020-01-02 | 404 | 2 | 1 | . | . ||
I would appreciate it if anyone had any pointers about how to approach
this sort of problem that they could share.
Kind regards,
—
Greg