Hi All,
We have a Static DataFrame with as follows.
--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

We also have live stream of events, a Streaming DataFrame which contains id
and updated time_stamp. The first batch of event is as follows -
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------
Now after every batch, we want to update the Static DataFrame with the
updated values of Streaming Dataframe like follows - Static DF after first
batch :
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
Below is the code snippet of our pyspark code with spark 2.4.4

static_df = spark.read.schema(schemaName).json(fileName)
streaming_df = spark.readStream(....)
new_reference_data = update_reference_df(streaming_df, static_df)
def update_reference_df(df, static_df):
    query: StreamingQuery = df \
        .writeStream \
        .outputMode("append") \
        .foreachBatch(lambda batch_df, batchId: update_static_df(batch_df,
static_df)) \
        .start()
 return query

def update_static_df(batch_df, static_df):
    static_df: DataFrame = static_df.union(batch_df.join(static_df,
                                             (batch_df.id== static_df.id)
                                             "left_anti"))

return static_df

Question:
I want to know how will the static_df get refreshed with the new values
from the data processed via foreachBatch. As I know foreachBatch returns
nothing (VOID). I need to use the refreshed values from static_df in
further processing. Appreciate for your help.

Thanks in advance !

Debu

Reply via email to