RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

Chloe He Tue, 02 Apr 2024 14:42:41 -0700

Hi Mich,

Thank you so much for your response. I really appreciate your help!


You mentioned "defining the watermark using the withWatermark function on the 
streaming_df before creating the temporary view” - I believe this is what I’m 
doing and it’s not working for me. Here is the exact code snippet that I’m 
running:

```
>>> streaming_df = session.readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", "localhost:9092")\
    .option("subscribe", "payment_msg")\
    .option("startingOffsets","earliest")\
    .load()\
    .select(from_json(col("value").cast("string"), 
schema).alias("parsed_value"))\
    .select("parsed_value.*")\
    .withWatermark("createTime", "10 seconds")

>>> streaming_df.createOrReplaceTempView("streaming_df”)

>>> spark.sql("""
SELECT
    window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
    FROM streaming_df
    GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
    ORDER BY window.start
""")\
  .withWatermark("start", "10 seconds")\
  .writeStream\
  .format("kafka") \
  .option("checkpointLocation", "checkpoint") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("topic", "sink") \
  .start()

AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
EventTimeWatermark start#37: timestamp, 10 seconds
```

I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks again!

Best,
Chloe


On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> ok let us take it for a test.
> 
> The original code of mine
> 
>     def fetch_data(self):
>         self.sc.setLogLevel("ERROR")
>         schema = StructType() \
>          .add("rowkey", StringType()) \
>          .add("timestamp", TimestampType()) \
>          .add("temperature", IntegerType())
>         checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
>         try:
> 
>             # construct a streaming dataframe 'streamingDataFrame' that
> subscribes to topic temperature
>             streamingDataFrame = self.spark \
>                 .readStream \
>                 .format("kafka") \
>                 .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
>                 .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
>                 .option("group.id", config['common']['appName']) \
>                 .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
>                 .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
>                 .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
>                 .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
>                 .option("subscribe", "temperature") \
>                 .option("failOnDataLoss", "false") \
>                 .option("includeHeaders", "true") \
>                 .option("startingOffsets", "earliest") \
>                 .load() \
>                 .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
> 
> 
>             resultC = streamingDataFrame.select( \
>                      col("parsed_value.rowkey").alias("rowkey") \
>                    , col("parsed_value.timestamp").alias("timestamp") \
>                    , col("parsed_value.temperature").alias("temperature"))
> 
>             """
>             We work out the window and the AVG(temperature) in the window's
> timeframe below
>             This should return back the following Dataframe as struct
> 
>              root
>              |-- window: struct (nullable = false)
>              |    |-- start: timestamp (nullable = true)
>              |    |-- end: timestamp (nullable = true)
>              |-- avg(temperature): double (nullable = true)
> 
>             """
>             resultM = resultC. \
>                      withWatermark("timestamp", "5 minutes"). \
>                      groupBy(window(resultC.timestamp, "5 minutes", "5
> minutes")). \
>                      avg('temperature')
> 
>             # We take the above DataFrame and flatten it to get the columns
> aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
>             resultMF = resultM. \
>                        select( \
>                             F.col("window.start").alias("startOfWindow") \
>                           , F.col("window.end").alias("endOfWindow") \
>                           ,
> F.col("avg(temperature)").alias("AVGTemperature"))
> 
>             # Kafka producer requires a key, value pair. We generate UUID
> key as the unique identifier of Kafka record
>             uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())
> 
>             """
>             We take DataFrame resultMF containing temperature info and
> write it to Kafka. The uuid is serialized as a string and used as the key.
>             We take all the columns of the DataFrame and serialize them as
> a JSON string, putting the results in the "value" of the record.
>             """
>             result = resultMF.withColumn("uuid",uuidUdf()) \
>                      .selectExpr("CAST(uuid AS STRING) AS key",
> "to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \
>                      .writeStream \
>                      .outputMode('complete') \
>                      .format("kafka") \
>                      .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
>                      .option("topic", "avgtemperature") \
>                      .option('checkpointLocation', checkpoint_path) \
>                      .queryName("avgtemperature") \
>                      .start()
> 
>         except Exception as e:
>                 print(f"""{e}, quitting""")
>                 sys.exit(1)
> 
>         #print(result.status)
>         #print(result.recentProgress)
>         #print(result.lastProgress)
> 
>         result.awaitTermination()
> 
> Now try to use sql for the entire transformation and aggression
> 
> #import this and anything else needed
> from pyspark.sql.functions import from_json, col, window
> from pyspark.sql.types import StructType, StringType,IntegerType,
> FloatType, TimestampType
> 
> 
> # Define the schema for the JSON data
> schema = ... # Replace with your schema definition
> 
>         # construct a streaming dataframe 'streamingDataFrame' that
> subscribes to topic temperature
>         streamingDataFrame = self.spark \
>                 .readStream \
>                 .format("kafka") \
>                 .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
>                 .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
>                 .option("group.id", config['common']['appName']) \
>                 .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
>                 .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
>                 .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
>                 .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
>                 .option("subscribe", "temperature") \
>                 .option("failOnDataLoss", "false") \
>                 .option("includeHeaders", "true") \
>                 .option("startingOffsets", "earliest") \
>                 .load() \
>                 .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
>               .select("parsed_value.*")
>               .withWatermark("createTime", "10 seconds"))  # Define the
> watermark here
> 
> # Create a temporary view from the streaming DataFrame with watermark
> streaming_df.createOrReplaceTempView("michboy")
> 
> # Execute SQL queries on the temporary view
> result_df = (spark.sql("""
>     SELECT
>         window.start, window.end, provinceId, sum(payAmount) as
> totalPayAmount
>     FROM michboy
>     GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
>     ORDER BY window.start
> """)
>               .writeStream
>               .format("kafka")
>               .option("checkpointLocation", "checkpoint")
>               .option("kafka.bootstrap.servers", "localhost:9092")
>               .option("topic", "sink")
>               .start())
> 
> Note that the watermark is defined using the withWatermark function on the
> streaming_df before creating the temporary view (michboy 😀). This way, the
> watermark information is correctly propagated to the temporary view,
> allowing you to execute SQL queries with window functions and aggregations
> on the streaming data.
> 
> Note that by defining the watermark on the streaming DataFrame before
> creating the temporary view, Spark will recognize the watermark and allow
> streaming aggregations and window operations in your SQL queries.
> 
> HTH
> 
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
> 
> 
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
> 
> 
> On Tue, 2 Apr 2024 at 20:24, Chloe He <[email protected]> wrote:
> 
> > Hello!
> >
> > I am attempting to write a streaming pipeline that would consume data from
> > a Kafka source, manipulate the data, and then write results to a downstream
> > sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using
> > the function API that Spark offers. I read a few guides on how to do this
> > and my understanding is that I need to create a temp view in order to
> > execute my raw SQL queries via spark.sql().
> >
> > However, I’m having trouble defining watermarks on my source. It doesn’t
> > seem like there is a way to introduce watermark in the raw SQL that Spark
> > supports, so I’m using the .withWatermark() function. However, this
> > watermark does not work on the temp view.
> >
> > Example code:
> > ```
> > streaming_df.select(from_json(col("value").cast("string"),
> > schema).alias("parsed_value")).select("parsed_value.*").withWatermark("createTime",
> > "10 seconds”)
> >
> > json_df.createOrReplaceTempView("json_df”)
> >
> > session.sql("""
> > SELECT
> > window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
> > FROM json_df
> > GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
> > ORDER BY window.start
> > """)\
> > .writeStream\
> > .format("kafka") \
> > .option("checkpointLocation", "checkpoint") \
> > .option("kafka.bootstrap.servers", "localhost:9092") \
> > .option("topic", "sink") \
> > .start()
> > ```
> > This throws
> > ```
> > AnalysisException: Append output mode not supported when there are
> > streaming aggregations on streaming DataFrames/DataSets without watermark;
> > ```
> >
> > If I switch out the SQL query and write it in the function API instead,
> > everything seems to work fine.
> >
> > How can I use .sql() in conjunction with watermarks?
> >
> > Best,
> > Chloe
> >
>

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

Reply via email to