Hi Experts,

I have a structured streaming query running on spark 2.3 over yarn cluster,
with below features:

   - Reading JSON messages from Kafka topic with:
      - maxOffsetsPerTrigger as 5000
      - trigger interval of my writeStream task is 500ms.
      - streaming dataset is defined as events with  fields: id, name,
      refid, createdTime
   - A cached dataset of CSV file read from HDFS, such that, the CSV file
   contains a list of prohibited events refid
   - I have defined an intermediate dataset with the following query, which
   filters out prohibited events from the streaming data
      - select * from events where event.refid NOT IN (select refid from
      CSVData)


The query progress from StreamingQuery object, it shows metrics as
numInputRows, inputRowsPerSecond and processedRowsPerSecond as 0, although
my query is executing with an execution time of ~400ms. And I can see that
the query does take records from kafka and writes the processed data to the
output database.

If I remove the event filtering tasks, then all the metrics are displayed
properly.

Can anyone please point out why this behaviour is observed and how to
gather metrics like numInputRows, etc while also filtering events fetched
from CSV file?
I am also open to suggestions if there is a better way of filtering out the
prohibited events in structured streaming.

Thanks in advance.

Akshay Bhardwaj
+91-97111-33849

Reply via email to