If it helps, below is the same query progress report that I am able to fetch from streaming query
{ "id" : "f2cb24d4-622e-4355-b315-8e440f01a90c", "runId" : "6f3834ff-10a9-4f57-ae71-8a434ee519ce", "name" : "query_name_1", "timestamp" : "2019-02-27T06:06:58.500Z", "batchId" : 3725, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : { "addBatch" : 275, "getBatch" : 3, "getOffset" : 8, "queryPlanning" : 79, "triggerExecution" : 409, "walCommit" : 43 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaSource[SubscribePattern[kafka_events_topic]]", "startOffset" : { "kafka_events_topic" : { "2" : 32822078, "1" : 114248484, "0" : 114242134 } }, "endOffset" : { "kafka_events_topic" : { "2" : 32822496, "1" : 114248896, "0" : 114242552 } }, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0 } ], "sink" : { "description" : "ForeachSink" } } Akshay Bhardwaj +91-97111-33849 On Wed, Feb 27, 2019 at 11:36 AM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Experts, > > I have a structured streaming query running on spark 2.3 over yarn > cluster, with below features: > > - Reading JSON messages from Kafka topic with: > - maxOffsetsPerTrigger as 5000 > - trigger interval of my writeStream task is 500ms. > - streaming dataset is defined as events with fields: id, name, > refid, createdTime > - A cached dataset of CSV file read from HDFS, such that, the CSV file > contains a list of prohibited events refid > - I have defined an intermediate dataset with the following query, > which filters out prohibited events from the streaming data > - select * from events where event.refid NOT IN (select refid from > CSVData) > > > The query progress from StreamingQuery object, it shows metrics as > numInputRows, inputRowsPerSecond and processedRowsPerSecond as 0, although > my query is executing with an execution time of ~400ms. And I can see that > the query does take records from kafka and writes the processed data to the > output database. > > If I remove the event filtering tasks, then all the metrics are displayed > properly. > > Can anyone please point out why this behaviour is observed and how to > gather metrics like numInputRows, etc while also filtering events fetched > from CSV file? > I am also open to suggestions if there is a better way of filtering out > the prohibited events in structured streaming. > > Thanks in advance. > > Akshay Bhardwaj > +91-97111-33849 >