If it helps, below is the same query progress report that I am able to
fetch from streaming query
{
"id" : "f2cb24d4-622e-4355-b315-8e440f01a90c",
"runId" : "6f3834ff-10a9-4f57-ae71-8a434ee519ce",
"name" : "query_name_1",
"timestamp" : "2019-02-27T06:06:58.500Z",
"batchId" : 3725,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 275,
"getBatch" : 3,
"getOffset" : 8,
"queryPlanning" : 79,
"triggerExecution" : 409,
"walCommit" : 43
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[SubscribePattern[kafka_events_topic]]",
"startOffset" : {
"kafka_events_topic" : {
"2" : 32822078,
"1" : 114248484,
"0" : 114242134
}
},
"endOffset" : {
"kafka_events_topic" : {
"2" : 32822496,
"1" : 114248896,
"0" : 114242552
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "ForeachSink"
}
}
Akshay Bhardwaj
+91-97111-33849
On Wed, Feb 27, 2019 at 11:36 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:
> Hi Experts,
>
> I have a structured streaming query running on spark 2.3 over yarn
> cluster, with below features:
>
>- Reading JSON messages from Kafka topic with:
> - maxOffsetsPerTrigger as 5000
> - trigger interval of my writeStream task is 500ms.
> - streaming dataset is defined as events with fields: id, name,
> refid, createdTime
>- A cached dataset of CSV file read from HDFS, such that, the CSV file
>contains a list of prohibited events refid
>- I have defined an intermediate dataset with the following query,
>which filters out prohibited events from the streaming data
> - select * from events where event.refid NOT IN (select refid from
> CSVData)
>
>
> The query progress from StreamingQuery object, it shows metrics as
> numInputRows, inputRowsPerSecond and processedRowsPerSecond as 0, although
> my query is executing with an execution time of ~400ms. And I can see that
> the query does take records from kafka and writes the processed data to the
> output database.
>
> If I remove the event filtering tasks, then all the metrics are displayed
> properly.
>
> Can anyone please point out why this behaviour is observed and how to
> gather metrics like numInputRows, etc while also filtering events fetched
> from CSV file?
> I am also open to suggestions if there is a better way of filtering out
> the prohibited events in structured streaming.
>
> Thanks in advance.
>
> Akshay Bhardwaj
> +91-97111-33849
>