If it helps, below is the same query progress report that I am able to
fetch from streaming query

{
  "id" : "f2cb24d4-622e-4355-b315-8e440f01a90c",
  "runId" : "6f3834ff-10a9-4f57-ae71-8a434ee519ce",
  "name" : "query_name_1",
  "timestamp" : "2019-02-27T06:06:58.500Z",
  "batchId" : 3725,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "addBatch" : 275,
    "getBatch" : 3,
    "getOffset" : 8,
    "queryPlanning" : 79,
    "triggerExecution" : 409,
    "walCommit" : 43
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[SubscribePattern[kafka_events_topic]]",
    "startOffset" : {
      "kafka_events_topic" : {
        "2" : 32822078,
        "1" : 114248484,
        "0" : 114242134
      }
    },
    "endOffset" : {
      "kafka_events_topic" : {
        "2" : 32822496,
        "1" : 114248896,
        "0" : 114242552
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "ForeachSink"
  }
}



Akshay Bhardwaj
+91-97111-33849


On Wed, Feb 27, 2019 at 11:36 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Experts,
>
> I have a structured streaming query running on spark 2.3 over yarn
> cluster, with below features:
>
>    - Reading JSON messages from Kafka topic with:
>       - maxOffsetsPerTrigger as 5000
>       - trigger interval of my writeStream task is 500ms.
>       - streaming dataset is defined as events with  fields: id, name,
>       refid, createdTime
>    - A cached dataset of CSV file read from HDFS, such that, the CSV file
>    contains a list of prohibited events refid
>    - I have defined an intermediate dataset with the following query,
>    which filters out prohibited events from the streaming data
>       - select * from events where event.refid NOT IN (select refid from
>       CSVData)
>
>
> The query progress from StreamingQuery object, it shows metrics as
> numInputRows, inputRowsPerSecond and processedRowsPerSecond as 0, although
> my query is executing with an execution time of ~400ms. And I can see that
> the query does take records from kafka and writes the processed data to the
> output database.
>
> If I remove the event filtering tasks, then all the metrics are displayed
> properly.
>
> Can anyone please point out why this behaviour is observed and how to
> gather metrics like numInputRows, etc while also filtering events fetched
> from CSV file?
> I am also open to suggestions if there is a better way of filtering out
> the prohibited events in structured streaming.
>
> Thanks in advance.
>
> Akshay Bhardwaj
> +91-97111-33849
>

Reply via email to