Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them.
For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining why the 2 rowPerSec difference. > On Feb 10, 2018, at 8:42 PM, M Singh <mans2si...@yahoo.com.INVALID> wrote: > > Hi: > > I am working with spark 2.2.0 and am looking at the query status console > output. > > My application reads from kafka - performs flatMapGroupsWithState and then > aggregates the elements for two group counts. The output is send to console > sink. I see the following output (with my questions in bold). > > Please me know where I can find detailed description of the query status > fields for spark structured streaming ? > > > StreamExecution: Streaming query made progress: { > "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", > "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", > "name" : null, > "timestamp" : "2018-02-11T01:18:00.005Z", > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554, // Why is the number of > processedRowsPerSecond greater than inputRowsPerSecond ? Does this include > shuffling/grouping ? > "durationMs" : { > "addBatch" : 9765, // > Is the time taken to get send output to all console output streams ? > "getBatch" : 3, > // Is this time taken to get the batch from Kafka ? > "getOffset" : 3, > // Is this time for getting offset from Kafka ? > "queryPlanning" : 89, // > The value of this field changes with different triggers but the query is not > changing so why does this change ? > "triggerExecution" : 9898, // Is > this total time for this trigger ? > "walCommit" : 35 // > Is this for checkpointing ? > }, > "stateOperators" : [ { // > What are the two state operators ? I am assuming one is flatMapWthState > (first one). > "numRowsTotal" : 8, > "numRowsUpdated" : 1 > }, { > "numRowsTotal" : 6, //Is > this the group by state operator ? If so, I have two group by so why do I > see only one ? > "numRowsUpdated" : 6 > } ], > "sources" : [ { > "description" : "KafkaSource[Subscribe[xyz]]", > "startOffset" : { > "xyz" : { > "2" : 9183, > "1" : 9184, > "3" : 9184, > "0" : 9183 > } > }, > "endOffset" : { > "xyz" : { > "2" : 10628, > "1" : 10629, > "3" : 10629, > "0" : 10628 > } > }, > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" > } > } > >