Re: Apache Spark - Structured Streaming Query Status - field descriptions

M Singh Sun, 11 Feb 2018 07:53:33 -0800

Thanks Richard.  I am hoping that Spark team will at some time, provide more 
detailed documentation.


    On Sunday, February 11, 2018 2:17 AM, Richard Qiao 
<richardqiao2...@gmail.com> wrote:
 

 Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.
For example:  inputRowsPerSecond = numRecords / inputTimeSec,  
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why 
the 2 rowPerSec difference.

On Feb 10, 2018, at 8:42 PM, M Singh <mans2si...@yahoo.com.INVALID> wrote:
Hi:
I am working with spark 2.2.0 and am looking at the query status console 
output.  

My application reads from kafka - performs flatMapGroupsWithState and then 
aggregates the elements for two group counts.  The output is send to console 
sink.  I see the following output  (with my questions in bold). 

Please me know where I can find detailed description of the query status fields 
for spark structured streaming ?


StreamExecution: Streaming query made progress: {
  "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
  "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
  "name" : null,
  "timestamp" : "2018-02-11T01:18:00.005Z",
  "numInputRows" : 5780,
  "inputRowsPerSecond" : 96.32851690748795,            
  "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
shuffling/grouping ?
  "durationMs" : {
    "addBatch" : 9765,                                                    // Is 
the time taken to get send output to all console output streams ? 
    "getBatch" : 3,                                                           
// Is this time taken to get the batch from Kafka ?
    "getOffset" : 3,                                                           
// Is this time for getting offset from Kafka ?
    "queryPlanning" : 89,                                                 // 
The value of this field changes with different triggers but the query is not 
changing so why does this change ?
    "triggerExecution" : 9898,                                         // Is 
this total time for this trigger ?
    "walCommit" : 35                                                     // Is 
this for checkpointing ?
  },
  "stateOperators" : [ {                                                   // 
What are the two state operators ? I am assuming one is flatMapWthState (first 
one).
    "numRowsTotal" : 8,
    "numRowsUpdated" : 1
  }, {
    "numRowsTotal" : 6,                                                //Is 
this the group by state operator ?  If so, I have two group by so why do I see 
only one ?
    "numRowsUpdated" : 6
  } ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[xyz]]",
    "startOffset" : {
      "xyz" : {
        "2" : 9183,
        "1" : 9184,
        "3" : 9184,
        "0" : 9183
      }
    },
    "endOffset" : {
      "xyz" : {
        "2" : 10628,
        "1" : 10629,
        "3" : 10629,
        "0" : 10628
      }
    },
    "numInputRows" : 5780,
    "inputRowsPerSecond" : 96.32851690748795,
    "processedRowsPerSecond" : 583.9563548191554
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
  }
}

Re: Apache Spark - Structured Streaming Query Status - field descriptions

Reply via email to