Re: Apache Spark - Structured Streaming Query Status - field descriptions

Richard Qiao Sun, 11 Feb 2018 02:18:08 -0800

Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.


For example:
  inputRowsPerSecond = numRecords / inputTimeSec,
  processedRowsPerSecond = numRecords / processingTimeSec
This is explaining why the 2 rowPerSec difference.

> On Feb 10, 2018, at 8:42 PM, M Singh <mans2si...@yahoo.com.INVALID> wrote:
> 
> Hi:
> 
> I am working with spark 2.2.0 and am looking at the query status console 
> output.  
> 
> My application reads from kafka - performs flatMapGroupsWithState and then 
> aggregates the elements for two group counts.  The output is send to console 
> sink.  I see the following output  (with my questions in bold). 
> 
> Please me know where I can find detailed description of the query status 
> fields for spark structured streaming ?
> 
> 
> StreamExecution: Streaming query made progress: {
>   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
>   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
>   "name" : null,
>   "timestamp" : "2018-02-11T01:18:00.005Z",
>   "numInputRows" : 5780,
>   "inputRowsPerSecond" : 96.32851690748795,            
>   "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
> processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
> shuffling/grouping ?
>   "durationMs" : {
>     "addBatch" : 9765,                                                    // 
> Is the time taken to get send output to all console output streams ? 
>     "getBatch" : 3,                                                           
> // Is this time taken to get the batch from Kafka ?
>     "getOffset" : 3,                                                          
>  // Is this time for getting offset from Kafka ?
>     "queryPlanning" : 89,                                                 // 
> The value of this field changes with different triggers but the query is not 
> changing so why does this change ?
>     "triggerExecution" : 9898,                                         // Is 
> this total time for this trigger ?
>     "walCommit" : 35                                                     // 
> Is this for checkpointing ?
>   },
>   "stateOperators" : [ {                                                   // 
> What are the two state operators ? I am assuming one is flatMapWthState 
> (first one).
>     "numRowsTotal" : 8,
>     "numRowsUpdated" : 1
>   }, {
>     "numRowsTotal" : 6,                                                //Is 
> this the group by state operator ?  If so, I have two group by so why do I 
> see only one ?
>     "numRowsUpdated" : 6
>   } ],
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[xyz]]",
>     "startOffset" : {
>       "xyz" : {
>         "2" : 9183,
>         "1" : 9184,
>         "3" : 9184,
>         "0" : 9183
>       }
>     },
>     "endOffset" : {
>       "xyz" : {
>         "2" : 10628,
>         "1" : 10629,
>         "3" : 10629,
>         "0" : 10628
>       }
>     },
>     "numInputRows" : 5780,
>     "inputRowsPerSecond" : 96.32851690748795,
>     "processedRowsPerSecond" : 583.9563548191554
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
>   }
> }
> 
>

Re: Apache Spark - Structured Streaming Query Status - field descriptions

Reply via email to