Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
Thanks Richard.  I am hoping that Spark team will at some time, provide more 
detailed documentation.
 

On Sunday, February 11, 2018 2:17 AM, Richard Qiao 
 wrote:
 

 Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.
For example:  inputRowsPerSecond = numRecords / inputTimeSec,  
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why 
the 2 rowPerSec difference.

On Feb 10, 2018, at 8:42 PM, M Singh  wrote:
Hi:
I am working with spark 2.2.0 and am looking at the query status console 
output.  

My application reads from kafka - performs flatMapGroupsWithState and then 
aggregates the elements for two group counts.  The output is send to console 
sink.  I see the following output  (with my questions in bold). 

Please me know where I can find detailed description of the query status fields 
for spark structured streaming ?


StreamExecution: Streaming query made progress: {
  "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
  "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
  "name" : null,
  "timestamp" : "2018-02-11T01:18:00.005Z",
  "numInputRows" : 5780,
  "inputRowsPerSecond" : 96.32851690748795,    
  "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
shuffling/grouping ?
  "durationMs" : {
    "addBatch" : 9765,    // Is 
the time taken to get send output to all console output streams ? 
    "getBatch" : 3,   
// Is this time taken to get the batch from Kafka ?
    "getOffset" : 3,   
// Is this time for getting offset from Kafka ?
    "queryPlanning" : 89, // 
The value of this field changes with different triggers but the query is not 
changing so why does this change ?
    "triggerExecution" : 9898, // Is 
this total time for this trigger ?
    "walCommit" : 35 // Is 
this for checkpointing ?
  },
  "stateOperators" : [ {   // 
What are the two state operators ? I am assuming one is flatMapWthState (first 
one).
    "numRowsTotal" : 8,
    "numRowsUpdated" : 1
  }, {
    "numRowsTotal" : 6,    //Is 
this the group by state operator ?  If so, I have two group by so why do I see 
only one ?
    "numRowsUpdated" : 6
  } ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[xyz]]",
    "startOffset" : {
  "xyz" : {
    "2" : 9183,
    "1" : 9184,
    "3" : 9184,
    "0" : 9183
  }
    },
    "endOffset" : {
  "xyz" : {
    "2" : 10628,
    "1" : 10629,
    "3" : 10629,
    "0" : 10628
  }
    },
    "numInputRows" : 5780,
    "inputRowsPerSecond" : 96.32851690748795,
    "processedRowsPerSecond" : 583.9563548191554
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
  }
}






   

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.

For example:
  inputRowsPerSecond = numRecords / inputTimeSec,
  processedRowsPerSecond = numRecords / processingTimeSec
This is explaining why the 2 rowPerSec difference.

> On Feb 10, 2018, at 8:42 PM, M Singh  wrote:
> 
> Hi:
> 
> I am working with spark 2.2.0 and am looking at the query status console 
> output.  
> 
> My application reads from kafka - performs flatMapGroupsWithState and then 
> aggregates the elements for two group counts.  The output is send to console 
> sink.  I see the following output  (with my questions in bold). 
> 
> Please me know where I can find detailed description of the query status 
> fields for spark structured streaming ?
> 
> 
> StreamExecution: Streaming query made progress: {
>   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
>   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
>   "name" : null,
>   "timestamp" : "2018-02-11T01:18:00.005Z",
>   "numInputRows" : 5780,
>   "inputRowsPerSecond" : 96.32851690748795,
>   "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
> processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
> shuffling/grouping ?
>   "durationMs" : {
> "addBatch" : 9765,// 
> Is the time taken to get send output to all console output streams ? 
> "getBatch" : 3,   
> // Is this time taken to get the batch from Kafka ?
> "getOffset" : 3,  
>  // Is this time for getting offset from Kafka ?
> "queryPlanning" : 89, // 
> The value of this field changes with different triggers but the query is not 
> changing so why does this change ?
> "triggerExecution" : 9898, // Is 
> this total time for this trigger ?
> "walCommit" : 35 // 
> Is this for checkpointing ?
>   },
>   "stateOperators" : [ {   // 
> What are the two state operators ? I am assuming one is flatMapWthState 
> (first one).
> "numRowsTotal" : 8,
> "numRowsUpdated" : 1
>   }, {
> "numRowsTotal" : 6,//Is 
> this the group by state operator ?  If so, I have two group by so why do I 
> see only one ?
> "numRowsUpdated" : 6
>   } ],
>   "sources" : [ {
> "description" : "KafkaSource[Subscribe[xyz]]",
> "startOffset" : {
>   "xyz" : {
> "2" : 9183,
> "1" : 9184,
> "3" : 9184,
> "0" : 9183
>   }
> },
> "endOffset" : {
>   "xyz" : {
> "2" : 10628,
> "1" : 10629,
> "3" : 10629,
> "0" : 10628
>   }
> },
> "numInputRows" : 5780,
> "inputRowsPerSecond" : 96.32851690748795,
> "processedRowsPerSecond" : 583.9563548191554
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
>   }
> }
> 
> 



Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi:
I am working with spark 2.2.0 and am looking at the query status console 
output.  

My application reads from kafka - performs flatMapGroupsWithState and then 
aggregates the elements for two group counts.  The output is send to console 
sink.  I see the following output  (with my questions in bold). 

Please me know where I can find detailed description of the query status fields 
for spark structured streaming ?


StreamExecution: Streaming query made progress: {
  "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
  "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
  "name" : null,
  "timestamp" : "2018-02-11T01:18:00.005Z",
  "numInputRows" : 5780,
  "inputRowsPerSecond" : 96.32851690748795,    
  "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
shuffling/grouping ?
  "durationMs" : {
    "addBatch" : 9765,    // Is 
the time taken to get send output to all console output streams ? 
    "getBatch" : 3,   
// Is this time taken to get the batch from Kafka ?
    "getOffset" : 3,   
// Is this time for getting offset from Kafka ?
    "queryPlanning" : 89, // 
The value of this field changes with different triggers but the query is not 
changing so why does this change ?
    "triggerExecution" : 9898, // Is 
this total time for this trigger ?
    "walCommit" : 35 // Is 
this for checkpointing ?
  },
  "stateOperators" : [ {   // 
What are the two state operators ? I am assuming one is flatMapWthState (first 
one).
    "numRowsTotal" : 8,
    "numRowsUpdated" : 1
  }, {
    "numRowsTotal" : 6,    //Is 
this the group by state operator ?  If so, I have two group by so why do I see 
only one ?
    "numRowsUpdated" : 6
  } ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[xyz]]",
    "startOffset" : {
  "xyz" : {
    "2" : 9183,
    "1" : 9184,
    "3" : 9184,
    "0" : 9183
  }
    },
    "endOffset" : {
  "xyz" : {
    "2" : 10628,
    "1" : 10629,
    "3" : 10629,
    "0" : 10628
  }
    },
    "numInputRows" : 5780,
    "inputRowsPerSecond" : 96.32851690748795,
    "processedRowsPerSecond" : 583.9563548191554
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
  }
}