Re: Apache Spark - Structured Streaming Query Status - field descriptions
Thanks Richard. I am hoping that Spark team will at some time, provide more detailed documentation. On Sunday, February 11, 2018 2:17 AM, Richard Qiaowrote: Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why the 2 rowPerSec difference. On Feb 10, 2018, at 8:42 PM, M Singh wrote: Hi: I am working with spark 2.2.0 and am looking at the query status console output. My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts. The output is send to console sink. I see the following output (with my questions in bold). Please me know where I can find detailed description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: { "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", "name" : null, "timestamp" : "2018-02-11T01:18:00.005Z", "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554, // Why is the number of processedRowsPerSecond greater than inputRowsPerSecond ? Does this include shuffling/grouping ? "durationMs" : { "addBatch" : 9765, // Is the time taken to get send output to all console output streams ? "getBatch" : 3, // Is this time taken to get the batch from Kafka ? "getOffset" : 3, // Is this time for getting offset from Kafka ? "queryPlanning" : 89, // The value of this field changes with different triggers but the query is not changing so why does this change ? "triggerExecution" : 9898, // Is this total time for this trigger ? "walCommit" : 35 // Is this for checkpointing ? }, "stateOperators" : [ { // What are the two state operators ? I am assuming one is flatMapWthState (first one). "numRowsTotal" : 8, "numRowsUpdated" : 1 }, { "numRowsTotal" : 6, //Is this the group by state operator ? If so, I have two group by so why do I see only one ? "numRowsUpdated" : 6 } ], "sources" : [ { "description" : "KafkaSource[Subscribe[xyz]]", "startOffset" : { "xyz" : { "2" : 9183, "1" : 9184, "3" : 9184, "0" : 9183 } }, "endOffset" : { "xyz" : { "2" : 10628, "1" : 10629, "3" : 10629, "0" : 10628 } }, "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" } }
Re: Apache Spark - Structured Streaming Query Status - field descriptions
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining why the 2 rowPerSec difference. > On Feb 10, 2018, at 8:42 PM, M Singhwrote: > > Hi: > > I am working with spark 2.2.0 and am looking at the query status console > output. > > My application reads from kafka - performs flatMapGroupsWithState and then > aggregates the elements for two group counts. The output is send to console > sink. I see the following output (with my questions in bold). > > Please me know where I can find detailed description of the query status > fields for spark structured streaming ? > > > StreamExecution: Streaming query made progress: { > "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", > "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", > "name" : null, > "timestamp" : "2018-02-11T01:18:00.005Z", > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554, // Why is the number of > processedRowsPerSecond greater than inputRowsPerSecond ? Does this include > shuffling/grouping ? > "durationMs" : { > "addBatch" : 9765,// > Is the time taken to get send output to all console output streams ? > "getBatch" : 3, > // Is this time taken to get the batch from Kafka ? > "getOffset" : 3, > // Is this time for getting offset from Kafka ? > "queryPlanning" : 89, // > The value of this field changes with different triggers but the query is not > changing so why does this change ? > "triggerExecution" : 9898, // Is > this total time for this trigger ? > "walCommit" : 35 // > Is this for checkpointing ? > }, > "stateOperators" : [ { // > What are the two state operators ? I am assuming one is flatMapWthState > (first one). > "numRowsTotal" : 8, > "numRowsUpdated" : 1 > }, { > "numRowsTotal" : 6,//Is > this the group by state operator ? If so, I have two group by so why do I > see only one ? > "numRowsUpdated" : 6 > } ], > "sources" : [ { > "description" : "KafkaSource[Subscribe[xyz]]", > "startOffset" : { > "xyz" : { > "2" : 9183, > "1" : 9184, > "3" : 9184, > "0" : 9183 > } > }, > "endOffset" : { > "xyz" : { > "2" : 10628, > "1" : 10629, > "3" : 10629, > "0" : 10628 > } > }, > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" > } > } > >
Apache Spark - Structured Streaming Query Status - field descriptions
Hi: I am working with spark 2.2.0 and am looking at the query status console output. My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts. The output is send to console sink. I see the following output (with my questions in bold). Please me know where I can find detailed description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: { "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", "name" : null, "timestamp" : "2018-02-11T01:18:00.005Z", "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554, // Why is the number of processedRowsPerSecond greater than inputRowsPerSecond ? Does this include shuffling/grouping ? "durationMs" : { "addBatch" : 9765, // Is the time taken to get send output to all console output streams ? "getBatch" : 3, // Is this time taken to get the batch from Kafka ? "getOffset" : 3, // Is this time for getting offset from Kafka ? "queryPlanning" : 89, // The value of this field changes with different triggers but the query is not changing so why does this change ? "triggerExecution" : 9898, // Is this total time for this trigger ? "walCommit" : 35 // Is this for checkpointing ? }, "stateOperators" : [ { // What are the two state operators ? I am assuming one is flatMapWthState (first one). "numRowsTotal" : 8, "numRowsUpdated" : 1 }, { "numRowsTotal" : 6, //Is this the group by state operator ? If so, I have two group by so why do I see only one ? "numRowsUpdated" : 6 } ], "sources" : [ { "description" : "KafkaSource[Subscribe[xyz]]", "startOffset" : { "xyz" : { "2" : 9183, "1" : 9184, "3" : 9184, "0" : 9183 } }, "endOffset" : { "xyz" : { "2" : 10628, "1" : 10629, "3" : 10629, "0" : 10628 } }, "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" } }