Github user arunmahadevan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21721#discussion_r200730270
  
    --- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/JsonUtils.scala
 ---
    @@ -95,4 +95,25 @@ private object JsonUtils {
         }
         Serialization.write(result)
       }
    +
    +  /**
    +   * Write per-topic partition lag as json string
    --- End diff --
    
    This is the difference between the latest offsets in Kafka the time the 
metrics is reported (just after a micro-batch completes) and the latest offset 
Spark has processed. It can be 0 if spark keeps up with the rate at which 
messages are ingested into Kafka topics (steady state).
    
    I would assume we would always want to set some reasonable micro batch 
sizes by setting `maxOffsetsPerTrigger`. Otherwise spark can end up processing 
entire data in the topics in one micro batch (e.g. if the starting offset is 
set to earliest or the streaming job is stopped for sometime and restarted). 
IMO, we should address this by setting some sane defaults which is currently 
missing.
    
    If we want to handle the custom metrics for Kafka outside the scope of this 
PR I will raise a separate one for this, but this can be really useful to 
identify issues like data skews in some partitions or some other issues causing 
spark to not keep up with the ingestion rate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to