Github user arunmahadevan commented on a diff in the pull request: https://github.com/apache/spark/pull/21721#discussion_r200730270 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/JsonUtils.scala --- @@ -95,4 +95,25 @@ private object JsonUtils { } Serialization.write(result) } + + /** + * Write per-topic partition lag as json string --- End diff -- This is the difference between the latest offsets in Kafka the time the metrics is reported (just after a micro-batch completes) and the latest offset Spark has processed. It can be 0 if spark keeps up with the rate at which messages are ingested into Kafka topics (steady state). I would assume we would always want to set some reasonable micro batch sizes by setting `maxOffsetsPerTrigger`. Otherwise spark can end up processing entire data in the topics in one micro batch (e.g. if the starting offset is set to earliest or the streaming job is stopped for sometime and restarted). IMO, we should address this by setting some sane defaults which is currently missing. If we want to handle the custom metrics for Kafka outside the scope of this PR I will raise a separate one for this, but this can be really useful to identify issues like data skews in some partitions or some other issues causing spark to not keep up with the ingestion rate.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org