[ https://issues.apache.org/jira/browse/BEAM-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739694#comment-15739694 ]
Aviem Zur commented on BEAM-1126: --------------------------------- Reasoning: The backlog accessors are a very good indicator for application monitoring. As such, we plan to expose backlog as aggregators in spark-runner. Number of events is more human comprehensible than bytes. Specifically, in Kafka, backlog (or lag) is reasoned about in {{number of messages}}. See: https://kafka.apache.org/documentation#others_monitoring If I understand correctly, for {{PubSub}} it is more common to reason about backlog in bytes, however, the implementation for {{KafkaIO}} seems forced, applying a byte approximation on a value that is originally in {{number of messages}}: {code:java} synchronized long approxBacklogInBytes() { // Note that is an an estimate of uncompressed backlog. if (latestOffset < 0 || nextOffset < 0) { return UnboundedReader.BACKLOG_UNKNOWN; } return Math.max(0, (long) ((latestOffset - nextOffset) * avgRecordSize)); } {code} In conclusion - it seems that the API was written with {{PubSub}} in mind, however, {{Kafka}}, the open source equivalent, relates to backlog in terms of {{number of messages}}. > Expose UnboundedSource split backlog in number of events > -------------------------------------------------------- > > Key: BEAM-1126 > URL: https://issues.apache.org/jira/browse/BEAM-1126 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core > Reporter: Aviem Zur > Assignee: Daniel Halperin > Priority: Minor > > Today {{UnboundedSource}} exposes split backlog in bytes via > {{getSplitBacklogBytes()}} > There is value in exposing backlog in number of events as well, since this > number can be more human comprehensible than bytes. something like > {{getSplitBacklogEvents()}} or {{getSplitBacklogCount()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)