[ https://issues.apache.org/jira/browse/BEAM-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693377#comment-15693377 ]
Amit Sela commented on BEAM-1048: --------------------------------- To clarify - calling `rdd.count()` on the read input RDD creates a batch job each interval. I was following "direct kafka" implementation, but in that case it's only metadata computation as the read from Kafka is known ahead of actual read (it has a preset [start, end] offsets). This is used for UI visibility, and feeding the {{RateController}}. If {{RateController}} is not used, we might want to disable this for now.. (UI will lose count visualisation too). [~ksalant] I've assigned you since you're working on it. If resolving this prolongs, let's open a sub-task to disable this if {{RateController}} is not used, and enable once resolved. > Spark Runner streaming batch duration does not include duration of reading > from source > --------------------------------------------------------------------------------------- > > Key: BEAM-1048 > URL: https://issues.apache.org/jira/browse/BEAM-1048 > Project: Beam > Issue Type: Bug > Components: runner-spark > Affects Versions: 0.4.0-incubating > Reporter: Kobi Salant > Assignee: Kobi Salant > > Spark Runner streaming batch duration does not include duration of reading > from source this is because we perform rdd.count in SparkUnboundedSourcewhich > that invokes a regular spark job outside the streaming context. > We do it for reporting the batch size both for UI and back pressure -- This message was sent by Atlassian JIRA (v6.3.4#6332)