Hi Team, I have ETL use cases with source as Kafka ( in *Batch* mode) with SparkRunner. I am trying to understand the internals' of KafkaIO.Read.
Can someone please confirm if my understanding is correct? - WindowedContextOutputReceiver is getting used for collecting Kafka records from KafkaIO.Read from BoundedReadFromUnboundedSource [1]. - All read Kafka records get stored in memory and gets spilled to downstream once the loop ends in Read function [2]. Ref : [1] : https://github.com/apache/beam/blob/3bb232fb098700de408f574585dfe74bbaff7230/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L206 [2] : https://github.com/apache/beam/blob/3bb232fb098700de408f574585dfe74bbaff7230/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L201 Thank You, Shrikant Bang. On Sun, Jan 10, 2021 at 11:43 PM shrikant bang < [email protected]> wrote: > Hi Team, > > I have below questions/ understandings for KafkaIO.Read in batch mode : > > 1. I built an understanding on debugging that, KafkaIO converts > unbounded stream into bounded read and *buffers all records* till > either of criteria matches - max records/ max time to read. > If this understanding is correct, then read is memory intensive as > KafkaIO has to buffer all read records before passing to down-streams. Is > my understanding correct? > > 2. If #1 is correct, then is there any way we can keep writing records > instead of buffering into memory in KafkaIO read operation (in batch mode) > ? > > > Thank You, > Shrikant Bang >
