Hi Rico, there is no way to deferr records from one micro batch to the next one. So it‘s guaranteed that the data and trigger event will be processed within the dame batch.
I assume that one trigger event lead to an unknown batch size of actual events pulled via HTTP. This bypasses throughput properties of spark streaming. Depending on the amount of the resulting HTTP records, maybe you consider splitting the pipeline into two parts: - process trigger event, pull data from HTTP, write to kafka - perform structured streaming ingestion Kind regards Dipl.-Inf. Rico Bergmann <i...@ricobergmann.de> schrieb am Fr. 5. März 2021 um 09:06: > Hi all! > > I'm using Spark structured streaming for a data ingestion pipeline. > Basically the pipeline reads events (notifications of new available > data) from a Kafka topic and then queries a REST endpoint to get the > real data (within a flatMap). > > For one single event the pipeline creates a few thousand records (rows) > that have to be stored. And to write the data I use foreachBatch(). > > My question is now: Is it guaranteed by Spark that all output records of > one event are always contained in a single batch or can the records also > be split into multiple batches? > > > Best, > > Rico. > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Roland Johann Data Architect/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann