Hi Rico,

there is no way to deferr records from one micro batch to the next one. So
it‘s guaranteed that the data and trigger event will be processed within
the dame batch.

I assume that one trigger event lead to an unknown batch size of actual
events pulled via HTTP. This bypasses throughput properties of spark
streaming. Depending on the amount of the resulting HTTP records, maybe you
consider splitting the pipeline into two parts:
- process trigger event, pull data from HTTP, write to kafka
- perform structured streaming ingestion

Kind regards

Dipl.-Inf. Rico Bergmann <i...@ricobergmann.de> schrieb am Fr. 5. März 2021
um 09:06:

> Hi all!
>
> I'm using Spark structured streaming for a data ingestion pipeline.
> Basically the pipeline reads events (notifications of new available
> data) from a Kafka topic and then queries a REST endpoint to get the
> real data (within a flatMap).
>
> For one single event the pipeline creates a few thousand records (rows)
> that have to be stored. And to write the data I use foreachBatch().
>
> My question is now: Is it guaranteed by Spark that all output records of
> one event are always contained in a single batch or can the records also
> be split into multiple batches?
>
>
> Best,
>
> Rico.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Roland Johann
Data Architect/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Reply via email to