For posterity: the problem was FileStreamSourceLog class. I needed to
overwrite method shouldRetain, that by default returns true and its doc say:
Default implementation retains all log entries. Implementations should
override the method to change the behavior.
--
Kind regards/ Pozdrawiam,
Hi Gourav!
I use stateless processing, no watermarking, no aggregations.
I don't want any data loss, so changing checkpoint location is not an
option to me.
--
Kind regards/ Pozdrawiam,
Wojciech Indyk
pt., 29 kwi 2022 o 11:07 Gourav Sengupta
napisaĆ(a):
> Hi,
>
> this may not solve the
Hi,
this may not solve the problem, but have you tried to stop the job
gracefully, and then restart without much delay by pointing to a new
checkpoint location? The approach will have certain uncertainties for
scenarios where the source system can loose data, or we do not expect
duplicates to be
Update for the scenario of deleting compact files: it recovers from the
recent (not compacted) checkpoint file, but when it comes to compaction of
checkpoint then it fails with missing recent compaction file. I use Spark
3.1.2
--
Kind regards/ Pozdrawiam,
Wojciech Indyk
pt., 29 kwi 2022 o 07:00
Hello!
I use spark struture streaming. I need to use s3 for storing checkpoint
metadata (I know, it's not optimal storage for checkpoint metadata).
Compaction interval is 10 (default) and I set
"spark.sql.streaming.minBatchesToRetain"=5. When the job was running for a
few weeks then checkpointing