Re: structured streaming- checkpoint metadata growing indefinetely

2022-05-04 Thread Wojciech Indyk
For posterity: the problem was FileStreamSourceLog class. I needed to overwrite method shouldRetain, that by default returns true and its doc say: Default implementation retains all log entries. Implementations should override the method to change the behavior. -- Kind regards/ Pozdrawiam,

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-30 Thread Wojciech Indyk
Hi Gourav! I use stateless processing, no watermarking, no aggregations. I don't want any data loss, so changing checkpoint location is not an option to me. -- Kind regards/ Pozdrawiam, Wojciech Indyk pt., 29 kwi 2022 o 11:07 Gourav Sengupta napisaƂ(a): > Hi, > > this may not solve the

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-29 Thread Gourav Sengupta
Hi, this may not solve the problem, but have you tried to stop the job gracefully, and then restart without much delay by pointing to a new checkpoint location? The approach will have certain uncertainties for scenarios where the source system can loose data, or we do not expect duplicates to be

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-29 Thread Wojciech Indyk
Update for the scenario of deleting compact files: it recovers from the recent (not compacted) checkpoint file, but when it comes to compaction of checkpoint then it fails with missing recent compaction file. I use Spark 3.1.2 -- Kind regards/ Pozdrawiam, Wojciech Indyk pt., 29 kwi 2022 o 07:00

structured streaming- checkpoint metadata growing indefinetely

2022-04-28 Thread Wojciech Indyk
Hello! I use spark struture streaming. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction interval is 10 (default) and I set "spark.sql.streaming.minBatchesToRetain"=5. When the job was running for a few weeks then checkpointing