Hi In one of the spark summit demo, it is been alluded that we should think batch jobs in streaming pattern, using "run once" in a schedule. I find this idea very interesting and I understand how this can be achieved for sources like kafka, kinesis or similar. in fact we have implemented this model for cosmos changefeed.
My question is: can this model extend to file based sources? I understand it can be for append only file streams. The use case I have is: A CDC tool like aws dms or shareplex or similar writing changes to a stream of files, in date based folders. So it just goes on like T1, T2 etc folders. Also, lets assume files are written every 10 mins, but I want to process them every 4 hours. Can I use streaming method so that it can manage checkpoints on its own? Best - Ayan -- Best Regards, Ayan Guha