I have a streaming job which writes data to S3. I know there are saveAsXXXX functions helping write data to S3. But it bundles all elements then writes out to S3. So my first question - Is there any way to let saveAsXXXX functions write data in batch or single elements instead of whole bundle?
Right now I use S3 TransferManager to upload files in batch. The code looks like below (sorry I don't have code at hand) ... val manager = // initialize TransferManager... stream.foreachRDD { rdd => val elements = rdd.collect manager.upload...(elemnts) } ... I suppose there would have problem here because TransferManager instance is at driver program (Now the job is working that may be because I run spark as a single process). And checking on the internet, seemingly it is recommended to use foreachPartition instead, and prevent using function that cause actions such as rdd.collect. So another questions: what is the best practice regarding to this scenario (batch upload transformed data to external storage such as S3)? And what functions would cause 'action' to be triggered (like data to be sent back to driver program)? Thanks