I have a streaming job which writes data to S3. I know there are saveAsXXXX 
functions helping write data to S3. But it bundles all elements then writes out 
to S3. So my first question - Is there any way to let saveAsXXXX functions 
write data in batch or single elements instead of whole bundle? 

Right now I use S3 TransferManager to upload files in batch. The code looks 
like below (sorry I don't have code at hand)

...

val manager = // initialize TransferManager...

stream.foreachRDD { rdd =>

  val elements = rdd.collect

  manager.upload...(elemnts)

}

...


I suppose there would have problem here because TransferManager instance is at 
driver program (Now the job is working that may be because I run spark as a 
single process). And checking on the internet, seemingly it is recommended to 
use foreachPartition instead, and prevent using function that cause actions 
such as rdd.collect. So another questions: what is the best practice regarding 
to this scenario (batch upload transformed data to external storage such as 
S3)? And what functions would cause 'action' to be triggered (like data to be 
sent back to driver program)? 


Thanks


 

Reply via email to