Thanks Evo for your detailed explanation.
> On Apr 16, 2015, at 1:38 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > > The reason for this is as follows: > > 1. You are saving data on HDFS > 2. HDFS as a cluster/server side Service has a Single Writer / Multiple > Reader multithreading model > 3. Hence each thread of execution in Spark has to write to a separate > file in HDFS > 4. Moreover the RDDs are partitioned across cluster nodes and operated > upon by multiple threads there and on top of that in Spark Streaming you have > many micro-batch RDDs streaming in all the time as part of a DStream > > If you want fine / detailed management of the writing to HDFS you can > implement your own HDFS adapter and invoke it in forEachRDD and foreach > > Regards > Evo Eftimov > > From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] > Sent: Thursday, April 16, 2015 6:33 PM > To: user@spark.apache.org > Subject: saveAsTextFile > > I am using Spark Streaming where during each micro-batch I output data to S3 > using > saveAsTextFile. Right now each batch of data is put into its own directory > containing > 2 objects, "_SUCCESS" and "part-00000." > > How do I output each batch into a common directory? > > Thanks, > Vadim > ᐧ