Thanks Evo for your detailed explanation.

> On Apr 16, 2015, at 1:38 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> 
> The reason for this is as follows:
>  
> 1.       You are saving data on HDFS
> 2.       HDFS as a cluster/server side Service has a Single Writer / Multiple 
> Reader multithreading model
> 3.       Hence each thread of execution in Spark has to write to a separate 
> file in HDFS
> 4.       Moreover the RDDs are partitioned across cluster nodes and operated 
> upon by multiple threads there and on top of that in Spark Streaming you have 
> many micro-batch RDDs streaming in all the time as part of a DStream  
>  
> If you want fine / detailed management of the writing to HDFS you can 
> implement your own HDFS adapter and invoke it in forEachRDD and foreach
>  
> Regards
> Evo Eftimov  
>  
> From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
> Sent: Thursday, April 16, 2015 6:33 PM
> To: user@spark.apache.org
> Subject: saveAsTextFile
>  
> I am using Spark Streaming where during each micro-batch I output data to S3 
> using
> saveAsTextFile. Right now each batch of data is put into its own directory 
> containing
> 2 objects, "_SUCCESS" and "part-00000."
>  
> How do I output each batch into a common directory?
>  
> Thanks,
> Vadim
> ᐧ

Reply via email to