The reason for this is as follows:

 

1.       You are saving data on HDFS

2.       HDFS as a cluster/server side Service has a Single Writer / Multiple 
Reader multithreading model 

3.       Hence each thread of execution in Spark has to write to a separate 
file in HDFS

4.       Moreover the RDDs are partitioned across cluster nodes and operated 
upon by multiple threads there and on top of that in Spark Streaming you have 
many micro-batch RDDs streaming in all the time as part of a DStream  

 

If you want fine / detailed management of the writing to HDFS you can implement 
your own HDFS adapter and invoke it in forEachRDD and foreach 

 

Regards

Evo Eftimov  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile

 

I am using Spark Streaming where during each micro-batch I output data to S3 
using

saveAsTextFile. Right now each batch of data is put into its own directory 
containing

2 objects, "_SUCCESS" and "part-00000."

 

How do I output each batch into a common directory?

 

Thanks,

Vadim

  
<https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=057349bb-29a2-4296-82b7-c52b46ae19f6>
 ᐧ

  
<http://t.signauxcinq.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752&pi=ff283f35-99c4-4b15-dd07-91df78970bf8>
 

Reply via email to