Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason.
Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Simplest way would be to merge the output files at the end of your job like: > > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt > > If you want to do it pro grammatically, then you can use the > FileUtil.copyMerge API > . like: > > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem > of destination(hdfs), Path to the merged files /merged-ouput, true(to delete > the original dir),null) > > > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote: >> >> Thanks Akhil for the suggestion, it is now only giving me one part - xxxx. >> Is there anyway I can just create a file rather than a directory? It doesn't >> seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. >> >> Also, for the copy/merge api, how would I add that to my spark job? >> >> Thanks Akhil! >> >> Best, >> >> Su >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >>> >>> For streaming application, for every batch it will create a new directory >>> and puts the data in it. If you don't want to have multiple files inside the >>> directory as part-xxxx then you can do a repartition before the saveAs* >>> call. >>> >>> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>> String.class, (Class) TextOutputFormat.class); >>> >>> >>> Thanks >>> Best Regards >>> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote: >>>> >>>> Hello Everyone, >>>> >>>> I am writing simple word counts to hdfs using >>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>>> String.class, (Class) TextOutputFormat.class); >>>> >>>> 1) However, each 2 seconds I getting a new directory that is titled as a >>>> csv. So i'll have test.csv, which will be a directory that has two files >>>> inside of it called part-00000 and part 00001 (something like that). This >>>> obv makes it very hard for me to read the data stored in the csv files. I >>>> am >>>> wondering if there is a better way to store the JavaPairRecieverDStream and >>>> JavaPairDStream? >>>> >>>> 2) I know there is a copy/merge hadoop api for merging files...can this >>>> be done inside java? I am not sure the logic behind this api if I am using >>>> spark streaming which is constantly making new files. >>>> >>>> Thanks a lot for the help! >>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org