Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this:
foo-10000001.csv/part -xxxxx, part-xxxxx foo-10000002.csv/part -xxxxx, part-xxxxx When I see this on Hue, the csv's appear to me as *directories*, but if I understand correctly, they will appear as csv *files* to other hadoop ecosystem tools? And, if I understand Tathagata's answer correctly, other hadoop based ecosystems, such as Hive, will be able to create a table based of the multiple foo-100000x.csv "directories"? Thank you, I really appreciate the help! On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <so...@cloudera.com> wrote: > Keep in mind that if you repartition to 1 partition, you are only > using 1 task to write the output, and potentially only 1 task to > compute some parent RDDs. You lose parallelism. The > files-in-a-directory output scheme is standard for Hadoop and for a > reason. > > Therefore I would consider separating this concern and merging the > files afterwards if you need to. > > On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > > Simplest way would be to merge the output files at the end of your job > like: > > > > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt > > > > If you want to do it pro grammatically, then you can use the > > FileUtil.copyMerge API > > . like: > > > > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, > FileSystem > > of destination(hdfs), Path to the merged files /merged-ouput, true(to > delete > > the original dir),null) > > > > > > > > Thanks > > Best Regards > > > > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote: > >> > >> Thanks Akhil for the suggestion, it is now only giving me one part - > xxxx. > >> Is there anyway I can just create a file rather than a directory? It > doesn't > >> seem like there is just a saveAsTextFile option for > JavaPairRecieverDstream. > >> > >> Also, for the copy/merge api, how would I add that to my spark job? > >> > >> Thanks Akhil! > >> > >> Best, > >> > >> Su > >> > >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com > > > >> wrote: > >>> > >>> For streaming application, for every batch it will create a new > directory > >>> and puts the data in it. If you don't want to have multiple files > inside the > >>> directory as part-xxxx then you can do a repartition before the saveAs* > >>> call. > >>> > >>> > >>> > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >>> String.class, (Class) TextOutputFormat.class); > >>> > >>> > >>> Thanks > >>> Best Regards > >>> > >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> > wrote: > >>>> > >>>> Hello Everyone, > >>>> > >>>> I am writing simple word counts to hdfs using > >>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >>>> String.class, (Class) TextOutputFormat.class); > >>>> > >>>> 1) However, each 2 seconds I getting a new directory that is titled > as a > >>>> csv. So i'll have test.csv, which will be a directory that has two > files > >>>> inside of it called part-00000 and part 00001 (something like that). > This > >>>> obv makes it very hard for me to read the data stored in the csv > files. I am > >>>> wondering if there is a better way to store the > JavaPairRecieverDStream and > >>>> JavaPairDStream? > >>>> > >>>> 2) I know there is a copy/merge hadoop api for merging files...can > this > >>>> be done inside java? I am not sure the logic behind this api if I am > using > >>>> spark streaming which is constantly making new files. > >>>> > >>>> Thanks a lot for the help! > >>> > >>> > >> > > >