Okay, got it, thanks for the help Sean!
On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen <so...@cloudera.com> wrote: > No, they appear as directories + files to everything. Lots of tools > are used to taking an input that is a directory of part files though. > You can certainly point MR, Hive, etc at a directory of these files. > > On Sat, Feb 14, 2015 at 9:05 PM, Su She <suhsheka...@gmail.com> wrote: > > Thanks Sean and Akhil! I will take out the repartition(1). Please let me > > know if I understood this correctly, Spark Streamingwrites data like > this: > > > > foo-10000001.csv/part -xxxxx, part-xxxxx > > foo-10000002.csv/part -xxxxx, part-xxxxx > > > > When I see this on Hue, the csv's appear to me as directories, but if I > > understand correctly, they will appear as csv files to other hadoop > > ecosystem tools? And, if I understand Tathagata's answer correctly, other > > hadoop based ecosystems, such as Hive, will be able to create a table > based > > of the multiple foo-100000x.csv "directories"? > > > > Thank you, I really appreciate the help! > > > > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Keep in mind that if you repartition to 1 partition, you are only > >> using 1 task to write the output, and potentially only 1 task to > >> compute some parent RDDs. You lose parallelism. The > >> files-in-a-directory output scheme is standard for Hadoop and for a > >> reason. > >> > >> Therefore I would consider separating this concern and merging the > >> files afterwards if you need to. > >> > >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com> > >> wrote: > >> > Simplest way would be to merge the output files at the end of your job > >> > like: > >> > > >> > hadoop fs -getmerge /output/dir/on/hdfs/ > /desired/local/output/file.txt > >> > > >> > If you want to do it pro grammatically, then you can use the > >> > FileUtil.copyMerge API > >> > . like: > >> > > >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, > >> > FileSystem > >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to > >> > delete > >> > the original dir),null) > >> > > >> > > >> > > >> > Thanks > >> > Best Regards > >> > > >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> > wrote: > >> >> > >> >> Thanks Akhil for the suggestion, it is now only giving me one part - > >> >> xxxx. > >> >> Is there anyway I can just create a file rather than a directory? It > >> >> doesn't > >> >> seem like there is just a saveAsTextFile option for > >> >> JavaPairRecieverDstream. > >> >> > >> >> Also, for the copy/merge api, how would I add that to my spark job? > >> >> > >> >> Thanks Akhil! > >> >> > >> >> Best, > >> >> > >> >> Su > >> >> > >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das > >> >> <ak...@sigmoidanalytics.com> > >> >> wrote: > >> >>> > >> >>> For streaming application, for every batch it will create a new > >> >>> directory > >> >>> and puts the data in it. If you don't want to have multiple files > >> >>> inside the > >> >>> directory as part-xxxx then you can do a repartition before the > >> >>> saveAs* > >> >>> call. > >> >>> > >> >>> > >> >>> > >> >>> > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >> >>> String.class, (Class) TextOutputFormat.class); > >> >>> > >> >>> > >> >>> Thanks > >> >>> Best Regards > >> >>> > >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> > >> >>> wrote: > >> >>>> > >> >>>> Hello Everyone, > >> >>>> > >> >>>> I am writing simple word counts to hdfs using > >> >>>> > >> >>>> > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >> >>>> String.class, (Class) TextOutputFormat.class); > >> >>>> > >> >>>> 1) However, each 2 seconds I getting a new directory that is titled > >> >>>> as a > >> >>>> csv. So i'll have test.csv, which will be a directory that has two > >> >>>> files > >> >>>> inside of it called part-00000 and part 00001 (something like > that). > >> >>>> This > >> >>>> obv makes it very hard for me to read the data stored in the csv > >> >>>> files. I am > >> >>>> wondering if there is a better way to store the > >> >>>> JavaPairRecieverDStream and > >> >>>> JavaPairDStream? > >> >>>> > >> >>>> 2) I know there is a copy/merge hadoop api for merging files...can > >> >>>> this > >> >>>> be done inside java? I am not sure the logic behind this api if I > am > >> >>>> using > >> >>>> spark streaming which is constantly making new files. > >> >>>> > >> >>>> Thanks a lot for the help! > >> >>> > >> >>> > >> >> > >> > > > > > >