No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files.
On Sat, Feb 14, 2015 at 9:05 PM, Su She <suhsheka...@gmail.com> wrote: > Thanks Sean and Akhil! I will take out the repartition(1). Please let me > know if I understood this correctly, Spark Streamingwrites data like this: > > foo-10000001.csv/part -xxxxx, part-xxxxx > foo-10000002.csv/part -xxxxx, part-xxxxx > > When I see this on Hue, the csv's appear to me as directories, but if I > understand correctly, they will appear as csv files to other hadoop > ecosystem tools? And, if I understand Tathagata's answer correctly, other > hadoop based ecosystems, such as Hive, will be able to create a table based > of the multiple foo-100000x.csv "directories"? > > Thank you, I really appreciate the help! > > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <so...@cloudera.com> wrote: >> >> Keep in mind that if you repartition to 1 partition, you are only >> using 1 task to write the output, and potentially only 1 task to >> compute some parent RDDs. You lose parallelism. The >> files-in-a-directory output scheme is standard for Hadoop and for a >> reason. >> >> Therefore I would consider separating this concern and merging the >> files afterwards if you need to. >> >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> > Simplest way would be to merge the output files at the end of your job >> > like: >> > >> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >> > >> > If you want to do it pro grammatically, then you can use the >> > FileUtil.copyMerge API >> > . like: >> > >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >> > FileSystem >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to >> > delete >> > the original dir),null) >> > >> > >> > >> > Thanks >> > Best Regards >> > >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote: >> >> >> >> Thanks Akhil for the suggestion, it is now only giving me one part - >> >> xxxx. >> >> Is there anyway I can just create a file rather than a directory? It >> >> doesn't >> >> seem like there is just a saveAsTextFile option for >> >> JavaPairRecieverDstream. >> >> >> >> Also, for the copy/merge api, how would I add that to my spark job? >> >> >> >> Thanks Akhil! >> >> >> >> Best, >> >> >> >> Su >> >> >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >> >> <ak...@sigmoidanalytics.com> >> >> wrote: >> >>> >> >>> For streaming application, for every batch it will create a new >> >>> directory >> >>> and puts the data in it. If you don't want to have multiple files >> >>> inside the >> >>> directory as part-xxxx then you can do a repartition before the >> >>> saveAs* >> >>> call. >> >>> >> >>> >> >>> >> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> >>> String.class, (Class) TextOutputFormat.class); >> >>> >> >>> >> >>> Thanks >> >>> Best Regards >> >>> >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> >> >>> wrote: >> >>>> >> >>>> Hello Everyone, >> >>>> >> >>>> I am writing simple word counts to hdfs using >> >>>> >> >>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> >>>> String.class, (Class) TextOutputFormat.class); >> >>>> >> >>>> 1) However, each 2 seconds I getting a new directory that is titled >> >>>> as a >> >>>> csv. So i'll have test.csv, which will be a directory that has two >> >>>> files >> >>>> inside of it called part-00000 and part 00001 (something like that). >> >>>> This >> >>>> obv makes it very hard for me to read the data stored in the csv >> >>>> files. I am >> >>>> wondering if there is a better way to store the >> >>>> JavaPairRecieverDStream and >> >>>> JavaPairDStream? >> >>>> >> >>>> 2) I know there is a copy/merge hadoop api for merging files...can >> >>>> this >> >>>> be done inside java? I am not sure the logic behind this api if I am >> >>>> using >> >>>> spark streaming which is constantly making new files. >> >>>> >> >>>> Thanks a lot for the help! >> >>> >> >>> >> >> >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org