http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark
Just read this...seems like it should be easily readable. Thanks! On Sat, Feb 14, 2015 at 1:36 AM, Su She <suhsheka...@gmail.com> wrote: > Thanks Akhil for the link. Is there a reason why there is a new directory > created for each batch? Is this a format that is easily readable by other > applications such as hive/impala? > > > On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> You can directly write to hbase with Spark. Here's and example for doing >> that https://issues.apache.org/jira/browse/SPARK-944 >> >> Thanks >> Best Regards >> >> On Sat, Feb 14, 2015 at 2:55 PM, Su She <suhsheka...@gmail.com> wrote: >> >>> Hello Akhil, thank you for your continued help! >>> >>> 1) So, if I can write it in programitically after every batch, then >>> technically I should be able to have just the csv files in one directory. >>> However, can the /desired/output/file.txt be in hdfs? If it is only local, >>> I am not sure if it will help me for my use case I describe in 2) >>> >>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs >>> desired/dir/in/hdfs ? >>> >>> 2) Just to make sure I am going on the right path...my end use case is >>> to use hive or hbase to create a database off these csv files. Is there an >>> easy way for hive to read /user/test/many sub directories/with one csv file >>> in each into a table? >>> >>> Thank you! >>> >>> >>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Simplest way would be to merge the output files at the end of your job >>>> like: >>>> >>>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >>>> >>>> ​If you want to do it pro grammatically, then you can use the ​ >>>> FileUtil.copyMerge API >>>> ​.​ like: >>>> >>>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >>>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput, >>>> true(to delete the original dir),null) >>>> >>>> >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote: >>>> >>>>> Thanks Akhil for the suggestion, it is now only giving me one part - >>>>> xxxx. Is there anyway I can just create a file rather than a directory? It >>>>> doesn't seem like there is just a saveAsTextFile option for >>>>> JavaPairRecieverDstream. >>>>> >>>>> Also, for the copy/merge api, how would I add that to my spark job? >>>>> >>>>> Thanks Akhil! >>>>> >>>>> Best, >>>>> >>>>> Su >>>>> >>>>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das < >>>>> ak...@sigmoidanalytics.com> wrote: >>>>> >>>>>> For streaming application, for every batch it will create a new >>>>>> directory and puts the data in it. If you don't want to have multiple >>>>>> files >>>>>> inside the directory as part-xxxx then you can do a repartition before >>>>>> the >>>>>> saveAs* call. >>>>>> >>>>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>>>>> String.class, (Class) TextOutputFormat.class); >>>>>> >>>>>> >>>>>> Thanks >>>>>> Best Regards >>>>>> >>>>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello Everyone, >>>>>>> >>>>>>> I am writing simple word counts to hdfs using >>>>>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>>>>>> String.class, (Class) TextOutputFormat.class); >>>>>>> >>>>>>> 1) However, each 2 seconds I getting a new *directory *that is >>>>>>> titled as a csv. So i'll have test.csv, which will be a directory that >>>>>>> has >>>>>>> two files inside of it called part-00000 and part 00001 (something like >>>>>>> that). This obv makes it very hard for me to read the data stored in the >>>>>>> csv files. I am wondering if there is a better way to store the >>>>>>> JavaPairRecieverDStream and JavaPairDStream? >>>>>>> >>>>>>> 2) I know there is a copy/merge hadoop api for merging files...can >>>>>>> this be done inside java? I am not sure the logic behind this api if I >>>>>>> am >>>>>>> using spark streaming which is constantly making new files. >>>>>>> >>>>>>> Thanks a lot for the help! >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >