Just read this...seems like it should be easily readable. Thanks!

On Sat, Feb 14, 2015 at 1:36 AM, Su She <> wrote:

> Thanks Akhil for the link. Is there a reason why there is a new directory
> created for each batch? Is this a format that is easily readable by other
> applications such as hive/impala?
> On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das <>
> wrote:
>> You can directly write to hbase with Spark. Here's and example for doing
>> that
>> Thanks
>> Best Regards
>> On Sat, Feb 14, 2015 at 2:55 PM, Su She <> wrote:
>>> Hello Akhil, thank you for your continued help!
>>> 1) So, if I can write it in programitically after every batch, then
>>> technically I should be able to have just the csv files in one directory.
>>> However, can the /desired/output/file.txt be in hdfs? If it is only local,
>>> I am not sure if it will help me for my use case I describe in 2)
>>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
>>> desired/dir/in/hdfs ?
>>> 2) Just to make sure I am going on the right end use case is
>>> to use hive or hbase to create a database off these csv files. Is there an
>>> easy way for hive to read /user/test/many sub directories/with one csv file
>>> in each into a table?
>>> Thank you!
>>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das <>
>>> wrote:
>>>> Simplest way would be to merge the output files at the end of your job
>>>> like:
>>>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>>>> ​If you want to do it pro grammatically, then you can use the ​
>>>> FileUtil.copyMerge API
>>>> ​.​ like:
>>>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>>>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
>>>> true(to delete the original dir),null)
>>>> Thanks
>>>> Best Regards
>>>> On Sat, Feb 14, 2015 at 2:18 AM, Su She <> wrote:
>>>>> Thanks Akhil for the suggestion, it is now only giving me one part -
>>>>> xxxx. Is there anyway I can just create a file rather than a directory? It
>>>>> doesn't seem like there is just a saveAsTextFile option for
>>>>> JavaPairRecieverDstream.
>>>>> Also, for the copy/merge api, how would I add that to my spark job?
>>>>> Thanks Akhil!
>>>>> Best,
>>>>> Su
>>>>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <
>>>>>> wrote:
>>>>>> For streaming application, for every batch it will create a new
>>>>>> directory and puts the data in it. If you don't want to have multiple 
>>>>>> files
>>>>>> inside the directory as part-xxxx then you can do a repartition before 
>>>>>> the
>>>>>> saveAs* call.
>>>>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>>>>> String.class, (Class) TextOutputFormat.class);
>>>>>> Thanks
>>>>>> Best Regards
>>>>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <>
>>>>>> wrote:
>>>>>>> Hello Everyone,
>>>>>>> I am writing simple word counts to hdfs using
>>>>>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>>>>>> String.class, (Class) TextOutputFormat.class);
>>>>>>> 1) However, each 2 seconds I getting a new *directory *that is
>>>>>>> titled as a csv. So i'll have test.csv, which will be a directory that 
>>>>>>> has
>>>>>>> two files inside of it called part-00000 and part 00001 (something like
>>>>>>> that). This obv makes it very hard for me to read the data stored in the
>>>>>>> csv files. I am wondering if there is a better way to store the
>>>>>>> JavaPairRecieverDStream and JavaPairDStream?
>>>>>>> 2) I know there is a copy/merge hadoop api for merging files...can
>>>>>>> this be done inside java? I am not sure the logic behind this api if I 
>>>>>>> am
>>>>>>> using spark streaming which is constantly making new files.
>>>>>>> Thanks a lot for the help!

Reply via email to