Re: Why are there different "parts" in my CSV?

Su She Fri, 13 Feb 2015 12:50:33 -0800

Thanks Akhil for the suggestion, it is now only giving me one part - xxxx.
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.


Also, for the copy/merge api, how would I add that to my spark job?

Thanks Akhil!

Best,

Su

On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> For streaming application, for every batch it will create a new directory
> and puts the data in it. If you don't want to have multiple files inside
> the directory as part-xxxx then you can do a repartition before the saveAs*
> call.
>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
>
> Thanks
> Best Regards
>
> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I am writing simple word counts to hdfs using
>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>> 1) However, each 2 seconds I getting a new *directory *that is titled as
>> a csv. So i'll have test.csv, which will be a directory that has two files
>> inside of it called part-00000 and part 00001 (something like that). This
>> obv makes it very hard for me to read the data stored in the csv files. I
>> am wondering if there is a better way to store the JavaPairRecieverDStream
>> and JavaPairDStream?
>>
>> 2) I know there is a copy/merge hadoop api for merging files...can this
>> be done inside java? I am not sure the logic behind this api if I am using
>> spark streaming which is constantly making new files.
>>
>> Thanks a lot for the help!
>>
>
>

Re: Why are there different "parts" in my CSV?

Reply via email to