Re: Why are there different "parts" in my CSV?

Akhil Das Thu, 12 Feb 2015 23:53:22 -0800

For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part-xxxx then you can do a repartition before the saveAs*
call.


messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);


Thanks
Best Regards

On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote:

> Hello Everyone,
>
> I am writing simple word counts to hdfs using
> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
> 1) However, each 2 seconds I getting a new *directory *that is titled as
> a csv. So i'll have test.csv, which will be a directory that has two files
> inside of it called part-00000 and part 00001 (something like that). This
> obv makes it very hard for me to read the data stored in the csv files. I
> am wondering if there is a better way to store the JavaPairRecieverDStream
> and JavaPairDStream?
>
> 2) I know there is a copy/merge hadoop api for merging files...can this be
> done inside java? I am not sure the logic behind this api if I am using
> spark streaming which is constantly making new files.
>
> Thanks a lot for the help!
>

Re: Why are there different "parts" in my CSV?

Reply via email to