For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part-xxxx then you can do a repartition before the saveAs* call.
messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote: > Hello Everyone, > > I am writing simple word counts to hdfs using > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > 1) However, each 2 seconds I getting a new *directory *that is titled as > a csv. So i'll have test.csv, which will be a directory that has two files > inside of it called part-00000 and part 00001 (something like that). This > obv makes it very hard for me to read the data stored in the csv files. I > am wondering if there is a better way to store the JavaPairRecieverDStream > and JavaPairDStream? > > 2) I know there is a copy/merge hadoop api for merging files...can this be > done inside java? I am not sure the logic behind this api if I am using > spark streaming which is constantly making new files. > > Thanks a lot for the help! >