Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote: Hello Akhil, thank you for your continued help! 1) So, if I can write it in

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark Just read this...seems like it should be easily readable. Thanks! On Sat, Feb 14, 2015 at 1:36 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the link. Is there a reason why there is a new directory created

Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt ​If you want to do it pro grammatically, then you can use the ​ FileUtil.copyMerge API ​.​ like: FileUtil.copyMerge(FileSystem of source(hdfs),

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala? On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can directly write to hbase

Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as *directories*,

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Okay, got it, thanks for the help Sean! On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen so...@cloudera.com wrote: No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a

Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files. On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote: Thanks Sean and

Re: Why are there different parts in my CSV?

2015-02-13 Thread Su She
Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job?

Why are there different parts in my CSV?

2015-02-12 Thread Su She
Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a

Re: Why are there different parts in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call.