Okay, got it, thanks for the help Sean!
On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen wrote:
> No, they appear as directories + files to everything. Lots of tools
> are used to taking an input that is a directory of part files though.
> You can certainly point MR, Hive, etc at a directory of these
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a directory of these files.
On Sat, Feb 14, 2015 at 9:05 PM, Su She wrote:
> Thanks Sean and Akhil! I will take out th
Thanks Sean and Akhil! I will take out the repartition(1). Please let me
know if I understood this correctly, Spark Streamingwrites data like this:
foo-1001.csv/part -x, part-x
foo-1002.csv/part -x, part-x
When I see this on Hue, the csv's appear to me as *directories*, b
Keep in mind that if you repartition to 1 partition, you are only
using 1 task to write the output, and potentially only 1 task to
compute some parent RDDs. You lose parallelism. The
files-in-a-directory output scheme is standard for Hadoop and for a
reason.
Therefore I would consider separating
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark
Just read this...seems like it should be easily readable. Thanks!
On Sat, Feb 14, 2015 at 1:36 AM, Su She wrote:
> Thanks Akhil for the link. Is there a reason why there is a new directory
> created for each batch? Is thi
Thanks Akhil for the link. Is there a reason why there is a new directory
created for each batch? Is this a format that is easily readable by other
applications such as hive/impala?
On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das
wrote:
> You can directly write to hbase with Spark. Here's and exampl
You can directly write to hbase with Spark. Here's and example for doing
that https://issues.apache.org/jira/browse/SPARK-944
Thanks
Best Regards
On Sat, Feb 14, 2015 at 2:55 PM, Su She wrote:
> Hello Akhil, thank you for your continued help!
>
> 1) So, if I can write it in programitically afte
Hello Akhil, thank you for your continued help!
1) So, if I can write it in programitically after every batch, then
technically I should be able to have just the csv files in one directory.
However, can the /desired/output/file.txt be in hdfs? If it is only local,
I am not sure if it will help me
Simplest way would be to merge the output files at the end of your job like:
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
​If you want to do it pro grammatically, then you can use the ​
FileUtil.copyMerge API
​.​ like:
FileUtil.copyMerge(FileSystem of source(hdfs), /ou
Thanks Akhil for the suggestion, it is now only giving me one part - .
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.
Also, for the copy/merge api, how would I add that to my spark job?
For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part- then you can do a repartition before the saveAs*
call.
messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","cs
Hello Everyone,
I am writing simple word counts to hdfs using
messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);
1) However, each 2 seconds I getting a new *directory *that is titled as a
csv. So i'll have test.csv, which will be
12 matches
Mail list logo