Re: Reading kafka stream and writing to hdfs

2015-09-30 Thread Akhil Das
Like: counts.saveAsTestFiles("hdfs://host:port/some/location") Thanks Best Regards On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu wrote: > Hi, > I am going thru this example here: > >

Reading kafka stream and writing to hdfs

2015-09-28 Thread Chengi Liu
Hi, I am going thru this example here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py If I want to write this data on hdfs. Whats the right way to do this? Thanks

Re: Writing to HDFS

2015-08-04 Thread Akhil Das
Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage? Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha

Re: Writing to HDFS

2015-08-03 Thread ayan guha
Is your data skewed? What happens if you do rdd.count()? On 4 Aug 2015 05:49, Jasleen Kaur jasleenkaur1...@gmail.com wrote: I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024

Writing to HDFS

2015-08-03 Thread Jasleen Kaur
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024 - spark.default.parallelism=25600 - driver-memory=4G - executor-memory=32G. - My input size is around 1.5TB. My problem

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
as opposed to 1.2 min for the slaves). Any suggestion what the reason might be? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html Sent from the Apache Spark User List mailing list archive

Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
to 1.2 min for the slaves). Any suggestion what the reason might be? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html Sent from the Apache Spark User List mailing list archive

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
on the other 2 nodes -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, April 20, 2015 12:57 PM To: jamborta Cc: user@spark.apache.org Subject: Re: writing to hdfs on master node much faster What machines are HDFS data nodes -- just your master? that would explain

writing to hdfs on master node much faster

2015-04-20 Thread jamborta
://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Akhil Das
There was already a thread around it if i understood your question correctly, you can go through this https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/%3ccannjawtrp0nd3odz-5-_ya351rin81q-9+f2u-qn+vruqy+...@mail.gmail.com%3E Thanks Best Regards On Thu, Feb 19, 2015 at 8:16 PM,

bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Chico Qi
Hi all, In Spark Streaming I want use the Dstream.saveAsTextFiles by bulk writing because of the normal saveAsTextFiles cannot during the batch interval of setting. May be a common pool of writing or another assigned worker for bulk writing? Thanks! B/R Jichao

Re: Writing to HDFS from spark Streaming

2015-02-16 Thread Sean Owen
PS this is the real fix to this issue: https://issues.apache.org/jira/browse/SPARK-5795 I'd like to merge it as I don't think it breaks the API; it actually fixes it to work as intended. On Mon, Feb 16, 2015 at 3:25 AM, Bahubali Jain bahub...@gmail.com wrote: I used the latest assembly jar and

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
I used the latest assembly jar and the below as suggested by Akhil to fix this problem... temp.saveAsHadoopFiles(DailyCSV,.txt, String.class, String.class, *(Class)* TextOutputFormat.class); Thanks All for the help ! On Wed, Feb 11, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: That

Re: Writing to HDFS from spark Streaming

2015-02-11 Thread Sean Owen
That kinda dodges the problem by ignoring generic types. But it may be simpler than the 'real' solution, which is a bit ugly. (But first, to double check, are you importing the correct TextOutputFormat? there are two versions. You use .mapred. with the old API and .mapreduce. with the new API.)

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar
to know is why didn't Spark retry writing file to HDFS? It just shows it as failed job in Spark UI. Error: java.io.IOException: All datanodes x.x.x.x: are bad. Aborting... org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128

Re: Spark Streaming writing to HDFS

2014-10-05 Thread Sean Owen
On Sat, Oct 4, 2014 at 5:28 PM, Abraham Jacob abe.jac...@gmail.com wrote: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; Good. There is also a

Re: Spark Streaming writing to HDFS

2014-10-04 Thread Sean Owen
Are you importing the '.mapred.' version of TextOutputFormat instead of the new API '.mapreduce.' version? On Sat, Oct 4, 2014 at 1:08 AM, Abraham Jacob abe.jac...@gmail.com wrote: Hi All, Would really appreciate if someone in the community can help me with this. I have a simple Java spark

Re: Spark Streaming writing to HDFS

2014-10-04 Thread Abraham Jacob
Hi Sean/All, I am importing among various other things the newer mapreduce version - import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import

Spark Streaming writing to HDFS

2014-10-03 Thread Abraham Jacob
Hi All, Would really appreciate if someone in the community can help me with this. I have a simple Java spark streaming application - NetworkWordCount SparkConf sparkConf = new SparkConf().setMaster(yarn-cluster).setAppName(Streaming WordCount); JavaStreamingContext jssc = new