from:"hotdog"

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread hotdog

yes, the first code takes only 30mins. but the second method, I wait for 5 hours, only finish 10% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html Sent from the Apache Spark User List mailing

job hangs when using pipe() with reduceByKey()

2015-10-31 Thread hotdog

I meet a situation: When I use val a = rdd.pipe("./my_cpp_program").persist() a.count() // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count() it is so slow and

Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-12 Thread hotdog

hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from the

run reduceByKey on huge data in spark

2015-06-30 Thread hotdog

I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word = (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile(hdfs://...) but it always run out of

how to read lz4 compressed data using fileStream of spark streaming?

2015-05-13 Thread hotdog

in spark streaming, I want to use fileStream to monitor a directory. But the files in that directory are compressed using lz4. So the new lz4 files are not detected by the following code. How to detect these new files? val list_join_action_stream = ssc.fileStream[LongWritable, Text,

force the kafka consumer process to different machines

2015-05-13 Thread hotdog

I 'm using streaming integrated with streaming-kafka. My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines, the bandwidth of the 2 machines will be very very high. I wonder is there

can we start a new thread in foreachRDD in spark streaming?

2015-05-11 Thread hotdog

I want to start a child-thread in foreachRDD. My situation is: the job is reading from a hdfs dir continuously, and every 100 batches, I want to launch a model training task (I will make a snapshot of the rdds at that time and start the training task. the training task takes a very long time(2

Re: job hangs when using pipe() with reduceByKey()

job hangs when using pipe() with reduceByKey()

Re: Configuring Spark for reduceByKey on on massive data sets

run reduceByKey on huge data in spark

how to read lz4 compressed data using fileStream of spark streaming?

force the kafka consumer process to different machines

can we start a new thread in foreachRDD in spark streaming?

7 matches

Site Navigation

Mail list logo

Footer information