Re: HdfsWordCount only counts some of the words
I guess because this example is stateless, so it outputs counts only for given RDD. Take a look at stateful word counter StatefulNetworkWordCount.scala On Wed, Sep 24, 2014 at 4:29 AM, SK skrishna...@gmail.com wrote: I execute it as follows: $SPARK_HOME/bin/spark-submit --master master url --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar hdfsdir After I start the job, I add a new test file in hdfsdir. It is a large text file which I will not be able to copy here. But it probably has at least 100 distinct words. But the streaming output has only about 5-6 words along with their counts as follows. I then stop the job after some time. Time ... (word1, cnt1) (word2, cnt2) (word3, cnt3) (word4, cnt4) (word5, cnt5) Time ... Time ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HdfsWordCount only counts some of the words
If you look at the code for HdfsWordCount, you see it calls print(), which defaults to print 10 elements from each RDD. If you are just talking about the console output, then it is not expected to print all words to begin with. On Wed, Sep 24, 2014 at 2:29 AM, SK skrishna...@gmail.com wrote: I execute it as follows: $SPARK_HOME/bin/spark-submit --master master url --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar hdfsdir After I start the job, I add a new test file in hdfsdir. It is a large text file which I will not be able to copy here. But it probably has at least 100 distinct words. But the streaming output has only about 5-6 words along with their counts as follows. I then stop the job after some time. Time ... (word1, cnt1) (word2, cnt2) (word3, cnt3) (word4, cnt4) (word5, cnt5) Time ... Time ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: HdfsWordCount only counts some of the words
I execute it as follows: $SPARK_HOME/bin/spark-submit --master master url --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar hdfsdir After I start the job, I add a new test file in hdfsdir. It is a large text file which I will not be able to copy here. But it probably has at least 100 distinct words. But the streaming output has only about 5-6 words along with their counts as follows. I then stop the job after some time. Time ... (word1, cnt1) (word2, cnt2) (word3, cnt3) (word4, cnt4) (word5, cnt5) Time ... Time ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org