subject:"RE\: HdfsWordCount only counts some of the words"

Re: HdfsWordCount only counts some of the words

2014-09-24 Thread aka.fe2s

I guess because this example is stateless, so it outputs counts only for
given RDD. Take a look at stateful word counter
StatefulNetworkWordCount.scala

On Wed, Sep 24, 2014 at 4:29 AM, SK skrishna...@gmail.com wrote:


 I execute it as follows:

 $SPARK_HOME/bin/spark-submit   --master master url  --class
 org.apache.spark.examples.streaming.HdfsWordCount
 target/scala-2.10/spark_stream_examples-assembly-1.0.jar  hdfsdir

 After I start the job, I add a new test file in hdfsdir. It is a large text
 file which I will not be able to copy here. But it  probably has at least
 100 distinct words. But the streaming output has only about 5-6 words along
 with their counts as follows. I then stop the job after some time.

 Time ...

 (word1, cnt1)
 (word2, cnt2)
 (word3, cnt3)
 (word4, cnt4)
 (word5, cnt5)

 Time ...

 Time ...




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: HdfsWordCount only counts some of the words

2014-09-24 Thread Sean Owen

If you look at the code for HdfsWordCount, you see it calls print(),
which defaults to print 10 elements from each RDD. If you are just
talking about the console output, then it is not expected to print all
words to begin with.

On Wed, Sep 24, 2014 at 2:29 AM, SK skrishna...@gmail.com wrote:

 I execute it as follows:

 $SPARK_HOME/bin/spark-submit   --master master url  --class
 org.apache.spark.examples.streaming.HdfsWordCount
 target/scala-2.10/spark_stream_examples-assembly-1.0.jar  hdfsdir

 After I start the job, I add a new test file in hdfsdir. It is a large text
 file which I will not be able to copy here. But it  probably has at least
 100 distinct words. But the streaming output has only about 5-6 words along
 with their counts as follows. I then stop the job after some time.

 Time ...

 (word1, cnt1)
 (word2, cnt2)
 (word3, cnt3)
 (word4, cnt4)
 (word5, cnt5)

 Time ...

 Time ...




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: HdfsWordCount only counts some of the words

2014-09-23 Thread SK


I execute it as follows:

$SPARK_HOME/bin/spark-submit   --master master url  --class 
org.apache.spark.examples.streaming.HdfsWordCount 
target/scala-2.10/spark_stream_examples-assembly-1.0.jar  hdfsdir

After I start the job, I add a new test file in hdfsdir. It is a large text
file which I will not be able to copy here. But it  probably has at least
100 distinct words. But the streaming output has only about 5-6 words along
with their counts as follows. I then stop the job after some time. 

Time ...

(word1, cnt1)
(word2, cnt2)
(word3, cnt3)
(word4, cnt4)
(word5, cnt5)

Time ...

Time ...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HdfsWordCount only counts some of the words

Re: HdfsWordCount only counts some of the words

RE: HdfsWordCount only counts some of the words

3 matches

Site Navigation

Mail list logo

Footer information