Let me first ask for a few clarifications. 1. If you just want to count the words in a single text file like Don Quixote (that is, not for a stream of data), you should use only Spark. Then the program to count the frequency of words in a text file would look like this in Java. If you are not super-comfortable with Java, then I strongly recommend using the Scala API or pyspark. For scala, it may be a little trickier to learn if you have absolutely no idea. But it is worth it. The frequency count would look like this.
val sc = new SparkContext(...) val linesInFile = sc.textFile("path_to_file") val words = linesInFile.flatMap(line => line.split(" ")) val frequencies = words.map(word => (word, 1L)).reduceByKey(_ + _) println("Word frequencies = " + frequences.collect()) // collect is costly if the file is large 2. Let me assume that you want to do read a stream of text over the network and then print the count of total number of words into a file. Note that it is "total number of words" and not "frequency of each word". The Java version would be something like this. DStream<Integer> totalCounts = words.count(); totalCounts.foreachRDD(new Function2<JavaRDD<Long>, Time, Void>() { @Override public Void call(JavaRDD<Long> pairRDD, Time time) throws Exception { Long totalCount = totalCounts.first(); // print to screen System.out.println(totalCount); // append count to file ... return null; } }) This is count how many words have been received in each batch. The Scala version would be much simpler to read. words.count().foreachRDD(rdd => { val totalCount = rdd.first() // print to screen println(totalCount) // append count to file ... }) Hope this helps! I apologize if the code doesnt compile, I didnt test for syntax and stuff. TD On Thu, Jan 30, 2014 at 8:12 AM, Eduardo Costa Alfaia < e.costaalf...@unibs.it> wrote: > Hi Guys, > > I'm not very good like java programmer, so anybody could me help with this > code piece from JavaNetworkWordcount: > > JavaPairDStream<String, Integer> wordCounts = words.map( > new PairFunction<String, String, Integer>() { > @Override > public Tuple2<String, Integer> call(String s) throws Exception { > return new Tuple2<String, Integer>(s, 1); > } > }).reduceByKey(new Function2<Integer, Integer, Integer>() { > @Override > public Integer call(Integer i1, Integer i2) throws Exception { > return i1 + i2; > } > }); > > JavaPairDStream<String, Integer> counts = > wordCounts.reduceByKeyAndWindow( > new Function2<Integer, Integer, Integer>() { > public Integer call(Integer i1, Integer i2) { return i1 + i2; } > }, > new Function2<Integer, Integer, Integer>() { > public Integer call(Integer i1, Integer i2) { return i1 - i2; } > }, > new Duration(60 * 5 * 1000), > new Duration(1 * 1000) > ); > > I would like to think a manner of counting and after summing and getting a > total from words counted in a single file, for example a book in txt > extension Don Quixote. The counts function give me the resulted from each > word has found and not a total of words from the file. > Tathagata has sent me a piece from scala code, Thanks Tathagata by your > attention with my posts I am very thankfully, > > yourDStream.foreachRDD(rdd => { > > // Get and print first n elements > val firstN = rdd.take(n) > println("First N elements = " + firstN) > > // Count the number of elements in each batch > println("RDD has " + rdd.count() + " elements") > > }) > > yourDStream.count.print() > > Could anybody help me? > > > Thanks Guys > > -- > INFORMATIVA SUL TRATTAMENTO DEI DATI PERSONALI > > I dati utilizzati per l'invio del presente messaggio sono trattati > dall'Università degli Studi di Brescia esclusivamente per finalità > istituzionali. Informazioni più dettagliate anche in ordine ai diritti > dell'interessato sono riposte nell'informativa generale e nelle notizie > pubblicate sul sito web dell'Ateneo nella sezione "Privacy". > > Il contenuto di questo messaggio è rivolto unicamente alle persona cui > è indirizzato e può contenere informazioni la cui riservatezza è > tutelata legalmente. Ne sono vietati la riproduzione, la diffusione e l'uso > in mancanza di autorizzazione del destinatario. Qualora il messaggio > fosse pervenuto per errore, preghiamo di eliminarlo. >