Wordcount is a very common example so you can find that several places in Spark 
documentation and tutorials. Beware! They typically tokenize the text by 
splitting on whitespace. That will leave you with tokens that have trailing 
commas, periods, and other things.

Also, you probably want to lowercase your text, get rid of non-word tokens, and 
may want to filter out stopwords and rare words. Here's an example. This has 
been edited a bit without re-running things so no guarantees it will work out 
of the box. It's in Python and uses NLTK for tokenization, but if you don't 
have that handy you can write a tokenizer.


import nltk
from nltk.tokenize import TreebankWordTokenizer

stopwords = {"a", "about", "above", "above", "across", "after", ... "yet", 
"you", "your",}

linesRDD = sc.textFile("path/to/file.txt")
words = linesRDD.flatMap(lambda s: 
TreebankWordTokenizer().tokenize(s.lower())).filter(lambda w: w not in 
stopwords).filter(lambda w: w.isalpha())
wcounts = words.map(lambda w: (w, 1)).reduceByKey(lambda x, y: x + 
y).filter(lambda x: x[1]>3)

# The wcounts are not sorted


> -----Original Message-----
> From: heszak [mailto:hzakerza...@collabware.com]
> Sent: Wednesday, March 18, 2015 1:35 PM
> To: user@spark.apache.org
> Subject: topic modeling using LDA in MLLib
> 
> I'm coming from a Hadoop background but I'm totally new to Apache Spark.
> I'd like to do topic modeling using LDA algorithm on some txt files. The
> example on the Spark website assumes that the input to the LDA is a file
> containing the words counts. I wonder if someone could help me figuring
> out the steps to start from actual txt documents (actual content) and come
> up with the actual topics.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-
> list.1001560.n3.nabble.com/topic-modeling-using-LDA-in-MLLib-
> tp22128.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to