spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/master f4fa61eff - 747c2ba80 [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/747c2ba8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/747c2ba8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/747c2ba8 Branch: refs/heads/master Commit: 747c2ba8006d5b86f3be8dfa9ace639042a35628 Parents: f4fa61e Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:36 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/747c2ba8/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 80debff12 - ec7079f9c [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. (cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec7079f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec7079f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec7079f9 Branch: refs/heads/branch-1.5 Commit: ec7079f9c94cb98efdac6f92b7c85efb0e67492e Parents: 80debff Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:43 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ec7079f9/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org