Re: [Dev] LDA-Topic-Model by Apache Spark

Srinath Perera Thu, 17 Dec 2015 02:23:23 -0800

Adding dev@

What are the topic identified? Do they have real life explanation?


What happen with 3 and more topics?

--Srinath

On Wed, Dec 16, 2015 at 2:00 PM, Sinnathamby Mahesan <sinnatha...@wso2.com>
wrote:

> Dear Srinath / Nirmal
>
>    - This is about the program for Apache Spark LDA Modelling
>    - The data file 'twittersGL.txt' is tried assuming each twitter is a
>    document
>    - stop words are removed
>    - word frequency in each document is found
>    - the prescribed numbers of  most and least frequency words are removed
>    - a new vocabulary list is constructed (after removal of those words)
>    - Using the frequency count in each document and the new vocabulary
>    list,  the necessary data file for Apache LDA model is constructed
>    - datafile is passed to LDA Model algorithm along with other parameters
>    - such number of topics
>    - LDA algorithm produces LDA-scores of vocab-size x number-of-topics
>    entries
>    - These entries (saved in a file) should be passed to a data
>    visualisation application for decision making.
>
> Tried with LibreOffice Calc, for the three out of five topics -
> chart make us conclude two topics can be identified while the third is a
> kind of mixture of the other two topics.
>
> Here is a brief description of variables listed - I believe it is
> self-explanatory and in flow along the processing order.
>
>
> STOPWORDS_FILE stopwordstwitter.txt stopWords List<String>
>
> input data twittersGL.txt through args[0]
>
> JavaSparkContext jSc
>
> lines as JavaRDD<String> , each line delimited by a new line is
> considered a document wordListsLcRDD lines are mapped to lower-case and
> split into list of lists of words
>
> cleanListsRDD wordListsLcRDD is mapped to remove stop words, and with
> empty-lines removed (and cached as needed more than once)
>
> countMapRDD CleanListsRDD is mapped to find word frequency in each of the
> lists separately (i.e. in each document)
>
> flatList cleanListsRDD got flatten and collected as a list construct
> vocab-list
>
> initialVocabSize ( i.e. before removing most and least frequency words)
> flatList is streamed, made distinct and counted
>
> allDocsWordCount (Map) flatList is mapped to group words and to count
> them separately
>
> nTopWords, nBttmWords Variables set to 20 and 154 respectively
> newVocabSize initialVocabSize – nTopWords – nBttmWords
>
> CountMapSortedOnValues countMapRDD is sorted on values with most and
> least frequency words removed (number of words to be removed are
> prescribed) – thanks to skip(n) and limit(m) methods in Java stream
>
> VocabList (with newVocabSize) constructed using keySet of
> CountMapSortedOnValues, and sorted in alphabetical order
>
> docWordFreqRDD This is the matrix needed for Apache SparkLDA Model
> CountMapRDD is mapped to check with newVocabList and to assign values
> from the map and zero for non-existing words.
> Also, rows with no non-zero entries are removed
>
> LDATopicModel Uses the Apache Spark LDA model algorithm taking
> docWordFreqRDD, nTopics as parameters, Produces a matrix of LDA scores.
>
> LDAScores+TimeStamp text file containing LDA Scores - (newVocabSize x
> nTopics) entries with words are saved.
>
> All mapping functions and filer functions are defined with the prefix 'sm_'
> in MyFunctions.java
>
>
> All related documents are in a folder zipped and attached herewith.
>
> Thank you
> Mahesan
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Sinnathamby Mahesan
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>



-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] LDA-Topic-Model by Apache Spark

Reply via email to