Adding dev@ What are the topic identified? Do they have real life explanation?
What happen with 3 and more topics? --Srinath On Wed, Dec 16, 2015 at 2:00 PM, Sinnathamby Mahesan <sinnatha...@wso2.com> wrote: > Dear Srinath / Nirmal > > - This is about the program for Apache Spark LDA Modelling > - The data file 'twittersGL.txt' is tried assuming each twitter is a > document > - stop words are removed > - word frequency in each document is found > - the prescribed numbers of most and least frequency words are removed > - a new vocabulary list is constructed (after removal of those words) > - Using the frequency count in each document and the new vocabulary > list, the necessary data file for Apache LDA model is constructed > - datafile is passed to LDA Model algorithm along with other parameters > - such number of topics > - LDA algorithm produces LDA-scores of vocab-size x number-of-topics > entries > - These entries (saved in a file) should be passed to a data > visualisation application for decision making. > > Tried with LibreOffice Calc, for the three out of five topics - > chart make us conclude two topics can be identified while the third is a > kind of mixture of the other two topics. > > Here is a brief description of variables listed - I believe it is > self-explanatory and in flow along the processing order. > > > STOPWORDS_FILE stopwordstwitter.txt stopWords List<String> > > input data twittersGL.txt through args[0] > > JavaSparkContext jSc > > lines as JavaRDD<String> , each line delimited by a new line is > considered a document wordListsLcRDD lines are mapped to lower-case and > split into list of lists of words > > cleanListsRDD wordListsLcRDD is mapped to remove stop words, and with > empty-lines removed (and cached as needed more than once) > > countMapRDD CleanListsRDD is mapped to find word frequency in each of the > lists separately (i.e. in each document) > > flatList cleanListsRDD got flatten and collected as a list construct > vocab-list > > initialVocabSize ( i.e. before removing most and least frequency words) > flatList is streamed, made distinct and counted > > allDocsWordCount (Map) flatList is mapped to group words and to count > them separately > > nTopWords, nBttmWords Variables set to 20 and 154 respectively > newVocabSize initialVocabSize – nTopWords – nBttmWords > > CountMapSortedOnValues countMapRDD is sorted on values with most and > least frequency words removed (number of words to be removed are > prescribed) – thanks to skip(n) and limit(m) methods in Java stream > > VocabList (with newVocabSize) constructed using keySet of > CountMapSortedOnValues, and sorted in alphabetical order > > docWordFreqRDD This is the matrix needed for Apache SparkLDA Model > CountMapRDD is mapped to check with newVocabList and to assign values > from the map and zero for non-existing words. > Also, rows with no non-zero entries are removed > > LDATopicModel Uses the Apache Spark LDA model algorithm taking > docWordFreqRDD, nTopics as parameters, Produces a matrix of LDA scores. > > LDAScores+TimeStamp text file containing LDA Scores - (newVocabSize x > nTopics) entries with words are saved. > > All mapping functions and filer functions are defined with the prefix 'sm_' > in MyFunctions.java > > > All related documents are in a folder zipped and attached herewith. > > Thank you > Mahesan > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Sinnathamby Mahesan > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev