Sent from my iPhone
> On Feb 6, 2014, at 10:08 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I can't comment on the specific question that you ask, but it should not > necessarily be expected that LDA will reconstruct the categories that you > have in mind. It will develop categories that explain the data as well as > it can, but that won't necessarily match the categories you intend. > > It is likely, however, that the topics that LDA derives would make a good > set of features for a classifier. > > > > > On Thu, Feb 6, 2014 at 2:56 PM, Stamatis Rapanakis > <stamrapana...@gmail.com>wrote: > >> I am trying to run the LDA algorithm. I can create meaningful topics but >> the document/topic assignment is of very bad quality. >> >> I have assigned 30 tweets to the following 10 topics: >> >> /grammy awards >> /greek crisis >> /greek islands >> /premier inn >> /premier league >> /rihanna >> /syria >> /terrorism >> /winter olympics >> /winter sales >> >> I have a total of 300 tweets and my purpose is to run the LDA algorithm >> to see how well these tweets are assigned. For example, if the number of >> topics parameter is set to 10, how much do they match to the original >> assignment. >> >> 1. I start by creating a file that will contain (in random order) the >> tweets (*tweets.tsv*). This file will be used to compare the final tweets >> topic assignment. >> >> 2. I remove stopwords, urls, replies and create a file with the tweets >> text only (*tweets_no_stopwords.tsv*). One tweet (document) per file >> line. This will be the LDA input file. >> >> 3. I use some java code to create a sequence file from >> *tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an >> integer and value the tweet text (extract attached tweets_no_stopwords.rar >> that contains a chunk-0 file). >> >> By executing the command: *mahout seqdumper -i >> tweets_no_stopwords/chunk-0* >> the chunk-0 file contents appear correctly: >> >> *Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!* >> *Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments* >> *...* >> *Key: 299: Value: team scored goal matches! (Man City)* >> *Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards* >> >> 4. I convert the data to vectors: >> >> bin/mahout seq2sparse -i tweets_no_stopwords -o >> tweets_no_stopwords-vectors -ow >> >> (I review the file with the command: *bin/mahout seqdumper -i >> tweets_no_stopwords-vectors/tf-vectors/part-r-00000*) >> >> 5. I convert keys to IntWritables >> >> bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o >> tweets_no_stopwords-vectors/tf-vectors-cvb >> >> The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have keys >> from 0 - 299 (300 instances). >> >> 6. Finally I run the LDA algorithm: >> >> *bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o >> lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10 >> -x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0* >> >> Note: I have to enter Cltr+C to stop the command execution (after it >> finished and the message "Program took XXXX ms" appears). But the folders >> are created as expected. This was an issue with thread pools not being terminated and was fixed in Mahout 0.8 >> >> The topics created (lda_output/topicterm) seem fine. I execute the command: >> >> *bin/mahout vectordump -i lda_output/topicterm -d >> tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p >> true -o p_term_topic.txt -sort lda_output/topicterm -vs 10* >> >> and follow the steps described in this link ( >> http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html) >> to create a file *p_term_topic.txt* and show a report with the output. >> >> *Topic 0**Topic 1**Topic 2* *Topic 3**Topic 4*winter, sales, olympics, >> love, played, people, big, photo, sale, trailterrorism, grammy, awards, >> blaindianexus, 56th, balochistan, bla, rock, 2014, photos islands, greek, >> greece, travel, find, book, make, kea, days, holidaygreek, crisis, β, >> lol, s, top, economic, tomorrow, job, eugrammys, found, style, red, >> hairdressers, room, mata, good, ty, walks *Topic5**Topic 6**Topic 7**Topic >> 8**Topic 9*sochi, team, time, all, usa, war, free, syria, sending, >> checksyria, >> city, manchester, united, back, hit, watching, chelsea, week, matchday syria, >> support, olympic, economy, video, today, competition, arab, u.s, >> inn'srihanna, >> time, watch, unapologetic, follow, great, euro, congrats, bet, hotelspremier, >> inn, league, stay, season, β, year, home, goals, won >> >> >> >> These results are good, if you have in mind the (10) categories they >> belonged to: >> >> /grammy awards >> /greek crisis >> /greek islands >> /premier inn >> /premier league >> /rihanna >> /syria >> /terrorism >> /winter olympics >> /winter sales >> >> But the results in the folder *lda_output/docTopics* are really bad! >> >> bin/mahout seqdumper -i lda_output/docTopics/part-m-00000 (Display the >> results) >> >> Key: 0: Value: >> {0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4: >> *0.5144069716184998* >> ,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5} >> Key: 1: Value: >> {0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3: >> *0.32904199007739116* >> ,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288} >> Key: 2: Value: >> {0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9 >> *:0.2676210542614334*} >> >> >> *Tweet**Topic* *Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop >> belle !!23Grammy Awards Hairstyles: Memorable Moments39 Preeminent >> #terrorism research center website. Check out: cc >> >> >> Am I missing something? Doesn't key 0 correspond to the first tweet >> (document), key 2 to the second tweet and so on? Could be a minor issue that needs to be fixed? Could u verify that the issue exists on 0.8? >> >> Thank you in advance for your responses. >> >> >>