Sent from my iPhone

> On Feb 6, 2014, at 10:08 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> I can't comment on the specific question that you ask, but it should not
> necessarily be expected that LDA will reconstruct the categories that you
> have in mind.  It will develop categories that explain the data as well as
> it can, but that won't necessarily match the categories you intend.
> 
> It is likely, however, that the topics that LDA derives would make a good
> set of features for a classifier.
> 
> 
> 
> 
> On Thu, Feb 6, 2014 at 2:56 PM, Stamatis Rapanakis
> <stamrapana...@gmail.com>wrote:
> 
>>  I am trying to run the LDA algorithm. I can create meaningful topics but
>> the document/topic assignment is of very bad quality.
>> 
>>  I have assigned 30 tweets to the following 10 topics:
>> 
>> /grammy awards
>> /greek crisis
>> /greek islands
>> /premier inn
>> /premier league
>> /rihanna
>> /syria
>> /terrorism
>> /winter olympics
>> /winter sales
>> 
>>  I have a total of 300 tweets and my purpose is to run the LDA algorithm
>> to see how well these tweets are assigned. For example, if the number of
>> topics parameter is set to 10, how much do they match to the original
>> assignment.
>> 
>> 1. I start by creating a file that will contain (in random order) the
>> tweets (*tweets.tsv*). This file will be used to compare the final tweets
>> topic assignment.
>> 
>> 2. I remove stopwords, urls, replies and create a file with the tweets
>> text only (*tweets_no_stopwords.tsv*). One tweet (document) per file
>> line. This will be the LDA input file.
>> 
>> 3. I use some java code to create a sequence file from
>> *tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an
>> integer and value the tweet text (extract attached tweets_no_stopwords.rar
>> that contains a chunk-0 file).
>> 
>> By executing the command: *mahout seqdumper -i
>> tweets_no_stopwords/chunk-0*
>> the chunk-0 file contents appear correctly:
>> 
>> *Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!*
>> *Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments*
>> *...*
>> *Key: 299: Value: team scored goal matches! (Man City)*
>> *Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards*
>> 
>> 4. I convert the data to vectors:
>> 
>> bin/mahout seq2sparse -i tweets_no_stopwords -o
>> tweets_no_stopwords-vectors -ow
>> 
>> (I review the file with the command: *bin/mahout seqdumper -i
>> tweets_no_stopwords-vectors/tf-vectors/part-r-00000*)
>> 
>> 5. I convert keys to IntWritables
>> 
>> bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o
>> tweets_no_stopwords-vectors/tf-vectors-cvb
>> 
>> The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have keys
>> from 0 - 299 (300 instances).
>> 
>> 6. Finally I run the LDA algorithm:
>> 
>> *bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o
>> lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10
>> -x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0*
>> 
>> Note: I have to enter Cltr+C to stop the command execution (after it
>> finished and the message "Program took XXXX ms" appears). But the folders
>> are created as expected.

This was an issue with thread pools not being terminated and was fixed in 
Mahout 0.8
>> 
>> The topics created (lda_output/topicterm) seem fine. I execute the command:
>> 
>> *bin/mahout vectordump -i lda_output/topicterm -d
>> tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p
>> true -o p_term_topic.txt -sort lda_output/topicterm -vs 10*
>> 
>> and follow the steps described in this link (
>> http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html)
>> to create a file *p_term_topic.txt* and show a report with the output.
>> 
>> *Topic 0**Topic 1**Topic 2* *Topic 3**Topic 4*winter, sales, olympics,
>> love, played, people, big, photo, sale, trailterrorism, grammy, awards,
>> blaindianexus, 56th, balochistan, bla, rock, 2014, photos islands, greek,
>> greece, travel, find, book, make, kea, days, holidaygreek, crisis, β,
>> lol, s, top, economic, tomorrow, job, eugrammys, found, style, red,
>> hairdressers, room, mata, good, ty, walks *Topic5**Topic 6**Topic 7**Topic
>> 8**Topic 9*sochi, team, time, all, usa, war, free, syria, sending, 
>> checksyria,
>> city, manchester, united, back, hit, watching, chelsea, week, matchday syria,
>> support, olympic, economy, video, today, competition, arab, u.s, 
>> inn'srihanna,
>> time, watch, unapologetic, follow, great, euro, congrats, bet, hotelspremier,
>> inn, league, stay, season, β, year, home, goals, won
>> 
>> 
>> 
>> These results are good, if you have in mind the (10) categories they
>> belonged to:
>> 
>> /grammy awards
>> /greek crisis
>> /greek islands
>> /premier inn
>> /premier league
>> /rihanna
>> /syria
>> /terrorism
>> /winter olympics
>> /winter sales
>> 
>> But the results in the folder *lda_output/docTopics* are really bad!
>> 
>> bin/mahout seqdumper -i lda_output/docTopics/part-m-00000  (Display the
>> results)
>> 
>> Key: 0: Value:
>> {0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4:
>> *0.5144069716184998*
>> ,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5}
>> Key: 1: Value:
>> {0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3:
>> *0.32904199007739116*
>> ,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288}
>> Key: 2: Value:
>> {0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9
>> *:0.2676210542614334*}
>> 
>> 
>> *Tweet**Topic* *Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop
>> belle !!23Grammy Awards Hairstyles: Memorable Moments39 Preeminent
>> #terrorism research center website. Check out: cc
>> 
>> 
>> Am I missing something? Doesn't key 0 correspond to the first tweet
>> (document), key 2 to the second tweet and so on?

Could be a minor issue that needs to be fixed? Could u verify that the issue 
exists on 0.8?
>> 
>>  Thank you in advance for your responses.
>> 
>> 
>> 

Reply via email to