Re: New logo
Will you be wearing “one of those t-shirts” on Monday in Houston :) ? SCott Scott C. Cote scottcc...@gmail.com 972.672.6484 > On May 6, 2017, at 1:52 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I know where one of those t-shirts is. > > > > On Sat, May 6, 2017 at 7:13 AM, Isabel Drost-Fromm <isa...@apache.org> > wrote: > >> The green logo was the very first design iteration before iirc Robin came >> up with the yellow one. The should be like five TShirts world wide with the >> old logo printed in 2009. >> >> >> Am 1. Mai 2017 20:41:43 MESZ schrieb Trevor Grant < >> trevor.d.gr...@gmail.com>: >>> Thanks Scott, >>> >>> You are correct- in fact we're going even further now, that you can do >>> native optimization regardless of the architecture with native-solvers. >>> >>> Do you or anyone more familiar with the history of the website know >>> anything about the origins/uses of this: >>> https://mahout.apache.org/images/Mahout-logo-245x300.png >>> It seems to be a green mahout logo. >>> >>> Also Scott, or anyone lurking who may be able to help. As part of the >>> website reboot I've included a "history" page and would really >>> apppreciate >>> some help capturing that from first person sources if possible. Ive put >>> in >>> some headers but those are only directional: >>> >>> https://github.com/rawkintrevo/mahout/blob/website/website/front/ >> community/history.md >>> >>> >>> >>> Trevor Grant >>> Data Scientist >>> https://github.com/rawkintrevo >>> http://stackexchange.com/users/3002022/rawkintrevo >>> http://trevorgrant.org >>> >>> *"Fortunate is he, who is able to know the causes of things." -Virgil* >>> >>> >>> On Mon, May 1, 2017 at 11:18 AM, scott cote <scottcc...@gmail.com> >>> wrote: >>> >>>> Trevor et al: >>>> >>>> Some ideas to spur you on (and related points): >>>> >>>> Mahout is no longer a grab bag of algorithms and routines, but a math >>>> language right? You don’t care about the under the cover >>> implementation. >>>> Today its Spark with alternative implementations in Flink, etc …. >>>> >>>> Don’t know if that is the long term goal still - haven’t kept up - >>> but it >>>> seems like you are insulating yourself from the underlying >>> technology. >>>> >>>> Math is a universal language. Right? >>>> >>>> Tower of Babel is coming to mind …. >>>> >>>> SCott >>>> >>>>> On Apr 27, 2017, at 10:27 PM, Trevor Grant >>> <trevor.d.gr...@gmail.com> >>>> wrote: >>>>> >>>>> It also bugs me when I can't suggest any alternatives, yet don't >>> like the >>>>> ones in front of me... >>>>> >>>>> I became aware of a symbol a week or so ago, and it keeps coming >>> back to >>>>> me. >>>>> >>>>> The Enso. >>>>> https://en.wikipedia.org/wiki/Ens%C5%8D >>>>> >>>>> Things I like about it: >>>>> (all from wikipedia, since the only thing I knew about this symbol >>> prior >>>> is >>>>> that someone I met had a tattoo of it). >>>>> It represents (among a few other things) enlightenment. >>>>> ^^ This resonated with the 'alternate definition of mahout' from >>> Hebrew- >>>>> which may be something akin to essence or truth. >>>>> >>>>> It is a circle- which plays to the Samsara theme. >>>>> >>>>> It is very expressive, a simple one or two brush stroke circle >>> which >>>>> symbolizes several large concepts and things about the creator, >>>> expressive >>>>> like our DSL (I feel gross comparing such a symbol to a Scala DSL, >>> but >>>> I'm >>>>> spit balling here, please forgive me- I am not so expressive). >>>>> >>>>> "Once the *ensō* is drawn, one does not change it. It evidences the >>>>> character of its creator and the context of its creation in a >>> brief, >>>>> contiguous period of time." Which reminds me of the DRMs >>>>> >>>>> In closed form it
Re: streaming kmeans vs incremental canopy/solr/kmeans
Mahout Gurus, I’m back at the clustering text game (after a hiatus of a year). Not for recommendation purposes - thanks for the book and the idea of solr for recommendation …. that’s cool (Found Ted at Data Days in Austin - nice to see you again). My question: How do I apply streaming cluster technology to text when I don’t have accurate vectors? Let me explain exactly what I mean. I have a series of sentences coming at me over time. I may or may not have the word in a “dictionary” when I receive it. I need to group the similar sentences together. So I want to cluster the sentences. Streaming cluster lib listed in mahout assumes that the text has already been vectorized. So how do I vectorize a sentence that has words that are not in the dictionary? Do I save the elements of the TF-IDF prior calculations and incrementally update? ... Ugh - I think I just figured out my source of confusion. Please confirm my understanding Streaming does NOT imply an unbounded set of data …. I will have a set of sentences that arrives in some period of time T. Those that arrive in time T will be treated as a “batch” and vectorized in the usual fashion (TF-IDF). Then I feed the batched vector sets into the shiney new streaming methods (instead of using the tired old canopy combined with straight k-means) to arrive at my groupings. - No time or cpu burned up discovering canopies. - No intermediate disk consumed pushing canopy output into k-means. Nice groups. So all I have to do is keep updating the tfidf as new sentences arrive and re “ball” the sentences with the fast shiney streaming cluster technology. My big hurdle is coming up with an efficient way to update tfidf (ideas are welcome). On a separate note - over the last year - I have been using markdown and developing my documentation skills. Held off on writing docs on canopy as I saw that it is going to be deprecated (Suneel) Does my use case sound like a good example for streaming? If yes - I’ll cook up my specifics into a postable example. Also - just checking - streaming isn’t going to be deprecated is it? I know that I crammed a whole bunch of questions into this letter - so I will truly appreciate ya’ll being patient and wading through. Regards, SCott On 2/14/14, 12:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: In-memory ball k-means should solve your problem pretty well right now. In-memory streaming k-means followed by ball k-means will take you to well beyond your scaled case. At 1 million documents, you should be able to do your clustering in a few minutes, depending on whether some of the sparse matrix performance issues got fixed in the clustering code (I think they did). On Fri, Feb 14, 2014 at 10:50 AM, Scott C. Cote scottcc...@gmail.comwrote: Right now - I'm dealing with only 40,000 documents, but we will eventually grow more than 10x (put on the manager hat and say 1 mil docs) where a doc is usually no longer than 20 or 30 words. SCott On 2/14/14 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote: Scott, How much data do you have? How much do you plan to have? On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote scottcc...@gmail.com wrote: Hello All, I have two questions (Q1, Q2). Q1: Am digging in to Text Analysis and am wrestling with competing analyzed data maintenance strategies. NOTE: my text comes from a very narrowly focused source. - Am currently crunching the data (batch) using the following scheme: 1. Load source text as rows in a mysql database. 2. Create named TFIDF vectors using a custom analyzer from source text (-stopwords, lowercase, std filter, Š.) 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine metric (derived from a custom metric found in MiA) 4. Load references of Clusters into SOLR (core1) cluster id, top terms along with full cluster data into Mongo (a cluster is a doc) 5. Then load source text into SOLR(core2) using same custom analyzer with appropriate boost along with the reference cluster id NOTE: in all cases, the id of the source text is preserved throughout the flow in the vector naming process, etc. So now I have a mysql table, two SOLR cores, and a Mongo Document Collection (all tied together with text id as the common name) - Now when a new document enters the system after batch has been performed, I use core2 to test the top SOLR matches (custom analyzer normalizes the new doc) to find best cluster within a tolerance. If a cluster is found, then I place the text in that cluster if not, then I start a new group (my word for a cluster not generated via kmeans). Either way, the doc makes its way into both (core1 and core2). I keep track of the number of group creations/document placements so that if a threshold is crossed, then I can re-batch the data. In MiA, (I think ch 11), suggests that a user could run the canopy cluster routine to assign new
Re: canopy creating canopies with the same points
Reinis, The documentation has several Jira¹s open - with one with my name on it. Fortunately, the canopy cluster technology has a good page (as well as some outdated pages). Please see this link for your question: http://mahout.apache.org/users/clustering/canopy-clustering.html as I believe that it is well written. To directly answer your question: Remember that T1 T2 and points within T2 are added to the cluster and removed from the input set, while points within T1 are added to the cluster but NOT removed from the ³input set (and therefore may be added to another cluster later in the process). SCott On 3/24/14, 6:44 AM, Reinis Vicups mah...@orbit-x.de wrote: Hi, apparently I am missunderstanding the way canopy works. I thought that once datapoint is added to canopy, it is removed from the list of to-be-clustered points thus one point is assigned to one canopy. In the example below this is not the case: :C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:238.981, 468:40.572, 556:10.985, 889:8.678, 1101:114 :C-29{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:217.804, 468:33.560, 556:10.985, 889:8.678, 1101:113 :C-30{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:215.841, 468:37.231, 556:10.985, 889:8.678, 1101:113 :C-31{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:206.121, 468:32.243, 556:10.985, 889:8.678, 1101:112 So is the correct assumption that only the points within T2 get assigned to only one canopy or even points within T2 can get assigned to more than one canopy? greets reinis
Re: canopy creating canopies with the same points
Reinis, I don’t know - perhaps one of the other denizens of Users has an answer? SCott On 3/24/14, 10:13 AM, Reinis Vicups mah...@orbit-x.de wrote: Scott, thx a bunch for the pointer, very useful. One thing I would like to clarify tho. I forgot to mention that I ran canopy with T1 == T2 (this was suggested in some post as a method to find in a fast way T2 that gives particular number of canopies. You mention jiras you opened (gonna check them right after) - could it be one of them is for this special T1 == T2 case? br reinis On 24.03.2014 15:28, Scott C. Cote wrote: Reinis, The documentation has several Jira¹s open - with one with my name on it. Fortunately, the canopy cluster technology has a good page (as well as some outdated pages). Please see this link for your question: http://mahout.apache.org/users/clustering/canopy-clustering.html as I believe that it is well written. To directly answer your question: Remember that T1 T2 and points within T2 are added to the cluster and removed from the input set, while points within T1 are added to the cluster but NOT removed from the ³input set (and therefore may be added to another cluster later in the process). SCott On 3/24/14, 6:44 AM, Reinis Vicups mah...@orbit-x.de wrote: Hi, apparently I am missunderstanding the way canopy works. I thought that once datapoint is added to canopy, it is removed from the list of to-be-clustered points thus one point is assigned to one canopy. In the example below this is not the case: :C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:238.981, 468:40.572, 556:10.985, 889:8.678, 1101:114 :C-29{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:217.804, 468:33.560, 556:10.985, 889:8.678, 1101:113 :C-30{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:215.841, 468:37.231, 556:10.985, 889:8.678, 1101:113 :C-31{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:206.121, 468:32.243, 556:10.985, 889:8.678, 1101:112 So is the correct assumption that only the points within T2 get assigned to only one canopy or even points within T2 can get assigned to more than one canopy? greets reinis
Re: Website, urgent help needed
I have created issue https://issues.apache.org/jira/browse/MAHOUT-1461 Will upload shell scripts and suggested replacement text later tonight …. SCott On 3/13/14, 10:43 AM, Sebastian Schelter s...@apache.org wrote: Hi Scott, Create a jira ticket and attach your scripts and a text version of the page there. Best, Sebastian On 03/12/2014 03:27 PM, Scott C. Cote wrote: I took the tour of the text analysis and pushed through despite the problems on the page. Commiters helped me over the hump where others might have just gave up (to your point). When I did it, I made shell scripts so that my steps would be repeatable with an anticipation of updating the page. Unforunately, I gave up on trying to figure out how to update the page (there were links indicating that I could do it), and I didn¹t want to appear to be stupid asking how to update the documentation (my bad - not anyone else). Now I know that it was not possible unless I was a commiter. Who should I send my scripts to, or how should I proceed with a current form of the page? SCott On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote: Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
I took the tour of the text analysis and pushed through despite the problems on the page. Commiters helped me over the hump where others might have just gave up (to your point). When I did it, I made shell scripts so that my steps would be repeatable with an anticipation of updating the page. Unforunately, I gave up on trying to figure out how to update the page (there were links indicating that I could do it), and I didn¹t want to appear to be stupid asking how to update the documentation (my bad - not anyone else). Now I know that it was not possible unless I was a commiter. Who should I send my scripts to, or how should I proceed with a current form of the page? SCott On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote: Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
I’ll make it work. Don’t know markdown (assume some reduced mark”up” language) - but I’ll figure it out. I will assume that I can check with my consulting buddy “Google” and find it. :) Thank you for your contributions - glad that I can give “something” back. I’ll start off by sending the doc to one of the committers, and then if you guys like my work, then we can proceed from there …. SCott On 3/12/14, 9:38 AM, Sebastian Schelter s...@apache.org wrote: Hi Scott, The cms behind the website uses markdown. So ideally you would attach a textfile with markdown formattings to a jira issue and a committer will put that into the website. Does that work for you? PS: There are a lot of online markdown editors out there. On 03/12/2014 03:27 PM, Scott C. Cote wrote: I took the tour of the text analysis and pushed through despite the problems on the page. Commiters helped me over the hump where others might have just gave up (to your point). When I did it, I made shell scripts so that my steps would be repeatable with an anticipation of updating the page. Unforunately, I gave up on trying to figure out how to update the page (there were links indicating that I could do it), and I didn¹t want to appear to be stupid asking how to update the documentation (my bad - not anyone else). Now I know that it was not possible unless I was a commiter. Who should I send my scripts to, or how should I proceed with a current form of the page? SCott On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote: Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
ok On 3/12/14, 9:58 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thanks Scott; please just attach your work to an issue in the Jira system; if there's not one already you could file a new issue. On Mar 12, 2014, at 7:44 AM, Scott C. Cote scottcc...@gmail.com wrote: I’ll make it work. Don’t know markdown (assume some reduced mark”up” language) - but I’ll figure it out. I will assume that I can check with my consulting buddy “Google” and find it. :) Thank you for your contributions - glad that I can give “something” back. I’ll start off by sending the doc to one of the committers, and then if you guys like my work, then we can proceed from there …. SCott On 3/12/14, 9:38 AM, Sebastian Schelter s...@apache.org wrote: Hi Scott, The cms behind the website uses markdown. So ideally you would attach a textfile with markdown formattings to a jira issue and a committer will put that into the website. Does that work for you? PS: There are a lot of online markdown editors out there. On 03/12/2014 03:27 PM, Scott C. Cote wrote: I took the tour of the text analysis and pushed through despite the problems on the page. Commiters helped me over the hump where others might have just gave up (to your point). When I did it, I made shell scripts so that my steps would be repeatable with an anticipation of updating the page. Unforunately, I gave up on trying to figure out how to update the page (there were links indicating that I could do it), and I didn¹t want to appear to be stupid asking how to update the documentation (my bad - not anyone else). Now I know that it was not possible unless I was a commiter. Who should I send my scripts to, or how should I proceed with a current form of the page? SCott On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote: Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Welcome Andrew Musselman as new comitter
I personally am looking forward to the ³advice from the newest ³recommended² committer to hadoop. Congratulations to Mahout team for increasing and growing :) Now back to my using . (and hopefully creating something meaningful for you guys) Scott PS: am bootstrapping my Machine Learning knowledge by taking the coursera course offered by Andrew NG - correct my shaky knowledge of classifiers. Anyone else on this list taking or have taken this course? (obviously - committers are probably not, but .) On 3/7/14, 11:36 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thank you for the welcome! Looking forward to it. I have a math background and got started with recommenders by building the first album recommender for Rhapsody ( http://rhapsody.com ) while I was doing web development and web services work for the service. Since then I learned to love/hate Pig and Hadoop for a living, and now I do data engineering and analytics at Accenture. We've used Mahout on a few production projects, and we're looking forward to more. See you on the lists! Best Andrew On Fri, Mar 7, 2014 at 9:12 AM, Sebastian Schelter s...@apache.org wrote: Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Andrew Musselman to become committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since in addition to posting patches on JIRA it also gives write access to the code repository. That also means that now we have yet another person who can commit patches submitted by others to our repo *wink* Andrew, we look forward to working with you in the future. Welcome! It would be great if you could introduce yourself with a few words :) Sebastian
Re: Rework our website
Ok - I expected (and am actually pleased that its not a free-for-all. I’ll see what has already been updated in this latest flurry of updates and see what I can contribute. Forwarded to you. Thanks, SCott On 3/5/14, 4:43 PM, Sebastian Schelter s...@apache.org wrote: At the moment, only committers can change the website unfortunately. If you have a text to add, I'm happy to work it in and add your name to our contributers list in the CHANGELOG. Best, Sebastian On 03/05/2014 04:58 PM, Scott C. Cote wrote: I had recently taken the text tour of mahout, but I couldn't decipher a way to contribute updates to the tour (some of the file names have changed, etc). How would I start? (this was part of my offer to help with the documentation of Mahout). SCott On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote: What no centered text?? ;-) Love either. BTW users are no longer able to contribute content to the wiki. Most CMSs have a way to allow input that is moderated. Might this make getting documentation help easier? Allow anyone to contribute but committers can filter out the bad‹sort of like submitting patches. On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote: Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
Re: Rework our website
I had recently taken the text tour of mahout, but I couldn't decipher a way to contribute updates to the tour (some of the file names have changed, etc). How would I start? (this was part of my offer to help with the documentation of Mahout). SCott On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote: What no centered text?? ;-) Love either. BTW users are no longer able to contribute content to the wiki. Most CMSs have a way to allow input that is moderated. Might this make getting documentation help easier? Allow anyone to contribute but committers can filter out the badsort of like submitting patches. On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote: Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
streaming kmeans vs incremental canopy/solr/kmeans
Hello All, I have two questions (Q1, Q2). Q1: Am digging in to Text Analysis and am wrestling with competing analyzed data maintenance strategies. NOTE: my text comes from a very narrowly focused source. - Am currently crunching the data (batch) using the following scheme: 1. Load source text as rows in a mysql database. 2. Create named TFIDF vectors using a custom analyzer from source text (-stopwords, lowercase, std filter, .) 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine metric (derived from a custom metric found in MiA) 4. Load references of Clusters into SOLR (core1) cluster id, top terms along with full cluster data into Mongo (a cluster is a doc) 5. Then load source text into SOLR(core2) using same custom analyzer with appropriate boost along with the reference cluster id NOTE: in all cases, the id of the source text is preserved throughout the flow in the vector naming process, etc. So now I have a mysql table, two SOLR cores, and a Mongo Document Collection (all tied together with text id as the common name) - Now when a new document enters the system after batch has been performed, I use core2 to test the top SOLR matches (custom analyzer normalizes the new doc) to find best cluster within a tolerance. If a cluster is found, then I place the text in that cluster if not, then I start a new group (my word for a cluster not generated via kmeans). Either way, the doc makes its way into both (core1 and core2). I keep track of the number of group creations/document placements so that if a threshold is crossed, then I can re-batch the data. In MiA, (I think ch 11), suggests that a user could run the canopy cluster routine to assign new entries to the clusters (instead of what I am doing). Does he mean to regenerate a new dictionary, frequencies, etc for the corpus for every inbound document? My observations have been that this has been a very speedy process, but I'm hoping that I'm just too much of a novice and haven't thought of a way to simply update the dictionary/frequencies. (this process also calls for the eventual rebatching of the clusters). While I was very early in my implement what I have read process, Suneel and Ted recommended that I examine the Streaming Kmeans process. Would that process sidestep much of what I'm doing? Q2: I need to really understand the lexicon of my corpus. How do I see the list of terms that have been omitted due either to being in too many documents or are not in enough documents for consideration? Please know that I know that I can look at the dictionary to see what terms are covered. And since my custom analyzer is using the StandardAnalyzer.stop words, those are obvious also. If there isn't an option to emit the omitted words, where would be the natural place to capture that data and save it into yet another data store (Sequence file,etc)? Thanks in Advance for the Guidance, SCott
Re: get similar items
I generate my initial sequence files directly from records in my mysql database. Follow Martin's advice on going through the tutorial. Very very very helpful. Also - I really like MiA even if it is a couple of versions behind. The clustering chapters are still very accurate (seem to be :) ). You really need to get a good feel of what kind of vectors you are going to use as input to your clusters. SCott On 2/14/14 1:32 AM, N! 12481...@qq.com wrote: Thank you SebastianMartinScott. I checked 'https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana lysis+using+the+Mahout+command+line'. It looks like the case what I said.But I am using JAVA with a Mysql database, is there an example related to this? thanks. -- Original -- From: Scott C. Cote;scottcc...@gmail.com; Date: Wed, Feb 12, 2014 11:47 PM To: user@mahout.apache.orguser@mahout.apache.org; Subject: Re: get similar items Since you are relying on unguided data - switch from recommenders/classifier to clustering. Anyone else agree with me on this??? SCott On 2/12/14 9:04 AM, Martin, Nick nimar...@pssd.com wrote: Yeah, since it would appear you're lacking requisite data for recommenders the only other thing I can think of in this case is potentially treating the movie records as documents and clustering them (via whatever might be in the 'description' field). Have a look here https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana l ysis+using+the+Mahout+command+line and see if you can support something like this with your dataset. -Original Message- From: Sebastian Schelter [mailto:ssc.o...@googlemail.com] Sent: Wednesday, February 12, 2014 6:28 AM To: user@mahout.apache.org Subject: Re: get similar items Hi, Mahout's recommenders are based on analyzing interactions between users and items/movies, e.g. ratings or counts how often the movie was watched. On 02/12/2014 11:34 AM, N! wrote: Hi all: Does anyone have any suggestions for the questions below? thanks a lot. -- Original -- Sender: N!12481...@qq.com; Send time: Wednesday, Feb 12, 2014 6:17 PM To: useruser@mahout.apache.org; Subject: Re: get similar items Hi Sean: Thanks for the reply. Assume I have only one table named 'movie' with 1000+ records, this table have three columns:'id','movieName','movieDescription'. Can Mahout calculate the most similar movies for a movie.(based on only the 'movie' table)? code like: List mostSimilarMovieList = recommender.mostSimilar(int movieId). if not, do you have any suggestions for this scenario? .
Re: streaming kmeans vs incremental canopy/solr/kmeans
Right now - I'm dealing with only 40,000 documents, but we will eventually grow more than 10x (put on the manager hat and say 1 mil docs) where a doc is usually no longer than 20 or 30 words. SCott On 2/14/14 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote: Scott, How much data do you have? How much do you plan to have? On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote scottcc...@gmail.com wrote: Hello All, I have two questions (Q1, Q2). Q1: Am digging in to Text Analysis and am wrestling with competing analyzed data maintenance strategies. NOTE: my text comes from a very narrowly focused source. - Am currently crunching the data (batch) using the following scheme: 1. Load source text as rows in a mysql database. 2. Create named TFIDF vectors using a custom analyzer from source text (-stopwords, lowercase, std filter, Š.) 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine metric (derived from a custom metric found in MiA) 4. Load references of Clusters into SOLR (core1) cluster id, top terms along with full cluster data into Mongo (a cluster is a doc) 5. Then load source text into SOLR(core2) using same custom analyzer with appropriate boost along with the reference cluster id NOTE: in all cases, the id of the source text is preserved throughout the flow in the vector naming process, etc. So now I have a mysql table, two SOLR cores, and a Mongo Document Collection (all tied together with text id as the common name) - Now when a new document enters the system after batch has been performed, I use core2 to test the top SOLR matches (custom analyzer normalizes the new doc) to find best cluster within a tolerance. If a cluster is found, then I place the text in that cluster if not, then I start a new group (my word for a cluster not generated via kmeans). Either way, the doc makes its way into both (core1 and core2). I keep track of the number of group creations/document placements so that if a threshold is crossed, then I can re-batch the data. In MiA, (I think ch 11), suggests that a user could run the canopy cluster routine to assign new entries to the clusters (instead of what I am doing). Does he mean to regenerate a new dictionary, frequencies, etc for the corpus for every inbound document? My observations have been that this has been a very speedy process, but I'm hoping that I'm just too much of a novice and haven't thought of a way to simply update the dictionary/frequencies. (this process also calls for the eventual rebatching of the clusters). While I was very early in my implement what I have read process, Suneel and Ted recommended that I examine the Streaming Kmeans process. Would that process sidestep much of what I'm doing? Q2: I need to really understand the lexicon of my corpus. How do I see the list of terms that have been omitted due either to being in too many documents or are not in enough documents for consideration? Please know that I know that I can look at the dictionary to see what terms are covered. And since my custom analyzer is using the StandardAnalyzer.stop words, those are obvious also. If there isn't an option to emit the omitted words, where would be the natural place to capture that data and save it into yet another data store (Sequence file,etc)? Thanks in Advance for the Guidance, SCott
Re: get similar items
Since you are relying on unguided data - switch from recommenders/classifier to clustering. Anyone else agree with me on this??? SCott On 2/12/14 9:04 AM, Martin, Nick nimar...@pssd.com wrote: Yeah, since it would appear you're lacking requisite data for recommenders the only other thing I can think of in this case is potentially treating the movie records as documents and clustering them (via whatever might be in the 'description' field). Have a look here https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+anal ysis+using+the+Mahout+command+line and see if you can support something like this with your dataset. -Original Message- From: Sebastian Schelter [mailto:ssc.o...@googlemail.com] Sent: Wednesday, February 12, 2014 6:28 AM To: user@mahout.apache.org Subject: Re: get similar items Hi, Mahout's recommenders are based on analyzing interactions between users and items/movies, e.g. ratings or counts how often the movie was watched. On 02/12/2014 11:34 AM, N! wrote: Hi all: Does anyone have any suggestions for the questions below? thanks a lot. -- Original -- Sender: N!12481...@qq.com; Send time: Wednesday, Feb 12, 2014 6:17 PM To: useruser@mahout.apache.org; Subject: Re: get similar items Hi Sean: Thanks for the reply. Assume I have only one table named 'movie' with 1000+ records, this table have three columns:'id','movieName','movieDescription'. Can Mahout calculate the most similar movies for a movie.(based on only the 'movie' table)? code like: List mostSimilarMovieList = recommender.mostSimilar(int movieId). if not, do you have any suggestions for this scenario?
Re: Problem converting tokenized documents into TFIDF vectors
Drew, I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I got passed my problem. It was the min freq that was killing me. Forgot about that parameter. Thank you for your assist. Hope to be able to return the favor. Am on the hook to update documentation for Mahout already - maybe that will do it :) This week, I'll be testing my code against the .9 distribution. SCott On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote: Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: Problem converting tokenized documents into TFIDF vectors
I understand that it is not official. Am just trying to provide another test opportunity for the .9 release. SCott On 1/26/14 1:05 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Scott, FYI... 0.9 Release is not official yet. The project trunk's still at 0.9-SNAPSHOT. Please feel free to update the documentation. On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com wrote: Drew, I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I got passed my problem. It was the min freq that was killing me. Forgot about that parameter. Thank you for your assist. Hope to be able to return the favor. Am on the hook to update documentation for Mahout already - maybe that will do it :) This week, I'll be testing my code against the .9 distribution. SCott On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote: Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: Running Mahout Example
To eliminate the MAHOUT_LOCAL stack traces, I set the env var to an arbitrary value. export MAHOUT_HOME=~/mahout export MAHOUT_LOCAL=yes export PATH=$PATH:${MAHOUT_HOME}/bin On 1/22/14 9:50 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: What's ur Mahout version? On Wednesday, January 22, 2014 10:27 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Strangely, I get the following: MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Exception in thread main java.lang.NoClassDefFoundError: classpath Caused by: java.lang.ClassNotFoundException: classpath at java.net.URLClassLoader.findClass(URLClassLoader.java:434) at java.lang.ClassLoader.loadClass(ClassLoader.java:653) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358) at java.lang.ClassLoader.loadClass(ClassLoader.java:619) Could not find the main class: classpath. Program will exit. Running on hadoop, using /mnt/hdgpfs/shared_home/hadoop/IHC-0.20.2/bin/hadoop and HADOOP_CONF_DIR=/mnt/hdgpfs/shared_home/hadoop/IHC-0.20.2/conf Benjamin On Wed, Jan 22, 2014 at 4:59 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Try examples /bin/cluster-reuters.sh Sent from my iPhone On Jan 22, 2014, at 9:56 AM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi, I wished to run the mahout example for Kmeans algorithm. I suppose that it is: org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (1) Is it right? It looks for a /testdata/ directory. I did not find it (2) Where is it, please? I thought to use the reuters data set described in Manning book and I extracted it to my disk and pointed to this directory in the main method. However, I get the following, when running the Job: java.lang.NumberFormatException: For input string: amex at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.valueOf(Unknown Source) at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java: 48) at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java: 1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) (3) What do I do wrong? Best regards Benjamin
Problem converting tokenized documents into TFIDF vectors
All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuff.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized form. By the time I get to the TFIDF vectors (step 4) my output is of zero . No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: need help explaining difference in k means output
Mahesh, I guess this is what I get for working too long and not recognizing the diff . Suspected it was something silly. Changing the driver parameters to EXACTLY the same as the command line does indeed work. Thank you. I now have one file. Not sure if it was the convergence or the sequential, but I have a hunch that the problem was the sequential (As you pointed out, I have plenty of iterations left). Cheers! SCott On 1/6/14 3:58 AM, Mahesh Balija balijamahesh@gmail.com wrote: Hi Scott, Not very sure why you are getting many part files in code execution, the difference b/w in your command line and the code execution is your cd [Convergence Delta] is different 0.1 and 0.01, in the later case KMeans might take more iterations to converge as its convergenceDelta is very less but anyways you have number of iterations set to 10. Another difference is you are running your source code execution in sequential mode. I am not sure whether these factors really effect the number of part files being generated. Anyhow you have to evaluate the number of clusters being generated finally by using ClusterDumper in both the cases, that will give you the number of clusters and the points associated with each clusters. The ClusteredPoints will be generated in the last iteration and will have the info about the clusters and associated points for each cluster. Best, Mahesh Balija. On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote scottcc...@gmail.com wrote: All, When I run the Kmeans analysis from the command line, # # added the -cd option per instructions in the Mahout In Action (MiA) so the convergance threhsold is .1 # instead of default value of .5 because cosines lie within 0 and 1. # # maximum number of iterations is 10 # mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 the iterations resolve to a directory with the word final that has a single file where the name is like part-r-0 . If I run it as a java routine: KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, clusters-0-final), clusterOutput, new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true); thousands of files such as part-00338 are produced. The same data is used as input for both and both are initialized from canopy . Why does the command line form generate a single file while my Java version generate multiple output files. What setting/configuration am I missing? Secondary question: The sequence files located in the final folder I assume to contain the centroids of the data (and the points that the centroids were derived from are in the clusteredPoints (please confirm). Thanks in advance. SCott
need help explaining difference in k means output
All, When I run the Kmeans analysis from the command line, # # added the -cd option per instructions in the Mahout In Action (MiA) so the convergance threhsold is .1 # instead of default value of .5 because cosines lie within 0 and 1. # # maximum number of iterations is 10 # mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 the iterations resolve to a directory with the word final that has a single file where the name is like part-r-0 . If I run it as a java routine: KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, clusters-0-final), clusterOutput, new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true); thousands of files such as part-00338 are produced. The same data is used as input for both and both are initialized from canopy . Why does the command line form generate a single file while my Java version generate multiple output files. What setting/configuration am I missing? Secondary question: The sequence files located in the final folder I assume to contain the centroids of the data (and the points that the centroids were derived from are in the clusteredPoints (please confirm). Thanks in advance. SCott
Re: Equality of two DenseMatrix objects
Ted - thank you for taking the time to point out that in Multivariate Systems, there are many interpretations to what would seem ordinary and non-debatable in scalar mathematics. For example, in the relational algebra world, I know of seven different interpretations of relational division. SCott On 12/29/13 10:02 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Dec 29, 2013 at 7:30 PM, Tharindu Rusira tharindurus...@gmail.comwrote: Hi Ted, Thanks for taking this discussion back alive. It's true, as Sebestian mentioned, equality checking for matrices is an expensive task and Ted has come up with a smart one liner here(even though a considerable amount of computational complexity is hidden somewhere). But don't you think (at least for the sake of completeness) that we should have an implementation of this? Not really. The problem is that there are many different meanings of equal for matrices. In fact there are many definitions of zero, as well. This stems partly from the fact that we have to inherit a sense of nearly zero or nearly equal from the fact that we are using floating point arithmetic. This is exactly why equals is poorly defined for floating point numbers, but worse. As such any single definition is going to be seriously problematic. Any definition that doesn't have a tolerance argument is inherently dangerous to use except in very limited situations. For example here are some possibilities for vector equality: | x - y|_F \delta | x - y|_1 \delta | x - y|_0 \delta (x-y)^T A (x-y) \delta x^T A y 1-\delta/2 The first says that the sum of the squares of the components of the difference is less than a particular number. The second says that the sum of the absolute values of the difference is less. The third says that the maximum value of the difference is different. The third says that the dot product of the of the difference is nearly zero neglecting components in the null space of A. The last form is useful for cases where x and y have unit norm with respect to A (i.e. x^T A x = 1). Which of these is correct? Of all of these, only the last two are equivalent and only in limited situations. For matrices, there are even more possibilities. Btw, this thread has turned into a developers discussion, so I'm not sure whether we should continue this on the developers list. I think that this is a very important thread for users at large as well.
Mahout In Action - NewsKMeansClustering sample not generating clusters
Hello Mahout Trainers and Gurus: I am plowing through the sample code from Mahout in Action. Have been trying to run the example NewsKMeansClustering using the Reuters dataset. Found Alex Ott's Blog http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm l And downloaded the updated examples for 0.7 mahout. I took the exploded zip and modified the pom.xml so that it referenced 0.8 mahout instead of 0.7 mahout. Of course, there are compile errors (expected), but the only seemingly significant problems are in the helper class called MyAnalyzer. NOTE: am NOT complaining about the fact that the samples don't compile properly in 0.8 . If my efforts to make it work results in sharable code then I have helped (or that person who helps me helped). I need help in potentially two different parts: Revision of MyAnalyzer (steps 1 and 2) and/or sidestepping it (step 3) Steps Taken (total of 3 steps): Step 1. Performed the sgml2text conversion of reuters data and then converted the text to sequence files. Step 2. Attempted to run the java NewsKMeansClustering with MyAnalyzer - attempted to modify MyAnalyzer to fit into the 0.8 mahout world When I try to run the program, the sample blows up with this message: 2013-12-27 12:59:29.870 java[86219:1203] Unable to load realm info from SCDynamicStore SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-jcl/1.7.5/slf4j-jcl -1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-log4j12/1.5.11/slf4 j-log4j12-1.5.11.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory] 2013-12-27 12:59:30 NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-12-27 12:59:30 JobClient [WARN] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2013-12-27 12:59:30 LocalJobRunner [WARN] job_local_0001 java.lang.NullPointerException at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.fill(Charac terUtils.java:209) at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.jav a:135) at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence FileTokenizerMapper.java:49) at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence FileTokenizerMapper.java:38) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) Exception in thread main java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProce ssor.java:95) at mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:53) Here is the source code to my revised MyAnalyzer I tried to stay as true to form of the original MyAnalyzer but I'm sure that I misunderstood something in this class when I ported it to the new Lucene Analyzer interface api. public class MyAnalyzer extends Analyzer { private final Pattern alphabets = Pattern.compile([a-z]+); /* * (non-Javadoc) * @see org.apache.lucene.analysis.Analyzer#createComponents(java.lang.String, java.io.Reader) */ @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new StandardTokenizer(Version.LUCENE_CURRENT, reader); TokenStream result = new StandardFilter(Version.LUCENE_CURRENT, source); result = new LowerCaseFilter(Version.LUCENE_CURRENT, result); result = new StopFilter(Version.LUCENE_CURRENT, result, StandardAnalyzer.STOP_WORDS_SET); CharTermAttribute termAtt = result.addAttribute(CharTermAttribute.class); StringBuilder buf = new StringBuilder(); try { result.reset(); while ( result.incrementToken() ) { if ( termAtt.length() 3 ) continue; String word = new String(termAtt.buffer(), 0, termAtt.length()); Matcher m = alphabets.matcher(word); if ( m.matches() ) { buf.append(word).append( ); } } } catch ( IOException e ) { e.printStackTrace(); } TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new StringReader(buf.toString())); return new TokenStreamComponents(source, ts); } } Step 3. Since I wasn't progressing with MyAnalyzer - I commented out the MyAnalyzer reference inside NewsKMeansClustering and replaced with // MyAnalyzer analyzer = new MyAnalyzer();
Re: Mahout In Action - NewsKMeansClustering sample not generating clusters
source from Alex Ott's .7 version of NewsKMeansClustering: /* * Source code for Listing 9.4 */ package mia.clustering.ch09; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.util.Version; import org.apache.mahout.clustering.Cluster; import org.apache.mahout.clustering.canopy.CanopyDriver; import org.apache.mahout.clustering.classify.WeightedVectorWritable; import org.apache.mahout.clustering.kmeans.KMeansDriver; import org.apache.mahout.common.HadoopUtil; import org.apache.mahout.common.Pair; import org.apache.mahout.common.distance.CosineDistanceMeasure; import org.apache.mahout.vectorizer.DictionaryVectorizer; import org.apache.mahout.vectorizer.DocumentProcessor; import org.apache.mahout.vectorizer.tfidf.TFIDFConverter; public class NewsKMeansClustering { public static void main(String args[]) throws Exception { // // changes from Alex Otts Source: // // 1. changed booleans that indicate the use of named vectors from false to true // 2. changed sequential access booleans from false to true // 3. changed MyAnalyzer to StandardAnalyzer // 4. added system.out.println statements to provide console guidance on progress // 5. Changed Input dir to reuters-seqfiles to make use of output from command line approach in tour // int minSupport = 5; int minDf = 5; int maxDFPercent = 95; int maxNGramSize = 2; int minLLRValue = 50; int reduceTasks = 1; int chunkSize = 200; int norm = 2; boolean sequentialAccessOutput = true; // String inputDir = inputDir; String inputDir = reuters-seqfiles; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); String outputDir = newsClusters; HadoopUtil.delete(conf, new Path(outputDir)); Path tokenizedPath = new Path(outputDir, DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); // MyAnalyzer analyzer = new MyAnalyzer(); System.out.println(tokenizing the documents); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT, StandardAnalyzer.STOP_WORDS_SET); DocumentProcessor.tokenizeDocuments(new Path(inputDir), analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); // // System.out.println(creating the term frequency vectors from tokenized documents); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport, maxNGramSize, minLLRValue, 2, true, reduceTasks, chunkSize, sequentialAccessOutput, true); // // System.out.println(calculating document frequencies from tf vectors); PairLong[], ListPath dfData = TFIDFConverter.calculateDF(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSize); System.out.println(creating the tfidf vectors); TFIDFConverter.processTfIdf(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path( outputDir), conf, dfData, minDf, maxDFPercent, norm, true, sequentialAccessOutput, true, reduceTasks); // // Path vectorsFolder = new Path(outputDir, tfidf-vectors); Path canopyCentroids = new Path(outputDir, canopy-centroids); Path clusterOutput = new Path(outputDir, clusters); System.out.println(Deriving canopy clusters from the tfidf vectors); // CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250, 120, false, 0.0, false); CanopyDriver.run(vectorsFolder, canopyCentroids, new CosineDistanceMeasure(), .4, .8, true, 0.0, true); // // System.out.println(running cluster kmean); // KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, clusters-0-final), clusterOutput, // new TanimotoDistanceMeasure(), 0.01, 20, true, 0.0, false); KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, clusters-0-final), clusterOutput, new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true); SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(clusterOutput + Cluster.CLUSTERED_POINTS_DIR + /part-0), conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while ( reader.next(key, value) ) { System.out.println(key.toString() + belongs to cluster + value.toString()); } reader.close(); } } I'm running out of ideas. SCott From: Scott C. Cote scottcc...@gmail.com Date: Friday, December 27, 2013 1:56 PM To: user@mahout.apache.org user@mahout.apache.org Subject: Mahout In Action - NewsKMeansClustering sample not generating clusters Hello Mahout Trainers and Gurus: I am plowing through the sample code from Mahout in Action. Have been trying to run the example NewsKMeansClustering using the Reuters dataset. Found Alex Ott's Blog http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm l And downloaded the updated
Questions related to MiA and Quick tour of text analysis ..
All, Two questions related to Quick tour of text analysis using the Mahout command line 1. metrics: When moving through the process of performing the cluster analysis one can use many different metrics. In the tour, the choice was made to use the Cosine metric. Is there any problems that can arise from using the cosine metric to define the clusters, but use tanimoto or euclid to dump the clusters? I have so far remained consistent in that once starting with Cosine, go all the way with cosine. When does it make sense to not do what I am doing? To be clear the current version of the tour does NOT specify that a metric should be used when dumping a cluster, so the default Euclid is used. 2. Parameters around canopy cluster: What are parameters t3 and t4? I know that they are optional reducers and t1 and t2 are used for them if t3 and t4 are not specified. https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering Lots of discussion about t1 and t2, but t3 and t4 are not covered in MiA either. Are these params that I should ignore for now? SCott
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
version of java (java -version): java version 1.6.0_65, Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode) Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 On 12/19/13 1:08 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I don't see a need for uploading ur commands. Clean up HDFS (both output and temp folders) and try running the 5 steps again - extract reuters, seqdirectory, seq2sparse, rowid job, rowsimilarity job. Please use '-ow' option while running each of the jobs. On Thursday, December 19, 2013 2:04 PM, Scott C. Cote scottcc...@gmail.com wrote: I manually deleted the temp folder too (After 2 failed starts). Would it be helpful for me to upload my shells that encapsulate all of the commands posted on the tour? They reflect the current state of reuters and .8 mahout. And if I did - how would I do it? Thanks, SCott On 12/19/13 1:00 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Yep, that's what has happened in ur case. the wiki doesn't have but please specify the -ow (overwrite) option while running the RowsimilarityJob. That should clear up both the output and temp folders before running the job. On Thursday, December 19, 2013 1:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Haha... that could explain it, Rowsimilarityjob creates temp files during execution. If ur laptop 'sleeped' then the temp files still persist and running the job again wouldn't overwrite the old temp files (i need to verify that). It should be good enough to run the Rowsimilarity job again. On Thursday, December 19, 2013 1:46 PM, Scott C. Cote scottcc...@gmail.com wrote: Suneel, I'm going to do the similarity part of the tour over - my laptop was sleeped in the middle of the run of the rowsimilarity job. Maybe the job is sensitive to that …. :( Normally - a server would not go to sleep nor would it run in local mode. Sorry that I didn't think of that sooner. Will let you know my outcome. Am planning on redoing by deleting the contents and the folder titled reuters-similarity Please let me know if that is not good enough. Thanks again. SCott On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: What you are seeing is the output matrix of the RowSimilarity job. You are right there should be 21578 documents only in the reuters corpus. a) How many documents do you have in your docIndex? DocIndex is one of the artifacts of the RowIDJob and should have been executed prior to the RowSimilarity Job. You can run seqdumper on docIndex to see the output. b) Also what was the message at the end of the RowId job. It should read something like 'Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix'. On Thursday, December 19, 2013 12:14 PM, Scott C. Cote scottcc...@gmail.com wrote: All, I am a newbie Mahout user and am trying to use the Quick tour of text analysis using the Mahout command line . Thank you to whomever contributed to that page. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a n a lysis +using+the+Mahout+command+line Went all the way from beginning to end of the page with seemingly no hiccups. At the very end of the tour, I became confused because the command: mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) Key: 1: Value: /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4 4 0 3:0.2 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,1010 8 : 0.126 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375 0 : 0.188 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969: 0 . 36601 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734: 0 . 10869 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0 . 1 23091 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0. 0 6 16936 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0 . 1 23271 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0. 0 8 01873 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0. 1 9 87470 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.1 4 7 88025 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109 7 3 79357 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03 5 8 19767 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10 8 1 98203 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09 5 2 82500 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
you should be looking at a seqdumper of the output from rowsimilarity which in ur case would be the output in reuters-similarity. That should give the 10 most similar documents and their cosine distances from the referenced document. mahout seqdumper -i reuters-similarity/part-r-* | more Yields Input Path: reuters-similarity/part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {0:0.,13611:0.17446750688012366,13430:0.15853208358190823,1 7520:0.19351644052283437,18330:0.15898358188286904,4411:0.20851636244169733 ,13403:0.1663674094837415,14458:0.17265033919444714,14613:0.153651769452232 38,11399:0.19745333923929734} Key: 1: Value: {9858:0.32081902404236906,9704:0.2485999435029943,9833:0.30851564542610826, 19789:0.37458607189215337,10056:0.2885413911200995,10601:0.2598640283997712 4,11858:0.305718360283,17412:0.30330496505095894,1:0.9998,9 702:0.26198579353949075} Key: 2: Value: {2:1.0004,1087:0.28125327148896956,10390:0.2690057046963114,100 22:0.27668518648436297,6746:0.26969982074464605,12886:0.27032675431539793,1 3168:0.25889934686395943,997:0.26225673856545156,1392:0.2673559453473729,20 614:0.3009916279814217} ….. :) There's an error on the wiki link instructions, the seqdumper should have been on rowsimilarity/part-r-* and not on matrix/matrix for determining similar documents. Hope this helps. Sorry again for the confusion. On Friday, December 20, 2013 4:51 PM, Scott C. Cote scottcc...@gmail.com wrote: Suneel and others, I am still getting the strange results when I do the tour. Suneel: I manually wiped out the temp folder and also deleted the reuters-XXX folders. Also, per your advice I added the -ow option to all of the commands. NOTE: The step to create a matrix would NOT take a -ow option I have tried again, and am still seeing references to documents that do not exist. The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i reuters-matrix/matrix | tail) : INFO: Program took 1077 ms (Minutes: 0.01795) Key: 21569: Value: /reut2-021.sgm-91.txt Key: 21570: Value: /reut2-021.sgm-92.txt Key: 21571: Value: /reut2-021.sgm-93.txt Key: 21572: Value: /reut2-021.sgm-94.txt Key: 21573: Value: /reut2-021.sgm-95.txt Key: 21574: Value: /reut2-021.sgm-96.txt Key: 21575: Value: /reut2-021.sgm-97.txt Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 And the following snippet exists inside reuters-matrix/matrix and references key 41625 (which is larger than any key in docindex). Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 6 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 0 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 0 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: 0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 4 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 7 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 2 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 , 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 7 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 6 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 3 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 7 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 8 8897003744,} --- So in this email, I have listed the following pieces of information 1. Commands, 2. Env vars, 3. Sw version info Again, thank you in advance for your help. Scott INFO Below: 1. sequence of commands with relevant logged output points (omitted the sequence dump commands): mv reuters xreuters rm -r temp rm -r reuters-* mv xreuters reuters mvn -e -q exec:java -Dexec.mainClass=org.apache.lucene.benchmark.utils.ExtractReuters -Dexec.args=reuters/ reuters-extracted/ mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv # # added the -cd option per instructions in the Mahout In Action (MiA) so the convergance threhsold is .1 (originally this was default value but no affect on the unexpected results) # instead of default value of .5 because cosines lie within 0 and 1. # mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile -i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p reuters-kmeans-clusters/clusteredPoints/ mahout rowid -i reuters-vectors/tfidf
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
What does the data in cdump.txt represent? Can you point me in the right direction? SCott On 12/20/13 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Sorry Scott I should have looked at this more closely. I apologize. 1. You are doing a seqdumper of the matrix (which is generated from the rowid job and is not the output of the rowsimilarity job). Rowid Job generates a MxN matrix where M - no. of documents and N - terms associated with each document The value of a cell in the Matrix is the tf-idf weight of the term. So in the following output: {Code} Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 6 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 0 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 0 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: 0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 4 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 7 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 2 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 , 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 7 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 6 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 3 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 7 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 8 8897003744,} {Code} means for document 2 what follows are the terms:tf-df weights. To see the term corresponding to 41625 look at dictionary.file-0 for the corresponding key. Hope that clarifies and clears the confusion here. 2. In order to see the most similar documents for a given document you should be looking at a seqdumper of the output from rowsimilarity which in ur case would be the output in reuters-similarity. That should give the 10 most similar documents and their cosine distances from the referenced document. There's an error on the wiki link instructions, the seqdumper should have been on rowsimilarity/part-r-* and not on matrix/matrix for determining similar documents. Hope this helps. Sorry again for the confusion. On Friday, December 20, 2013 4:51 PM, Scott C. Cote scottcc...@gmail.com wrote: Suneel and others, I am still getting the strange results when I do the tour. Suneel: I manually wiped out the temp folder and also deleted the reuters-XXX folders. Also, per your advice I added the -ow option to all of the commands. NOTE: The step to create a matrix would NOT take a -ow option I have tried again, and am still seeing references to documents that do not exist. The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i reuters-matrix/matrix | tail) : INFO: Program took 1077 ms (Minutes: 0.01795) Key: 21569: Value: /reut2-021.sgm-91.txt Key: 21570: Value: /reut2-021.sgm-92.txt Key: 21571: Value: /reut2-021.sgm-93.txt Key: 21572: Value: /reut2-021.sgm-94.txt Key: 21573: Value: /reut2-021.sgm-95.txt Key: 21574: Value: /reut2-021.sgm-96.txt Key: 21575: Value: /reut2-021.sgm-97.txt Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 And the following snippet exists inside reuters-matrix/matrix and references key 41625 (which is larger than any key in docindex). Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 6 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 0 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 0 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: 0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 4 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 7 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 2 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 , 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 7 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 6 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 3 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 7 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 8 8897003744,} --- So in this email, I have listed the following pieces of information 1. Commands, 2. Env vars, 3. Sw version info Again, thank you in advance for your help. Scott INFO Below: 1. sequence of commands with relevant logged output points (omitted the sequence dump commands): mv reuters xreuters rm -r temp rm -r reuters-* mv xreuters reuters mvn -e -q exec:java -Dexec.mainClass=org.apache.lucene.benchmark.utils.ExtractReuters -Dexec.args=reuters
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
Suneel, I think I have it :) Pls confirm this understanding: I'm looking at the cdump.out that comes from clusterdump. It has the 20 clusters, each of the top words in the cluster, and each of the vectors that are members of the cluster. Do I have it? Am I getting this? Thanks, SCott On 12/20/13 6:32 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Which cdump.txt ? On Friday, December 20, 2013 7:29 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: You could use clusterdump to see the output of your clusters. Eg: $MAHOUT clusterdump \ -i ${WORK_DIR}/reuters-kmeans/clusters-*-final \ -o ${WORK_DIR}/reuters-kmeans/clusterdump \ -d ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \ --pointsDir ${WORK_DIR}/reuters-kmeans/clusteredPoints \ I am assuming you had run kmeans clustering, if so the clusters wouldn't overlap. You would see cluster overlap if u were to run fuzzy kmeans clustering. On Friday, December 20, 2013 7:06 PM, Scott C. Cote scottcc...@gmail.com wrote: Suneel, Thank you for your help. :) Thought I was completely in the ditch. If you are interested: inline with you comments are demonstrations that I finally have it (and the commands that I used)…. YAQ (Yet another question): How do I see with the dumper the documents that belong in a given cluster? I issued the command: mahout seqdumper -I reuters-kmeans-clusters/clusters-3-final/part-r-0 Which yields data like: Input Path: part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 Key: 2: Value: org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 … Key: 19: Value: org.apache.mahout.clustering.iterator.ClusterWritable@193936e1 Count: 20 Was hoping to see something that associated a centroid/cluster with its members. Given that there are 20 centroids, how do I break out the files into say: 20 folders - one folder per centroid so that I know their associations (I'm assuming that the clusters don't overlap). Or - is there a sequence file that is generated somewhere that definitively associates the vectors with each cluster? Here is what I do know: I know that the clusters are not given names and it is suggested that we use the top terms of the cluster to define a name. According to the tour, I should be able to see a likelihood that a given vector is in a cluster. But mahout seqdumper -i reuters-kmeans-clusters/clusteredPoints/part-m-0 | more Yields: Input Path: part-m-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable Key: 10266: Value: 1.0: /reut2-000.sgm-0.txt = [62:0.085, 222:0.043, 291:0.084, 1411:0.083, 1421:0.087, 1451:0.085, 1456:0.092, 1457:0.092, 1462:0.135, 1512:0.070, 1543:0.104, 2962:0.037 …. which does NOT look like the output in the tour (did I miss something again?). But I'll try to interpret the output as saying vector with key 62 has a cosine distance of .085 from key 10266 - is that right? What do I need to look at? - MiA sheds no light on this part that I have found. NOTE: I wrote a very simple - non scalable k-means java routine that found the clusters in a set of points (2 dimensional) and tracked which point belongs to which cluster (no overlap). Want to do the same with Mahout. Looking forward to your response to get me over this next hump …. SCott On 12/20/13 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Sorry Scott I should have looked at this more closely. I apologize. 1. You are doing a seqdumper of the matrix (which is generated from the rowid job and is not the output of the rowsimilarity job). Rowid Job generates a MxN matrix where M - no. of documents and N - terms associated with each document The value of a cell in the Matrix is the tf-idf weight of the term. So in the following output: {Code} Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,2 9 6 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,5 4 0 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,68 9 0 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260 : 0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,147 1 4 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,1 9 7 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497, 2 2 224:0.1442082289933512,22556:0.10371188492300307,22892:0.1850160368204063 8 , 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.0722773614334 8 7 77,25502
unexpected results in seqdump of reuters-matrix in quick tour of text analysis
All, I am a newbie Mahout user and am trying to use the Quick tour of text analysis using the Mahout command line . Thank you to whomever contributed to that page. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis +using+the+Mahout+command+line Went all the way from beginning to end of the page with seemingly no hiccups. At the very end of the tour, I became confused because the command: mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) Key: 1: Value: /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4403:0.2 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:0.126 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:0.188 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.36601 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.10869 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.123091 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0616936 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.123271 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0801873 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1987470 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.14788025 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097379357 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035819767 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108198203 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095282500 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last element in the snippet). The problem is that the folder /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted Only has 21578 files in it. Indeed, my dictionary file (output command used shown below) mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 So I cannot find the document with key value 41154 . What does the 41154 related to Obviously I have misunderstood something that I did or need to do in the tour. Can someone please shine a light on where I strayed? I have scripted every step that I took and can share them here if desired (I noticed that some of the output file names changed since the page was written so I made adjustments). Regards, SCott PS Thanks TD for helping me earlier
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
Suneel, Thank you for your help. On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: What you are seeing is the output matrix of the RowSimilarity job. You are right there should be 21578 documents only in the reuters corpus. a) How many documents do you have in your docIndex? DocIndex is one of the artifacts of the RowIDJob and should have been executed prior to the RowSimilarity Job. You can run seqdumper on docIndex to see the output. mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 b) Also what was the message at the end of the RowId job. It should read something like 'Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix'. Dec 18, 2013 4:01:13 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Wrote out matrix with 21578 rows and 41807 columns to reuters-matrix/matrix Dec 18, 2013 4:01:13 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Program took 3453 ms (Minutes: 0.05755) On Thursday, December 19, 2013 12:14 PM, Scott C. Cote scottcc...@gmail.com wrote: All, I am a newbie Mahout user and am trying to use the Quick tour of text analysis using the Mahout command line . Thank you to whomever contributed to that page. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana lysis +using+the+Mahout+command+line Went all the way from beginning to end of the page with seemingly no hiccups. At the very end of the tour, I became confused because the command: mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) Key: 1: Value: /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440 3:0.2 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108: 0.126 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750: 0.188 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0. 36601 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0. 10869 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.1 23091 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06 16936 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1 23271 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08 01873 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19 87470 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.147 88025 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973 79357 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358 19767 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081 98203 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952 82500 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last element in the snippet). The problem is that the folder /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted Only has 21578 files in it. Indeed, my dictionary file (output command used shown below) mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 So I cannot find the document with key value 41154 . What does the 41154 related to Obviously I have misunderstood something that I did or need to do in the tour. Can someone please shine a light on where I strayed? I have scripted every step that I took and can share them here if desired (I noticed that some of the output file names changed since the page was written so I made adjustments). Regards, SCott PS Thanks TD for helping me earlier
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
Suneel, I'm going to do the similarity part of the tour over - my laptop was sleeped in the middle of the run of the rowsimilarity job. Maybe the job is sensitive to that …. :( Normally - a server would not go to sleep nor would it run in local mode. Sorry that I didn't think of that sooner. Will let you know my outcome. Am planning on redoing by deleting the contents and the folder titled reuters-similarity Please let me know if that is not good enough. Thanks again. SCott On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: What you are seeing is the output matrix of the RowSimilarity job. You are right there should be 21578 documents only in the reuters corpus. a) How many documents do you have in your docIndex? DocIndex is one of the artifacts of the RowIDJob and should have been executed prior to the RowSimilarity Job. You can run seqdumper on docIndex to see the output. b) Also what was the message at the end of the RowId job. It should read something like 'Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix'. On Thursday, December 19, 2013 12:14 PM, Scott C. Cote scottcc...@gmail.com wrote: All, I am a newbie Mahout user and am trying to use the Quick tour of text analysis using the Mahout command line . Thank you to whomever contributed to that page. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana lysis +using+the+Mahout+command+line Went all the way from beginning to end of the page with seemingly no hiccups. At the very end of the tour, I became confused because the command: mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) Key: 1: Value: /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440 3:0.2 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108: 0.126 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750: 0.188 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0. 36601 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0. 10869 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.1 23091 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06 16936 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1 23271 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08 01873 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19 87470 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.147 88025 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973 79357 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358 19767 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081 98203 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952 82500 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last element in the snippet). The problem is that the folder /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted Only has 21578 files in it. Indeed, my dictionary file (output command used shown below) mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 So I cannot find the document with key value 41154 . What does the 41154 related to Obviously I have misunderstood something that I did or need to do in the tour. Can someone please shine a light on where I strayed? I have scripted every step that I took and can share them here if desired (I noticed that some of the output file names changed since the page was written so I made adjustments). Regards, SCott PS Thanks TD for helping me earlier
Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
I manually deleted the temp folder too (After 2 failed starts). Would it be helpful for me to upload my shells that encapsulate all of the commands posted on the tour? They reflect the current state of reuters and .8 mahout. And if I did - how would I do it? Thanks, SCott On 12/19/13 1:00 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Yep, that's what has happened in ur case. the wiki doesn't have but please specify the -ow (overwrite) option while running the RowsimilarityJob. That should clear up both the output and temp folders before running the job. On Thursday, December 19, 2013 1:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Haha... that could explain it, Rowsimilarityjob creates temp files during execution. If ur laptop 'sleeped' then the temp files still persist and running the job again wouldn't overwrite the old temp files (i need to verify that). It should be good enough to run the Rowsimilarity job again. On Thursday, December 19, 2013 1:46 PM, Scott C. Cote scottcc...@gmail.com wrote: Suneel, I'm going to do the similarity part of the tour over - my laptop was sleeped in the middle of the run of the rowsimilarity job. Maybe the job is sensitive to that …. :( Normally - a server would not go to sleep nor would it run in local mode. Sorry that I didn't think of that sooner. Will let you know my outcome. Am planning on redoing by deleting the contents and the folder titled reuters-similarity Please let me know if that is not good enough. Thanks again. SCott On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: What you are seeing is the output matrix of the RowSimilarity job. You are right there should be 21578 documents only in the reuters corpus. a) How many documents do you have in your docIndex? DocIndex is one of the artifacts of the RowIDJob and should have been executed prior to the RowSimilarity Job. You can run seqdumper on docIndex to see the output. b) Also what was the message at the end of the RowId job. It should read something like 'Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix'. On Thursday, December 19, 2013 12:14 PM, Scott C. Cote scottcc...@gmail.com wrote: All, I am a newbie Mahout user and am trying to use the Quick tour of text analysis using the Mahout command line . Thank you to whomever contributed to that page. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+an a lysis +using+the+Mahout+command+line Went all the way from beginning to end of the page with seemingly no hiccups. At the very end of the tour, I became confused because the command: mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) Key: 1: Value: /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,44 0 3:0.2 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108 : 0.126 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750 : 0.188 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0 . 36601 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0 . 10869 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0. 1 23091 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0 6 16936 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0. 1 23271 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0 8 01873 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1 9 87470 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.14 7 88025 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097 3 79357 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035 8 19767 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108 1 98203 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095 2 82500 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last element in the snippet). The problem is that the folder /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted Only has 21578 files in it. Indeed, my dictionary file (output command used shown below) mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 So I cannot find the document with key value 41154 . What does the 41154 related to Obviously I have misunderstood something that I did or need to do in the tour. Can someone please shine a light on where I strayed? I have scripted every step that I took and can share them here if desired (I noticed that some of the output file names changed since the page was written so I made adjustments