Re: New logo

2017-05-06 Thread Scott C. Cote
Will you be wearing “one of those t-shirts” on Monday  in Houston :)   ?
SCott
Scott C. Cote
scottcc...@gmail.com
972.672.6484



> On May 6, 2017, at 1:52 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> I know where one of those t-shirts is.
> 
> 
> 
> On Sat, May 6, 2017 at 7:13 AM, Isabel Drost-Fromm <isa...@apache.org>
> wrote:
> 
>> The green logo was the very first design iteration before iirc Robin came
>> up with the yellow one. The should be like five TShirts world wide with the
>> old logo printed in 2009.
>> 
>> 
>> Am 1. Mai 2017 20:41:43 MESZ schrieb Trevor Grant <
>> trevor.d.gr...@gmail.com>:
>>> Thanks Scott,
>>> 
>>> You are correct- in fact we're going even further now, that you can do
>>> native optimization regardless of the architecture with native-solvers.
>>> 
>>> Do you or anyone more familiar with the history of the website know
>>> anything about the origins/uses of this:
>>> https://mahout.apache.org/images/Mahout-logo-245x300.png
>>> It seems to be a green mahout logo.
>>> 
>>> Also Scott, or anyone lurking who may be able to help.  As part of the
>>> website reboot I've included a "history" page and would really
>>> apppreciate
>>> some help capturing that from first person sources if possible. Ive put
>>> in
>>> some headers but those are only directional:
>>> 
>>> https://github.com/rawkintrevo/mahout/blob/website/website/front/
>> community/history.md
>>> 
>>> 
>>> 
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
>>> http://stackexchange.com/users/3002022/rawkintrevo
>>> http://trevorgrant.org
>>> 
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>> 
>>> 
>>> On Mon, May 1, 2017 at 11:18 AM, scott cote <scottcc...@gmail.com>
>>> wrote:
>>> 
>>>> Trevor et al:
>>>> 
>>>> Some ideas to spur you on (and related points):
>>>> 
>>>> Mahout is no longer a grab bag of algorithms and routines, but a math
>>>> language right?  You don’t care about the under the cover
>>> implementation.
>>>> Today its Spark with alternative implementations in Flink, etc ….
>>>> 
>>>> Don’t know if that is the long term goal still  - haven’t kept up -
>>> but it
>>>> seems like you are insulating yourself from the underlying
>>> technology.
>>>> 
>>>> Math is a universal language.  Right?
>>>> 
>>>> Tower of Babel is coming to mind ….
>>>> 
>>>> SCott
>>>> 
>>>>> On Apr 27, 2017, at 10:27 PM, Trevor Grant
>>> <trevor.d.gr...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> It also bugs me when I can't suggest any alternatives, yet don't
>>> like the
>>>>> ones in front of me...
>>>>> 
>>>>> I became aware of a symbol a week or so ago, and it keeps coming
>>> back to
>>>>> me.
>>>>> 
>>>>> The Enso.
>>>>> https://en.wikipedia.org/wiki/Ens%C5%8D
>>>>> 
>>>>> Things I like about it:
>>>>> (all from wikipedia, since the only thing I knew about this symbol
>>> prior
>>>> is
>>>>> that someone I met had a tattoo of it).
>>>>> It represents (among a few other things) enlightenment.
>>>>> ^^ This resonated with the 'alternate definition of mahout' from
>>> Hebrew-
>>>>> which may be something akin to essence or truth.
>>>>> 
>>>>> It is a circle- which plays to the Samsara theme.
>>>>> 
>>>>> It is very expressive, a simple one or two brush stroke circle
>>> which
>>>>> symbolizes several large concepts and things about the creator,
>>>> expressive
>>>>> like our DSL (I feel gross comparing such a symbol to a Scala DSL,
>>> but
>>>> I'm
>>>>> spit balling here, please forgive me- I am not so expressive).
>>>>> 
>>>>> "Once the *ensō* is drawn, one does not change it. It evidences the
>>>>> character of its creator and the context of its creation in a
>>> brief,
>>>>> contiguous period of time." Which reminds me of the DRMs
>>>>> 
>>>>> In closed form it 

Re: streaming kmeans vs incremental canopy/solr/kmeans

2015-01-22 Thread Scott C. Cote
Mahout Gurus,

I’m back at the clustering text game (after a hiatus of a year).   Not for
recommendation purposes - thanks for the book and the idea of solr for
recommendation ….  that’s cool (Found Ted at Data Days in Austin - nice to
see you again).

My question:
How do I apply streaming cluster technology to text when I don’t have
accurate vectors?  

Let me explain exactly what I mean.
I have a series of sentences coming at me over time.  I may or may not
have the word in a “dictionary” when I receive it.  I need to group the
similar sentences together.  So I want to cluster the sentences.
Streaming cluster lib listed in mahout assumes that the text has already
been vectorized.  So how do I vectorize a sentence that has words that are
not in the dictionary?

Do I save the elements of the TF-IDF prior calculations and incrementally
update? 

...

Ugh - I think I just figured out my source of confusion.

Please confirm my understanding

Streaming does NOT imply an unbounded set of data ….

I will have a set of sentences that arrives in some period of time T.
Those that arrive in time T will be treated as a “batch” and vectorized in
the usual fashion (TF-IDF).
Then I feed the batched vector sets into the shiney new streaming methods
(instead of using the tired old canopy combined with straight k-means) to
arrive at my groupings.

- No time or cpu burned up discovering canopies.
- No intermediate disk consumed pushing canopy output into k-means.

Nice groups.

So all I have to do is keep updating the tfidf as new sentences arrive and
re “ball” the sentences with the fast shiney streaming cluster technology.

My big hurdle is coming up with an efficient way to update tfidf (ideas
are welcome).


On a separate note - over the last year - I have been using markdown and
developing my documentation skills.  Held off on writing docs on canopy as
I saw that it is going to be deprecated (Suneel)  Does my use case sound
like a good example for streaming?  If yes - I’ll cook up my specifics
into a postable example.   Also - just checking - streaming isn’t going to
be deprecated is it?


I know that I crammed a whole bunch of questions into this letter - so I
will truly appreciate ya’ll being patient and wading through.

Regards,

SCott


On 2/14/14, 12:55 PM, Ted Dunning ted.dunn...@gmail.com wrote:

In-memory ball k-means should solve your problem pretty well right now.
 In-memory streaming k-means followed by ball k-means will take you to
well
beyond your scaled case.

At 1 million documents, you should be able to do your clustering in a few
minutes, depending on whether some of the sparse matrix performance issues
got fixed in the clustering code (I think they did).




On Fri, Feb 14, 2014 at 10:50 AM, Scott C. Cote
scottcc...@gmail.comwrote:

 Right now - I'm dealing with only 40,000 documents, but we will
eventually
 grow more than 10x (put on the manager hat and say 1 mil docs) where a
doc
 is usually no longer than 20 or 30 words.

 SCott

 On 2/14/14 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Scott,
 
 How much data do you have?
 
 How much do you plan to have?
 
 
 
 On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote scottcc...@gmail.com
 wrote:
 
  Hello All,
 
  I have two questions (Q1, Q2).
 
  Q1: Am digging in to Text Analysis and am wrestling with competing
 analyzed
  data maintenance strategies.
 
  NOTE: my text comes from a very narrowly focused source.
 
  - Am currently crunching the data (batch) using the following scheme:
  1. Load source text as rows in a mysql database.
  2. Create named TFIDF vectors using a custom analyzer from source
text
  (-stopwords, lowercase, std filter, Š.)
  3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced
 cosine
  metric (derived from a custom metric found in MiA)
  4. Load references of Clusters into SOLR (core1) ­ cluster id, top
terms
  along with full cluster data into Mongo (a cluster is a doc)
  5. Then load source text into SOLR(core2) using same custom analyzer
 with
  appropriate boost along with the reference cluster id
  NOTE: in all cases, the id of the source text is preserved throughout
 the
  flow in the vector naming process, etc.
 
  So now I have a mysql table,  two SOLR cores, and a Mongo Document
  Collection (all tied together with text id as the common name)
 
  - Now when  a new document enters the system after batch has been
  performed, I use core2 to test the top  SOLR matches (custom analyzer
  normalizes the new doc) to find best cluster within a tolerance.  If
a
  cluster is found, then I place the text in that cluster ­ if not,
then I
  start a new group (my word for a cluster not generated via kmeans).
 Either
  way, the doc makes its way into both (core1 and core2). I keep track
of
 the
  number of group creations/document placements so that if a threshold
is
  crossed, then I can re-batch the data.
 
  In MiA, (I think ch 11), suggests that a user could run the canopy
 cluster
  routine to assign new

Re: canopy creating canopies with the same points

2014-03-24 Thread Scott C. Cote
Reinis,

The documentation has several Jira¹s open - with one with my name on it.

Fortunately, the canopy cluster technology has a good page (as well as
some outdated pages).

Please see this link for your question:

http://mahout.apache.org/users/clustering/canopy-clustering.html


as I believe that it is well written.

To directly answer your question:

Remember that T1  T2 and points within T2 are added to the cluster and
removed from the input set, while points within T1 are added to the
cluster but NOT removed from the ³input set (and therefore may be added
to another cluster later in the process).

SCott

On 3/24/14, 6:44 AM, Reinis Vicups mah...@orbit-x.de wrote:

Hi,

apparently I am missunderstanding the way canopy works. I thought that
once datapoint is added to canopy, it is removed from the list of
to-be-clustered points thus one point is assigned to one canopy.

In the example below this is not the case:

:C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:238.981, 468:40.572,
556:10.985, 889:8.678, 1101:114
:C-29{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:217.804, 468:33.560,
556:10.985, 889:8.678, 1101:113
:C-30{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:215.841, 468:37.231,
556:10.985, 889:8.678, 1101:113
:C-31{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:206.121, 468:32.243,
556:10.985, 889:8.678, 1101:112

So is the correct assumption that only the points within T2 get assigned
to only one canopy or even points within T2 can get assigned to more
than one canopy?

greets
reinis




Re: canopy creating canopies with the same points

2014-03-24 Thread Scott C. Cote
Reinis,

I don’t know - perhaps one of the other denizens of Users has an answer?


SCott

On 3/24/14, 10:13 AM, Reinis Vicups mah...@orbit-x.de wrote:

Scott,

thx a bunch for the pointer, very useful.

One thing I would like to clarify tho. I forgot to mention that I ran
canopy with T1 == T2 (this was suggested in some post as a method to
find in a fast way T2 that gives particular number of canopies. You
mention jiras you opened (gonna check them right after) - could it be
one of them is for this special T1 == T2 case?

br
reinis

On 24.03.2014 15:28, Scott C. Cote wrote:
 Reinis,

 The documentation has several Jira¹s open - with one with my name on it.

 Fortunately, the canopy cluster technology has a good page (as well as
 some outdated pages).

 Please see this link for your question:

  http://mahout.apache.org/users/clustering/canopy-clustering.html


 as I believe that it is well written.

 To directly answer your question:

 Remember that T1  T2 and points within T2 are added to the cluster and
 removed from the input set, while points within T1 are added to the
 cluster but NOT removed from the ³input set (and therefore may be added
 to another cluster later in the process).

 SCott

 On 3/24/14, 6:44 AM, Reinis Vicups mah...@orbit-x.de wrote:

 Hi,

 apparently I am missunderstanding the way canopy works. I thought that
 once datapoint is added to canopy, it is removed from the list of
 to-be-clustered points thus one point is assigned to one canopy.

 In the example below this is not the case:

 :C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:238.981, 468:40.572,
 556:10.985, 889:8.678, 1101:114
 :C-29{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:217.804, 468:33.560,
 556:10.985, 889:8.678, 1101:113
 :C-30{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:215.841, 468:37.231,
 556:10.985, 889:8.678, 1101:113
 :C-31{n=1 c=[70:11.686, 72:7.170, 236:8.182, 396:206.121, 468:32.243,
 556:10.985, 889:8.678, 1101:112

 So is the correct assumption that only the points within T2 get
assigned
 to only one canopy or even points within T2 can get assigned to more
 than one canopy?

 greets
 reinis





Re: Website, urgent help needed

2014-03-13 Thread Scott C. Cote
I have created issue https://issues.apache.org/jira/browse/MAHOUT-1461

Will upload shell scripts and suggested replacement text later tonight ….

SCott

On 3/13/14, 10:43 AM, Sebastian Schelter s...@apache.org wrote:

Hi Scott,

Create a jira ticket and attach your scripts and a text version of the
page there.

Best,
Sebastian


On 03/12/2014 03:27 PM, Scott C. Cote wrote:
 I took the tour of the text analysis and pushed through despite the
 problems on the page.  Commiters helped me over the hump where others
 might have just gave up (to your point).
 When I did it, I made shell scripts so that my steps would be repeatable
 with an anticipation of updating the page.

 Unforunately, I gave up on trying to figure out how to update the page
 (there were links indicating that I could do it), and I didn¹t want to
 appear to be stupid asking how to update the documentation (my bad - not
 anyone else).  Now I know that it was not possible unless I was a
commiter.

 Who should I send my scripts to, or how should I proceed with a current
 form of the page?

 SCott

 On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Pavan,

 Awesome that you're willing to help. The documentation are the pages
 listed under Clustering in the navigation bar under mahout.apache.org

 If you start working on one of the pages listed there (e.g. the k-Means
 doc), please created jira ticket in our issue tracker with a title
along
 the lines of Cleaning up the documentation for k-Means on the
website.

 Put a list of errors and corrections into the jira and I (or some other
 committer) will make sure to fix the website.

 Thanks,
 Sebastian


 On 03/12/2014 08:48 AM, Pavan Kumar N wrote:
 i ll help with clustering algorithms documentation. do send me old
 documentation and i will check and remove errors.  or better let me
know
 how to proceed.

 Pavan
 On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 As you've probably noticed, I've put in a lot of effort over the last
 days
 to kickstart cleaning up our website. I've thrown out a lot of stuff
 and
 have been startled by the amout of outdated and incorrect information
 on
 our website, as well as links pointing to nowhere.

 I think our lack of documentation makes it superhard to use Mahout
for
 new
 people. A crucial next step is to clean up the documentation on
 classification and clustering. I cannot do this alone, because I
don't
 have
 the time and I'm not so familiar with the background of the
algorithms.

 I need volunteers to go through all the pages under Classification
 and
 Clustering on the website. For the algorithms, the content and
 claims of
 the articles need to be checked, for the examples we need to make
sure
 that
 everything still works as described. It would also be great to move
 articles from personal blogs to our website.

 Imagine that some developer wants to try out Mahout and takes one
hour
 for
 that in the evening. She will go to our website, download Mahout,
read
 the
 description of an algorithm and try to run an example. In the current
 state
 of the documentation, I'm afraid that most people will walk away
 frustrated, because the website does not help them as it should.

 Best,
 Sebastian

 PS: I will make my standpoint on whether Mahout should do a 1.0
release
 depend on whether we manage to clean up and maintain our
documentation.










Re: Website, urgent help needed

2014-03-12 Thread Scott C. Cote
I took the tour of the text analysis and pushed through despite the
problems on the page.  Commiters helped me over the hump where others
might have just gave up (to your point).
When I did it, I made shell scripts so that my steps would be repeatable
with an anticipation of updating the page.

Unforunately, I gave up on trying to figure out how to update the page
(there were links indicating that I could do it), and I didn¹t want to
appear to be stupid asking how to update the documentation (my bad - not
anyone else).  Now I know that it was not possible unless I was a commiter.

Who should I send my scripts to, or how should I proceed with a current
form of the page?

SCott

On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:

Hi Pavan,

Awesome that you're willing to help. The documentation are the pages
listed under Clustering in the navigation bar under mahout.apache.org

If you start working on one of the pages listed there (e.g. the k-Means
doc), please created jira ticket in our issue tracker with a title along
the lines of Cleaning up the documentation for k-Means on the website.

Put a list of errors and corrections into the jira and I (or some other
committer) will make sure to fix the website.

Thanks,
Sebastian


On 03/12/2014 08:48 AM, Pavan Kumar N wrote:
 i ll help with clustering algorithms documentation. do send me old
 documentation and i will check and remove errors.  or better let me know
 how to proceed.

 Pavan
 On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 As you've probably noticed, I've put in a lot of effort over the last
days
 to kickstart cleaning up our website. I've thrown out a lot of stuff
and
 have been startled by the amout of outdated and incorrect information
on
 our website, as well as links pointing to nowhere.

 I think our lack of documentation makes it superhard to use Mahout for
new
 people. A crucial next step is to clean up the documentation on
 classification and clustering. I cannot do this alone, because I don't
have
 the time and I'm not so familiar with the background of the algorithms.

 I need volunteers to go through all the pages under Classification
and
 Clustering on the website. For the algorithms, the content and
claims of
 the articles need to be checked, for the examples we need to make sure
that
 everything still works as described. It would also be great to move
 articles from personal blogs to our website.

 Imagine that some developer wants to try out Mahout and takes one hour
for
 that in the evening. She will go to our website, download Mahout, read
the
 description of an algorithm and try to run an example. In the current
state
 of the documentation, I'm afraid that most people will walk away
 frustrated, because the website does not help them as it should.

 Best,
 Sebastian

 PS: I will make my standpoint on whether Mahout should do a 1.0 release
 depend on whether we manage to clean up and maintain our documentation.







Re: Website, urgent help needed

2014-03-12 Thread Scott C. Cote
I’ll make it work.
Don’t know markdown (assume some reduced mark”up” language) - but I’ll
figure it out.  I will assume that I can check with my consulting buddy
“Google” and find it. :)

Thank you for your contributions - glad that I can give “something” back.
I’ll start off by sending the doc to one of the committers, and then if
you guys like my work, then we can proceed from there ….

SCott

On 3/12/14, 9:38 AM, Sebastian Schelter s...@apache.org wrote:

Hi Scott,

The cms behind the website uses markdown. So ideally you would attach a
textfile with markdown formattings to a jira issue and a committer will
put that into the website.

Does that work for you?

PS: There are a lot of online markdown editors out there.

On 03/12/2014 03:27 PM, Scott C. Cote wrote:
 I took the tour of the text analysis and pushed through despite the
 problems on the page.  Commiters helped me over the hump where others
 might have just gave up (to your point).
 When I did it, I made shell scripts so that my steps would be repeatable
 with an anticipation of updating the page.

 Unforunately, I gave up on trying to figure out how to update the page
 (there were links indicating that I could do it), and I didn¹t want to
 appear to be stupid asking how to update the documentation (my bad - not
 anyone else).  Now I know that it was not possible unless I was a
commiter.

 Who should I send my scripts to, or how should I proceed with a current
 form of the page?

 SCott

 On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Pavan,

 Awesome that you're willing to help. The documentation are the pages
 listed under Clustering in the navigation bar under mahout.apache.org

 If you start working on one of the pages listed there (e.g. the k-Means
 doc), please created jira ticket in our issue tracker with a title
along
 the lines of Cleaning up the documentation for k-Means on the
website.

 Put a list of errors and corrections into the jira and I (or some other
 committer) will make sure to fix the website.

 Thanks,
 Sebastian


 On 03/12/2014 08:48 AM, Pavan Kumar N wrote:
 i ll help with clustering algorithms documentation. do send me old
 documentation and i will check and remove errors.  or better let me
know
 how to proceed.

 Pavan
 On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 As you've probably noticed, I've put in a lot of effort over the last
 days
 to kickstart cleaning up our website. I've thrown out a lot of stuff
 and
 have been startled by the amout of outdated and incorrect information
 on
 our website, as well as links pointing to nowhere.

 I think our lack of documentation makes it superhard to use Mahout
for
 new
 people. A crucial next step is to clean up the documentation on
 classification and clustering. I cannot do this alone, because I
don't
 have
 the time and I'm not so familiar with the background of the
algorithms.

 I need volunteers to go through all the pages under Classification
 and
 Clustering on the website. For the algorithms, the content and
 claims of
 the articles need to be checked, for the examples we need to make
sure
 that
 everything still works as described. It would also be great to move
 articles from personal blogs to our website.

 Imagine that some developer wants to try out Mahout and takes one
hour
 for
 that in the evening. She will go to our website, download Mahout,
read
 the
 description of an algorithm and try to run an example. In the current
 state
 of the documentation, I'm afraid that most people will walk away
 frustrated, because the website does not help them as it should.

 Best,
 Sebastian

 PS: I will make my standpoint on whether Mahout should do a 1.0
release
 depend on whether we manage to clean up and maintain our
documentation.










Re: Website, urgent help needed

2014-03-12 Thread Scott C. Cote
ok

On 3/12/14, 9:58 AM, Andrew Musselman andrew.mussel...@gmail.com wrote:

Thanks Scott; please just attach your work to an issue in the Jira
system; if there's not one already you could file a new issue.

 On Mar 12, 2014, at 7:44 AM, Scott C. Cote scottcc...@gmail.com
wrote:
 
 I’ll make it work.
 Don’t know markdown (assume some reduced mark”up” language) - but I’ll
 figure it out.  I will assume that I can check with my consulting buddy
 “Google” and find it. :)
 
 Thank you for your contributions - glad that I can give “something”
back.
 I’ll start off by sending the doc to one of the committers, and then if
 you guys like my work, then we can proceed from there ….
 
 SCott
 
 On 3/12/14, 9:38 AM, Sebastian Schelter s...@apache.org wrote:
 
 Hi Scott,
 
 The cms behind the website uses markdown. So ideally you would attach a
 textfile with markdown formattings to a jira issue and a committer will
 put that into the website.
 
 Does that work for you?
 
 PS: There are a lot of online markdown editors out there.
 
 On 03/12/2014 03:27 PM, Scott C. Cote wrote:
 I took the tour of the text analysis and pushed through despite the
 problems on the page.  Commiters helped me over the hump where others
 might have just gave up (to your point).
 When I did it, I made shell scripts so that my steps would be
repeatable
 with an anticipation of updating the page.
 
 Unforunately, I gave up on trying to figure out how to update the page
 (there were links indicating that I could do it), and I didn¹t want to
 appear to be stupid asking how to update the documentation (my bad -
not
 anyone else).  Now I know that it was not possible unless I was a
 commiter.
 
 Who should I send my scripts to, or how should I proceed with a
current
 form of the page?
 
 SCott
 
 On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:
 
 Hi Pavan,
 
 Awesome that you're willing to help. The documentation are the pages
 listed under Clustering in the navigation bar under
mahout.apache.org
 
 If you start working on one of the pages listed there (e.g. the
k-Means
 doc), please created jira ticket in our issue tracker with a title
 along
 the lines of Cleaning up the documentation for k-Means on the
 website.
 
 Put a list of errors and corrections into the jira and I (or some
other
 committer) will make sure to fix the website.
 
 Thanks,
 Sebastian
 
 
 On 03/12/2014 08:48 AM, Pavan Kumar N wrote:
 i ll help with clustering algorithms documentation. do send me old
 documentation and i will check and remove errors.  or better let me
 know
 how to proceed.
 
 Pavan
 On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org
wrote:
 
 Hi,
 
 As you've probably noticed, I've put in a lot of effort over the
last
 days
 to kickstart cleaning up our website. I've thrown out a lot of
stuff
 and
 have been startled by the amout of outdated and incorrect
information
 on
 our website, as well as links pointing to nowhere.
 
 I think our lack of documentation makes it superhard to use Mahout
 for
 new
 people. A crucial next step is to clean up the documentation on
 classification and clustering. I cannot do this alone, because I
 don't
 have
 the time and I'm not so familiar with the background of the
 algorithms.
 
 I need volunteers to go through all the pages under
Classification
 and
 Clustering on the website. For the algorithms, the content and
 claims of
 the articles need to be checked, for the examples we need to make
 sure
 that
 everything still works as described. It would also be great to move
 articles from personal blogs to our website.
 
 Imagine that some developer wants to try out Mahout and takes one
 hour
 for
 that in the evening. She will go to our website, download Mahout,
 read
 the
 description of an algorithm and try to run an example. In the
current
 state
 of the documentation, I'm afraid that most people will walk away
 frustrated, because the website does not help them as it should.
 
 Best,
 Sebastian
 
 PS: I will make my standpoint on whether Mahout should do a 1.0
 release
 depend on whether we manage to clean up and maintain our
 documentation.
 
 




Re: Welcome Andrew Musselman as new comitter

2014-03-07 Thread Scott C. Cote
I personally am looking forward to the ³advice from the newest
³recommended² committer to hadoop.

Congratulations to Mahout team for increasing and growing  :)

Now back to my using Š.  (and hopefully creating something meaningful for
you guys)


Scott

PS:  am bootstrapping my Machine Learning knowledge by taking the coursera
course offered by Andrew NG - correct my shaky knowledge of classifiers.
Anyone else on this list taking or have taken this course?  (obviously -
committers are probably not, but Š.)


On 3/7/14, 11:36 AM, Andrew Musselman andrew.mussel...@gmail.com wrote:

Thank you for the welcome!  Looking forward to it.

I have a math background and got started with recommenders by building the
first album recommender for Rhapsody ( http://rhapsody.com ) while I was
doing web development and web services work for the service.  Since then I
learned to love/hate Pig and Hadoop for a living, and now I do data
engineering and analytics at Accenture.

We've used Mahout on a few production projects, and we're looking forward
to more.

See you on the lists!

Best
Andrew


On Fri, Mar 7, 2014 at 9:12 AM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 this is to announce that the Project Management Committee (PMC) for
Apache
 Mahout has asked Andrew Musselman to become committer and we are
pleased to
 announce that he has accepted.

 Being a committer enables easier contribution to the project since in
 addition to posting patches on JIRA it also gives write access to the
code
 repository. That also means that now we have yet another person who can
 commit patches submitted by others to our repo *wink*

 Andrew, we look forward to working with you in the future. Welcome! It
 would be great if you could introduce yourself with a few words :)

 Sebastian





Re: Rework our website

2014-03-06 Thread Scott C. Cote
Ok - I expected (and am actually pleased that its not a free-for-all.

I’ll see what has already been updated in this latest flurry of updates
and see what I can contribute.  Forwarded to you.

Thanks,

SCott

On 3/5/14, 4:43 PM, Sebastian Schelter s...@apache.org wrote:

At the moment, only committers can change the website unfortunately. If
you have a text to add, I'm happy to work it in and add your name to our
contributers list in the CHANGELOG.

Best,
Sebastian


On 03/05/2014 04:58 PM, Scott C. Cote wrote:
 I had recently taken the text tour of mahout, but I couldn't decipher a
 way to contribute updates to the tour (some of the file names have
 changed, etc).

 How would I start?   (this was part of my offer to help with the
 documentation of Mahout).

 SCott

 On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

 What no centered text??

 ;-)

 Love either.

 BTW users are no longer able to contribute content to the wiki. Most
CMSs
 have a way to allow input that is moderated. Might this make getting
 documentation help easier? Allow anyone to contribute but committers
can
 filter out the bad‹sort of like submitting patches.

 On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

 Hi everyone,

 In our latest discussion, I argued that the lack (and errors) of
 documentation on our website is one of the main pain points of Mahout
 atm. To be honest, I'm also not very happy with the design, especially
 fonts and spacing make it super hard to read long articles. This also
 prevents me from wanting to add articles and documentation.

 I think we should have a beautiful website, where it is fun to add new
 stuff.

 My design skills are pretty limited, but fortunately my brother is an
art
 director! I asked him to make our website a bit more beautiful without
 changing to much of the structure, so that a redesign wouldn't take too
 long.

 I really like the results and would volunteer to dig out my CSS skills
 and do the redesign, if people agree.

 Here are his drafts, I like the second one best:

 https://people.apache.org/~ssc/mahout/mahout.jpg
 https://people.apache.org/~ssc/mahout/mahout2.jpg

 Let me know what you think!

 Best,
 Sebastian








Re: Rework our website

2014-03-05 Thread Scott C. Cote
I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian





streaming kmeans vs incremental canopy/solr/kmeans

2014-02-14 Thread Scott C. Cote
Hello All,

I have two questions (Q1, Q2).

Q1: Am digging in to Text Analysis and am wrestling with competing analyzed
data maintenance strategies.

NOTE: my text comes from a very narrowly focused source.

- Am currently crunching the data (batch) using the following scheme:
1. Load source text as rows in a mysql database.
2. Create named TFIDF vectors using a custom analyzer from source text
(-stopwords, lowercase, std filter, Š.)
3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine
metric (derived from a custom metric found in MiA)
4. Load references of Clusters into SOLR (core1) ­ cluster id, top terms
along with full cluster data into Mongo (a cluster is a doc)
5. Then load source text into SOLR(core2) using same custom analyzer with
appropriate boost along with the reference cluster id
NOTE: in all cases, the id of the source text is preserved throughout the
flow in the vector naming process, etc.

So now I have a mysql table,  two SOLR cores, and a Mongo Document
Collection (all tied together with text id as the common name)

- Now when  a new document enters the system after batch has been
performed, I use core2 to test the top  SOLR matches (custom analyzer
normalizes the new doc) to find best cluster within a tolerance.  If a
cluster is found, then I place the text in that cluster ­ if not, then I
start a new group (my word for a cluster not generated via kmeans).  Either
way, the doc makes its way into both (core1 and core2). I keep track of the
number of group creations/document placements so that if a threshold is
crossed, then I can re-batch the data.

In MiA, (I think ch 11), suggests that a user could run the canopy cluster
routine to assign new entries to the clusters (instead of what I am doing).
Does he mean to regenerate a new dictionary, frequencies, etc for the corpus
for every inbound document?  My observations have been that this has been a
very speedy process, but I'm hoping that I'm just too much of a novice and
haven't thought of a way to simply update the dictionary/frequencies.  (this
process also calls for the eventual rebatching of the clusters).

While I was very early in my implement what I have read process, Suneel
and Ted recommended that I examine the Streaming Kmeans process.  Would that
process sidestep much of what I'm doing?

Q2: I need to really understand the lexicon of my corpus.  How do I see the
list of terms that have been omitted due either to being in too many
documents or are not in enough documents for consideration?

Please know that I know that I can look at the dictionary to see what terms
are covered.  And since my custom analyzer is using the
StandardAnalyzer.stop words, those are obvious also.  If there isn't an
option to emit the  omitted words, where would be the natural place to
capture that data and save it into yet another data store (Sequence
file,etc)?

Thanks in Advance for the Guidance,

SCott




Re: get similar items

2014-02-14 Thread Scott C. Cote
I generate my initial sequence files directly from records in my mysql
database.  Follow Martin's advice on going through the tutorial.  Very
very very helpful.  Also - I really like MiA even if it is a couple of
versions behind.  The clustering chapters are still very accurate (seem to
be :)  ).  

You really need to get a good feel of what kind of vectors you are going
to use as input to your clusters.

SCott

On 2/14/14 1:32 AM, N! 12481...@qq.com wrote:

Thank you SebastianMartinScott.
I checked 
'https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
lysis+using+the+Mahout+command+line'.
It looks like the case what I said.But I am using JAVA with a Mysql
database, is there an example related to this?


thanks.
-- Original --
From:  Scott C. Cote;scottcc...@gmail.com;
Date:  Wed, Feb 12, 2014 11:47 PM
To:  user@mahout.apache.orguser@mahout.apache.org;

Subject:  Re: get similar items



Since you are relying on unguided data - switch from
recommenders/classifier to clustering.

Anyone else agree with me on this???

SCott

On 2/12/14 9:04 AM, Martin, Nick nimar...@pssd.com wrote:

Yeah, since it would appear you're lacking requisite data for
recommenders the only other thing I can think of in this case is
potentially treating the movie records as documents and clustering them
(via whatever might be in the 'description' field).

Have a look here 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
l
ysis+using+the+Mahout+command+line and see if you can support something
like this with your dataset.

-Original Message-
From: Sebastian Schelter [mailto:ssc.o...@googlemail.com]
Sent: Wednesday, February 12, 2014 6:28 AM
To: user@mahout.apache.org
Subject: Re: get similar items

Hi,

Mahout's recommenders are based on analyzing interactions between users
and items/movies, e.g. ratings or counts how often the movie was watched.


On 02/12/2014 11:34 AM, N! wrote:
 Hi all:
   Does anyone have any suggestions for the questions below?


   thanks a lot.


 -- Original --
 Sender: N!12481...@qq.com;
 Send time: Wednesday, Feb 12, 2014 6:17 PM
 To: useruser@mahout.apache.org;

 Subject: Re: get similar items



 Hi Sean:
  Thanks for the reply.
  Assume I have only one table named 'movie' with 1000+
records, this table have three
columns:'id','movieName','movieDescription'.
  Can Mahout calculate the most similar movies for a
movie.(based on only the 'movie' table)?
  code like: List mostSimilarMovieList =
recommender.mostSimilar(int movieId).
  if not, do you have any suggestions for this scenario?




.




Re: streaming kmeans vs incremental canopy/solr/kmeans

2014-02-14 Thread Scott C. Cote
Right now - I'm dealing with only 40,000 documents, but we will eventually
grow more than 10x (put on the manager hat and say 1 mil docs) where a doc
is usually no longer than 20 or 30 words.

SCott

On 2/14/14 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Scott,

How much data do you have?

How much do you plan to have?



On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote scottcc...@gmail.com
wrote:

 Hello All,

 I have two questions (Q1, Q2).

 Q1: Am digging in to Text Analysis and am wrestling with competing
analyzed
 data maintenance strategies.

 NOTE: my text comes from a very narrowly focused source.

 - Am currently crunching the data (batch) using the following scheme:
 1. Load source text as rows in a mysql database.
 2. Create named TFIDF vectors using a custom analyzer from source text
 (-stopwords, lowercase, std filter, Š.)
 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced
cosine
 metric (derived from a custom metric found in MiA)
 4. Load references of Clusters into SOLR (core1) ­ cluster id, top terms
 along with full cluster data into Mongo (a cluster is a doc)
 5. Then load source text into SOLR(core2) using same custom analyzer
with
 appropriate boost along with the reference cluster id
 NOTE: in all cases, the id of the source text is preserved throughout
the
 flow in the vector naming process, etc.

 So now I have a mysql table,  two SOLR cores, and a Mongo Document
 Collection (all tied together with text id as the common name)

 - Now when  a new document enters the system after batch has been
 performed, I use core2 to test the top  SOLR matches (custom analyzer
 normalizes the new doc) to find best cluster within a tolerance.  If a
 cluster is found, then I place the text in that cluster ­ if not, then I
 start a new group (my word for a cluster not generated via kmeans).
Either
 way, the doc makes its way into both (core1 and core2). I keep track of
the
 number of group creations/document placements so that if a threshold is
 crossed, then I can re-batch the data.

 In MiA, (I think ch 11), suggests that a user could run the canopy
cluster
 routine to assign new entries to the clusters (instead of what I am
doing).
 Does he mean to regenerate a new dictionary, frequencies, etc for the
 corpus
 for every inbound document?  My observations have been that this has
been a
 very speedy process, but I'm hoping that I'm just too much of a novice
and
 haven't thought of a way to simply update the dictionary/frequencies.
  (this
 process also calls for the eventual rebatching of the clusters).

 While I was very early in my implement what I have read process,
Suneel
 and Ted recommended that I examine the Streaming Kmeans process.  Would
 that
 process sidestep much of what I'm doing?

 Q2: I need to really understand the lexicon of my corpus.  How do I see
the
 list of terms that have been omitted due either to being in too many
 documents or are not in enough documents for consideration?

 Please know that I know that I can look at the dictionary to see what
terms
 are covered.  And since my custom analyzer is using the
 StandardAnalyzer.stop words, those are obvious also.  If there isn't an
 option to emit the  omitted words, where would be the natural place to
 capture that data and save it into yet another data store (Sequence
 file,etc)?

 Thanks in Advance for the Guidance,

 SCott







Re: get similar items

2014-02-12 Thread Scott C. Cote
Since you are relying on unguided data - switch from
recommenders/classifier to clustering.

Anyone else agree with me on this???

SCott

On 2/12/14 9:04 AM, Martin, Nick nimar...@pssd.com wrote:

Yeah, since it would appear you're lacking requisite data for
recommenders the only other thing I can think of in this case is
potentially treating the movie records as documents and clustering them
(via whatever might be in the 'description' field).

Have a look here 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+anal
ysis+using+the+Mahout+command+line and see if you can support something
like this with your dataset.

-Original Message-
From: Sebastian Schelter [mailto:ssc.o...@googlemail.com]
Sent: Wednesday, February 12, 2014 6:28 AM
To: user@mahout.apache.org
Subject: Re: get similar items

Hi,

Mahout's recommenders are based on analyzing interactions between users
and items/movies, e.g. ratings or counts how often the movie was watched.


On 02/12/2014 11:34 AM, N! wrote:
 Hi all:
   Does anyone have any suggestions for the questions below?


   thanks a lot.


 -- Original --
 Sender: N!12481...@qq.com;
 Send time: Wednesday, Feb 12, 2014 6:17 PM
 To: useruser@mahout.apache.org;

 Subject: Re: get similar items



 Hi Sean:
  Thanks for the reply.
  Assume I have only one table named 'movie' with 1000+
records, this table have three
columns:'id','movieName','movieDescription'.
  Can Mahout calculate the most similar movies for a
movie.(based on only the 'movie' table)?
  code like: List mostSimilarMovieList =
recommender.mostSimilar(int movieId).
  if not, do you have any suggestions for this scenario?






Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Scott C. Cote
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott

On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote:

Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in
each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?)
where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are
long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of
tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture,
using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages,
create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java,
development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were
created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott







Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Scott C. Cote
I understand that it is not official.

Am just trying to provide another test opportunity for the .9 release.

SCott

On 1/26/14 1:05 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Scott,

FYI... 0.9 Release is not official yet. The project trunk's still at
0.9-SNAPSHOT.

Please feel free to update the documentation.






On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com
wrote:
 
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott


On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote:

Scott,

Based on the dictionary output, it looks like the processing of
generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is
being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in
each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create
vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?)
where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are
long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of
tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture,
using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages,
create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java,
development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more
than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were
created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott






Re: Running Mahout Example

2014-01-22 Thread Scott C. Cote
To eliminate the MAHOUT_LOCAL stack traces, I set the env var to an
arbitrary value.  

export MAHOUT_HOME=~/mahout
export MAHOUT_LOCAL=yes
export PATH=$PATH:${MAHOUT_HOME}/bin



On 1/22/14 9:50 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

What's ur Mahout version?





On Wednesday, January 22, 2014 10:27 AM, Sznajder ForMailingList
bs4mailingl...@gmail.com wrote:
 
Strangely,

I get the following:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Exception in thread main java.lang.NoClassDefFoundError: classpath
Caused by: java.lang.ClassNotFoundException: classpath
at java.net.URLClassLoader.findClass(URLClassLoader.java:434)
at java.lang.ClassLoader.loadClass(ClassLoader.java:653)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358)
at java.lang.ClassLoader.loadClass(ClassLoader.java:619)
Could not find the main class: classpath.  Program will exit.
Running on hadoop, using
/mnt/hdgpfs/shared_home/hadoop/IHC-0.20.2/bin/hadoop and
HADOOP_CONF_DIR=/mnt/hdgpfs/shared_home/hadoop/IHC-0.20.2/conf


Benjamin



On Wed, Jan 22, 2014 at 4:59 PM, Suneel Marthi
suneel_mar...@yahoo.comwrote:

 Try examples /bin/cluster-reuters.sh

 Sent from my iPhone

  On Jan 22, 2014, at 9:56 AM, Sznajder ForMailingList 
 bs4mailingl...@gmail.com wrote:
 
  Hi,
 
  I wished to run the mahout example for Kmeans algorithm.
 
  I suppose that it is:
  org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
 
  (1) Is it right?
 
 
  It looks for a /testdata/ directory. I did not find it
 
  (2) Where is it, please?
 
 
  I thought to use the reuters data set described in Manning book and I
  extracted it to my disk and pointed to this directory in the main
method.
 
  However, I get the following, when running the Job:
 
  java.lang.NumberFormatException: For input string: amex
 at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
 at java.lang.Double.valueOf(Unknown Source)
 at
 
 
org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:
48)
 at
 
 
org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:
1)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at
  
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 
 
  (3) What do I do wrong?
 
  Best regards
  Benjamin




Problem converting tokenized documents into TFIDF vectors

2014-01-21 Thread Scott C. Cote
All,

Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
immediately pull in the .9 stuffŠ..

I am trying to make a small data set work (perhaps it is too small?) where I
am clustering skills (phrases).  For sake of brevity (my steps are long) , I
have not documented the steps that I took to get my text of skills into
tokenized formŠ.

By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero Š.
No tfidf vectors generated.


I have broken this down into 4 steps.



Step 1. Tokenize docs.  Here is output validating success of tokenization.

mahout seqdumper -i tokenized-documents/part-m-0

yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.common.StringTuple
Key: 1: Value: [rest, web, services]
Key: 2: Value: [soa, design, build, service, oriented, architecture, using,
java]
Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer,
oracle]
Key: 4: Value: [spring, injection, use, spring, templates, inversion,
control]
Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate,
spring]
Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
Key: 7: Value: [java, graphics, uses, android, graphics, packages, create,
user, interfaces]
Key: 8: Value: [core, java, understand, core, libraries, java, development,
kit]
Key: 9: Value: [design, develop, jdbc, sql, queries]
Key: 10: Value: [multithreading, thread, synchronization]
Count: 10


Step 2. Create term frequency vectors from the tokenized sequence file (step
1).

mahout seqdumper -i dictionary.file-0

Yields

Key: java: Value: 0
Count: 1

mahout seqdumper -i tf-vectors/part-r-0

Yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: 2: Value: 2:{0:1.0}
Key: 3: Value: 3:{0:1.0}
Key: 5: Value: 5:{0:1.0}
Key: 7: Value: 7:{0:1.0}
Key: 8: Value: 8:{0:2.0}
Count: 5


Step 3. Create the document frequency data.

mahout seqdumper -i frequency.file-0

Yields

Key: 0: Value: 5
Count: 1

NOTE to READER:  Java is NOT the only common word ­ web occurs more than
once ­ how come its not included?





Step 4. Create the tfidf vectors: (can't remember if partials were created
in the past step)

mahout seqdumper -i partial-vectors-0/part-r-0

yields

INFO: Command line arguments: {--endPhase=[2147483647],
--input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
SCDynamicStore
Input Path: part-r-0
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: 2: Value: 2:{}
Key: 3: Value: 3:{}
Key: 5: Value: 5:{}
Key: 7: Value: 7:{}
Key: 8: Value: 8:{}
Count: 5

NOTE to READER:  What do the empty brackets mean here?


mahout seqdumper -i tfidf-vectors/part-r-0

Yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

Why 0?

What am I NOT understanding here?

SCott




Re: need help explaining difference in k means output

2014-01-06 Thread Scott C. Cote
Mahesh,

I guess this is what I get for working too long and not recognizing the
diff Š.  Suspected it was something silly.

Changing the driver parameters to EXACTLY the same as the command line
does indeed work.   Thank you.

I now have one file.  Not sure if it was the convergence or the
sequential, but I have a hunch that the problem was the sequential (As you
pointed out, I have plenty of iterations left).

Cheers!

SCott

On 1/6/14 3:58 AM, Mahesh Balija balijamahesh@gmail.com wrote:

Hi Scott,

Not very sure why you are getting many part files in code execution, the
difference b/w in your command line and the code execution is your cd
[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans
might take more iterations to converge as its convergenceDelta is very
less
but anyways you have number of iterations set to 10.
Another difference is you are running your source code execution in
sequential mode. I am not sure whether these factors really effect the
number of part files being generated.

Anyhow you have to evaluate the number of clusters being generated finally
by using ClusterDumper in both the cases, that will give you the number of
clusters and the points associated with each clusters.

The ClusteredPoints will be generated in the last iteration and will have
the info about the clusters and associated points for each cluster.

Best,
Mahesh Balija.


On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 When I run the Kmeans analysis from the command line,

  #
  # added the -cd option per instructions in the Mahout In Action (MiA)
so
 the
  convergance threhsold is .1
  #   instead of default value of .5  because cosines lie within 0
and
 1.
  #
  # maximum number of iterations is 10
  #
  mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
  reuters-canopy-centroids/clusters-0-final/ -cl -ow -o
 reuters-kmeans-clusters
  -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd
0.1

  the iterations resolve to a directory with the word final that has a
 single file where the name is like part-r-0  .
  If I run it as a java routine:

 KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
 clusters-0-final), clusterOutput,

 new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);



  thousands of files such as part-00338  are produced.  The same data
is
 used as input for both and both are initialized from canopy .

 Why does the command line form generate a single file while my Java
version
 generate multiple output files.  What setting/configuration am I
missing?

 Secondary question:  The sequence files located in the final folder I
 assume to contain the centroids of the data (and the points that the
 centroids were derived from are in the clusteredPoints (please
confirm).

 Thanks in advance.

 SCott









need help explaining difference in k means output

2014-01-04 Thread Scott C. Cote
All,

When I run the Kmeans analysis from the command line,

 #
 # added the -cd option per instructions in the Mahout In Action (MiA) so the
 convergance threhsold is .1
 #   instead of default value of .5  because cosines lie within 0 and 1.
 #
 # maximum number of iterations is 10
 #
 mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
 reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters
 -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1

 the iterations resolve to a directory with the word final that has a
single file where the name is like part-r-0  .
 If I run it as a java routine:

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
clusters-0-final), clusterOutput,

new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);



 thousands of files such as part-00338  are produced.  The same data is
used as input for both and both are initialized from canopy .

Why does the command line form generate a single file while my Java version
generate multiple output files.  What setting/configuration am I missing?

Secondary question:  The sequence files located in the final folder I
assume to contain the centroids of the data (and the points that the
centroids were derived from are in the clusteredPoints (please confirm).

Thanks in advance.

SCott






Re: Equality of two DenseMatrix objects

2013-12-29 Thread Scott C. Cote
Ted - thank you for taking the time to point out that in Multivariate
Systems, there are many interpretations to what would seem ordinary and
non-debatable in scalar mathematics.

For example, in the relational algebra world, I know of seven different
interpretations of relational division.

SCott

On 12/29/13 10:02 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Sun, Dec 29, 2013 at 7:30 PM, Tharindu Rusira
tharindurus...@gmail.comwrote:

 Hi Ted, Thanks for taking this discussion back alive. It's true, as
 Sebestian mentioned, equality checking for matrices is an expensive task
 and Ted has come up with a smart one liner here(even though a
considerable
 amount of computational complexity is hidden somewhere).
 But don't you think (at least for the sake of completeness) that we
should
 have an implementation of this?


Not really.  The problem is that there are many different meanings of
equal for matrices. In fact there are many definitions of zero, as well.
 This stems partly from the fact that we have to inherit a sense of nearly
zero or nearly equal from the fact that we are using floating point
arithmetic.  This is exactly why equals is poorly defined for floating
point numbers, but worse.

As such any single definition is going to be seriously problematic.  Any
definition that doesn't have a tolerance argument is inherently dangerous
to use except in very limited situations.

For example here are some possibilities for vector equality:

   | x - y|_F  \delta
   | x - y|_1  \delta
   | x - y|_0  \delta
   (x-y)^T A (x-y)  \delta
   x^T A y  1-\delta/2

The first says that the sum of the squares of the components of the
difference is less than a particular number.  The second says that the sum
of the absolute values of the difference is less.  The third says that the
maximum value of the difference is different.  The third says that the dot
product of the of the difference is nearly zero neglecting components in
the null space of A.  The last form is useful for cases where x and y have
unit norm with respect to A (i.e. x^T A x = 1).

Which of these is correct?  Of all of these, only the last two are
equivalent and only in limited situations.

For matrices, there are even more possibilities.



 Btw, this thread has turned into a developers discussion, so I'm not
sure
 whether we should continue this on the developers list.


I think that this is a very important thread for users at large as well.




Mahout In Action - NewsKMeansClustering sample not generating clusters

2013-12-27 Thread Scott C. Cote
Hello Mahout Trainers and Gurus:

I am plowing through the sample code from Mahout in Action.  Have been
trying to run the example NewsKMeansClustering using the  Reuters dataset.
Found Alex Ott's Blog

http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm
l

And downloaded the updated examples for 0.7 mahout.  I took the exploded zip
and modified the pom.xml so that it referenced 0.8 mahout instead of 0.7
mahout.

Of course, there are compile errors (expected), but the only seemingly
significant problems are in the helper class called MyAnalyzer.

NOTE: am NOT complaining about the fact that the samples don't compile
properly in 0.8 .  If my efforts to make it work results in sharable code ­
then I have helped (or that person who helps me helped).


I need help in potentially two different parts:   Revision of MyAnalyzer
(steps 1 and 2) and/or sidestepping it (step 3)

Steps Taken (total of 3 steps):

Step 1. Performed the sgml2text conversion of reuters data and then
converted the text to sequence files.
Step 2. Attempted to run the java NewsKMeansClustering  with MyAnalyzer -
attempted to modify MyAnalyzer to fit into the 0.8 mahout world

When I try to run the program, the sample blows up with this message:

 2013-12-27 12:59:29.870 java[86219:1203] Unable to load realm info from
 SCDynamicStore
 
 SLF4J: Class path contains multiple SLF4J bindings.
 
 SLF4J: Found binding in
 [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-jcl/1.7.5/slf4j-jcl
 -1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 
 SLF4J: Found binding in
 [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-log4j12/1.5.11/slf4
 j-log4j12-1.5.11.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 
 SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory]
 
 2013-12-27 12:59:30 NativeCodeLoader [WARN] Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 
 2013-12-27 12:59:30 JobClient [WARN] Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 
 2013-12-27 12:59:30 LocalJobRunner [WARN] job_local_0001
 
 java.lang.NullPointerException
 
 at 
 org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.fill(Charac
 terUtils.java:209)
 
 at 
 org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.jav
 a:135)
 
 at 
 org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
 FileTokenizerMapper.java:49)
 
 at 
 org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
 FileTokenizerMapper.java:38)
 
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
 
 Exception in thread main java.lang.IllegalStateException: Job failed!
 
 at 
 org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProce
 ssor.java:95)
 
 at mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:53)


Here is the source code to my revised MyAnalyzer ­ I tried to stay as true
to form of the original MyAnalyzer but I'm sure that
I misunderstood something in this class when I ported it to the new Lucene
Analyzer interface apiŠ.

 public class MyAnalyzer extends Analyzer
 
 {
 
 
 
 private final Pattern alphabets = Pattern.compile([a-z]+);
 
 
 
 /*
 
  * (non-Javadoc)
 
  * @see org.apache.lucene.analysis.Analyzer#createComponents(java.lang.String,
 java.io.Reader)
 
  */
 
 @Override
 
 protected TokenStreamComponents createComponents(String fieldName, Reader
 reader)
 
 {
 
 final Tokenizer source = new StandardTokenizer(Version.LUCENE_CURRENT,
 reader);
 
 TokenStream result = new StandardFilter(Version.LUCENE_CURRENT, source);
 
 result = new LowerCaseFilter(Version.LUCENE_CURRENT, result);
 
 result = new StopFilter(Version.LUCENE_CURRENT, result,
 StandardAnalyzer.STOP_WORDS_SET);
 
 CharTermAttribute termAtt = result.addAttribute(CharTermAttribute.class);
 
 StringBuilder buf = new StringBuilder();
 
 
 
 try
 
 {
 
 result.reset();
 
 while ( result.incrementToken() )
 
 {
 
 if ( termAtt.length()  3 )
 
 continue;
 
 String word = new String(termAtt.buffer(), 0, termAtt.length());
 
 Matcher m = alphabets.matcher(word);
 
 
 
 if ( m.matches() )
 
 {
 
 buf.append(word).append( );
 
 }
 
 }
 
 }
 
 catch ( IOException e )
 
 {
 
 e.printStackTrace();
 
 }
 
 
 
 TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new
 StringReader(buf.toString()));
 
 return new TokenStreamComponents(source, ts);
 
 }
 
 }


Step 3. Since I wasn't progressing with MyAnalyzer - I commented out the
MyAnalyzer reference inside NewsKMeansClustering and replaced with

 // MyAnalyzer analyzer = new MyAnalyzer();
 
 

Re: Mahout In Action - NewsKMeansClustering sample not generating clusters

2013-12-27 Thread Scott C. Cote
 source from Alex Ott's .7 version of
NewsKMeansClustering:

/*
 * Source code for Listing 9.4
 */

package mia.clustering.ch09;

import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedVectorWritable;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.distance.CosineDistanceMeasure;
import org.apache.mahout.vectorizer.DictionaryVectorizer;
import org.apache.mahout.vectorizer.DocumentProcessor;
import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;

public class NewsKMeansClustering
{

public static void main(String args[]) throws Exception
{
//
// changes from Alex Otts Source:
//
// 1. changed booleans that indicate the use of named vectors from false to
true
// 2. changed sequential access booleans from false to true
// 3. changed MyAnalyzer to StandardAnalyzer
// 4. added system.out.println statements to provide console guidance on
progress
// 5. Changed Input dir to reuters-seqfiles to make use of output from
command line approach in tour
//
int minSupport = 5;
int minDf = 5;
int maxDFPercent = 95;
int maxNGramSize = 2;
int minLLRValue = 50;
int reduceTasks = 1;
int chunkSize = 200;
int norm = 2;
boolean sequentialAccessOutput = true;

// String inputDir = inputDir;
String inputDir = reuters-seqfiles;

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

String outputDir = newsClusters;
HadoopUtil.delete(conf, new Path(outputDir));
Path tokenizedPath = new Path(outputDir,
DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
// MyAnalyzer analyzer = new MyAnalyzer();
System.out.println(tokenizing the documents);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT,
StandardAnalyzer.STOP_WORDS_SET);

DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass().asSubclass(Analyzer.class),
tokenizedPath, conf);
//
//
System.out.println(creating the term frequency vectors from tokenized
documents);
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport,
maxNGramSize, minLLRValue, 2, true,
reduceTasks, chunkSize, sequentialAccessOutput, true);
//
//
System.out.println(calculating document frequencies from tf vectors);
PairLong[], ListPath dfData = TFIDFConverter.calculateDF(new
Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir),
conf, chunkSize);
System.out.println(creating the tfidf vectors);
TFIDFConverter.processTfIdf(new Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(
outputDir), conf, dfData, minDf, maxDFPercent, norm, true,
sequentialAccessOutput, true, reduceTasks);
//
//
Path vectorsFolder = new Path(outputDir, tfidf-vectors);
Path canopyCentroids = new Path(outputDir, canopy-centroids);
Path clusterOutput = new Path(outputDir, clusters);

System.out.println(Deriving canopy clusters from the tfidf vectors);
// CanopyDriver.run(vectorsFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250, 120, false, 0.0, false);
CanopyDriver.run(vectorsFolder, canopyCentroids, new
CosineDistanceMeasure(), .4, .8, true, 0.0, true);
//
//
System.out.println(running cluster kmean);
// KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
clusters-0-final), clusterOutput,
// new TanimotoDistanceMeasure(), 0.01, 20, true, 0.0, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
clusters-0-final), clusterOutput,
new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);

SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(clusterOutput + Cluster.CLUSTERED_POINTS_DIR
+ /part-0), conf);

IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while ( reader.next(key, value) )
{
System.out.println(key.toString() +  belongs to cluster  +
value.toString());
}
reader.close();
}
}


I'm running out of ideas. Š

SCott

From:  Scott C. Cote scottcc...@gmail.com
Date:  Friday, December 27, 2013 1:56 PM
To:  user@mahout.apache.org user@mahout.apache.org
Subject:  Mahout In Action - NewsKMeansClustering sample not generating
clusters

Hello Mahout Trainers and Gurus:

I am plowing through the sample code from Mahout in Action.  Have been
trying to run the example NewsKMeansClustering using the  Reuters dataset.
Found Alex Ott's Blog

http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm
l

And downloaded the updated

Questions related to MiA and Quick tour of text analysis Š..

2013-12-23 Thread Scott C. Cote
All,

Two questions related to Quick tour of text analysis using the Mahout
command line

1.  metrics:
When moving through the process of performing the cluster analysis ­ one can
use many different metrics.  In the tour, the choice was made to use the
Cosine metric.  Is there any problems that can arise from using the cosine
metric to define the clusters, but use tanimoto or euclid to dump the
clusters?  I have so far remained consistent in that once starting with
Cosine, go all the way with cosine.  When does it make sense to not do what
I am doing?

To be clear ­ the current version of the tour does NOT specify that a metric
should be used when dumping a cluster, so the default Euclid is used.

2. Parameters around canopy cluster:
What are parameters t3 and t4?  I know that they are optional reducers and
t1 and t2 are used for them if t3 and t4 are not specified.

https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering

Lots of discussion about t1 and t2, but t3 and t4 are not covered in MiA
either.  Are these params that I should ignore for now?

SCott






Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-20 Thread Scott C. Cote

version of java (java -version): java version 1.6.0_65, Java(TM) SE
Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
64-Bit Server VM (build 20.65-b04-462, mixed mode)

Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64


 

On 12/19/13 1:08 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

I don't see a need for uploading ur commands.  Clean up HDFS (both output
and temp folders) and try running the 5 steps again - extract reuters,
seqdirectory, seq2sparse, rowid job, rowsimilarity job.

Please use '-ow' option while running each of the jobs.







On Thursday, December 19, 2013 2:04 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
I manually deleted the temp folder too (After 2 failed starts).

Would it be helpful for me to upload my shells that encapsulate all of the
commands posted on the tour?  They reflect the current state of reuters
and .8 mahout.
And if I did - how would I do it?

Thanks,

SCott


On 12/19/13 1:00 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Yep, that's what has happened in ur case. the wiki doesn't have but
please specify the -ow (overwrite) option while running the
RowsimilarityJob. That should clear up both the output and temp folders
before running the job.





On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
suneel_mar...@yahoo.com wrote:
 
Haha... that could explain it, Rowsimilarityjob creates temp files during
execution. If ur laptop 'sleeped' then the temp files still persist and
running the job again wouldn't overwrite the old temp files (i need to
verify that).

It should be good enough to run the Rowsimilarity job again.







On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
Suneel,

I'm going to do the similarity part of the tour over - my laptop was
sleeped in the middle of the run of the rowsimilarity job.
Maybe the job is sensitive to that ….  :(  Normally - a server would not
go to sleep nor would it run
in local mode.

Sorry that I didn't think of that sooner.
Will let you know my outcome.

Am planning on redoing by deleting the contents and the folder titled
reuters-similarity

Please let me know if that is not good enough.

Thanks again.

SCott


On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

What you are seeing is the output matrix of the RowSimilarity job.  You
are right there should be 21578 documents only in the reuters corpus.

a) How many documents do you have in your docIndex?  DocIndex is one of
the artifacts of the RowIDJob and should have been executed prior to the
RowSimilarity Job. You can run seqdumper on docIndex to see the output.

b) Also what was the message at the end of the RowId job. It should read
something like 'Wrote out matrix with 21578 rows and 19515 columns to
reuters-matrix/matrix'.




On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
All,

I am a newbie Mahout user and am trying to use the Quick tour of text
analysis using the Mahout command line .  Thank you to whomever
contributed
to that page.

 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a
n
a
lysis
 +using+the+Mahout+command+line

Went all the way from beginning to end of
 the page with seemingly no
hiccups.
At the very end of the tour, I became confused because the command:

 mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

 Key: 1: Value:
 
/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4
4
0
3:0.2
 
279237043863,5405:0.0964390139170019,5997:0.030023608542497426,1010
8
:
0.126
 
28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375
0
:
0.188
 
8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:
0
.
36601
 
581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:
0
.
10869
 
648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0
.
1
23091
 
46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.
0
6
16936
 
10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0
.
1
23271
 
84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.
0
8
01873
 
7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.
1
9
87470
 
224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.1
4
7
88025
 
30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109
7
3
79357
 
6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03
5
8
19767
 
691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10
8
1
98203
 
50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09
5
2
82500

 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-20 Thread Scott C. Cote
 you
should be looking at a seqdumper of the output from rowsimilarity which
in ur case would be the output in reuters-similarity.  That should give
the 10 most similar documents and their cosine distances from the
referenced document.


mahout seqdumper -i reuters-similarity/part-r-* | more

Yields

Input Path: reuters-similarity/part-r-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: 
{0:0.,13611:0.17446750688012366,13430:0.15853208358190823,1
7520:0.19351644052283437,18330:0.15898358188286904,4411:0.20851636244169733
,13403:0.1663674094837415,14458:0.17265033919444714,14613:0.153651769452232
38,11399:0.19745333923929734}
Key: 1: Value: 
{9858:0.32081902404236906,9704:0.2485999435029943,9833:0.30851564542610826,
19789:0.37458607189215337,10056:0.2885413911200995,10601:0.2598640283997712
4,11858:0.305718360283,17412:0.30330496505095894,1:0.9998,9
702:0.26198579353949075}
Key: 2: Value: 
{2:1.0004,1087:0.28125327148896956,10390:0.2690057046963114,100
22:0.27668518648436297,6746:0.26969982074464605,12886:0.27032675431539793,1
3168:0.25889934686395943,997:0.26225673856545156,1392:0.2673559453473729,20
614:0.3009916279814217}
…..




:)





There's an error on the wiki link instructions, the seqdumper should have
been on rowsimilarity/part-r-* and not on matrix/matrix for determining
similar documents.

Hope this helps. Sorry again for the confusion.








On Friday, December 20, 2013 4:51 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
Suneel and others,

I am still getting the strange results when I do the tour. Suneel: I
manually wiped out the temp folder and also deleted the reuters-XXX
folders.  
Also, per your advice I added the -ow option to all of the commands.
NOTE: The step to create a matrix would NOT take a -ow option

I have tried again, and am still seeing references to documents that do
not exist.

The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
reuters-matrix/matrix | tail) :

INFO: Program took 1077 ms (Minutes: 0.01795)
Key: 21569: Value: /reut2-021.sgm-91.txt
Key: 21570: Value: /reut2-021.sgm-92.txt
Key: 21571: Value: /reut2-021.sgm-93.txt
Key: 21572: Value: /reut2-021.sgm-94.txt
Key: 21573: Value: /reut2-021.sgm-95.txt
Key: 21574: Value: /reut2-021.sgm-96.txt
Key: 21575: Value: /reut2-021.sgm-97.txt
Key: 21576: Value: /reut2-021.sgm-98.txt
Key: 21577: Value: /reut2-021.sgm-99.txt
Count: 21578



And the following snippet exists inside reuters-matrix/matrix and
references key 41625 (which is larger than any key in docindex).

Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
6
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
0
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
0
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
4
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
7
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
2
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
7
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
6
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
3
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
7
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
8
8897003744,}

--- So in this email, I have listed the following pieces
 of
information 1. Commands, 2. Env vars, 3. Sw version info

Again, thank you in advance for your help.

Scott

INFO Below:

1. sequence of commands with relevant logged output points (omitted the
sequence dump commands):

mv reuters xreuters
rm -r temp

rm -r reuters-*
mv xreuters reuters
mvn -e -q exec:java
-Dexec.mainClass=org.apache.lucene.benchmark.utils.ExtractReuters
-Dexec.args=reuters/ reuters-extracted/
mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
-x 90 -seq -ml 50 -n 2 -nv
#
# added the -cd option per instructions in the Mahout In Action (MiA) so
the convergance threhsold is .1 (originally this was default value but no
affect on the unexpected results)
#   instead of default value of .5  because cosines lie within 0 and
1.
#
mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
reuters-kmeans-clusters/clusteredPoints/

mahout rowid -i reuters-vectors/tfidf

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-20 Thread Scott C. Cote
What does the data in cdump.txt represent?  Can you point me in the right
direction?

SCott

On 12/20/13 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Sorry Scott I should have looked at this more closely. I apologize.

1. You are doing a seqdumper of the matrix (which is generated from the
rowid job and is not the output of the rowsimilarity job).

 Rowid Job generates a MxN matrix where M - no. of documents and N -
terms associated with each document

The value of a cell in the Matrix is the tf-idf weight of the term.

 So in the following output:

 {Code}


  
Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
6
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
0
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
0
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
4
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
7
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
2
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
7
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
6
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
3
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
7
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
8
8897003744,}

{Code}

means for document 2 what follows are the terms:tf-df weights.

To see the term corresponding to 41625 look at dictionary.file-0 for the
corresponding key.

Hope that clarifies and clears the confusion here.

2.  In order to see the most similar documents for a given document you
should be looking at a seqdumper of the output from rowsimilarity which
in ur case would be the output in reuters-similarity.  That should give
the 10 most similar documents and their cosine distances from the
referenced document.

There's an error on the wiki link instructions, the seqdumper should have
been on rowsimilarity/part-r-* and not on matrix/matrix for determining
similar documents.

Hope this helps. Sorry again for the confusion.








On Friday, December 20, 2013 4:51 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
Suneel and others,

I am still getting the strange results when I do the tour. Suneel: I
manually wiped out the temp folder and also deleted the reuters-XXX
folders.  
Also, per your advice I added the -ow option to all of the commands.
NOTE: The step to create a matrix would NOT take a -ow option

I have tried again, and am still seeing references to documents that do
not exist.

The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
reuters-matrix/matrix | tail) :

INFO: Program took 1077 ms (Minutes: 0.01795)
Key: 21569: Value: /reut2-021.sgm-91.txt
Key: 21570: Value: /reut2-021.sgm-92.txt
Key: 21571: Value: /reut2-021.sgm-93.txt
Key: 21572: Value: /reut2-021.sgm-94.txt
Key: 21573: Value: /reut2-021.sgm-95.txt
Key: 21574: Value: /reut2-021.sgm-96.txt
Key: 21575: Value: /reut2-021.sgm-97.txt
Key: 21576: Value: /reut2-021.sgm-98.txt
Key: 21577: Value: /reut2-021.sgm-99.txt
Count: 21578



And the following snippet exists inside reuters-matrix/matrix and
references key 41625 (which is larger than any key in docindex).

Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
6
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
0
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
0
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
4
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
7
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
2
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
7
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
6
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
3
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
7
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
8
8897003744,}

--- So in this email, I have listed the following pieces
 of
information 1. Commands, 2. Env vars, 3. Sw version info

Again, thank you in advance for your help.

Scott

INFO Below:

1. sequence of commands with relevant logged output points (omitted the
sequence dump commands):

mv reuters xreuters
rm -r temp

rm -r reuters-*
mv xreuters reuters
mvn -e -q exec:java
-Dexec.mainClass=org.apache.lucene.benchmark.utils.ExtractReuters
-Dexec.args=reuters

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-20 Thread Scott C. Cote
Suneel,

I think I have it :)

Pls confirm this understanding:

I'm looking at the cdump.out that comes from clusterdump.   It has the 20
clusters, each of the top words in the cluster, and each of the vectors
that are members of the cluster.   Do I have it?  Am I getting this?

Thanks,

SCott   

On 12/20/13 6:32 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Which cdump.txt ?





On Friday, December 20, 2013 7:29 PM, Suneel Marthi
suneel_mar...@yahoo.com wrote:
 
You could use clusterdump to see the output of your clusters.

Eg: 

  $MAHOUT clusterdump \
-i ${WORK_DIR}/reuters-kmeans/clusters-*-final \
-o ${WORK_DIR}/reuters-kmeans/clusterdump \
-d ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \
-dt sequencefile -b 100 -n 20 --evaluate -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \
--pointsDir ${WORK_DIR}/reuters-kmeans/clusteredPoints \

I am assuming you had run kmeans clustering, if so the clusters wouldn't
overlap. You would see cluster overlap if u were to run fuzzy kmeans
clustering.





On Friday, December 20, 2013 7:06 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
Suneel,

Thank you for your help.  :)   Thought I was completely in the ditch.

If you are interested: inline with you comments are demonstrations that I
finally have it  (and the commands that I used)….

YAQ (Yet another question):
How do I see with the dumper the documents that belong in a given cluster?

I issued the command:  mahout seqdumper -I
reuters-kmeans-clusters/clusters-3-final/part-r-0

Which yields data like:

Input Path: part-r-0
Key class: class
 org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
Key: 1: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
Key: 2: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
…
Key: 19: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@193936e1
Count: 20


Was hoping to see something that associated a centroid/cluster with its
members.  
Given that there are 20 centroids, how do I break out the files into say:
20 folders - one folder per centroid so that I know their associations
(I'm assuming that the clusters don't overlap).  Or - is there a sequence
file that is generated somewhere that definitively associates the vectors
with each cluster?

Here is what I do know:
I know that the clusters are not given names and it is suggested that we
use the top terms of the cluster to define a name.

According to the tour, I should be able to see a likelihood that a given
vector is in a cluster.  But

mahout
 seqdumper -i reuters-kmeans-clusters/clusteredPoints/part-m-0 |
more

Yields:

Input Path: part-m-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 10266: Value: 1.0: /reut2-000.sgm-0.txt = [62:0.085, 222:0.043,
291:0.084, 1411:0.083, 1421:0.087, 1451:0.085, 1456:0.092, 1457:0.092,
1462:0.135, 1512:0.070, 1543:0.104, 2962:0.037
….


which does NOT look like the output in the tour (did I miss something
again?).   But I'll try to interpret the output as saying vector with key
62 has a cosine distance of .085 from key 10266 - is that right?

What do I need to look at? - MiA sheds no light on this part that I have
found.  NOTE:  I wrote a very simple - non scalable k-means java routine
that found the clusters in a set of points (2 dimensional) and tracked
which point belongs to which
 cluster (no overlap).  Want to do the same
with Mahout.

Looking forward to your response to get me over this next hump ….

SCott

On 12/20/13 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Sorry Scott I should have looked at this more closely. I apologize.

1. You are doing a seqdumper of the matrix (which is generated from the
rowid job and is not the output of the rowsimilarity job).

 Rowid Job generates a MxN matrix where M - no. of documents and N -
terms associated with each document

The value of a cell in the Matrix is the tf-idf weight of the term.

 So in the following output:

 {Code}


  
Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,2
9
6
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,5
4
0
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,68
9
0
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260
:
0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,147
1
4
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,1
9
7
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,
2
2
224:0.1442082289933512,22556:0.10371188492300307,22892:0.1850160368204063
8
,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.0722773614334
8
7
77,25502

unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-19 Thread Scott C. Cote
All,

I am a newbie Mahout user and am trying to use the Quick tour of text
analysis using the Mahout command line .  Thank you to whomever contributed
to that page.

 https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis
 +using+the+Mahout+command+line

Went all the way from beginning to end of the page with seemingly no
hiccups.
At the very end of the tour, I became confused because the command:

 mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

 Key: 1: Value: 
 /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4403:0.2
 279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:0.126
 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:0.188
 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.36601
 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.10869
 648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.123091
 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0616936
 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.123271
 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0801873
 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1987470
 224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.14788025
 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097379357
 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035819767
 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108198203
 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095282500
 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element in
the snippet).

The problem is that the folder

 /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

 mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

 Key: 21576: Value: /reut2-021.sgm-98.txt
 Key: 21577: Value: /reut2-021.sgm-99.txt
 Count: 21578

So I cannot find the document with key value 41154   .  What does the 41154
related to

Obviously I have misunderstood something that I did ­ or need to do ­ in the
tour.  Can someone please shine a light on where I strayed?  I have scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I made
adjustments).

Regards,

SCott  

PS  Thanks TD for helping me earlier




Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-19 Thread Scott C. Cote
Suneel,

Thank you for your help.



On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

What you are seeing is the output matrix of the RowSimilarity job.  You
are right there should be 21578 documents only in the reuters corpus.

a) How many documents do you have in your docIndex?  DocIndex is one of
the artifacts of the RowIDJob and should have been executed prior to the
RowSimilarity Job. You can run seqdumper on docIndex to see the output.



mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

Key: 21576: Value: /reut2-021.sgm-98.txt
Key: 21577: Value: /reut2-021.sgm-99.txt
Count: 21578



 

b) Also what was the message at the end of the RowId job. It should read
something like 'Wrote out matrix with 21578 rows and 19515 columns to
reuters-matrix/matrix'.



Dec 18, 2013 4:01:13 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Wrote out matrix with 21578 rows and 41807 columns to
reuters-matrix/matrix
Dec 18, 2013 4:01:13 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 3453 ms (Minutes: 0.05755)







On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
All,

I am a newbie Mahout user and am trying to use the Quick tour of text
analysis using the Mahout command line .  Thank you to whomever
contributed
to that page.

 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
lysis
 +using+the+Mahout+command+line

Went all the way from beginning to end of the page with seemingly no
hiccups.
At the very end of the tour, I became confused because the command:

 mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

 Key: 1: Value: 
 
/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440
3:0.2
 
279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:
0.126
 
28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:
0.188
 
8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.
36601
 
581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.
10869
 
648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.1
23091
 
46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06
16936
 
10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1
23271
 
84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08
01873
 
7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19
87470
 
224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.147
88025
 
30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973
79357
 
6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358
19767
 
691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081
98203
 
50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952
82500
 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element
in
the snippet).

The problem is that the folder

 /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

 mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

 Key: 21576: Value: /reut2-021.sgm-98.txt
 Key: 21577: Value: /reut2-021.sgm-99.txt
 Count: 21578

So I cannot find the document with key value 41154   .  What does the
41154
related to

Obviously I have misunderstood something that I did ­ or need to do ­ in
the
tour.  Can someone please shine a light on where I strayed?  I have
scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I
made
adjustments).

Regards,

SCott  

PS  Thanks TD for helping me earlier




Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-19 Thread Scott C. Cote
Suneel,

I'm going to do the similarity part of the tour over - my laptop was
sleeped in the middle of the run of the rowsimilarity job.
Maybe the job is sensitive to that ….  :(  Normally - a server would not
go to sleep nor would it run
in local mode.

Sorry that I didn't think of that sooner.
Will let you know my outcome.

Am planning on redoing by deleting the contents and the folder titled
reuters-similarity

Please let me know if that is not good enough.

Thanks again.

SCott

On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

What you are seeing is the output matrix of the RowSimilarity job.  You
are right there should be 21578 documents only in the reuters corpus.

a) How many documents do you have in your docIndex?  DocIndex is one of
the artifacts of the RowIDJob and should have been executed prior to the
RowSimilarity Job. You can run seqdumper on docIndex to see the output.

b) Also what was the message at the end of the RowId job. It should read
something like 'Wrote out matrix with 21578 rows and 19515 columns to
reuters-matrix/matrix'.




On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
All,

I am a newbie Mahout user and am trying to use the Quick tour of text
analysis using the Mahout command line .  Thank you to whomever
contributed
to that page.

 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
lysis
 +using+the+Mahout+command+line

Went all the way from beginning to end of the page with seemingly no
hiccups.
At the very end of the tour, I became confused because the command:

 mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

 Key: 1: Value: 
 
/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440
3:0.2
 
279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:
0.126
 
28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:
0.188
 
8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.
36601
 
581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.
10869
 
648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.1
23091
 
46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06
16936
 
10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1
23271
 
84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08
01873
 
7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19
87470
 
224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.147
88025
 
30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973
79357
 
6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358
19767
 
691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081
98203
 
50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952
82500
 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element
in
the snippet).

The problem is that the folder

 /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

 mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

 Key: 21576: Value: /reut2-021.sgm-98.txt
 Key: 21577: Value: /reut2-021.sgm-99.txt
 Count: 21578

So I cannot find the document with key value 41154   .  What does the
41154
related to

Obviously I have misunderstood something that I did ­ or need to do ­ in
the
tour.  Can someone please shine a light on where I strayed?  I have
scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I
made
adjustments).

Regards,

SCott  

PS  Thanks TD for helping me earlier




Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

2013-12-19 Thread Scott C. Cote
I manually deleted the temp folder too (After 2 failed starts).

Would it be helpful for me to upload my shells that encapsulate all of the
commands posted on the tour?  They reflect the current state of reuters
and .8 mahout.
And if I did - how would I do it?

Thanks,

SCott

On 12/19/13 1:00 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Yep, that's what has happened in ur case. the wiki doesn't have but
please specify the -ow (overwrite) option while running the
RowsimilarityJob. That should clear up both the output and temp folders
before running the job.





On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
suneel_mar...@yahoo.com wrote:
 
Haha... that could explain it, Rowsimilarityjob creates temp files during
execution. If ur laptop 'sleeped' then the temp files still persist and
running the job again wouldn't overwrite the old temp files (i need to
verify that).

It should be good enough to run the Rowsimilarity job again.







On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
Suneel,

I'm going to do the similarity part of the tour over - my laptop was
sleeped in the middle of the run of the rowsimilarity job.
Maybe the job is sensitive to that ….  :(  Normally - a server would not
go to sleep nor would it run
in local mode.

Sorry that I didn't think of that sooner.
Will let you know my outcome.

Am planning on redoing by deleting the contents and the folder titled
reuters-similarity

Please let me know if that is not good enough.

Thanks again.

SCott


On 12/19/13 11:53 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

What you are seeing is the output matrix of the RowSimilarity job.  You
are right there should be 21578 documents only in the reuters corpus.

a) How many documents do you have in your docIndex?  DocIndex is one of
the artifacts of the RowIDJob and should have been executed prior to the
RowSimilarity Job. You can run seqdumper on docIndex to see the output.

b) Also what was the message at the end of the RowId job. It should read
something like 'Wrote out matrix with 21578 rows and 19515 columns to
reuters-matrix/matrix'.




On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
scottcc...@gmail.com wrote:
 
All,

I am a newbie Mahout user and am trying to use the Quick tour of text
analysis using the Mahout command line .  Thank you to whomever
contributed
to that page.

 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+an
a
lysis
 +using+the+Mahout+command+line

Went all the way from beginning to end of
 the page with seemingly no
hiccups.
At the very end of the tour, I became confused because the command:

 mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

 Key: 1: Value: 
 
/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,44
0
3:0.2
 
279237043863,5405:0.0964390139170019,5997:0.030023608542497426,10108
:
0.126
 
28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750
:
0.188
 
8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0
.
36601
 
581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0
.
10869
 
648237816114,17978:0.11932381316475806,18019:0.105152778531,4:0.
1
23091
 
46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0
6
16936
 
10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.
1
23271
 
84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0
8
01873
 
7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1
9
87470
 
224449987,31071:0.17024007142554856,31386:0.2279237043863,31433:0.14
7
88025
 
30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097
3
79357
 
6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035
8
19767
 
691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108
1
98203
 
50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095
2
82500

 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element
in
the snippet).

The problem is that the folder

 /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

 mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

 Key: 21576: Value: /reut2-021.sgm-98.txt
 Key: 21577: Value:
 /reut2-021.sgm-99.txt
 Count: 21578

So I cannot find the document with key value 41154   .  What does the
41154
related to

Obviously I have misunderstood something that I did ­ or need to do ­ in
the
tour.  Can someone please shine a light on where I strayed?  I have
scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I
made
adjustments