Well, there is a PLSI implementation using Pig ( over Hadoop ) as a mahout patch : https://issues.apache.org/jira/browse/MAHOUT-106
-P On Wed, Jun 17, 2009 at 7:34 AM, Paul Jones <[email protected]>wrote: > Hi to one and all > > First time on this list, have read through the wiki, faq and other docs, > but before I dived further into Mahout I had a few questions or should I say > clarifications. > I am looking for a system which would allow me to: > > 1. Take a set of words > 2. Build clusters of these words, i.e work out the semantic relationship > between these (I guess I could use wordnet as a starter) words. i.e > inter-relationships > 3. Once clusters have been formed of words, also work out relationship > between the clusters themselves. > > so in essence I could work out that red was similiar to crimson, and hence > a search on red would produce docs with crimson in them even though red was > not mentioned. > > would mahout work here? > > Of course prior to this, there is the problem of cleaning up the data, i.e > stemming etc. > > Now I have read several detailed papers on clustering, ranking, etc, and of > course some algos are better than others, but to me a platform like Mahout > seems interesting since you can deploy the existing ones in the system, and > also later on add others. > > Looking at the algorithms it seems as if LSI (PLSI) has not been > implemented as yet, if so which other algo would "suffice" in this case. > Admitedley my knowledge of algos is poor to say the least :-). Also where > would (if it does) Lucene fit in, would it be used to search the results > after the algo's had been applied ? since it seems as if Lucene just uses a > weighting system to create the index, or can Mahout do it all. > > As you can see confused, but this is my first pass at this system. > > tks > > Paul > > P.S are any of the algo's feedback algo's, i.e so that someone could > inprove results using user feedback. > > >
