Yup, I see that wordnet has also been "ported" to a lucene index, and hence pulling the hyponyms works great.
tks Paul ________________________________ From: Tommy Chheng <[email protected]> To: [email protected] Sent: Tuesday, 23 June, 2009 23:19:25 Subject: Re: mahout PLSI (with some lucene, thrown in) Have you looked at WordNet to get the hypohyms? Tommy On Jun 23, 2009, at 3:09 PM, Paul Jones wrote: > Okay, have seen the difficulty (apart from the maths :-)). > > I guess "similar" can mean many things, i.e hypohyms, but also words such as > hot...cold are also "related", hence to solve my little problem I am > wondering if there is a easier way, i.e to use things like existing hyponyms > relations which exist (wordnet and the like) , and/or if they do not then I > guess using something similar to a "google distance measure" may help in > "adding" new words to the system.... > > Paul > > > > > ________________________________ > From: Ted Dunning <[email protected]> > To: [email protected] > Sent: Tuesday, 23 June, 2009 18:00:12 > Subject: Re: mahout PLSI (with some lucene, thrown in) > > Yes. This can be done. It isn't necessarily real simple to do. > > See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275 for an > old (but still pretty good) example. > > On Tue, Jun 23, 2009 at 6:45 AM, Paul Jones <[email protected]>wrote: > >> Imagine we have crawled 100K webpages, and we have 100 pages which show >> "red" and 100 which show "crimson" and then 100 which show both "red and >> crimson" is there a way to deduce that there maybe (albeit weak) >> relationship between red AND crimson. Of course we can pre-seed this info, >> which then gets weighted by actual results. >> > > >
