Carrot (2): http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en
Otis --- [EMAIL PROTECTED] wrote: > I got all excited reading the subject line "clustering results" but > this isn't > really clustering is it? This is more sorting. Does anyone know of > any work > within Lucene (or another indexer) to do actual subject clustering > (i.e. like > Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? > It would > be pretty awesome if Lucene had such ability, I know there aren't a > whole lot > of clustering options, and the commercial products are very > expensive. > Anyhow, just curious. > > A brief definition of clustering: automatically organizing search or > database > query results into meaningful hierarchical folders ... transforming > long lists > of search results into categorized information without any clumsy > pre- > processing of the source documents. > > I'm not sure how it would be done...? Based off of top Term > Frequencies for a > document? > > -K > > Quoting "Michael A. Schoen" <[EMAIL PROTECTED]>: > > > So as Venu pointed out, sorting doesn't seem to help the problem. > If we have > > to walk the result set, access docs and dedupe using brute force, > we're > > better off w/ the standard order by relevance. > > > > If you've got an example of this type of clustering done in a more > efficient > > way, that'd be great. > > > > Any other ideas? > > > > > > ----- Original Message ----- > > From: "Erik Hatcher" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Saturday, April 10, 2004 12:35 AM > > Subject: Re: clustering results > > > > > > > On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: > > > > I have an index of urls, and need to display the top 10 results > for a > > > > given query, but want to display only 1 result per domain. It > seems > > > > that using either Hits or a HitCollector, I'll need to access > the doc, > > > > grab the domain field (I'll have it parse ahead of time) and > only > > > > take/display documents that are unique. > > > > > > > > A significant percentage of the time I expect I may have to > access > > > > thousands of results before I find 10 in unique domains. Is > there a > > > > faster approach that won't require accessing thousands of > documents? > > > > > > I have examples of this that I can post when I have more time, > but a > > > quick pointer... check out the overloaded IndexSearcher.search() > > > methods which accept a Sort. You can do really really > interesting > > > slicing and dicing, I think, using it. Try this one on for size: > > > > > > example.displayHits(allBooks, > > > new Sort(new SortField[]{ > > > new SortField("category"), > > > SortField.FIELD_SCORE, > > > new SortField("pubmonth", SortField.INT, true) > > > })); > > > > > > Be clever indexing the piece you want to group on - I think you > may > > > find this the solution you're looking for. > > > > > > Erik > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]