Carrot 2 (was: Re: clustering results)

Otis Gospodnetic Sun, 11 Apr 2004 12:15:22 -0700

Carrot (2):

  http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en


Otis

--- [EMAIL PROTECTED] wrote:
> I got all excited reading the subject line "clustering results" but
> this isn't 
> really clustering is it?  This is more sorting.  Does anyone know of
> any work 
> within Lucene (or another indexer) to do actual subject clustering
> (i.e. like 
> Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? 
> It would 
> be pretty awesome if Lucene had such ability, I know there aren't a
> whole lot 
> of clustering options, and the commercial products are very
> expensive.  
> Anyhow, just curious.
> 
> A brief definition of clustering: automatically organizing search or
> database 
> query results into meaningful hierarchical folders ... transforming
> long lists 
> of search results into categorized information without any clumsy
> pre-
> processing of the source documents.
> 
> I'm not sure how it would be done...?  Based off of top Term
> Frequencies for a 
> document?
> 
> -K
> 
> Quoting "Michael A. Schoen" <[EMAIL PROTECTED]>:
> 
> > So as Venu pointed out, sorting doesn't seem to help the problem.
> If we have
> > to walk the result set, access docs and dedupe using brute force,
> we're
> > better off w/ the standard order by relevance.
> > 
> > If you've got an example of this type of clustering done in a more
> efficient
> > way, that'd be great.
> > 
> > Any other ideas?
> > 
> > 
> > ----- Original Message ----- 
> > From: "Erik Hatcher" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Saturday, April 10, 2004 12:35 AM
> > Subject: Re: clustering results
> > 
> > 
> > > On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
> > > > I have an index of urls, and need to display the top 10 results
> for a
> > > > given query, but want to display only 1 result per domain. It
> seems
> > > > that using either Hits or a HitCollector, I'll need to access
> the doc,
> > > > grab the domain field (I'll have it parse ahead of time) and
> only
> > > > take/display documents that are unique.
> > > >
> > > > A significant percentage of the time I expect I may have to
> access
> > > > thousands of results before I find 10 in unique domains. Is
> there a
> > > > faster approach that won't require accessing thousands of
> documents?
> > >
> > > I have examples of this that I can post when I have more time,
> but a
> > > quick pointer... check out the overloaded IndexSearcher.search()
> > > methods which accept a Sort.  You can do really really
> interesting
> > > slicing and dicing, I think, using it.  Try this one on for size:
> > >
> > >      example.displayHits(allBooks,
> > >          new Sort(new SortField[]{
> > >            new SortField("category"),
> > >            SortField.FIELD_SCORE,
> > >            new SortField("pubmonth", SortField.INT, true)
> > >          }));
> > >
> > > Be clever indexing the piece you want to group on - I think you
> may
> > > find this the solution you're looking for.
> > >
> > > Erik
> > >
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > >
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Carrot 2 (was: Re: clustering results)

Reply via email to