>From my past projects, our Lucene classification corpus looked like this:

0|document text...|categoryA
1|document text...|categoryB
2|document text...|categoryA
3|document text...|categoryA
...
800|document text...|categoryC

With the faceting capabilities of Solr it is now possible to design more
dimensions of categories/taxonomies in a corpus with a minimal impact (?) on
computation time! Plus the configuration of synonyms in Solr configuration.

Like the idea to use Solr!

On Wed, Jan 28, 2009 at 7:57 AM, Neal Richter <nrich...@gmail.com> wrote:

> On Tue, Jan 27, 2009 at 2:21 PM, Grant Ingersoll <gsing...@apache.org>
> wrote:
> > One of the things I am interested in is the marriage of Solr and Mahout
> > (which has some Genetic Algorithms support) and other ML (Weka, etc.)
> tools.
>  [snip]
>
> I love it, good to know you are thinking big here.  Here's another big
> thought:
> http://www.eml-r.org/nlp/papers/ponzetto07b.pdf .. but assume we want
> to extract this type of structure from the full text of Wikipedia
> rather than the narrow categories DB.
>
> > Things that can help with all this:  LukeReqHandler, TermVectorComponent,
> > TermsComponent, others
> >
>
> [snip]
>
> > Neal, what did you have in mind for a JIRA issue?  I'd love to see a
> patch.
>
> More research needed, but the initial idea would be to enable the
> passing in of a weighted term vector as a query and allowing a
> more-like-this type search on it.  Anyone attempt this yet?
>
> Interesting point about faceting here is that it would give outgoing
> feedback on what  /new/ words (not in initial query) that if added to
> the query would result in additional discrimination between the
> matched categories.
>
> So Solr outputs a set of categories for a document, and also emits a
> set of related words to the initial query!  Categorization and
> recommendation in one.
>
> - Neal
>

Reply via email to