>From my past projects, our Lucene classification corpus looked like this:
0|document text...|categoryA 1|document text...|categoryB 2|document text...|categoryA 3|document text...|categoryA ... 800|document text...|categoryC With the faceting capabilities of Solr it is now possible to design more dimensions of categories/taxonomies in a corpus with a minimal impact (?) on computation time! Plus the configuration of synonyms in Solr configuration. Like the idea to use Solr! On Wed, Jan 28, 2009 at 7:57 AM, Neal Richter <nrich...@gmail.com> wrote: > On Tue, Jan 27, 2009 at 2:21 PM, Grant Ingersoll <gsing...@apache.org> > wrote: > > One of the things I am interested in is the marriage of Solr and Mahout > > (which has some Genetic Algorithms support) and other ML (Weka, etc.) > tools. > [snip] > > I love it, good to know you are thinking big here. Here's another big > thought: > http://www.eml-r.org/nlp/papers/ponzetto07b.pdf .. but assume we want > to extract this type of structure from the full text of Wikipedia > rather than the narrow categories DB. > > > Things that can help with all this: LukeReqHandler, TermVectorComponent, > > TermsComponent, others > > > > [snip] > > > Neal, what did you have in mind for a JIRA issue? I'd love to see a > patch. > > More research needed, but the initial idea would be to enable the > passing in of a weighted term vector as a query and allowing a > more-like-this type search on it. Anyone attempt this yet? > > Interesting point about faceting here is that it would give outgoing > feedback on what /new/ words (not in initial query) that if added to > the query would result in additional discrimination between the > matched categories. > > So Solr outputs a set of categories for a document, and also emits a > set of related words to the initial query! Categorization and > recommendation in one. > > - Neal >