: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write

Assuming I understand what you want, I would look into using the 
SingleFilter to build up Tokens consisting of N->M tokens, then you could 
just facet on that field to see the really common "phrases" or use the 
TermsComponent to get them as well...

: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging. 

...as i mentioned, you can use the TermsComponent to get terms and their 
document count ... it has a terms.maxcount param so you can use that to 
limit the output to only terms that appear in no more then X documents.

That said: These are possible ways of solving these types of problems 
using Solr, which can be handy if you are building a Solr for other things 
in general -- but if you are just trying to do a one-time analysis of a 
large corpus of data (or even a many-time analysis of a corpus that 
changes very frequently) w/o needing any of Solr's other features then you 
may find that you can accomplish this type of task much simpler (and 
probably faster) with some simple map/reduce jobs in Hadoop.


-Hoss

Reply via email to