: The Collocations would be similar to facets except I am also trying to get : multi word phrases as well as single terms. So suppose I could write
Assuming I understand what you want, I would look into using the SingleFilter to build up Tokens consisting of N->M tokens, then you could just facet on that field to see the really common "phrases" or use the TermsComponent to get them as well... : The highly unusual phrases on the other hand requires getting a handle on : the IDF which at present only appears to be available via the explain : function of debugging. ...as i mentioned, you can use the TermsComponent to get terms and their document count ... it has a terms.maxcount param so you can use that to limit the output to only terms that appear in no more then X documents. That said: These are possible ways of solving these types of problems using Solr, which can be handy if you are building a Solr for other things in general -- but if you are just trying to do a one-time analysis of a large corpus of data (or even a many-time analysis of a corpus that changes very frequently) w/o needing any of Solr's other features then you may find that you can accomplish this type of task much simpler (and probably faster) with some simple map/reduce jobs in Hadoop. -Hoss