Hoss,
Thanks for your reply. As you pointed out the Terms Component alone with the terms.maxcount did the trick for single terms. And ShingleFilter did the trick for phrases. I have not ventured into Hadoop just yet - any examples you could point me to of simple map/reduce jobs? Most grateful, Christopher --------------------- Subject: <http://www.lucidimagination.com/search/document/34c21176f2004d70/using_idf_ to_find_collactions_and_sips> Re: Using IDF to find Collactions and SIPs . . ? From: Chris Hostetter <hossman_luc...@...> Date: 2009-12-31 22:55 : The Collocations would be similar to facets except I am also trying to get : multi word phrases as well as single terms. So suppose I could write Assuming I understand what you want, I would look into using the SingleFilter to build up Tokens consisting of N->M tokens, then you could just facet on that field to see the really common "phrases" or use the TermsComponent to get them as well... : The highly unusual phrases on the other hand requires getting a handle on : the IDF which at present only appears to be available via the explain : function of debugging. ...as i mentioned, you can use the TermsComponent to get terms and their document count ... it has a terms.maxcount param so you can use that to limit the output to only terms that appear in no more then X documents. That said: These are possible ways of solving these types of problems using Solr, which can be handy if you are building a Solr for other things in general -- but if you are just trying to do a one-time analysis of a large corpus of data (or even a many-time analysis of a corpus that changes very frequently) w/o needing any of Solr's other features then you may find that you can accomplish this type of task much simpler (and probably faster) with some simple map/reduce jobs in Hadoop. -Hoss