Hoss,

 

Thanks for your reply.

 

As you pointed out the Terms Component alone with the terms.maxcount did the
trick for single terms.

 

And ShingleFilter did the trick for phrases.

 

I have not ventured into Hadoop just yet - any examples you could point me
to of simple map/reduce jobs?

 

Most grateful,

 

Christopher

 

---------------------

 

Subject:

 
<http://www.lucidimagination.com/search/document/34c21176f2004d70/using_idf_
to_find_collactions_and_sips> Re: Using IDF to find Collactions and SIPs . .
?

From:

Chris Hostetter <hossman_luc...@...>

Date:

2009-12-31 22:55

: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write
 
Assuming I understand what you want, I would look into using the 
SingleFilter to build up Tokens consisting of N->M tokens, then you could 
just facet on that field to see the really common "phrases" or use the 
TermsComponent to get them as well...
 
: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging. 
 
...as i mentioned, you can use the TermsComponent to get terms and their 
document count ... it has a terms.maxcount param so you can use that to 
limit the output to only terms that appear in no more then X documents.
 
That said: These are possible ways of solving these types of problems 
using Solr, which can be handy if you are building a Solr for other things 
in general -- but if you are just trying to do a one-time analysis of a 
large corpus of data (or even a many-time analysis of a corpus that 
changes very frequently) w/o needing any of Solr's other features then you 
may find that you can accomplish this type of task much simpler (and 
probably faster) with some simple map/reduce jobs in Hadoop.
 
 
-Hoss

 

Reply via email to