Brian Whitman wrote:

> The Solr installations I know with many millions of docs don't have 
> hundreds of KB of text per doc. The "special" thing I'm doing is storing 
> the parse text from the nutch crawls (and other sources), which we need 
> for various reasons. We have an extraordinary amount of unique tokens, 
> which turns Solr/Lucene into a disk seek speed test. Full text search is 

This thread is already slightly off-topic ... but regarding the number 
of unique terms: when I'm faced with an explosion of unique terms due to 
the nature of the data or the tokenization method, if possible I use one 
(or both) of the following methods: splitting and combining. Example of 
splitting would be with dates - if you split year, month and day into 
separate fields then even if you have to store many unique dates the 
total number of unique terms in these fields will be smaller than if the 
dates (with this resolution) were stored in a single field. The other 
method (combining) is already in use in Nutch, and implemented in 
CommonGrams.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to