Brian Whitman wrote: > The Solr installations I know with many millions of docs don't have > hundreds of KB of text per doc. The "special" thing I'm doing is storing > the parse text from the nutch crawls (and other sources), which we need > for various reasons. We have an extraordinary amount of unique tokens, > which turns Solr/Lucene into a disk seek speed test. Full text search is
This thread is already slightly off-topic ... but regarding the number of unique terms: when I'm faced with an explosion of unique terms due to the nature of the data or the tokenization method, if possible I use one (or both) of the following methods: splitting and combining. Example of splitting would be with dates - if you split year, month and day into separate fields then even if you have to store many unique dates the total number of unique terms in these fields will be smaller than if the dates (with this resolution) were stored in a single field. The other method (combining) is already in use in Nutch, and implemented in CommonGrams. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
