: The corpus is the English Wikipedia, and I indexed the title and body of : the articles. I used a list of 525 stop words. : : With stopwords removed the index is 227MB. : With stopwords kept the index is 331MB.
That doesn't seem horribly surprising. consider that for every Term in the index, lucene is keeping track of the list of <docId, freq> pairs for every document that contains that term. Assume that something has to be in at least 25% of the docs before you decide it's worth making it a stop word. your URL indicates you are dealing with 400k docs, which means that for each stop word, the space need to store the int pairs for <docId, freq> is... (4B + 4B) * 100,000 =~ 780KB (per stop word Term, minimum) ...not counting any indexing structures that may be used internally to improve the lookup of a Term. assuming some of those words are in more or less then 25% of your documents, that could easily account for a differents of 100MB. I suspect that an interesting excersize would be to use some of the code I've seen tossed arround on this list that lets you iterate over all Terms and find the most common once to help you determine your stopword list progromaticly. Then remove/reindex any documents that have each word as you add it to your stoplist (one word at a time) and watch your index shrink. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]