On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote: > On my (incomplete) spider index, the index file for the word "the" (it > indexes no other words) is 17MB. ?This seems rather large. ?It might > make sense to have the spider not even bother creating an index on a > handful of very common words (the, be, to, of, and, a, in, I, etc). > Of course, this presents the occasional difficulty: > http://bash.org/?514353 ?I think I'm in favor of not indexing common > words even so.
Yes, it should ignore common words. This is called "stopword" in search engine termology. > > Also, on a related note, the index splitting policy should be a bit > more sophisticated: in an attempt to fit within the max index size as > configured, it split all the way down to index_8fc42.xml. ?As a > result, the file index_8fc4b.xml sits all by itself at 3KiB. ?It > contains the two words "vergessene" and "txjmnsm". ?I suspect it would > have reliability issues should anyone actually want to search either > of those. ?It would make more sense to have all of index_8fc4 in one > file, since it would be only trivially larger. ?(I have a patch that I > thought did that, but it has a bug; I'll test once my indexwriter is > finished writing, since I don't want to interrupt it by reloading the > plugin.) "trivially larger" ... ugh... how trivial is trivial? the xmllibrarian can handle index_8fc42.xml on its own but all other 8fc4 on index_8fc4.xml. however, as i have stated in irc, that make index generation even slower. > Evan Daniel
