[freenet-dev] Should the spider ignore common words?

Evan Daniel Wed, 10 Jun 2009 00:02:46 -0400

On my (incomplete) spider index, the index file for the word "the" (it
indexes no other words) is 17MB.  This seems rather large.  It might
make sense to have the spider not even bother creating an index on a
handful of very common words (the, be, to, of, and, a, in, I, etc).
Of course, this presents the occasional difficulty:
http://bash.org/?514353  I think I'm in favor of not indexing common
words even so.


Also, on a related note, the index splitting policy should be a bit
more sophisticated: in an attempt to fit within the max index size as
configured, it split all the way down to index_8fc42.xml.  As a
result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
contains the two words "vergessene" and "txjmnsm".  I suspect it would
have reliability issues should anyone actually want to search either
of those.  It would make more sense to have all of index_8fc4 in one
file, since it would be only trivially larger.  (I have a patch that I
thought did that, but it has a bug; I'll test once my indexwriter is
finished writing, since I don't want to interrupt it by reloading the
plugin.)

Evan Daniel

[freenet-dev] Should the spider ignore common words?

Reply via email to