On Wed, Jun 10, 2009 at 3:18 PM, Evan Daniel<evanbd at gmail.com> wrote: > On Wed, Jun 10, 2009 at 2:56 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> > wrote: >> On Wed, Jun 10, 2009 at 2:06 PM, Evan Daniel<evanbd at gmail.com> wrote: >>> On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> >>> wrote: >>>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote: >>>>> On my (incomplete) spider index, the index file for the word "the" (it >>>>> indexes no other words) is 17MB. ?This seems rather large. ?It might >>>>> make sense to have the spider not even bother creating an index on a >>>>> handful of very common words (the, be, to, of, and, a, in, I, etc). >>>>> Of course, this presents the occasional difficulty: >>>>> http://bash.org/?514353 ?I think I'm in favor of not indexing common >>>>> words even so. >>>> >>>> Yes, it should ignore common words. >>>> This is called "stopword" in search engine termology. >>>> >>>>> >>>>> Also, on a related note, the index splitting policy should be a bit >>>>> more sophisticated: in an attempt to fit within the max index size as >>>>> configured, it split all the way down to index_8fc42.xml. ?As a >>>>> result, the file index_8fc4b.xml sits all by itself at 3KiB. ?It >>>>> contains the two words "vergessene" and "txjmnsm". ?I suspect it would >>>>> have reliability issues should anyone actually want to search either >>>>> of those. ?It would make more sense to have all of index_8fc4 in one >>>>> file, since it would be only trivially larger. ?(I have a patch that I >>>>> thought did that, but it has a bug; I'll test once my indexwriter is >>>>> finished writing, since I don't want to interrupt it by reloading the >>>>> plugin.) >>>> >>>> "trivially larger" ... >>>> ugh... how trivial is trivial? >>>> >>>> the xmllibrarian can handle ?index_8fc42.xml on its own but all other >>>> 8fc4 on ?index_8fc4.xml. >>>> however, as i have stated in irc, that make index generation even slower. >>> >>> 8fc42 is 17382 KiB. ?All other 8fc4 are 79 KiB combined. >>> >>> Also, it would make index generation faster. ?The spider first does >>> all the work of creating 8fc4, then discards it to recreate the >>> sub-indexes. ?The vast majority of this work is in 8fc42, which gets >>> created twice. ?Not splitting the index would nearly halve the time to >> >> It don't get created twice, it shortcut early. >> see the estimateSize variable in IndexWriter. > > Unless I'm mistaken, the slow part of the index creation is the > term.getPages() call. ?That call is where all the disk io hides, no?
no :) getPages() return a IPersistentSet (ScalableSet) which is lazy evaluated. Internally, it is a linkedset when small, btree when large. the .size() method is always cached. > The "shortcut" doesn't occur until after that call returns. ?As > discussed above, "the" accounts for about 99.5% of the whole index, > and therefore (I'm assuming) 99.5% of the disk io. ?And that 99.5% > happens twice. > > The shortcut only functions properly when the largest term accounts > for a modest fraction of the total work, which is exactly what isn't > happening here. > > Evan Daniel >
