I just found a problem with my results. It looks like the way lucene was set it was unable to cache the entire result set, so it was only returning a subset of the files. The result is still correct for returning 2% of the file (Because it was small enough for lucene to return), so it does get 2% in around ten seconds, whereas collection takes around 40 seconds, but I need to figure out some things to retest the 100% results. It might not even be possible to return this amount of data from the index at one time.
Steven On Tue, Sep 17, 2013 at 11:47 AM, Vinayak Borkar <[email protected]> wrote: > That seems very plausible. Opening and closing files is fairly expensive. > In addition, each file read is a random seek on disk. > > > > > On 9/17/13 11:33 AM, Steven Jacobs wrote: > >> My current theory is that collection faces the overhead of creating file >> handlers 40,000 times while index creates a handler only once to read the >> index results. I don't know if this is enough to produce the large >> difference though. >> >> Steven >> >> >> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]> >> wrote: >> >> Interesting! I haven't followed enough yet, but now you have my >>> interest;. :-) >>> do you have an explanation for why your index wins even in the case of >>> 100%? >>> (Not intuitive - maybe I am missing some details that would fill my >>> intuition gap.) >>> >>> >>> On 9/17/13 10:59 AM, Steven Jacobs wrote: >>> >>> I ran a test on one of Preston's real-world data sets (Weather >>>> collection) that had around 40,000 files. I am attaching the results. >>>> There >>>> are three graphs. >>>> >>>> The first shows the time for returning the entire XML for all 40000 >>>> files. My index algorithm has huge gains over collection, no matter how >>>> much of the data is returned. >>>> >>>> The second shows how the two algorithms perform as the number of files >>>> increases. Both linearly increase, but collection has a much higher >>>> slope. >>>> >>>> The last is just a one-point comparison for returning paths that only >>>> exist in only 100 out of the 40000 files. Once again, index has a huge >>>> advantage. >>>> >>>> >>>> Steven >>>> >>>> >>>> >>>> >>> >> >
