And I may be misinterpreting, but - it also sounds like it's not
apples-to-apples in terms of the queries?
(I.e, you are maybe not returning the entirety of the documents? Or
does the index hold a complete replica of the data? What is the query,
or queries?)
Cheers,
Mike
On 9/17/13 12:06 PM, Steven Jacobs wrote:
I just found a problem with my results. It looks like the way lucene was
set it was unable to cache the entire result set, so it was only returning
a subset of the files. The result is still correct for returning 2% of the
file (Because it was small enough for lucene to return), so it does get 2%
in around ten seconds, whereas collection takes around 40 seconds, but I
need to figure out some things to retest the 100% results. It might not
even be possible to return this amount of data from the index at one time.
Steven
On Tue, Sep 17, 2013 at 11:47 AM, Vinayak Borkar <[email protected]> wrote:
That seems very plausible. Opening and closing files is fairly expensive.
In addition, each file read is a random seek on disk.
On 9/17/13 11:33 AM, Steven Jacobs wrote:
My current theory is that collection faces the overhead of creating file
handlers 40,000 times while index creates a handler only once to read the
index results. I don't know if this is enough to produce the large
difference though.
Steven
On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]>
wrote:
Interesting! I haven't followed enough yet, but now you have my
interest;. :-)
do you have an explanation for why your index wins even in the case of
100%?
(Not intuitive - maybe I am missing some details that would fill my
intuition gap.)
On 9/17/13 10:59 AM, Steven Jacobs wrote:
I ran a test on one of Preston's real-world data sets (Weather
collection) that had around 40,000 files. I am attaching the results.
There
are three graphs.
The first shows the time for returning the entire XML for all 40000
files. My index algorithm has huge gains over collection, no matter how
much of the data is returned.
The second shows how the two algorithms perform as the number of files
increases. Both linearly increase, but collection has a much higher
slope.
The last is just a one-point comparison for returning paths that only
exist in only 100 out of the 40000 files. Once again, index has a huge
advantage.
Steven