I just found a problem with my results. It looks like the way lucene was
set it was unable to cache the entire result set, so it was only returning
a subset of the files. The result is still correct for returning 2% of the
file (Because it was small enough for lucene to return), so it does get 2%
in around ten seconds, whereas collection takes around 40 seconds, but I
need to figure out some things to retest the 100% results. It might not
even be possible to return this amount of data from the index at one time.

Steven


On Tue, Sep 17, 2013 at 11:47 AM, Vinayak Borkar <[email protected]> wrote:

> That seems very plausible. Opening and closing files is fairly expensive.
> In addition, each file read is a random seek on disk.
>
>
>
>
> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>
>> My current theory is that collection faces the overhead of creating file
>> handlers 40,000 times while index creates a handler only once to read the
>> index results. I don't know if this is enough to produce the large
>> difference though.
>>
>> Steven
>>
>>
>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]>
>> wrote:
>>
>>  Interesting!  I haven't followed enough yet, but now you have my
>>> interest;. :-)
>>> do you have an explanation for why your index wins even in the case of
>>> 100%?
>>> (Not intuitive - maybe I am missing some details that would fill my
>>> intuition gap.)
>>>
>>>
>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>
>>>  I ran a test on one of Preston's real-world data sets (Weather
>>>> collection) that had around 40,000 files. I am attaching the results.
>>>> There
>>>> are three graphs.
>>>>
>>>> The first shows the time for returning the entire XML for all 40000
>>>> files. My index algorithm has huge gains over collection, no matter how
>>>> much of the data is returned.
>>>>
>>>> The second shows how the two algorithms perform as the number of files
>>>> increases. Both linearly increase, but collection has a much higher
>>>> slope.
>>>>
>>>> The last is just a one-point comparison for returning paths that only
>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>> advantage.
>>>>
>>>>
>>>> Steven
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to