My current theory is that collection faces the overhead of creating file
handlers 40,000 times while index creates a handler only once to read the
index results. I don't know if this is enough to produce the large
difference though.

Steven


On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]> wrote:

> Interesting!  I haven't followed enough yet, but now you have my
> interest;. :-)
> do you have an explanation for why your index wins even in the case of
> 100%?
> (Not intuitive - maybe I am missing some details that would fill my
> intuition gap.)
>
>
> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>
>> I ran a test on one of Preston's real-world data sets (Weather
>> collection) that had around 40,000 files. I am attaching the results. There
>> are three graphs.
>>
>> The first shows the time for returning the entire XML for all 40000
>> files. My index algorithm has huge gains over collection, no matter how
>> much of the data is returned.
>>
>> The second shows how the two algorithms perform as the number of files
>> increases. Both linearly increase, but collection has a much higher slope.
>>
>> The last is just a one-point comparison for returning paths that only
>> exist in only 100 out of the 40000 files. Once again, index has a huge
>> advantage.
>>
>>
>> Steven
>>
>>
>>
>

Reply via email to