I made the changes to my indexing algorithm to receive incrementally the
amount of data that it can handle at a time. This yielded results more like
what we expected. I am attaching the revised graphs. The second and third
graph did not change, as these results were fitting everything at once
originally. They show that index has a huge advantage for returning small
amounts of data.

The first graph shows the negative effect of returning all of the data from
the index. What breaks my algorithm down is the number of times that it
needs to re-fetch from the index. In the case of returning all data from
all 40000 files, it took 32 chunks to perform the operation. An interesting
thing to find out is how the number of chunk fetches correlates to the loss
in advantage over collection. I don't know if I will be able to do a more
thorough examination before summer of code is over, but it is definitely
something to look at in the future.

In any case, indexing will be a poor choice when returning most of the data
from large data sets, but shines when it is fetching a small subset.

Steven



On Tue, Sep 17, 2013 at 10:51 PM, Michael Carey <[email protected]> wrote:

> Got it - now it makes more sense.  (In that case, I question the use of
> the term "index", though; I would view this instead as an "indexed
> collection implementation" that's an alternative to basic file storage.
>  But that's just words; doesn't affect the implementation work or
> performance work. :-))
>
>
> On 9/17/13 2:35 PM, Steven Jacobs wrote:
>
>> The index actually has all of the data, so it is capable of recreating the
>> xml structure. We use a parent-child relationship to maintain an id for
>> each piece of data. For example, assume the following xml code:
>> <letter>
>>    <title maxlength="10"> My letter</title>
>>    <body>Here is the body </body>
>>     <footer>
>> <greeting>Good Bye!</greeting>
>>      </footer>
>> </letter>
>>
>> >From this snippet we would get the following tuples (id, value, type):
>> 0 "letter" element
>> 0.0 "title" element
>> 0.0.0 "maxlength" attribute
>> 0.0.0.0 "10" text
>> 0.0.1 "My Letter" text
>> 0.1 "body" element
>> 0.1.0 "Here is the body" text
>> 0.2 "footer" element
>> 0.2.0 "greeting" element
>> 0.2.0.0 "Good Bye!" text
>>
>> Then we store these tuples in a lucene index. This means that the tests I
>> ran on 100% of the data were actually producing the entire result from the
>> index. Intuitively, the index should really stand out when we are only
>> returning a small subset of the results, as collection will parse
>> everything and index will only look at the relevant data, and this does
>> seem to be the case (Finished in 1/4 of the time on my test).
>>
>> At this point I am getting a memory overflow when trying to retrieve ALL
>> data from the index, so I am looking into Lucene's capability of
>> incremental results. I will post results when I get things working.
>>
>> Steven
>>
>>
>> On Tue, Sep 17, 2013 at 2:04 PM, Vinayak Borkar <[email protected]>
>> wrote:
>>
>>  Steven,
>>>
>>>
>>> It might be useful to send a brief note describing how you use Lucene to
>>> store XML documents.
>>>
>>> Thanks,
>>> Vinayak
>>>
>>>
>>>
>>> On 9/17/13 12:56 PM, Michael Carey wrote:
>>>
>>>  PS (to my last mail) - I would have thought the index results would
>>>> still have to now be used to go get the full XML documents back? (Or is
>>>> this less of an index and more of an alternative complete storage
>>>> solution?)
>>>>
>>>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>>>
>>>>  My current theory is that collection faces the overhead of creating
>>>>> file
>>>>> handlers 40,000 times while index creates a handler only once to read
>>>>> the
>>>>> index results. I don't know if this is enough to produce the large
>>>>> difference though.
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   Interesting!  I haven't followed enough yet, but now you have my
>>>>>
>>>>>> interest;. :-)
>>>>>> do you have an explanation for why your index wins even in the case of
>>>>>> 100%?
>>>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>>>> intuition gap.)
>>>>>>
>>>>>>
>>>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>>>
>>>>>>   I ran a test on one of Preston's real-world data sets (Weather
>>>>>>
>>>>>>> collection) that had around 40,000 files. I am attaching the
>>>>>>> results. There
>>>>>>> are three graphs.
>>>>>>>
>>>>>>> The first shows the time for returning the entire XML for all 40000
>>>>>>> files. My index algorithm has huge gains over collection, no matter
>>>>>>> how
>>>>>>> much of the data is returned.
>>>>>>>
>>>>>>> The second shows how the two algorithms perform as the number of
>>>>>>> files
>>>>>>> increases. Both linearly increase, but collection has a much higher
>>>>>>> slope.
>>>>>>>
>>>>>>> The last is just a one-point comparison for returning paths that only
>>>>>>> exist in only 100 out of the 40000 files. Once again, index has a
>>>>>>> huge
>>>>>>> advantage.
>>>>>>>
>>>>>>>
>>>>>>> Steven
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>

Attachment: index results 40000.xlsx
Description: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Reply via email to