The index actually has all of the data, so it is capable of recreating the
xml structure. We use a parent-child relationship to maintain an id for
each piece of data. For example, assume the following xml code:
<letter>
  <title maxlength="10"> My letter</title>
  <body>Here is the body </body>
   <footer>
<greeting>Good Bye!</greeting>
    </footer>
</letter>

>From this snippet we would get the following tuples (id, value, type):
0 "letter" element
0.0 "title" element
0.0.0 "maxlength" attribute
0.0.0.0 "10" text
0.0.1 "My Letter" text
0.1 "body" element
0.1.0 "Here is the body" text
0.2 "footer" element
0.2.0 "greeting" element
0.2.0.0 "Good Bye!" text

Then we store these tuples in a lucene index. This means that the tests I
ran on 100% of the data were actually producing the entire result from the
index. Intuitively, the index should really stand out when we are only
returning a small subset of the results, as collection will parse
everything and index will only look at the relevant data, and this does
seem to be the case (Finished in 1/4 of the time on my test).

At this point I am getting a memory overflow when trying to retrieve ALL
data from the index, so I am looking into Lucene's capability of
incremental results. I will post results when I get things working.

Steven


On Tue, Sep 17, 2013 at 2:04 PM, Vinayak Borkar <[email protected]> wrote:

> Steven,
>
>
> It might be useful to send a brief note describing how you use Lucene to
> store XML documents.
>
> Thanks,
> Vinayak
>
>
>
> On 9/17/13 12:56 PM, Michael Carey wrote:
>
>> PS (to my last mail) - I would have thought the index results would
>> still have to now be used to go get the full XML documents back? (Or is
>> this less of an index and more of an alternative complete storage
>> solution?)
>>
>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>
>>> My current theory is that collection faces the overhead of creating file
>>> handlers 40,000 times while index creates a handler only once to read the
>>> index results. I don't know if this is enough to produce the large
>>> difference though.
>>>
>>> Steven
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]>
>>> wrote:
>>>
>>>  Interesting!  I haven't followed enough yet, but now you have my
>>>> interest;. :-)
>>>> do you have an explanation for why your index wins even in the case of
>>>> 100%?
>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>> intuition gap.)
>>>>
>>>>
>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>
>>>>  I ran a test on one of Preston's real-world data sets (Weather
>>>>> collection) that had around 40,000 files. I am attaching the
>>>>> results. There
>>>>> are three graphs.
>>>>>
>>>>> The first shows the time for returning the entire XML for all 40000
>>>>> files. My index algorithm has huge gains over collection, no matter how
>>>>> much of the data is returned.
>>>>>
>>>>> The second shows how the two algorithms perform as the number of files
>>>>> increases. Both linearly increase, but collection has a much higher
>>>>> slope.
>>>>>
>>>>> The last is just a one-point comparison for returning paths that only
>>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>>> advantage.
>>>>>
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>

Reply via email to