Re: indexing/xpath query question

Mark J. Stang 5 Mar 2002 17:58:11 -0000

What I meant by "hiding" the result, is that it takes the VM a long time to
start up.
If you it takes only a 100 milliseconds for your query to run in either case
then
you won't see any difference.   It can be very hard to do timing with the Java
VM from the command line.  The only really valid tests are those where the
start up time are a small part of the overall time of the test.

Also, if both of your queries bring back 2300 documents and both print them
out, then indexing won't buy you anything because in both cases you want 2300
documents.   Where indexing will get you the biggest gains is when you want one
document out of the 2300.   Without the index, xindice has read each
document one at a time looking for your match.   With the index, it can use
a binary search and only read the index about 9 or 10 times to get to your
document.   (e.g. 2300/2=1150, 1150/2=575, 575/2=288, 288/2=144,
144/2=72, 72/2=36, 36/2=18, 18/2=9, 9/2=5, 5/2=3).
If your search is for the very first document that you put in and xindice
puts them in the file in the order that you added them, searching the
entire file only take one read.   Using the index it will take 9 or 10 hits
to the index and then a read.  So it will take longer with an index.

So if you are trying to see a difference between having and not having an
index you have to write your test correctly.   Using 2300 documents and
a binary search xindice can find any document in 10 or so index hits and
a read.   Worse case scenario for no index is it has to read 2300 documents.
My guess is that xindice puts them into the file in the order that you add them.

So do your search trying to find the last document you added.

The time to add an index is when your searches take to long.   Adding an index
will require that xindice maintain that index.   So everytime you add an new
document
it has to update all applicable indexes, so your inserts take longer.   You can
always
add indexes later if it looks like your queries are taking too long.

In the old days before databases, you had to add code for indexes to your
programs,
in these modern times you can add and remove indexes on the fly.   The time to
add an
index is when a particular query takes too long!

Sorry for the long rambling rant ;-).

Mark

Sreeni Chippada wrote:

> Mark,
>         It took me about 3 minutes to load about 2300 documents in 102MB.
>         It took me 31 sec to index /INVOICE/BILL_INVOICE.bill_ref_no.
>         Now, I deleted that collection and added a new collection with 22
> documents/1MB(just to make it simple)
>         I did not index the xpath. If run
>                 xindiceadmin xpath -c /db/lucent -q
> /INVOICE/BILL_INVOICE.bill_ref_no
>         I get all the 22 documents.
>         If I run
>                 xindiceadmin xpath -c /db/lucent -q
> /INVOICE/[BILL_INVOICE.bill_ref_no="2"]
>         I get nothing. I hope this query is correct.
>
>         What do you mean by 'VM startup is "hiding" the result' ? Could this
> be what is happening in my case.
>
> Thanks,
> Sreeni
>
>
>
> -----Original Message-----
> From: Mark J. Stang [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 04, 2002 4:50 PM
> To: [email protected]
> Subject: Re: indexing/xpath query question
>
> How much time is it really taking?   It maybe fast enough that
> the VM startup is "hiding" the result.   In Kimbros example,
> he did a complete search of 149,025 documents in less than
> 12 minutes.   If you have 2,000 documents, then it could
> take 1/75 of the time or about 10 seconds.   If you are printing
> the output to the screen it may seem the same.   Try doing an XPath
> search for the last document added, with and without the index.
> Just the one document, not all of them.
>
> HTH,
> Mark
>
> Kimbro did some tests last September:
>
> He wrote this:
> "As I've been working out some issues with the CORBA system I've been
> working on getting larger document sets into the server. My largest set
> right now is 149,025 documents in a single collection. The server can
> easily handle more documents this is just the largest dataset I have
> available right now. Here are some stats to give us a better idea where we
> stand. These are run against the current CVS version with one exception. I
> used OpenORB for the server ORB  instead of JacORB. JacORB was still used
> for the client. It's likely we'll need to switch to OpenORB overall as
> even the latest JacORB leaks memory on the server.
>
> computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8
> jdk: Sun 1.3.0_04
> Dataset size: 149,025 documents 601MB
> Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424 docs
> per minute or 24 per second.
> Collection size: 657MB
> Document retrieval: 2 seconds (including VM startup which is most of the
> time)
> Full collection scan query /disc[id = '11041c03']: 12 minutes
> Index creation: 13.5 minutes
> Index based query /disc[id = '11041c03']: 2.12 seconds (including VM
> startup which is most of that time)
> Index size 164MB
>
> The data set consists of documents similar to the following.
>
> <?xml version="1.0"?>
> <disc>
> <id>11041c03</id>
> <length>1054</length>
> <title>Orchestral Manoeuvres In The Dark / The OMD Remixes (Single)</title>
> <genre>cddb/misc</genre>
> <track index="1" offset="150">Enola Gay (OMD vs Sash! Radio Edit)</track>
> <track index="2" offset="18790"> (2)Souvenir (Moby Remix)</track>
> <track index="3" offset="39790"> (3)Electricity (The Micronauts
> Remix)</track>
> </disc>
>
> Kimbro Staken"
>
> Sreeni Chippada wrote:
>
> > Hi,
> >         I am new to xindice. I added a few documents as DOMs and ran xpath
> > query successfully. Then I added an index on the collection and ran the
> > query. It takes same amount of time.
> >
> > Here are the details:
> >
> > My document structure looks like this:
> >
> > <INVOICE>
> >         <BILL_INVOICE.bill_ref_no>2</BILL_INVOICE.bill_ref_no)
> >         .
> >         .
> >         .
> > </INVOICE>
> >
> > I loaded about 2000 documents.
> >
> > When I run 'xindiceadmin xpath -c /db/test -q
> > /INOVICE/BILL_INVOICE.bill_ref_no' I get all the
> > /INOVICE/BILL_INVOICE.bill_ref_no elements.
> >
> > Then ran the following command to add an index.
> >
> > xindiceadmin ai -c /db/test -n BillRefNum  -p
> > /INOVICE/BILL_INVOICE.bill_ref_no
> >
> > Now if run the same query as above, it still takes same time. Looks like
> it
> > not using the index i created.
> >
> > Appreciate any help.
> >
> > Thanks,
> > Sreeni

Re: indexing/xpath query question

Reply via email to