Re: Performance question (Am I doing something wrong?)

Beni Ruef 11 Nov 2002 21:45:01 -0000

On Sunday, November 10, 2002, at 08:45, Mark J. Stang wrote:

Not wrong.   If that is what your tests show, then you are not doing
anything wrong.   Not if your test is "how long will a command-line
query take?"


Got your point, excuse my stupidity ;-)

Try adding in a thousand such documents and
index them.   My guess is that the query time will increase but
it will be hard to measure the increase.

That's correct but there's a big but: searching for rare words (as my example was) is indeed rather independent of the corpus size but searching for more common words is another story (the following numbers are with a 30 MB corpus): Searching for a word occuring only twice takes ca. 3 seconds, searching for a word occuring 13 times takes 70 seconds, and a word occuring 1300 times takes even more than 6 minutes!

This is a typical characteristic of linguistic corpora: although the number of words is virtually unlimited the number of unique word forms is quite restricted: one of my corpora has 106'000 words but only 19'000 different word forms. It's even worse with word categories or POS tags (the 'pos' attributes in my example) as there are just 50 - 100 of them. Technically speaking, the values used as keys in the index are far from unique -- I don't know how Xindice can handle this problem.

The only accurate way to measure the time is to build a small
program and test the actual call to the database.   Try using
Example1.java, that one has always worked well for me.

I did this and it takes two seconds (measured between getCollection() and getIterator()) for my "rare word" query.

I had a problem with Mac OS X with every query taking
forever.  Turned out that the database initialization was being
done every time and it took forever.

But why, isn't the Java code identical? (Disclaimer: I know nothing about Java ;-)

The same query was faster on
Windows and Linux.

I did my benchmarks on a slow Linux machine (AMD-K6 350 MHz) and they take only twice as long as on my (fast :-) iBook.

I ended up caching my collections.


How do you do this?

This is from an e-mail dated September 5, 2001.
Kimbro Staken wrote:
computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8 jdk: Sun 1.3.0_04 Dataset size: 149,025 documents 601MB Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424 docs per minute or 24 per second. Collection size: 657MB Document retrieval: 2 seconds (including VM startup which is most of the time) Full collection scan query /disc[id = '11041c03']: 12 minutes Index creation: 13.5 minutes Index based query /disc[id = '11041c03']: 2.12 seconds (including VM startup which is most of that time)


I'd really like such a fast response :-) but these are unique keys...

Cheers
-Beni

Re: Performance question (Am I doing something wrong?)

Reply via email to