On Sunday, November 10, 2002, at 08:45, Mark J. Stang wrote:
Not wrong. If that is what your tests show, then you are not doing anything wrong. Not if your test is "how long will a command-line query take?"
Got your point, excuse my stupidity ;-)
Try adding in a thousand such documents and index them. My guess is that the query time will increase but it will be hard to measure the increase.
That's correct but there's a big but: searching for rare words (as my example was) is indeed rather independent of the corpus size but searching for more common words is another story (the following numbers are with a 30 MB corpus):
Searching for a word occuring only twice takes ca. 3 seconds, searching for a word occuring 13 times takes 70 seconds, and a word occuring 1300 times takes even more than 6 minutes!
This is a typical characteristic of linguistic corpora: although the number of words is virtually unlimited the number of unique word forms is quite restricted: one of my corpora has 106'000 words but only 19'000 different word forms. It's even worse with word categories or POS tags (the 'pos' attributes in my example) as there are just 50 - 100 of them. Technically speaking, the values used as keys in the index are far from unique -- I don't know how Xindice can handle this problem.
The only accurate way to measure the time is to build a small program and test the actual call to the database. Try using Example1.java, that one has always worked well for me.
I did this and it takes two seconds (measured between getCollection() and getIterator()) for my "rare word" query.
I had a problem with Mac OS X with every query taking forever. Turned out that the database initialization was being done every time and it took forever.
But why, isn't the Java code identical? (Disclaimer: I know nothing about Java ;-)
The same query was faster on Windows and Linux.
I did my benchmarks on a slow Linux machine (AMD-K6 350 MHz) and they take only twice as long as on my (fast :-) iBook.
I ended up caching my collections.
How do you do this?
This is from an e-mail dated September 5, 2001.
Kimbro Staken wrote:
computer: 750MHZ P3 256MB RAM Laptop running Mandrake Linux 8
jdk: Sun 1.3.0_04
Dataset size: 149,025 documents 601MB
Insertion time (no indexes): 1 hour 45 minutes which is roughly 1,424 docs
per minute or 24 per second.
Collection size: 657MB
Document retrieval: 2 seconds (including VM startup which is most of the
time)
Full collection scan query /disc[id = '11041c03']: 12 minutes
Index creation: 13.5 minutes
Index based query /disc[id = '11041c03']: 2.12 seconds (including VM
startup which is most of that time)
I'd really like such a fast response :-) but these are unique keys...
Cheers -Beni
