Sharon,
The system that we are working on does recordset paging, so only the first
100 are returned on the first page. But then you are able to page through
the rest of the of the results.
The Lucene index is stored as a standard FSDirectory on a RAID filesystem,
and it used only for the initial name lookup, to discover the SpeciesID.
Further requests are handled by the mySQL backend database.
The way that Lucene returns results by default is based on a document
scoring algorithm, and so results most likely are out of alphabetical order.
I think that we will likely have to use a custom hitcollector object to do
the sorting.
Many thanks, Peter.
-Original Message-
From: Xiaohong Yang (Sharon) [mailto:[EMAIL PROTECTED]
Sent: 25 January 2005 00:37
To: Lucene Users List
Subject: Re: Sort Performance Problems across large dataset
Hi Peter,
I just got on the list a few hours ago. I am still reading the source code.
I am not going to send this to the list.
I would like to know the ".2 sec" query time for 2 million fields, should it
display only the first page (100 or so), not the whole 3000 found? It is
very fast I agree.
If the alphabetic index display only a link, not the content, then it should
not be very slow since you only need to sort part of what a user need. May
be display only the first "A" page, as it did with the regular scored
results. Just my thought, might not work for you.
Do you store the Lucene index in the database or in a text file?
Best,
Sharon
LangPower Computing, Inc.
http://www.indexingonline.com
Peter Hollas <[EMAIL PROTECTED]> wrote:
I am working on a public accessible Struts based species database project
where the number of species names is currently at 2.3 million, and in the
near future will be somewhere nearer 4 million (probably the largest there
is). The species names are typically 1 to 7 words in length, and the broad
requirement is to be able to do a fulltext search across them. It is also
necessary to sort the results into alphabetical order by species name.
Currently we can issue a simple search query and expect a response back in
about 0.2 seconds (~3,000 results) with the Lucene index that we have built.
Lucene gives a much more predictable and faster average query time than
using standard fulltext indexing with mySQL. This however returns result in
score order, and not alphabetically.
To sort the resultset into alphabetical order, we added the species names as
a seperate keyword field, and sorted using it whilst querying. This solution
works fine, but is unacceptable since a query that returns thousands of
results can take upwards of 30 seconds to sort them.
My question is whether it is possible to somehow return the names in
alphabetical order without using a String SortField. My last resort will be
to perform a monthly index rebuild, and return results by index order (about
a day to re-index!). But ideally there might be a way to modify the Lucene
API to incorporate a scoring system in a way that scores by lexical order.
Any ideas are appreciated!
Many thanks, Peter.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]