On Aug 24, 2005, at 3:32 AM, WolfgangTäger wrote:
Dear all,
we are using Lucene to store 10Mio bilingual sentence pairs for
doing some
natural language processing with them. Each documents contains a
sentence,
its translation and a topical code. We want to select sentences
containing
certain words and do statistics over the topical codes in order to
detect
translations which depend on the topic (like key=> Taste (topic: input
devices), key=> Schlüssel (topic: cryptography)).
While the search is carried out in a reasonably short time (about
500..800ms) we have a performance problem with actually retrieving the
documents by code like:
for (int i = nrhits-1; i >=0; i--){
Document hitDoc = hits.doc(i);
String code=hitDoc.get("code");
... statistics
}
Even when restricting nrhits to 2000, we have to wait 10..20
seconds just
for the retrieval. Since the documents are so short we would have
expected
a quicker retrieval. BtW the loop was done in inverse order in the
hope to
accelerate the retrieval.
How many documents are you trying to retrieve? I think you'll have
much better luck if you walked the documents in ascending Hits order
than backwards, as Hits caches documents with the presumption you'll
move forward through them. I'd be curious to see how much (or if)
moving forwards through Hits helps.
Erik