On Aug 24, 2005, at 3:32 AM, WolfgangTäger wrote:

Dear all,

we are using Lucene to store 10Mio bilingual sentence pairs for doing some natural language processing with them. Each documents contains a sentence, its translation and a topical code. We want to select sentences containing certain words and do statistics over the topical codes in order to detect
translations which depend on the topic (like key=> Taste (topic: input
devices), key=> Schlüssel (topic: cryptography)).

While the search is carried out in a reasonably short time (about
500..800ms) we have a performance problem with actually retrieving the
documents by code like:

for (int i = nrhits-1; i >=0; i--){
        Document hitDoc = hits.doc(i);
        String code=hitDoc.get("code");
        ... statistics
}

Even when restricting nrhits to 2000, we have to wait 10..20 seconds just for the retrieval. Since the documents are so short we would have expected a quicker retrieval. BtW the loop was done in inverse order in the hope to
accelerate the retrieval.

How many documents are you trying to retrieve? I think you'll have much better luck if you walked the documents in ascending Hits order than backwards, as Hits caches documents with the presumption you'll move forward through them. I'd be curious to see how much (or if) moving forwards through Hits helps.

    Erik

Reply via email to