Dear all,

we are using Lucene to store 10Mio bilingual sentence pairs for doing some 
natural language processing with them. Each documents contains a sentence, 
its translation and a topical code. We want to select sentences containing 
certain words and do statistics over the topical codes in order to detect 
translations which depend on the topic (like key=> Taste (topic: input 
devices), key=> Schlüssel (topic: cryptography)).

While the search is carried out in a reasonably short time (about 
500..800ms) we have a performance problem with actually retrieving the 
documents by code like:

for (int i = nrhits-1; i >=0; i--){
        Document hitDoc = hits.doc(i);
        String code=hitDoc.get("code");
        ... statistics
}
 
Even when restricting nrhits to 2000, we have to wait 10..20 seconds just 
for the retrieval. Since the documents are so short we would have expected 
a quicker retrieval. BtW the loop was done in inverse order in the hope to 
accelerate the retrieval.

We are using Lucene 1.4.3 Java version on a Windows PC.
 
Would you recommend using the C version ? I suppose it is stable and we 
can reuse the database ? Any other suggestions ?

Thanks for your help !

Wolfgang

 
 
 

Reply via email to