Am 31.07.2012 12:10, schrieb Ian Lea: Hi Ian,
> Lucene 4.0 allows you to use custom codecs and there may be one that > would be better for this sort of data, or you could write one. > > In your tests is it the searching that is slow or are you reading lots > of data for lots of docs? The latter is always likely to be slow. > General performance advice as in > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be > relevant. SSDs and loads of RAM never hurt. You are very right, therer are many results from many docs for the slower searches performed on that index. However, I am still wondering about the theoretical implications: having a small vocabulary with many tokens in an inverted index would yield a rather long list of occurrences for some/many/all (depending on the actual distribution) of the search terms. Thanks for your pointer to the codecs in Lucene 4, I suppose that this will be the actual point to attack for that scenario. It may be a silly question, but one that might be of interest for the whole community ;-) : can someone point me to an in-depth documentation of Lucene 4 codecs, ideally covering both theoretical backgrounds and implementation? There are numerous helpful blog entries, presentations, etc. available on the net, but in case there is some central instance, I have not been able to find it anyway. Thanks! Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org