Re: Small Vocabulary

Carsten Schnober Thu, 02 Aug 2012 01:19:37 -0700

Am 31.07.2012 12:10, schrieb Ian Lea:

Hi Ian,


> Lucene 4.0 allows you to use custom codecs and there may be one that
> would be better for this sort of data, or you could write one.
> 
> In your tests is it the searching that is slow or are you reading lots
> of data for lots of docs?  The latter is always likely to be slow.
> General performance advice as in
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be
> relevant.  SSDs and loads of RAM never hurt.

You are very right, therer are many results from many docs for the
slower searches performed on that index. However, I am still wondering
about the theoretical implications: having a small vocabulary with many
tokens in an inverted index would yield a rather long list of
occurrences for some/many/all (depending on the actual distribution) of
the search terms.
Thanks for your pointer to the codecs in Lucene 4, I suppose that this
will be the actual point to attack for that scenario. It may be a silly
question, but one that might be of interest for the whole community ;-)
: can someone point me to an in-depth documentation of Lucene 4 codecs,
ideally covering both theoretical backgrounds and implementation? There
are numerous helpful blog entries, presentations, etc. available on the
net, but in case there is some central instance, I have not been able to
find it anyway.
Thanks!
Best regards,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Small Vocabulary

Reply via email to