On Thu, Sep 19, 2013 at 7:18 PM, Marcos Juarez Lopez <mjua...@gmail.com> wrote:
> Hi,
>
> I'm trying to optimize an index we have, and one thing that has come up
> recently is that we're not really using term frequencies, and we don't need
> any scoring.  We noticed that the term frequencies (.doc files) are a
> significant chunk of the total index size, and we'd like to reduce those,
> or eliminate them, if at all possible.

You should index with DOCS_ONLY; you will still have .doc files, but
they will be smaller since they won't store frequencies.  Also, you
won't have .pos files anymore ... (unless other fields are still
indexed "normally").

You should also omit norms (no more / smaller .nrm files).

> We don't do any sort of ranking, or scoring, and so I believe wouldn't need
> to store, or to use, any term frequencies (please correct me if I'm wrong
> on this assumption). The way our indexes work, we want to always return all
> matching documents, regardless of the amount of documents returned.

My silly pet peeve: it really should be "number of documents" not
"amount of documents".  You can have an amount of uncountable things
like water and happiness, but things that can be counted are "numbers
of ...".

> I've been looking at several things, specifically the
> FieldInfo.IndexOptions and creating a custom FieldType that implements
> IndexableFieldType, so that it would not store any of the TermVector info.
>  However, I want to make sure I'm on the right path, before I start
> changing our app.

That's exactly the right approach.  You can fork an existing FieldType
and tweak it, e.g.:

  FieldType myType = new FieldType(TextField.TYPE_NOT_STORED);
  myType.setIndexOptions(IndexOptions.DOCS_ONLY);
  myType.setOmitNorms(true);
  myType.freeze();

Do that once up front in your app, then, per doc, be sure to use myType, e.g.:

  doc.add(new Field("body", contents, myType));

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to