If I understand correctly, you compare the size of the .frq when
WhitespaceTokenizer is used, vs the CJK ones?

I'd bet this is because WhitespaceTokenizer creates far less terms than the
CJK one. Whitespace tokenizes the text by separating on whitespace, while
CJK does sort of N-Gram tokenization, which usually leads to much more terms
created. This affects the .frq file in that there are much more posting
lists created, which are stored in the .frq file.

See if the .tii and .tis files differ and if their difference is the same
order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
should be of the same order of difference), then I believe this is the
reason.

Shai

On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <danbsut...@gmail.com> wrote:

> Hi,
>
> We're trying to create a large index via solr for trends and notice
> that we have a large '.frq' file after doing the following:
>
>
> make all text fields index="true", stored="false",
> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
> termOffsets="false" termVectors="false"
>
> We are using a variation on org.apache.lucene.analysis.cjk and notice
> that the .frq is about 4 time larger than, for example, the
> WhiteSpaceTokenizer.
>
>
> Considering that with omitTermFreqAndPositions="true" for the text
> fields I'd have thought this should be : "If omitTf were true it would
> be this sequence of VInts instead:"
> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>
>
> Can anyone suggest how I can reduce the size of this file?
>
>
> Many thanks,
> Dan
>
> Lucene Specification Version: 2.9.1
> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to