If I understand correctly, you compare the size of the .frq when WhitespaceTokenizer is used, vs the CJK ones?
I'd bet this is because WhitespaceTokenizer creates far less terms than the CJK one. Whitespace tokenizes the text by separating on whitespace, while CJK does sort of N-Gram tokenization, which usually leads to much more terms created. This affects the .frq file in that there are much more posting lists created, which are stored in the .frq file. See if the .tii and .tis files differ and if their difference is the same order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq should be of the same order of difference), then I believe this is the reason. Shai On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <danbsut...@gmail.com> wrote: > Hi, > > We're trying to create a large index via solr for trends and notice > that we have a large '.frq' file after doing the following: > > > make all text fields index="true", stored="false", > omitTermFreqAndPositions="true" omitNorms="true" termPositions="false" > termOffsets="false" termVectors="false" > > We are using a variation on org.apache.lucene.analysis.cjk and notice > that the .frq is about 4 time larger than, for example, the > WhiteSpaceTokenizer. > > > Considering that with omitTermFreqAndPositions="true" for the text > fields I'd have thought this should be : "If omitTf were true it would > be this sequence of VInts instead:" > (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies) > > > Can anyone suggest how I can reduce the size of this file? > > > Many thanks, > Dan > > Lucene Specification Version: 2.9.1 > Solr Specification Version: 1.4.0.2010.09.10.17.10.36 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >