Is there a way to get an approximate measure of the memory used by an indexed field(s). I’m looking into a problem with one of our Solr indexes. I have a Japanese query that causes the replicas to run out of memory when processing a query. Also, is there a way to change or disable the timeout in the Solr Console? When I run this query there it always times out, and that is a real pain. I know that it will complete eventually.
I have this field type: <!-- Field type to support Asian languages Transforms Traditional Han to Simplified Han Transforms Hiragana to Katakana tokenizes languages to unigrams and bigrams for analysis and searching --> <fieldtype name="text_deep_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false"> <analyzer type="index"> <!-- remove spaces between CJK characters --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/> <tokenizer class="solr.ICUTokenizerFactory" /> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.CJKWidthFilterFactory"/> <!-- Transform Traditional Han to Simplified Han --> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> <!-- Transform Hiragana to Katakana just as was done for Endeca --> <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/> <filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed --> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> </analyzer> <analyzer type="query"> <!-- remove spaces between CJK characters --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/> <tokenizer class="solr.ICUTokenizerFactory" /> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.ICUTokenizerFactory" /> <filter class="solr.CJKWidthFilterFactory"/> <!-- Transform Traditional Han to Simplified Han --> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> <!-- Transform Hiragana to Katakana just as was done for Endeca --> <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/> <filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed --> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> </analyzer> </fieldtype> I have a number of fields of this type. The CJKBigramFilterFactory can generate a lot of tokens. I’m concerned that this combination is what is killing our solr instances This is the query that is causing my problems: モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体 マウス宿主抗体 We are using Solr 7.2 in a solrcloud