Re: If search matches index in the middle of filter chain, will result return?

Shawn Heisey Tue, 22 Nov 2011 20:04:03 -0800

On 11/22/2011 7:54 PM, Ellery Leung wrote:

I am searching for an email called: off...@officeofficeoffice.com.  If I
search any text under 20 characters, result will be returned.  But when I
search the whole string: off...@officeofficeoffice.com, no result return.


As you all see in the schema in "index" part, when I search the whole
string, it will match the index chain before NGramFilterFactory.  But after
NGram, no result found.

Here are my questions:
-          Is this behavior normal?

I'm pretty sure that your query must match after the entire analyzerchain is done. I would expect that behavior to be normal.

-          In order to get "off...@officeofficeoffice.com", does it mean
that I have to make the maxGramSize larger (like 70)?

If you were to increase the maxGramSize to 70, you would get a match inthis case, but your index might get a lot larger, depending on what's inyour source data. That's probably not the right approach, though.

In general, you want to have your index and query analyzer chainsexactly the same. There are some exceptions, but I don't think theNGram filter is one of them. The synonym filter and WordDelimiterFilterare examples where it is expected that your index and query analyzerchains will be different.

Add the NGram and CommonGram filters to the query chain, and everythingshould start working. If you were to go with a single analyzer for bothlike the following, I think it would start working. You wouldn't evenneed to reindex, since you wouldn't be changing the index analyzer.

<fieldType name="substring_search" class="solr.TextField"positionIncrementGap="100">

<analyzer>

<charFilter class="solr.MappingCharFilterFactory"mapping="../../filters/filter-mappings.txt"/>

<charFilter class="solr.HTMLStripCharFilterFactory" />
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />

<filter class="solr.CommonGramsFilterFactory"words="../../filters/stopwords.txt" ignoreCase="true"/>

<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

Regarding your NGram filter, I would actually increase the minGramSizeto at least 2 and decrease the maxGramSize to something like 10 or 15,then reindex.

An additional note: CommonGrams may not be all that useful unless youare indexing large numbers of huge documents, like entire books. Thisparticular fieldType is not suitable for full text anyway, since it usesKeywordTokenizer. Consider removing CommonGrams from this fieldType andreindexing. Unless you are dealing with large amounts of text, considerremoving it from the entire schema. If you do remove it, it's usuallynot a good idea to replace it with a StopFilter. The index sizereduction found in stopword removal is not usually worth the potentialloss of recall.

Be prepared to test all reasonable analyzer combinations, rather thantaking my word for it.

After reading the Hathi Trust blog, I tried CommonGrams on my ownindex. It actually made things slower, not faster. My typical documentis only a few thousand bytes of metadata. The Hathi Trust is indexingmillions of full-length books.


Thanks,
Shawn

Re: If search matches index in the middle of filter chain, will result return?

Reply via email to