After some further playing around, I think I understand what's going on. Because the SynonymFilterFactory pays attention to term position when it inserts a multi-word synonym, I had assumed it scanned for matches in a way that respected term position as well. (ie, for a two-word synonym, I assumed it would try to find the second word in position n+1 if it found the first word in position n)
This does not appear to be the case. It appears to find multi-word synonym matches by simply walking the list of terms, exhausting all the terms in position one before looking at any terms in position two. The ShingleFilter adds terms to most positions, so that throws off the 'adjacency' of the flattened list of terms. Meaning, a two-word synonym can only match if the synonym consists of the original term (position 1) followed by the added shingle (also in position 1). Perhaps a better description is if you're looking at the analysis.jsp display, it does not scan for multi-word synonym tokens "across then down", it scans "down then across". It doesn't look like there's a way to do what I'm trying to do (index shingles AND multi-word synonyms in one field) without writing my own filter. -----Original Message----- From: Jeff Wartes [mailto:jwar...@whitepages.com] Sent: Wednesday, August 10, 2011 1:27 PM To: solr-user@lucene.apache.org Subject: RE: Can't mix Synonyms with Shingles? Hi Steven, The token separator was certainly a deliberate choice, are you saying that after applying shingles, synonyms can only match shingled terms? The term analysis suggests the original tokens still exist. You've made me realize that only certain synonyms seem to have problems though, so it's not a blanket failure. Take this synonym definition: wamu, washington mutual bank, washington mutual Indexing "wamu" looks like it'll work fine - there are no shingles, and all three synonym expansions appear to get indexed. (expand="true") However, indexing "washington mutual" applies the shingles correctly, (adds washingtonmutual to position 1) but the synonym expansion does not happen. I would still expect the synonym definition to match the original terms and index 'wamu' along with the other stuff. Thanks. -----Original Message----- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, August 10, 2011 12:54 PM To: solr-user@lucene.apache.org Subject: RE: Can't mix Synonyms with Shingles? Hi Jeff, Hi Jeff, You have configured ShingleFilterFactory with a token separator of "", so e.g. "International Corporation" will output the shingle "InternationalCorporation". If this is the form you want to use for synonym matching, it must exist in your synonym file. Does it? Steve > -----Original Message----- > From: Jeff Wartes [mailto:jwar...@whitepages.com] > Sent: Wednesday, August 10, 2011 3:43 PM > To: solr-user@lucene.apache.org > Subject: Can't mix Synonyms with Shingles? > > > I would like to combine the ShingleFilterFactory with a > SynonymFilterFactory in a field type. > > I've looked at something like this using the analysis.jsp tool: > > <fieldType name="TestTerm" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" stemEnglishPosessive="1"/> > <filter class="solr.ShingleFilterFactory" tokenSeparator="" /> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.BusinessNames.txt" ignoreCase="true" expand="true"/> > ... > </analyzer> > <analyzer type="query"> > ... > </analyzer> > </fieldType> > > However, when a ShingleFilterFactory is applied first, the > SynonymFilterFactory appears to do nothing. > I haven't found any documentation or other warnings against this > combination, and I don't want to apply shingles after synonyms (this > works) because multi-word synonyms then cause severe term expansion. I > don't really mind if the synonyms fail to match shingles, (although > I'd prefer they succeed) but I'd at least expect that synonyms would > continue to match on the original tokens, as they do if I remove the > ShingleFilterFactory. > > I'm using Solr 3.3, any clarification would be appreciated. > > Thanks, > -Jeff Wartes