I have an autocomplete index that I return highlighting information for but am getting an error with certain search strings and fields on Solr 3.5. I’ve narrowed it down to a specific field matching with a specific search string. And I’ve tried making a few different changes to the schema and rebuilding but so far I cannot get the error to go away. The field that is failing is an ngram indexed field for matching on the start of any word. Any help would be appreciated.
The text being searched for is “ant” (without quotes). The field value that is matching and causing the error is “Anti-Å’dipus” (again without quotes). The field schema is (additional fields and field types removed): <types> <fieldType name="autocomplete_ngram" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)" replaceWith="or" replace="all"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([&])" replaceWith="and" replace="all"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="2"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement=" " replace="all"/> </analyzer> <analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)" replaceWith="or" replace="all"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([&])" replaceWith="and" replace="all"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement=" " replace="all"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" replacement="$1" replace="all"/> </analyzer> </fieldType> </types> <fields> <field name="ng" type="autocomplete_ngram" indexed="true" stored="true" omitNorms="true" omitTermFreqAndPositions="true"/> </fields> Things I’ve tried changing in the above are having the PatternReplaceCharFilterFactory charFilters be PatternReplaceFilterFactory filters instead, and moving around the order of the the filters (particularly moving the PatternReplaceFilterFactory filters to the top of bottom of the filters), and completely removing the WordDelimiterFilterFactory and the PatternReplaceFilterFactory that has the pattern="([^\w\d\*æøåÆØÅ ])". No matter what I do though I still get errors (sometimes it seems to change matched values that it gets the error on though, but the one included here seems to be the most consistent). Highlighting is configured as: <requestHandler name="ac" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">edismax</str> <str name="wt">json</str> <int name="rows">10</int> <bool name="hl">true</bool> <str name="hl.fl">ng</str> <int name="hl.snippets">4</int> <bool name="hl.requireFieldMatch">true</bool> <int name="hl.fragsize">2</int> <str name="fl">ng score</str> </lst> </requestHandler> When I do a field analysis using that search term and field value I get: *Index Analyzer* *org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping-ISOLatin1Accent.txt, luceneMatchVersion=LUCENE_35}* *text* Anti-A’dipus *org.apache.solr.analysis.PatternReplaceCharFilterFactory {replace=all, pattern=(\|), replaceWith=or, luceneMatchVersion=LUCENE_35}* *text* Anti-A’dipus *org.apache.solr.analysis.PatternReplaceCharFilterFactory {replace=all, pattern=([&]), replaceWith=and, luceneMatchVersion=LUCENE_35}* *text* Anti-A’dipus *org.apache.solr.analysis.StandardTokenizerFactory {luceneMatchVersion=LUCENE_35}* *position* 1 2 *term text* Anti A’dipus *startOffset* 0 5 *endOffset* 4 12 *type* <ALPHANUM> <ALPHANUM> *org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0, catenateNumbers=0}* *position* 1 2 3 *term text* Anti A dipus *startOffset* 0 5 7 *endOffset* 4 6 12 *type* <ALPHANUM> <ALPHANUM> <ALPHANUM> *org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_35}* *position* 1 2 3 *term text* anti a dipus *startOffset* 0 5 7 *endOffset* 4 6 12 *type* <ALPHANUM> <ALPHANUM> <ALPHANUM> *org.apache.solr.analysis.EdgeNGramFilterFactory {maxGramSize=20, minGramSize=2, luceneMatchVersion=LUCENE_35}* *position* 1 2 3 4 5 6 7 *term text* an ant anti di dip dipu dipus *startOffset* 0 0 0 7 7 7 7 *endOffset* 2 3 4 9 10 11 12 *type* word word word word word word word *org.apache.solr.analysis.PatternReplaceFilterFactory {replace=all, replacement= , pattern=([^\w\d\*æøåÆØÅ ]), luceneMatchVersion=LUCENE_35}* *position* 1 2 3 4 5 6 7 *term text* an ant anti di dip dipu dipus *startOffset* 0 0 0 7 7 7 7 *endOffset* 2 3 4 9 10 11 12 *type* word word word word word word word *Query Analyzer* ant ant ant ant ant And when I call the search URL: http://localhost:8983/solr/autocomplete/select/?q=ng%3A%28ant%29 I get the following error stack: HTTP ERROR 500 Problem accessing /solr/autocomplete/select/. Reason: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11 org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11 at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) ... 24 more I am not even sure where the “oedipus” token is coming from. It doesn’t show up in the analysis. Help please? Thank you, Justin