[ https://issues.apache.org/jira/browse/SOLR-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148433#comment-13148433 ]
Vadim Kisselmann commented on SOLR-2891: ---------------------------------------- it´s an old bug. I have big problems too with OffsetExceptions when i use Highlighting, or Carrot. It looks like a problem with HTMLStripCharFilter. Patch doesn´t work. https://issues.apache.org/jira/browse/LUCENE-2208 > InvalidTokenOffsetsException when using MappingCharFilterFactory, > DictionaryCompoundWordTokenFilterFactory and Highlighting > --------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-2891 > URL: https://issues.apache.org/jira/browse/SOLR-2891 > Project: Solr > Issue Type: Bug > Components: highlighter, Schema and Analysis, search > Affects Versions: 3.1, 3.4 > Environment: MacOS X, Java 1.6, Tomcat 7 > Reporter: Edwin Steiner > Priority: Critical > > I would like to handle german accents (Umlaute) by replacing the accented > char with its two-letter substitute (e.g ä => ae). For this reason I use the > char-filter solr.MappingCharFilterFactory configured with a mapping file > containing entries like "ä" => "ae". I also want to use the > solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of > compound words (e.g. revision in totalrevision). And finally I want to use > Solr highlighting. But there seems to be a problem if I combine the char > filter and the compound word filter in combination with highlighting (an > org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised). > Here are the details: > types: > -------- > <fieldType name="textAnalyzedFailed" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping.txt"/> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.DictionaryCompoundWordTokenFilterFactory" > dictionary="words.txt"/> > </analyzer> > </fieldType> > schema: > ----------- > <fields> > <field name="id" type="string" indexed="true" > stored="true" required="true" /> > <field name="title" type="textAnalyzedFailed" indexed="true" > stored="true"/> > </fields> > document: > -------------- > <doc> > <field name="id">1</field> > <field name="title">banküberfall</field> > </doc> > mapping.txt: > ----------------- > "ü" => "ue" > words.txt: > -------------- > fall > The resulting error when search with: > http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title > Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/select/ > params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 > Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall > exceeds length of provided text sized 12 > at > org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469) > at > org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378) > at > org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) > at > org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278) > at > org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) > at > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:680) > Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: > Token fall exceeds length of provided text sized 12 > at > org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228) > at > org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462) > ... 23 more > The analysis tool says the following for field name=title, field > value=banküberfall: > ------------------------------------------------------------------------------------ > Index Analyzer > org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, > luceneMatchVersion=LUCENE_31} > text bankueberfall > org.apache.solr.analysis.WhitespaceTokenizerFactory > {luceneMatchVersion=LUCENE_31} > position 1 > term text bankueberfall > startOffset 0 > endOffset 12 > org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory > {dictionary=words.txt, luceneMatchVersion=LUCENE_31} > position 1 > term text bankueberfall > fall > startOffset 0 > 9 > endOffset 12 > 13 > flags 0 > 0 > type word > word > payload -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org