[ 
https://issues.apache.org/jira/browse/SOLR-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751797#action_12751797
 ] 

Igor Motov commented on SOLR-1404:
----------------------------------

First of all, HTMLStripWhitespaceTokenizerFactory is deprecated, so it might be 
better to just replace it with: HTMLStripCharFilterFactory and 
WhitespaceTokenizerFactory

{code:xml}
<analyzer>
  <charFilter class="solr.HTMLStripCharFilterFactory" />
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
</analyzer>
{code}

Anyway, there seems to be a bug in reseting a token stream created by the 
HTMLStripWhitespaceTokenizerFactory. That's why the test works the first time 
when the token stream is created and fails the next time when it's reused. The 
problem might have been introduced in revision 802286 (see 
[SOLR-1343|http://issues.apache.org/jira/browse/SOLR-1343]), when 
HTMLStripReader, which was a Reader, became HTMLStripCharFilter, which is 
CharStream. As a result, super.reset in the following code changed from 
reset(CharStream input) to  reset(Reader input)

{code}
public class HTMLStripWhitespaceTokenizerFactory extends BaseTokenizerFactory {
  public Tokenizer create(Reader input) {
    return new WhitespaceTokenizer(new HTMLStripReader(input)) {
      @Override
      public void reset(Reader input) throws IOException {
        super.reset(new HTMLStripReader(input));
      }
    };
  }
}
{code}

WhitespaceTokenizer inherits from CharTokenizer. But CharTokenizer implements 
only reset(Reader input) and doesn't reset the stream on reset(CharStream 
input) which is now called. The simplest fix is to explicitly call 
super.reset(Reader input). A better fix, perhaps, would be implementing 
reset(CharStream input) in CharTokenizer in Lucene. 


> Random failures with highlighting
> ---------------------------------
>
>                 Key: SOLR-1404
>                 URL: https://issues.apache.org/jira/browse/SOLR-1404
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis, highlighter
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>             Fix For: 1.4
>
>
> With a recent Solr nightly, we started getting errors when highlighting.
> I have not been able to reduce our real setup to a minimal one that is 
> failing, but the same error seems to pop up with the configuration below. 
> Note that the QUERY will mostly fail, but it will work sometimes. Notably, 
> after running "java -jar start.jar", the QUERY will work the first time, but 
> then start failing for a while. Seems that something is not being reset 
> properly.
> The example uses the deprecated HTMLStripWhitespaceTokenizerFactory but the 
> problem apparently also exists with other tokenizers; I was just unable to 
> create a minimal example with other configurations.
> SCHEMA
> <?xml version="1.0" encoding="UTF-8" ?>
> <schema name="example" version="1.2">
>   <types>
>     <fieldType name="string" class="solr.StrField" />
>     <fieldtype name="testtype" class="solr.TextField">
>       <analyzer>
>         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" />
>       </analyzer>
>     </fieldtype>
>  </types>
>  <fields>
>    <field name="id" type="string" indexed="true" stored="false" />
>    <field name="test" type="testtype" indexed="false" stored="true" />
>  </fields>
>  <uniqueKey>id</uniqueKey>
> </schema>
> INDEX
> URL=http://localhost:8983/solr/update
> curl $URL --data-binary '<add><doc><field name="id">1</field><field 
> name="test">test</field></doc></add>' -H 'Content-type:text/xml; 
> charset=utf-8'
> curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
> QUERY
> curl 'http://localhost:8983/solr/select/?hl.fl=test&hl=true&q=id:1'
> ERROR
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token test 
> exceeds length of provided text sized 4
> org.apache.solr.common.SolrException: 
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token test 
> exceeds length of provided text sized 4
>       at 
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328)
>       at 
> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
>       at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>       at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>       at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>       at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>       at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>       at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>       at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>       at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>       at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>       at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>       at org.mortbay.jetty.Server.handle(Server.java:285)
>       at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>       at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>       at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>       at 
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
> Token test exceeds length of provided text sized 4
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254)
>       at 
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321)
>       ... 23 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to