Hi,

I've modified the HyphenationCompoundWordTokenFilter to emit less subtokens 
because the original filter can emit all kinds of subtokens that have a very 
different meaning on their own. I've modified it so no overlapping subtokens 
are emitted and no subtokens are emitted that can be found within another 
subtoken. I've also modified it to force that the generated subtokens comprise 
the original token and if they don't forget the subtokens. It also doesn't 
return the original token anymore, the original filter produces a duplicate of 
the original input token. For example: verzekeringmaatschappij now becomes 
verzekering and maatschappij and not verzekeringmaatschappij, ver, zeker, 
verzeker, zekering, ringmaat, maat and more.

But it seem that i have done something wrong because my modified version 
sometimes causes the Highlighter to throw the following IOOBE:

java.lang.StringIndexOutOfBoundsException: String index out of range: -14
        at java.lang.String.substring(String.java:1937)
        at 
org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:172)
        at 
org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:138)
        at 
org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:186)
        at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByFastVectorHighlighter(DefaultSolrHighlighter.java:571)
        at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
        at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:214)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1750)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        .....

Anyone to point me in the right direction? I've checked the LIA book on how to 
manipulate the tokenstream and thought it should be alright. My analysis tests 
also yield good results, nothing strange to be found. Or could it be an error 
in the highlighter that only now shows up?

Thanks,
Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to