[ 
https://issues.apache.org/jira/browse/SOLR-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874765#action_12874765
 ] 

Khaled Hammouda commented on SOLR-1630:
---------------------------------------

We just hit this bug as well. To reproduce, you must index a document that 
contains a hyphen (or underscore) and then search with a misspelled version of 
the indexed text; e.g.

document contains: mid-term
query: mis-term
result: exception thrown

I looked at the code of where this is happening and it seems to be related to 
token offsets (of the tokenized query) in conjunction with a feature of the 
spellcheck component called collation. Basically collation tries to replace the 
original query with the top suggested words. It relies on the tokenizer to 
remove the original misspelled words and insert the suggested ones (using 
StringBuilder.replace). Unfortunately the token offsets look weird for words 
with hyphens (or underscore); for example:

query: abc_def
1st token: value = abc; startOffset = 0; endOffset = 7
2nd token: value = def; startOffset = 0; endOffset = 7

Because the two tokens occupy the same range (0-7) this messes up the 
replacement logic. I'm not sure if this tokenizer behavior is the correct one, 
but it's part of the problem.

Having said that, I tried to change the spellcheck tokenizer from standard to 
whitespace and this actually solved the problem; no errors and I get correct 
suggestions.

So, until this gets fixed you can either:

1) Disable spellchecker collation, or
2) Use a whitespace tokenizer for the spellchecker component

> StringIndexOutOfBoundsException in SpellCheckComponent
> ------------------------------------------------------
>
>                 Key: SOLR-1630
>                 URL: https://issues.apache.org/jira/browse/SOLR-1630
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, spellchecker
>    Affects Versions: 1.4
>         Environment: Solr 1.4
> Lucene 2.9.1
> Win XP
> java version "1.6.0_14"
>            Reporter: Robin Wojciki
>            Assignee: Shalin Shekhar Mangar
>         Attachments: bug.xml, schema.xml, SOLR-1630.patch, solrconfig.xml, 
> spellcheckconfig.xml
>
>
> For some documents/search strings, the SpellCheckComponent throws 
> StringIndexOutOfBoundsException
> See: http://www.lucidimagination.com/search/document/3be6555227e031fc/
> h2. Replication
>  * Save attached schema.xml and solrconfig.xml in 
> apache-solr-1.4.0/example/solr/conf
>  * Start Solr
>  * Index attached bug.xml
>  * Query [http://localhost:8983/solr/select/?q=awehjse-wjkekw]
> It throws a StringIndexOutOfBoundsException
> {noformat} String index out of range: -7
> java.lang.StringIndexOutOfBoundsException: String index out of range: -7
>       at java.lang.AbstractStringBuilder.replace(Unknown Source)
>       at java.lang.StringBuilder.replace(Unknown Source)
>       at 
> org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
>       at 
> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
>       at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>       at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to