[
https://issues.apache.org/jira/browse/SOLR-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507672#comment-13507672
]
Steven Rowe commented on SOLR-4115:
-----------------------------------
Looks to me like a fundamental bug in WordBreakSpellChecker.
I agree with James that there's invalid UTF-8 here, but it's not
{{\uD864\uDC79}}, which is a valid UTF-16 sequence representing a single
character (codepoint: {{U+29079}}, UTF-8: {{F0 A9 81 B9}} - this is a CJK
ideograph above the BMP).
The wordbreak suggester is breaking up multibyte UTF-8 characters at
non-character boundaries.
As a method on TestWordBreakSpellChecker, this fails for me with the same stack
trace in Lucene:
{code:java}
public void testBreakingCharAboveBMP() throws Exception {
IndexReader ir = null;
try {
ir = DirectoryReader.open(dir);
WordBreakSpellChecker wbsp = new WordBreakSpellChecker();
Term term = new Term("numbers", "\uD864\uDC79");
wbsp.setMaxChanges(1);
wbsp.setMinBreakWordLength(1);
wbsp.setMinSuggestionFrequency(1);
SuggestWord[][] sw = wbsp.suggestWordBreaks(term, 5, ir,
SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX,
BreakSuggestionSortMethod.NUM_CHANGES_THEN_MAX_FREQUENCY);
Assert.assertEquals("sw.length", 0, sw.length);
} catch(Exception e) {
throw e;
} finally {
try { ir.close(); } catch(Exception e1) { }
}
}
{code}
{{UnicodeUtil.UTF8toUTF16()}} assumes you're sending it a valid UTF-8 sequence,
and so it croaks when WordBreakSpellChecker sends it the first byte in the
UTF-8 representation of {{\uD864\uDC79}}: {{F0}}, a non-valid UTF-8 sequence
without three following bytes.
> WordBreakSpellChecker throws ArrayIndexOutOfBoundsException for random query
> string
> -----------------------------------------------------------------------------------
>
> Key: SOLR-4115
> URL: https://issues.apache.org/jira/browse/SOLR-4115
> Project: Solr
> Issue Type: Bug
> Components: spellchecker
> Affects Versions: 4.0
> Environment: java version "1.6.0_37"
> Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
> Reporter: Andreas Hubold
>
> The following SolrJ test code causes an ArrayIndexOutOfBoundsException in the
> WordBreakSpellChecker. I tested this with the Solr 4.0.0 example webapp
> started with {{java -jar start.jar}}.
> {code:java}
> @Test
> public void testWordbreakSpellchecker() throws Exception {
> SolrQuery q = new SolrQuery("\uD864\uDC79");
> q.setRequestHandler("/browse");
> q.setParam("spellcheck.dictionary", "wordbreak");
> HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
> server.query(q, SolrRequest.METHOD.POST);
> }
> {code}
> {noformat}
> INFO: [collection1] webapp=/solr path=/browse
> params={spellcheck.dictionary=wordbreak&qt=/browse&wt=javabin&q=?&version=2}
> hits=0 status=500 QTime=11
> Nov 28, 2012 11:23:01 AM org.apache.solr.common.SolrException log
> SEVERE: null:java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:599)
> at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:165)
> at org.apache.lucene.index.Term.text(Term.java:72)
> at
> org.apache.lucene.search.spell.WordBreakSpellChecker.generateSuggestWord(WordBreakSpellChecker.java:350)
> at
> org.apache.lucene.search.spell.WordBreakSpellChecker.generateBreakUpSuggestions(WordBreakSpellChecker.java:283)
> at
> org.apache.lucene.search.spell.WordBreakSpellChecker.suggestWordBreaks(WordBreakSpellChecker.java:122)
> at
> org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:229)
> at
> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> at org.eclipse.jetty.server.Server.handle(Server.java:351)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> The query string is a random one (we found it in a randomized test). Other
> random strings work.
> There are no problems with this query string when the DirectSolrSpellChecker
> is used or during search.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]