Arun Rangarajan created SOLR-7154:
-------------------------------------
Summary: Wildcard query matches special characters
Key: SOLR-7154
URL: https://issues.apache.org/jira/browse/SOLR-7154
Project: Solr
Issue Type: Bug
Reporter: Arun Rangarajan
Priority: Minor
I have a string field raw_name defined like this:
{code}
<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
...
<field name="raw_name" type="string" indexed="true" stored="true" />
{code}
I have a document like this:
{code}
{raw_name: beyoncé}
{code}
Notice that the last character is a special character (accented e).
When I issue this wildcard query:
{code}
q=raw_name:beyonce*
{code}
i.e. with the last character simply being the ASCII 'e', Solr returns me the
above document.
Exact query:
{code}
/select?q=raw_name:beyonce*&wt=json&fl=raw_name
{code}
Response:
{code}
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"fl": "raw_name",
"q": "raw_name:beyonce*",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"raw_name": "beyoncé"
},
{
"raw_name": "beyoncé"
}
]
}
}
{code}
I used the analysis tool in Solr admin (with Jetty). The raw bytes look like
this:
Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]
So when you look at the bytes, it seems to explain why beyonce* might match
beyoncé.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]