Arun Rangarajan created SOLR-7154:
-------------------------------------

             Summary: Wildcard query matches special characters
                 Key: SOLR-7154
                 URL: https://issues.apache.org/jira/browse/SOLR-7154
             Project: Solr
          Issue Type: Bug
            Reporter: Arun Rangarajan
            Priority: Minor


I have a string field raw_name defined like this:

{code}
<fieldType name="string" class="solr.StrField" sortMissingLast="true" 
omitNorms="true"/>
...
<field name="raw_name" type="string" indexed="true" stored="true" />
{code}

I have a document like this:
{code}
{raw_name: beyoncé}
{code}
Notice that the last character is a special character (accented e).

When I issue this wildcard query:
{code}
q=raw_name:beyonce*
{code}
i.e. with the last character simply being the ASCII 'e', Solr returns me the 
above document.

Exact query:
{code}
/select?q=raw_name:beyonce*&wt=json&fl=raw_name
{code}

Response:

{code}
{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "fl": "raw_name",
      "q": "raw_name:beyonce*",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "raw_name": "beyoncé"
      },
      {
        "raw_name": "beyoncé"
      }
    ]
  }
}
{code}

I used the analysis tool in Solr admin (with Jetty). The raw bytes look like 
this:

Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]

So when you look at the bytes, it seems to explain why beyonce* might match 
beyoncé.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to