[jira] [Created] (SOLR-7154) Wildcard query matches special characters

2015-02-24 Thread Arun Rangarajan (JIRA)
Arun Rangarajan created SOLR-7154:
-

 Summary: Wildcard query matches special characters
 Key: SOLR-7154
 URL: https://issues.apache.org/jira/browse/SOLR-7154
 Project: Solr
  Issue Type: Bug
Reporter: Arun Rangarajan
Priority: Minor


I have a string field raw_name defined like this:

{code}

...

{code}

I have a document like this:
{code}
{raw_name: beyoncé}
{code}
Notice that the last character is a special character (accented e).

When I issue this wildcard query:
{code}
q=raw_name:beyonce*
{code}
i.e. with the last character simply being the ASCII 'e', Solr returns me the 
above document.

Exact query:
{code}
/select?q=raw_name:beyonce*&wt=json&fl=raw_name
{code}

Response:

{code}
{
  "responseHeader": {
"status": 0,
"QTime": 0,
"params": {
  "fl": "raw_name",
  "q": "raw_name:beyonce*",
  "wt": "json"
}
  },
  "response": {
"numFound": 2,
"start": 0,
"docs": [
  {
"raw_name": "beyoncé"
  },
  {
"raw_name": "beyoncé"
  }
]
  }
}
{code}

I used the analysis tool in Solr admin (with Jetty). The raw bytes look like 
this:

Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]

So when you look at the bytes, it seems to explain why beyonce* might match 
beyoncé.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7154) Wildcard query matches special characters

2015-02-24 Thread Arun Rangarajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335436#comment-14335436
 ] 

Arun Rangarajan commented on SOLR-7154:
---

I had initially done this on Solr 4.2.1. After seeing your comment, I tried the 
same on Solr 5.0.0 and it gives the same results.

> Wildcard query matches special characters
> -
>
> Key: SOLR-7154
> URL: https://issues.apache.org/jira/browse/SOLR-7154
> Project: Solr
>  Issue Type: Bug
>Reporter: Arun Rangarajan
>Priority: Minor
>
> I have a string field raw_name defined like this:
> {code}
>  omitNorms="true"/>
> ...
> 
> {code}
> I have a document like this:
> {code}
> {raw_name: beyoncé}
> {code}
> Notice that the last character is a special character (accented e).
> When I issue this wildcard query:
> {code}
> q=raw_name:beyonce*
> {code}
> i.e. with the last character simply being the ASCII 'e', Solr returns me the 
> above document.
> Exact query:
> {code}
> /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> {code}
> Response:
> {code}
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 0,
> "params": {
>   "fl": "raw_name",
>   "q": "raw_name:beyonce*",
>   "wt": "json"
> }
>   },
>   "response": {
> "numFound": 2,
> "start": 0,
> "docs": [
>   {
> "raw_name": "beyoncé"
>   },
>   {
> "raw_name": "beyoncé"
>   }
> ]
>   }
> }
> {code}
> I used the analysis tool in Solr admin (with Jetty). The raw bytes look like 
> this:
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]
> So when you look at the bytes, it seems to explain why beyonce* might match 
> beyoncé.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7154) Wildcard query matches special characters

2015-02-24 Thread Arun Rangarajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335452#comment-14335452
 ] 

Arun Rangarajan commented on SOLR-7154:
---

Right, that seems to be the issue. I added another document with "Latin small 
letter with acute" and that document does not match the wild-card query. So I 
think this needs to be fixed in my data source itself.

> Wildcard query matches special characters
> -
>
> Key: SOLR-7154
> URL: https://issues.apache.org/jira/browse/SOLR-7154
> Project: Solr
>  Issue Type: Bug
>Reporter: Arun Rangarajan
>Priority: Minor
>
> I have a string field raw_name defined like this:
> {code}
>  omitNorms="true"/>
> ...
> 
> {code}
> I have a document like this:
> {code}
> {raw_name: beyoncé}
> {code}
> Notice that the last character is a special character (accented e).
> When I issue this wildcard query:
> {code}
> q=raw_name:beyonce*
> {code}
> i.e. with the last character simply being the ASCII 'e', Solr returns me the 
> above document.
> Exact query:
> {code}
> /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> {code}
> Response:
> {code}
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 0,
> "params": {
>   "fl": "raw_name",
>   "q": "raw_name:beyonce*",
>   "wt": "json"
> }
>   },
>   "response": {
> "numFound": 2,
> "start": 0,
> "docs": [
>   {
> "raw_name": "beyoncé"
>   },
>   {
> "raw_name": "beyoncé"
>   }
> ]
>   }
> }
> {code}
> I used the analysis tool in Solr admin (with Jetty). The raw bytes look like 
> this:
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]
> So when you look at the bytes, it seems to explain why beyonce* might match 
> beyoncé.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org