Re: Special character and wildcard matching
On 24 February 2015 at 15:50, Jack Krupansky wrote: > It's a string field, so there shouldn't be any analysis. (read back in the > thread for the field and field type.) It's a multi-term expansion. There is _some_ analysis one way or another :-) Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Special character and wildcard matching
It's a string field, so there shouldn't be any analysis. (read back in the thread for the field and field type.) -- Jack Krupansky On Tue, Feb 24, 2015 at 3:19 PM, Alexandre Rafalovitch wrote: > What happens if the query does not have wildcard expansion (*)? If the > behavior is correct, then the issue is somehow with the > MultitermQueryAnalysis (a hidden automatically generated analyzer > chain): http://wiki.apache.org/solr/MultitermQueryAnalysis > > Which would still make it a bug, but at least the cause could be narrowed > down. > > Regards, >Alex. > > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 24 February 2015 at 14:56, Arun Rangarajan > wrote: > > Thanks, Jack. > > I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 > > > > > > On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky < > jack.krupan...@gmail.com> > > wrote: > > > >> Thanks. That at least verifies that the accented e is stored in the > field. > >> I don't see anything wrong here, so it is as if the Lucene prefix query > was > >> mapping the accented characters. It's not supposed to do that, but... > >> > >> Go ahead and file a Jira bug. Include all of the details that you > provided > >> in this thread. > >> > >> -- Jack Krupansky > >> > >> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan < > arunrangara...@gmail.com > >> > > >> wrote: > >> > >> > Exact query: > >> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name > >> > > >> > Response: > >> > > >> > { "responseHeader": {"status": 0,"QTime": 0,"params": { > >> >"fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" > >> > } }, "response": {"numFound": 2,"start": 0,"docs": [ > >> >{"raw_name": "beyoncé" }, {"raw_name": > >> > "beyoncé" }] }} > >> > > >> > > >> > > >> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky < > >> jack.krupan...@gmail.com > >> > > > >> > wrote: > >> > > >> > > Please post the info I requested - the exact query, and the Solr > >> > response. > >> > > > >> > > -- Jack Krupansky > >> > > > >> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < > >> > > arunrangara...@gmail.com> > >> > > wrote: > >> > > > >> > > > In our case, the lower-casing is happening in a custom Java > indexer > >> > code, > >> > > > via Java's String.toLowerCase() method. > >> > > > > >> > > > I used the analysis tool in Solr admin (with Jetty). I believe the > >> raw > >> > > > bytes explain this. > >> > > > > >> > > > Attached are the results for beyonce in file > beyonce_no_spl_chars.JPG > >> > and > >> > > > beyoncé in file beyonce_with_spl_chars.JPG. > >> > > > > >> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > >> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > >> > > > > >> > > > So when you look at the bytes, it seems to explain why beyonce* > >> matches > >> > > > beyoncé. > >> > > > > >> > > > I tried your approach with a KeywordTokenizer followed by a > >> > > > LowerCaseFilter, but I see the same behavior. > >> > > > > >> > > > > >> > > > > >> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < > >> > > jack.krupan...@gmail.com> > >> > > > wrote: > >> > > > > >> > > >> But how is that lowercasing occurring? I mean, solr.StrField > doesn't > >> > do > >> > > >> that. > >> > > >> > >> > > >> Some containers default to automatically mapping accented > >> characters, > >> > so > >> > > >> that the accented "e" would then get indexed as a normal "e", and > >> then > >> > > >> your > >> > > >> wildcard would match it, and an accented "e" in a query would get > >> > mapped > >> > > >> as > >> > > >> well and then match the normal "e" in the index. What does your > >> query > >> > > >> response look like? > >> > > >> > >> > > >> This blog post explains that problem: > >> > > >> http://bensch.be/tomcat-solr-and-special-characters > >> > > >> > >> > > >> Note that you could make your string field a text field with the > >> > keyword > >> > > >> tokenizer and then filter it for lower case, such as when the > user > >> > query > >> > > >> might have a capital "B". String field is most appropriate when > the > >> > > field > >> > > >> really is 100% raw. > >> > > >> > >> > > >> > >> > > >> -- Jack Krupansky > >> > > >> > >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < > >> > > >> arunrangara...@gmail.com> > >> > > >> wrote: > >> > > >> > >> > > >> > Yes, it is a string field and not a text field. > >> > > >> > > >> > > >> > >> > sortMissingLast="true" > >> > > >> > omitNorms="true"/> > >> > > >> > stored="true" > >> /> > >> > > >> > > >> > > >> > Lower-casing done to do case-insensitive matching. > >> > > >> > > >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < > >> > > >> jack.krupan...@gmail.com> > >> > > >> > wrote: > >> > > >> > > >> > > >> > > Is it really a string field - as opposed to a text field? > Show > >> us > >> > > the > >> > > >> > field > >> > > >>
Re: Special character and wildcard matching
What happens if the query does not have wildcard expansion (*)? If the behavior is correct, then the issue is somehow with the MultitermQueryAnalysis (a hidden automatically generated analyzer chain): http://wiki.apache.org/solr/MultitermQueryAnalysis Which would still make it a bug, but at least the cause could be narrowed down. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 14:56, Arun Rangarajan wrote: > Thanks, Jack. > I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 > > > On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky > wrote: > >> Thanks. That at least verifies that the accented e is stored in the field. >> I don't see anything wrong here, so it is as if the Lucene prefix query was >> mapping the accented characters. It's not supposed to do that, but... >> >> Go ahead and file a Jira bug. Include all of the details that you provided >> in this thread. >> >> -- Jack Krupansky >> >> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan > > >> wrote: >> >> > Exact query: >> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name >> > >> > Response: >> > >> > { "responseHeader": {"status": 0,"QTime": 0,"params": { >> >"fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" >> > } }, "response": {"numFound": 2,"start": 0,"docs": [ >> >{"raw_name": "beyoncé" }, {"raw_name": >> > "beyoncé" }] }} >> > >> > >> > >> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky < >> jack.krupan...@gmail.com >> > > >> > wrote: >> > >> > > Please post the info I requested - the exact query, and the Solr >> > response. >> > > >> > > -- Jack Krupansky >> > > >> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < >> > > arunrangara...@gmail.com> >> > > wrote: >> > > >> > > > In our case, the lower-casing is happening in a custom Java indexer >> > code, >> > > > via Java's String.toLowerCase() method. >> > > > >> > > > I used the analysis tool in Solr admin (with Jetty). I believe the >> raw >> > > > bytes explain this. >> > > > >> > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG >> > and >> > > > beyoncé in file beyonce_with_spl_chars.JPG. >> > > > >> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] >> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] >> > > > >> > > > So when you look at the bytes, it seems to explain why beyonce* >> matches >> > > > beyoncé. >> > > > >> > > > I tried your approach with a KeywordTokenizer followed by a >> > > > LowerCaseFilter, but I see the same behavior. >> > > > >> > > > >> > > > >> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < >> > > jack.krupan...@gmail.com> >> > > > wrote: >> > > > >> > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't >> > do >> > > >> that. >> > > >> >> > > >> Some containers default to automatically mapping accented >> characters, >> > so >> > > >> that the accented "e" would then get indexed as a normal "e", and >> then >> > > >> your >> > > >> wildcard would match it, and an accented "e" in a query would get >> > mapped >> > > >> as >> > > >> well and then match the normal "e" in the index. What does your >> query >> > > >> response look like? >> > > >> >> > > >> This blog post explains that problem: >> > > >> http://bensch.be/tomcat-solr-and-special-characters >> > > >> >> > > >> Note that you could make your string field a text field with the >> > keyword >> > > >> tokenizer and then filter it for lower case, such as when the user >> > query >> > > >> might have a capital "B". String field is most appropriate when the >> > > field >> > > >> really is 100% raw. >> > > >> >> > > >> >> > > >> -- Jack Krupansky >> > > >> >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < >> > > >> arunrangara...@gmail.com> >> > > >> wrote: >> > > >> >> > > >> > Yes, it is a string field and not a text field. >> > > >> > >> > > >> > > > sortMissingLast="true" >> > > >> > omitNorms="true"/> >> > > >> > > /> >> > > >> > >> > > >> > Lower-casing done to do case-insensitive matching. >> > > >> > >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < >> > > >> jack.krupan...@gmail.com> >> > > >> > wrote: >> > > >> > >> > > >> > > Is it really a string field - as opposed to a text field? Show >> us >> > > the >> > > >> > field >> > > >> > > and field type. >> > > >> > > >> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a >> > capital >> > > >> "B"? >> > > >> > > >> > > >> > > -- Jack Krupansky >> > > >> > > >> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < >> > > >> > arunrangara...@gmail.com >> > > >> > > > >> > > >> > > wrote: >> > > >> > > >> > > >> > > > I have a string field raw_name like this in my document: >> > > >> > > > >> > > >> > > > {raw_name: beyoncé} >> > > >> > > > >> > > >> > > > (Notice that the last character is a special character.) >> > > >> > > >
Re: Special character and wildcard matching
Thanks, Jack. I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky wrote: > Thanks. That at least verifies that the accented e is stored in the field. > I don't see anything wrong here, so it is as if the Lucene prefix query was > mapping the accented characters. It's not supposed to do that, but... > > Go ahead and file a Jira bug. Include all of the details that you provided > in this thread. > > -- Jack Krupansky > > On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan > > wrote: > > > Exact query: > > /select?q=raw_name:beyonce*&wt=json&fl=raw_name > > > > Response: > > > > { "responseHeader": {"status": 0,"QTime": 0,"params": { > >"fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" > > } }, "response": {"numFound": 2,"start": 0,"docs": [ > >{"raw_name": "beyoncé" }, {"raw_name": > > "beyoncé" }] }} > > > > > > > > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky < > jack.krupan...@gmail.com > > > > > wrote: > > > > > Please post the info I requested - the exact query, and the Solr > > response. > > > > > > -- Jack Krupansky > > > > > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < > > > arunrangara...@gmail.com> > > > wrote: > > > > > > > In our case, the lower-casing is happening in a custom Java indexer > > code, > > > > via Java's String.toLowerCase() method. > > > > > > > > I used the analysis tool in Solr admin (with Jetty). I believe the > raw > > > > bytes explain this. > > > > > > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG > > and > > > > beyoncé in file beyonce_with_spl_chars.JPG. > > > > > > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > > > > > > > > So when you look at the bytes, it seems to explain why beyonce* > matches > > > > beyoncé. > > > > > > > > I tried your approach with a KeywordTokenizer followed by a > > > > LowerCaseFilter, but I see the same behavior. > > > > > > > > > > > > > > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < > > > jack.krupan...@gmail.com> > > > > wrote: > > > > > > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't > > do > > > >> that. > > > >> > > > >> Some containers default to automatically mapping accented > characters, > > so > > > >> that the accented "e" would then get indexed as a normal "e", and > then > > > >> your > > > >> wildcard would match it, and an accented "e" in a query would get > > mapped > > > >> as > > > >> well and then match the normal "e" in the index. What does your > query > > > >> response look like? > > > >> > > > >> This blog post explains that problem: > > > >> http://bensch.be/tomcat-solr-and-special-characters > > > >> > > > >> Note that you could make your string field a text field with the > > keyword > > > >> tokenizer and then filter it for lower case, such as when the user > > query > > > >> might have a capital "B". String field is most appropriate when the > > > field > > > >> really is 100% raw. > > > >> > > > >> > > > >> -- Jack Krupansky > > > >> > > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < > > > >> arunrangara...@gmail.com> > > > >> wrote: > > > >> > > > >> > Yes, it is a string field and not a text field. > > > >> > > > > >> > > sortMissingLast="true" > > > >> > omitNorms="true"/> > > > >> > /> > > > >> > > > > >> > Lower-casing done to do case-insensitive matching. > > > >> > > > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < > > > >> jack.krupan...@gmail.com> > > > >> > wrote: > > > >> > > > > >> > > Is it really a string field - as opposed to a text field? Show > us > > > the > > > >> > field > > > >> > > and field type. > > > >> > > > > > >> > > Besides, if it really were a "raw" name, wouldn't that be a > > capital > > > >> "B"? > > > >> > > > > > >> > > -- Jack Krupansky > > > >> > > > > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < > > > >> > arunrangara...@gmail.com > > > >> > > > > > > >> > > wrote: > > > >> > > > > > >> > > > I have a string field raw_name like this in my document: > > > >> > > > > > > >> > > > {raw_name: beyoncé} > > > >> > > > > > > >> > > > (Notice that the last character is a special character.) > > > >> > > > > > > >> > > > When I issue this wildcard query: > > > >> > > > > > > >> > > > q=raw_name:beyonce* > > > >> > > > > > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr > > > >> returns > > > >> > me > > > >> > > > the above document. > > > >> > > > > > > >> > > > How do I prevent this? > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >
Re: Special character and wildcard matching
Thanks. That at least verifies that the accented e is stored in the field. I don't see anything wrong here, so it is as if the Lucene prefix query was mapping the accented characters. It's not supposed to do that, but... Go ahead and file a Jira bug. Include all of the details that you provided in this thread. -- Jack Krupansky On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan wrote: > Exact query: > /select?q=raw_name:beyonce*&wt=json&fl=raw_name > > Response: > > { "responseHeader": {"status": 0,"QTime": 0,"params": { >"fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" > } }, "response": {"numFound": 2,"start": 0,"docs": [ >{"raw_name": "beyoncé" }, {"raw_name": > "beyoncé" }] }} > > > > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky > > wrote: > > > Please post the info I requested - the exact query, and the Solr > response. > > > > -- Jack Krupansky > > > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < > > arunrangara...@gmail.com> > > wrote: > > > > > In our case, the lower-casing is happening in a custom Java indexer > code, > > > via Java's String.toLowerCase() method. > > > > > > I used the analysis tool in Solr admin (with Jetty). I believe the raw > > > bytes explain this. > > > > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG > and > > > beyoncé in file beyonce_with_spl_chars.JPG. > > > > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > > > > > > So when you look at the bytes, it seems to explain why beyonce* matches > > > beyoncé. > > > > > > I tried your approach with a KeywordTokenizer followed by a > > > LowerCaseFilter, but I see the same behavior. > > > > > > > > > > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < > > jack.krupan...@gmail.com> > > > wrote: > > > > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't > do > > >> that. > > >> > > >> Some containers default to automatically mapping accented characters, > so > > >> that the accented "e" would then get indexed as a normal "e", and then > > >> your > > >> wildcard would match it, and an accented "e" in a query would get > mapped > > >> as > > >> well and then match the normal "e" in the index. What does your query > > >> response look like? > > >> > > >> This blog post explains that problem: > > >> http://bensch.be/tomcat-solr-and-special-characters > > >> > > >> Note that you could make your string field a text field with the > keyword > > >> tokenizer and then filter it for lower case, such as when the user > query > > >> might have a capital "B". String field is most appropriate when the > > field > > >> really is 100% raw. > > >> > > >> > > >> -- Jack Krupansky > > >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < > > >> arunrangara...@gmail.com> > > >> wrote: > > >> > > >> > Yes, it is a string field and not a text field. > > >> > > > >> > sortMissingLast="true" > > >> > omitNorms="true"/> > > >> > > > >> > > > >> > Lower-casing done to do case-insensitive matching. > > >> > > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < > > >> jack.krupan...@gmail.com> > > >> > wrote: > > >> > > > >> > > Is it really a string field - as opposed to a text field? Show us > > the > > >> > field > > >> > > and field type. > > >> > > > > >> > > Besides, if it really were a "raw" name, wouldn't that be a > capital > > >> "B"? > > >> > > > > >> > > -- Jack Krupansky > > >> > > > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < > > >> > arunrangara...@gmail.com > > >> > > > > > >> > > wrote: > > >> > > > > >> > > > I have a string field raw_name like this in my document: > > >> > > > > > >> > > > {raw_name: beyoncé} > > >> > > > > > >> > > > (Notice that the last character is a special character.) > > >> > > > > > >> > > > When I issue this wildcard query: > > >> > > > > > >> > > > q=raw_name:beyonce* > > >> > > > > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr > > >> returns > > >> > me > > >> > > > the above document. > > >> > > > > > >> > > > How do I prevent this? > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
Re: Special character and wildcard matching
Exact query: /select?q=raw_name:beyonce*&wt=json&fl=raw_name Response: { "responseHeader": {"status": 0,"QTime": 0,"params": { "fl": "raw_name", "q": "raw_name:beyonce*", "wt": "json" } }, "response": {"numFound": 2,"start": 0,"docs": [ {"raw_name": "beyoncé" }, {"raw_name": "beyoncé" }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky wrote: > Please post the info I requested - the exact query, and the Solr response. > > -- Jack Krupansky > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan < > arunrangara...@gmail.com> > wrote: > > > In our case, the lower-casing is happening in a custom Java indexer code, > > via Java's String.toLowerCase() method. > > > > I used the analysis tool in Solr admin (with Jetty). I believe the raw > > bytes explain this. > > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and > > beyoncé in file beyonce_with_spl_chars.JPG. > > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > > > > So when you look at the bytes, it seems to explain why beyonce* matches > > beyoncé. > > > > I tried your approach with a KeywordTokenizer followed by a > > LowerCaseFilter, but I see the same behavior. > > > > > > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky < > jack.krupan...@gmail.com> > > wrote: > > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't do > >> that. > >> > >> Some containers default to automatically mapping accented characters, so > >> that the accented "e" would then get indexed as a normal "e", and then > >> your > >> wildcard would match it, and an accented "e" in a query would get mapped > >> as > >> well and then match the normal "e" in the index. What does your query > >> response look like? > >> > >> This blog post explains that problem: > >> http://bensch.be/tomcat-solr-and-special-characters > >> > >> Note that you could make your string field a text field with the keyword > >> tokenizer and then filter it for lower case, such as when the user query > >> might have a capital "B". String field is most appropriate when the > field > >> really is 100% raw. > >> > >> > >> -- Jack Krupansky > >> > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < > >> arunrangara...@gmail.com> > >> wrote: > >> > >> > Yes, it is a string field and not a text field. > >> > > >> > >> > omitNorms="true"/> > >> > > >> > > >> > Lower-casing done to do case-insensitive matching. > >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < > >> jack.krupan...@gmail.com> > >> > wrote: > >> > > >> > > Is it really a string field - as opposed to a text field? Show us > the > >> > field > >> > > and field type. > >> > > > >> > > Besides, if it really were a "raw" name, wouldn't that be a capital > >> "B"? > >> > > > >> > > -- Jack Krupansky > >> > > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < > >> > arunrangara...@gmail.com > >> > > > > >> > > wrote: > >> > > > >> > > > I have a string field raw_name like this in my document: > >> > > > > >> > > > {raw_name: beyoncé} > >> > > > > >> > > > (Notice that the last character is a special character.) > >> > > > > >> > > > When I issue this wildcard query: > >> > > > > >> > > > q=raw_name:beyonce* > >> > > > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr > >> returns > >> > me > >> > > > the above document. > >> > > > > >> > > > How do I prevent this? > >> > > > > >> > > > >> > > >> > > > > >
Re: Special character and wildcard matching
Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan wrote: > In our case, the lower-casing is happening in a custom Java indexer code, > via Java's String.toLowerCase() method. > > I used the analysis tool in Solr admin (with Jetty). I believe the raw > bytes explain this. > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and > beyoncé in file beyonce_with_spl_chars.JPG. > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65] > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] > > So when you look at the bytes, it seems to explain why beyonce* matches > beyoncé. > > I tried your approach with a KeywordTokenizer followed by a > LowerCaseFilter, but I see the same behavior. > > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky > wrote: > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't do >> that. >> >> Some containers default to automatically mapping accented characters, so >> that the accented "e" would then get indexed as a normal "e", and then >> your >> wildcard would match it, and an accented "e" in a query would get mapped >> as >> well and then match the normal "e" in the index. What does your query >> response look like? >> >> This blog post explains that problem: >> http://bensch.be/tomcat-solr-and-special-characters >> >> Note that you could make your string field a text field with the keyword >> tokenizer and then filter it for lower case, such as when the user query >> might have a capital "B". String field is most appropriate when the field >> really is 100% raw. >> >> >> -- Jack Krupansky >> >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan < >> arunrangara...@gmail.com> >> wrote: >> >> > Yes, it is a string field and not a text field. >> > >> > > > omitNorms="true"/> >> > >> > >> > Lower-casing done to do case-insensitive matching. >> > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < >> jack.krupan...@gmail.com> >> > wrote: >> > >> > > Is it really a string field - as opposed to a text field? Show us the >> > field >> > > and field type. >> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a capital >> "B"? >> > > >> > > -- Jack Krupansky >> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < >> > arunrangara...@gmail.com >> > > > >> > > wrote: >> > > >> > > > I have a string field raw_name like this in my document: >> > > > >> > > > {raw_name: beyoncé} >> > > > >> > > > (Notice that the last character is a special character.) >> > > > >> > > > When I issue this wildcard query: >> > > > >> > > > q=raw_name:beyonce* >> > > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr >> returns >> > me >> > > > the above document. >> > > > >> > > > How do I prevent this? >> > > > >> > > >> > >> > >
Re: Special character and wildcard matching
In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky wrote: > But how is that lowercasing occurring? I mean, solr.StrField doesn't do > that. > > Some containers default to automatically mapping accented characters, so > that the accented "e" would then get indexed as a normal "e", and then your > wildcard would match it, and an accented "e" in a query would get mapped as > well and then match the normal "e" in the index. What does your query > response look like? > > This blog post explains that problem: > http://bensch.be/tomcat-solr-and-special-characters > > Note that you could make your string field a text field with the keyword > tokenizer and then filter it for lower case, such as when the user query > might have a capital "B". String field is most appropriate when the field > really is 100% raw. > > > -- Jack Krupansky > > On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan > > wrote: > > > Yes, it is a string field and not a text field. > > > > > omitNorms="true"/> > > > > > > Lower-casing done to do case-insensitive matching. > > > > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky < > jack.krupan...@gmail.com> > > wrote: > > > > > Is it really a string field - as opposed to a text field? Show us the > > field > > > and field type. > > > > > > Besides, if it really were a "raw" name, wouldn't that be a capital > "B"? > > > > > > -- Jack Krupansky > > > > > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < > > arunrangara...@gmail.com > > > > > > > wrote: > > > > > > > I have a string field raw_name like this in my document: > > > > > > > > {raw_name: beyoncé} > > > > > > > > (Notice that the last character is a special character.) > > > > > > > > When I issue this wildcard query: > > > > > > > > q=raw_name:beyonce* > > > > > > > > i.e. with the last character simply being the ASCII 'e', Solr returns > > me > > > > the above document. > > > > > > > > How do I prevent this? > > > > > > > > > >
Re: Special character and wildcard matching
But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented "e" would then get indexed as a normal "e", and then your wildcard would match it, and an accented "e" in a query would get mapped as well and then match the normal "e" in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital "B". String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan wrote: > Yes, it is a string field and not a text field. > > omitNorms="true"/> > > > Lower-casing done to do case-insensitive matching. > > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky > wrote: > > > Is it really a string field - as opposed to a text field? Show us the > field > > and field type. > > > > Besides, if it really were a "raw" name, wouldn't that be a capital "B"? > > > > -- Jack Krupansky > > > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan < > arunrangara...@gmail.com > > > > > wrote: > > > > > I have a string field raw_name like this in my document: > > > > > > {raw_name: beyoncé} > > > > > > (Notice that the last character is a special character.) > > > > > > When I issue this wildcard query: > > > > > > q=raw_name:beyonce* > > > > > > i.e. with the last character simply being the ASCII 'e', Solr returns > me > > > the above document. > > > > > > How do I prevent this? > > > > > >
Re: Special character and wildcard matching
Yes, it is a string field and not a text field. Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky wrote: > Is it really a string field - as opposed to a text field? Show us the field > and field type. > > Besides, if it really were a "raw" name, wouldn't that be a capital "B"? > > -- Jack Krupansky > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan > > wrote: > > > I have a string field raw_name like this in my document: > > > > {raw_name: beyoncé} > > > > (Notice that the last character is a special character.) > > > > When I issue this wildcard query: > > > > q=raw_name:beyonce* > > > > i.e. with the last character simply being the ASCII 'e', Solr returns me > > the above document. > > > > How do I prevent this? > > >
Re: Special character and wildcard matching
Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a "raw" name, wouldn't that be a capital "B"? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan wrote: > I have a string field raw_name like this in my document: > > {raw_name: beyoncé} > > (Notice that the last character is a special character.) > > When I issue this wildcard query: > > q=raw_name:beyonce* > > i.e. with the last character simply being the ASCII 'e', Solr returns me > the above document. > > How do I prevent this? >
Special character and wildcard matching
I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?