Re: Special character and wildcard matching

2015-02-24 Thread Alexandre Rafalovitch
On 24 February 2015 at 15:50, Jack Krupansky  wrote:
> It's a string field, so there shouldn't be any analysis. (read back in the
> thread for the field and field type.)

It's a multi-term expansion. There is _some_ analysis one way or another :-)


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
It's a string field, so there shouldn't be any analysis. (read back in the
thread for the field and field type.)

-- Jack Krupansky

On Tue, Feb 24, 2015 at 3:19 PM, Alexandre Rafalovitch 
wrote:

> What happens if the query does not have wildcard expansion (*)? If the
> behavior is correct, then the issue is somehow with the
> MultitermQueryAnalysis (a hidden automatically generated analyzer
> chain): http://wiki.apache.org/solr/MultitermQueryAnalysis
>
> Which would still make it a bug, but at least the cause could be narrowed
> down.
>
> Regards,
>Alex.
>
>
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 24 February 2015 at 14:56, Arun Rangarajan 
> wrote:
> > Thanks, Jack.
> > I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154
> >
> >
> > On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> Thanks. That at least verifies that the accented e is stored in the
> field.
> >> I don't see anything wrong here, so it is as if the Lucene prefix query
> was
> >> mapping the accented characters. It's not supposed to do that, but...
> >>
> >> Go ahead and file a Jira bug. Include all of the details that you
> provided
> >> in this thread.
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan <
> arunrangara...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Exact query:
> >> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> >> >
> >> > Response:
> >> >
> >> > {  "responseHeader": {"status": 0,"QTime": 0,"params": {
> >> >"fl": "raw_name",  "q": "raw_name:beyonce*",  "wt": "json"
> >> >   }  },  "response": {"numFound": 2,"start": 0,"docs": [
> >> >{"raw_name": "beyoncé"  },  {"raw_name":
> >> > "beyoncé"  }]  }}
> >> >
> >> >
> >> >
> >> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <
> >> jack.krupan...@gmail.com
> >> > >
> >> > wrote:
> >> >
> >> > > Please post the info I requested - the exact query, and the Solr
> >> > response.
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> >> > > arunrangara...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > In our case, the lower-casing is happening in a custom Java
> indexer
> >> > code,
> >> > > > via Java's String.toLowerCase() method.
> >> > > >
> >> > > > I used the analysis tool in Solr admin (with Jetty). I believe the
> >> raw
> >> > > > bytes explain this.
> >> > > >
> >> > > > Attached are the results for beyonce in file
> beyonce_no_spl_chars.JPG
> >> > and
> >> > > > beyoncé in file beyonce_with_spl_chars.JPG.
> >> > > >
> >> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> >> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> >> > > >
> >> > > > So when you look at the bytes, it seems to explain why beyonce*
> >> matches
> >> > > > beyoncé.
> >> > > >
> >> > > > I tried your approach with a KeywordTokenizer followed by a
> >> > > > LowerCaseFilter, but I see the same behavior.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> >> > > jack.krupan...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > >> But how is that lowercasing occurring? I mean, solr.StrField
> doesn't
> >> > do
> >> > > >> that.
> >> > > >>
> >> > > >> Some containers default to automatically mapping accented
> >> characters,
> >> > so
> >> > > >> that the accented "e" would then get indexed as a normal "e", and
> >> then
> >> > > >> your
> >> > > >> wildcard would match it, and an accented "e" in a query would get
> >> > mapped
> >> > > >> as
> >> > > >> well and then match the normal "e" in the index. What does your
> >> query
> >> > > >> response look like?
> >> > > >>
> >> > > >> This blog post explains that problem:
> >> > > >> http://bensch.be/tomcat-solr-and-special-characters
> >> > > >>
> >> > > >> Note that you could make your string field a text field with the
> >> > keyword
> >> > > >> tokenizer and then filter it for lower case, such as when the
> user
> >> > query
> >> > > >> might have a capital "B". String field is most appropriate when
> the
> >> > > field
> >> > > >> really is 100% raw.
> >> > > >>
> >> > > >>
> >> > > >> -- Jack Krupansky
> >> > > >>
> >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> >> > > >> arunrangara...@gmail.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >> > Yes, it is a string field and not a text field.
> >> > > >> >
> >> > > >> >  >> > sortMissingLast="true"
> >> > > >> > omitNorms="true"/>
> >> > > >> >  stored="true"
> >> />
> >> > > >> >
> >> > > >> > Lower-casing done to do case-insensitive matching.
> >> > > >> >
> >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> >> > > >> jack.krupan...@gmail.com>
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Is it really a string field - as opposed to a text field?
> Show
> >> us
> >> > > the
> >> > > >> > field
> >> > > >> 

Re: Special character and wildcard matching

2015-02-24 Thread Alexandre Rafalovitch
What happens if the query does not have wildcard expansion (*)? If the
behavior is correct, then the issue is somehow with the
MultitermQueryAnalysis (a hidden automatically generated analyzer
chain): http://wiki.apache.org/solr/MultitermQueryAnalysis

Which would still make it a bug, but at least the cause could be narrowed down.

Regards,
   Alex.



Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 February 2015 at 14:56, Arun Rangarajan  wrote:
> Thanks, Jack.
> I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154
>
>
> On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky 
> wrote:
>
>> Thanks. That at least verifies that the accented e is stored in the field.
>> I don't see anything wrong here, so it is as if the Lucene prefix query was
>> mapping the accented characters. It's not supposed to do that, but...
>>
>> Go ahead and file a Jira bug. Include all of the details that you provided
>> in this thread.
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan > >
>> wrote:
>>
>> > Exact query:
>> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name
>> >
>> > Response:
>> >
>> > {  "responseHeader": {"status": 0,"QTime": 0,"params": {
>> >"fl": "raw_name",  "q": "raw_name:beyonce*",  "wt": "json"
>> >   }  },  "response": {"numFound": 2,"start": 0,"docs": [
>> >{"raw_name": "beyoncé"  },  {"raw_name":
>> > "beyoncé"  }]  }}
>> >
>> >
>> >
>> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <
>> jack.krupan...@gmail.com
>> > >
>> > wrote:
>> >
>> > > Please post the info I requested - the exact query, and the Solr
>> > response.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
>> > > arunrangara...@gmail.com>
>> > > wrote:
>> > >
>> > > > In our case, the lower-casing is happening in a custom Java indexer
>> > code,
>> > > > via Java's String.toLowerCase() method.
>> > > >
>> > > > I used the analysis tool in Solr admin (with Jetty). I believe the
>> raw
>> > > > bytes explain this.
>> > > >
>> > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
>> > and
>> > > > beyoncé in file beyonce_with_spl_chars.JPG.
>> > > >
>> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
>> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
>> > > >
>> > > > So when you look at the bytes, it seems to explain why beyonce*
>> matches
>> > > > beyoncé.
>> > > >
>> > > > I tried your approach with a KeywordTokenizer followed by a
>> > > > LowerCaseFilter, but I see the same behavior.
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
>> > > jack.krupan...@gmail.com>
>> > > > wrote:
>> > > >
>> > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't
>> > do
>> > > >> that.
>> > > >>
>> > > >> Some containers default to automatically mapping accented
>> characters,
>> > so
>> > > >> that the accented "e" would then get indexed as a normal "e", and
>> then
>> > > >> your
>> > > >> wildcard would match it, and an accented "e" in a query would get
>> > mapped
>> > > >> as
>> > > >> well and then match the normal "e" in the index. What does your
>> query
>> > > >> response look like?
>> > > >>
>> > > >> This blog post explains that problem:
>> > > >> http://bensch.be/tomcat-solr-and-special-characters
>> > > >>
>> > > >> Note that you could make your string field a text field with the
>> > keyword
>> > > >> tokenizer and then filter it for lower case, such as when the user
>> > query
>> > > >> might have a capital "B". String field is most appropriate when the
>> > > field
>> > > >> really is 100% raw.
>> > > >>
>> > > >>
>> > > >> -- Jack Krupansky
>> > > >>
>> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
>> > > >> arunrangara...@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >> > Yes, it is a string field and not a text field.
>> > > >> >
>> > > >> > > > sortMissingLast="true"
>> > > >> > omitNorms="true"/>
>> > > >> > > />
>> > > >> >
>> > > >> > Lower-casing done to do case-insensitive matching.
>> > > >> >
>> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
>> > > >> jack.krupan...@gmail.com>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Is it really a string field - as opposed to a text field? Show
>> us
>> > > the
>> > > >> > field
>> > > >> > > and field type.
>> > > >> > >
>> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a
>> > capital
>> > > >> "B"?
>> > > >> > >
>> > > >> > > -- Jack Krupansky
>> > > >> > >
>> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
>> > > >> > arunrangara...@gmail.com
>> > > >> > > >
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > I have a string field raw_name like this in my document:
>> > > >> > > >
>> > > >> > > > {raw_name: beyoncé}
>> > > >> > > >
>> > > >> > > > (Notice that the last character is a special character.)
>> > > >> > > >

Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
Thanks, Jack.
I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154


On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky 
wrote:

> Thanks. That at least verifies that the accented e is stored in the field.
> I don't see anything wrong here, so it is as if the Lucene prefix query was
> mapping the accented characters. It's not supposed to do that, but...
>
> Go ahead and file a Jira bug. Include all of the details that you provided
> in this thread.
>
> -- Jack Krupansky
>
> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan  >
> wrote:
>
> > Exact query:
> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> >
> > Response:
> >
> > {  "responseHeader": {"status": 0,"QTime": 0,"params": {
> >"fl": "raw_name",  "q": "raw_name:beyonce*",  "wt": "json"
> >   }  },  "response": {"numFound": 2,"start": 0,"docs": [
> >{"raw_name": "beyoncé"  },  {"raw_name":
> > "beyoncé"  }]  }}
> >
> >
> >
> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > wrote:
> >
> > > Please post the info I requested - the exact query, and the Solr
> > response.
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> > > arunrangara...@gmail.com>
> > > wrote:
> > >
> > > > In our case, the lower-casing is happening in a custom Java indexer
> > code,
> > > > via Java's String.toLowerCase() method.
> > > >
> > > > I used the analysis tool in Solr admin (with Jetty). I believe the
> raw
> > > > bytes explain this.
> > > >
> > > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
> > and
> > > > beyoncé in file beyonce_with_spl_chars.JPG.
> > > >
> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> > > >
> > > > So when you look at the bytes, it seems to explain why beyonce*
> matches
> > > > beyoncé.
> > > >
> > > > I tried your approach with a KeywordTokenizer followed by a
> > > > LowerCaseFilter, but I see the same behavior.
> > > >
> > > >
> > > >
> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> > > jack.krupan...@gmail.com>
> > > > wrote:
> > > >
> > > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't
> > do
> > > >> that.
> > > >>
> > > >> Some containers default to automatically mapping accented
> characters,
> > so
> > > >> that the accented "e" would then get indexed as a normal "e", and
> then
> > > >> your
> > > >> wildcard would match it, and an accented "e" in a query would get
> > mapped
> > > >> as
> > > >> well and then match the normal "e" in the index. What does your
> query
> > > >> response look like?
> > > >>
> > > >> This blog post explains that problem:
> > > >> http://bensch.be/tomcat-solr-and-special-characters
> > > >>
> > > >> Note that you could make your string field a text field with the
> > keyword
> > > >> tokenizer and then filter it for lower case, such as when the user
> > query
> > > >> might have a capital "B". String field is most appropriate when the
> > > field
> > > >> really is 100% raw.
> > > >>
> > > >>
> > > >> -- Jack Krupansky
> > > >>
> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> > > >> arunrangara...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Yes, it is a string field and not a text field.
> > > >> >
> > > >> >  > sortMissingLast="true"
> > > >> > omitNorms="true"/>
> > > >> >  />
> > > >> >
> > > >> > Lower-casing done to do case-insensitive matching.
> > > >> >
> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> > > >> jack.krupan...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Is it really a string field - as opposed to a text field? Show
> us
> > > the
> > > >> > field
> > > >> > > and field type.
> > > >> > >
> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a
> > capital
> > > >> "B"?
> > > >> > >
> > > >> > > -- Jack Krupansky
> > > >> > >
> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> > > >> > arunrangara...@gmail.com
> > > >> > > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > I have a string field raw_name like this in my document:
> > > >> > > >
> > > >> > > > {raw_name: beyoncé}
> > > >> > > >
> > > >> > > > (Notice that the last character is a special character.)
> > > >> > > >
> > > >> > > > When I issue this wildcard query:
> > > >> > > >
> > > >> > > > q=raw_name:beyonce*
> > > >> > > >
> > > >> > > > i.e. with the last character simply being the ASCII 'e', Solr
> > > >> returns
> > > >> > me
> > > >> > > > the above document.
> > > >> > > >
> > > >> > > > How do I prevent this?
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>


Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
Thanks. That at least verifies that the accented e is stored in the field.
I don't see anything wrong here, so it is as if the Lucene prefix query was
mapping the accented characters. It's not supposed to do that, but...

Go ahead and file a Jira bug. Include all of the details that you provided
in this thread.

-- Jack Krupansky

On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan 
wrote:

> Exact query:
> /select?q=raw_name:beyonce*&wt=json&fl=raw_name
>
> Response:
>
> {  "responseHeader": {"status": 0,"QTime": 0,"params": {
>"fl": "raw_name",  "q": "raw_name:beyonce*",  "wt": "json"
>   }  },  "response": {"numFound": 2,"start": 0,"docs": [
>{"raw_name": "beyoncé"  },  {"raw_name":
> "beyoncé"  }]  }}
>
>
>
> On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky  >
> wrote:
>
> > Please post the info I requested - the exact query, and the Solr
> response.
> >
> > -- Jack Krupansky
> >
> > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> > arunrangara...@gmail.com>
> > wrote:
> >
> > > In our case, the lower-casing is happening in a custom Java indexer
> code,
> > > via Java's String.toLowerCase() method.
> > >
> > > I used the analysis tool in Solr admin (with Jetty). I believe the raw
> > > bytes explain this.
> > >
> > > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
> and
> > > beyoncé in file beyonce_with_spl_chars.JPG.
> > >
> > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> > >
> > > So when you look at the bytes, it seems to explain why beyonce* matches
> > > beyoncé.
> > >
> > > I tried your approach with a KeywordTokenizer followed by a
> > > LowerCaseFilter, but I see the same behavior.
> > >
> > >
> > >
> > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> > jack.krupan...@gmail.com>
> > > wrote:
> > >
> > >> But how is that lowercasing occurring? I mean, solr.StrField doesn't
> do
> > >> that.
> > >>
> > >> Some containers default to automatically mapping accented characters,
> so
> > >> that the accented "e" would then get indexed as a normal "e", and then
> > >> your
> > >> wildcard would match it, and an accented "e" in a query would get
> mapped
> > >> as
> > >> well and then match the normal "e" in the index. What does your query
> > >> response look like?
> > >>
> > >> This blog post explains that problem:
> > >> http://bensch.be/tomcat-solr-and-special-characters
> > >>
> > >> Note that you could make your string field a text field with the
> keyword
> > >> tokenizer and then filter it for lower case, such as when the user
> query
> > >> might have a capital "B". String field is most appropriate when the
> > field
> > >> really is 100% raw.
> > >>
> > >>
> > >> -- Jack Krupansky
> > >>
> > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> > >> arunrangara...@gmail.com>
> > >> wrote:
> > >>
> > >> > Yes, it is a string field and not a text field.
> > >> >
> > >> >  sortMissingLast="true"
> > >> > omitNorms="true"/>
> > >> > 
> > >> >
> > >> > Lower-casing done to do case-insensitive matching.
> > >> >
> > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> > >> jack.krupan...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Is it really a string field - as opposed to a text field? Show us
> > the
> > >> > field
> > >> > > and field type.
> > >> > >
> > >> > > Besides, if it really were a "raw" name, wouldn't that be a
> capital
> > >> "B"?
> > >> > >
> > >> > > -- Jack Krupansky
> > >> > >
> > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> > >> > arunrangara...@gmail.com
> > >> > > >
> > >> > > wrote:
> > >> > >
> > >> > > > I have a string field raw_name like this in my document:
> > >> > > >
> > >> > > > {raw_name: beyoncé}
> > >> > > >
> > >> > > > (Notice that the last character is a special character.)
> > >> > > >
> > >> > > > When I issue this wildcard query:
> > >> > > >
> > >> > > > q=raw_name:beyonce*
> > >> > > >
> > >> > > > i.e. with the last character simply being the ASCII 'e', Solr
> > >> returns
> > >> > me
> > >> > > > the above document.
> > >> > > >
> > >> > > > How do I prevent this?
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>


Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
Exact query:
/select?q=raw_name:beyonce*&wt=json&fl=raw_name

Response:

{  "responseHeader": {"status": 0,"QTime": 0,"params": {
   "fl": "raw_name",  "q": "raw_name:beyonce*",  "wt": "json"
  }  },  "response": {"numFound": 2,"start": 0,"docs": [
   {"raw_name": "beyoncé"  },  {"raw_name":
"beyoncé"  }]  }}



On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky 
wrote:

> Please post the info I requested - the exact query, and the Solr response.
>
> -- Jack Krupansky
>
> On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> arunrangara...@gmail.com>
> wrote:
>
> > In our case, the lower-casing is happening in a custom Java indexer code,
> > via Java's String.toLowerCase() method.
> >
> > I used the analysis tool in Solr admin (with Jetty). I believe the raw
> > bytes explain this.
> >
> > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
> > beyoncé in file beyonce_with_spl_chars.JPG.
> >
> > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> >
> > So when you look at the bytes, it seems to explain why beyonce* matches
> > beyoncé.
> >
> > I tried your approach with a KeywordTokenizer followed by a
> > LowerCaseFilter, but I see the same behavior.
> >
> >
> >
> > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> But how is that lowercasing occurring? I mean, solr.StrField doesn't do
> >> that.
> >>
> >> Some containers default to automatically mapping accented characters, so
> >> that the accented "e" would then get indexed as a normal "e", and then
> >> your
> >> wildcard would match it, and an accented "e" in a query would get mapped
> >> as
> >> well and then match the normal "e" in the index. What does your query
> >> response look like?
> >>
> >> This blog post explains that problem:
> >> http://bensch.be/tomcat-solr-and-special-characters
> >>
> >> Note that you could make your string field a text field with the keyword
> >> tokenizer and then filter it for lower case, such as when the user query
> >> might have a capital "B". String field is most appropriate when the
> field
> >> really is 100% raw.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> >> arunrangara...@gmail.com>
> >> wrote:
> >>
> >> > Yes, it is a string field and not a text field.
> >> >
> >> >  >> > omitNorms="true"/>
> >> > 
> >> >
> >> > Lower-casing done to do case-insensitive matching.
> >> >
> >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> >> jack.krupan...@gmail.com>
> >> > wrote:
> >> >
> >> > > Is it really a string field - as opposed to a text field? Show us
> the
> >> > field
> >> > > and field type.
> >> > >
> >> > > Besides, if it really were a "raw" name, wouldn't that be a capital
> >> "B"?
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> >> > arunrangara...@gmail.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > I have a string field raw_name like this in my document:
> >> > > >
> >> > > > {raw_name: beyoncé}
> >> > > >
> >> > > > (Notice that the last character is a special character.)
> >> > > >
> >> > > > When I issue this wildcard query:
> >> > > >
> >> > > > q=raw_name:beyonce*
> >> > > >
> >> > > > i.e. with the last character simply being the ASCII 'e', Solr
> >> returns
> >> > me
> >> > > > the above document.
> >> > > >
> >> > > > How do I prevent this?
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>


Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
Please post the info I requested - the exact query, and the Solr response.

-- Jack Krupansky

On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
wrote:

> In our case, the lower-casing is happening in a custom Java indexer code,
> via Java's String.toLowerCase() method.
>
> I used the analysis tool in Solr admin (with Jetty). I believe the raw
> bytes explain this.
>
> Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
> beyoncé in file beyonce_with_spl_chars.JPG.
>
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
>
> So when you look at the bytes, it seems to explain why beyonce* matches
> beyoncé.
>
> I tried your approach with a KeywordTokenizer followed by a
> LowerCaseFilter, but I see the same behavior.
>
>
>
> On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
> wrote:
>
>> But how is that lowercasing occurring? I mean, solr.StrField doesn't do
>> that.
>>
>> Some containers default to automatically mapping accented characters, so
>> that the accented "e" would then get indexed as a normal "e", and then
>> your
>> wildcard would match it, and an accented "e" in a query would get mapped
>> as
>> well and then match the normal "e" in the index. What does your query
>> response look like?
>>
>> This blog post explains that problem:
>> http://bensch.be/tomcat-solr-and-special-characters
>>
>> Note that you could make your string field a text field with the keyword
>> tokenizer and then filter it for lower case, such as when the user query
>> might have a capital "B". String field is most appropriate when the field
>> really is 100% raw.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
>> arunrangara...@gmail.com>
>> wrote:
>>
>> > Yes, it is a string field and not a text field.
>> >
>> > > > omitNorms="true"/>
>> > 
>> >
>> > Lower-casing done to do case-insensitive matching.
>> >
>> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
>> jack.krupan...@gmail.com>
>> > wrote:
>> >
>> > > Is it really a string field - as opposed to a text field? Show us the
>> > field
>> > > and field type.
>> > >
>> > > Besides, if it really were a "raw" name, wouldn't that be a capital
>> "B"?
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
>> > arunrangara...@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > I have a string field raw_name like this in my document:
>> > > >
>> > > > {raw_name: beyoncé}
>> > > >
>> > > > (Notice that the last character is a special character.)
>> > > >
>> > > > When I issue this wildcard query:
>> > > >
>> > > > q=raw_name:beyonce*
>> > > >
>> > > > i.e. with the last character simply being the ASCII 'e', Solr
>> returns
>> > me
>> > > > the above document.
>> > > >
>> > > > How do I prevent this?
>> > > >
>> > >
>> >
>>
>
>


Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
In our case, the lower-casing is happening in a custom Java indexer code,
via Java's String.toLowerCase() method.

I used the analysis tool in Solr admin (with Jetty). I believe the raw
bytes explain this.

Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
beyoncé in file beyonce_with_spl_chars.JPG.

Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]

So when you look at the bytes, it seems to explain why beyonce* matches
beyoncé.

I tried your approach with a KeywordTokenizer followed by a
LowerCaseFilter, but I see the same behavior.



On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
wrote:

> But how is that lowercasing occurring? I mean, solr.StrField doesn't do
> that.
>
> Some containers default to automatically mapping accented characters, so
> that the accented "e" would then get indexed as a normal "e", and then your
> wildcard would match it, and an accented "e" in a query would get mapped as
> well and then match the normal "e" in the index. What does your query
> response look like?
>
> This blog post explains that problem:
> http://bensch.be/tomcat-solr-and-special-characters
>
> Note that you could make your string field a text field with the keyword
> tokenizer and then filter it for lower case, such as when the user query
> might have a capital "B". String field is most appropriate when the field
> really is 100% raw.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan  >
> wrote:
>
> > Yes, it is a string field and not a text field.
> >
> >  > omitNorms="true"/>
> > 
> >
> > Lower-casing done to do case-insensitive matching.
> >
> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > Is it really a string field - as opposed to a text field? Show us the
> > field
> > > and field type.
> > >
> > > Besides, if it really were a "raw" name, wouldn't that be a capital
> "B"?
> > >
> > > -- Jack Krupansky
> > >
> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> > arunrangara...@gmail.com
> > > >
> > > wrote:
> > >
> > > > I have a string field raw_name like this in my document:
> > > >
> > > > {raw_name: beyoncé}
> > > >
> > > > (Notice that the last character is a special character.)
> > > >
> > > > When I issue this wildcard query:
> > > >
> > > > q=raw_name:beyonce*
> > > >
> > > > i.e. with the last character simply being the ASCII 'e', Solr returns
> > me
> > > > the above document.
> > > >
> > > > How do I prevent this?
> > > >
> > >
> >
>


Re: Special character and wildcard matching

2015-02-23 Thread Jack Krupansky
But how is that lowercasing occurring? I mean, solr.StrField doesn't do
that.

Some containers default to automatically mapping accented characters, so
that the accented "e" would then get indexed as a normal "e", and then your
wildcard would match it, and an accented "e" in a query would get mapped as
well and then match the normal "e" in the index. What does your query
response look like?

This blog post explains that problem:
http://bensch.be/tomcat-solr-and-special-characters

Note that you could make your string field a text field with the keyword
tokenizer and then filter it for lower case, such as when the user query
might have a capital "B". String field is most appropriate when the field
really is 100% raw.


-- Jack Krupansky

On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
wrote:

> Yes, it is a string field and not a text field.
>
>  omitNorms="true"/>
> 
>
> Lower-casing done to do case-insensitive matching.
>
> On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
> wrote:
>
> > Is it really a string field - as opposed to a text field? Show us the
> field
> > and field type.
> >
> > Besides, if it really were a "raw" name, wouldn't that be a capital "B"?
> >
> > -- Jack Krupansky
> >
> > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> arunrangara...@gmail.com
> > >
> > wrote:
> >
> > > I have a string field raw_name like this in my document:
> > >
> > > {raw_name: beyoncé}
> > >
> > > (Notice that the last character is a special character.)
> > >
> > > When I issue this wildcard query:
> > >
> > > q=raw_name:beyonce*
> > >
> > > i.e. with the last character simply being the ASCII 'e', Solr returns
> me
> > > the above document.
> > >
> > > How do I prevent this?
> > >
> >
>


Re: Special character and wildcard matching

2015-02-23 Thread Arun Rangarajan
Yes, it is a string field and not a text field.




Lower-casing done to do case-insensitive matching.

On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
wrote:

> Is it really a string field - as opposed to a text field? Show us the field
> and field type.
>
> Besides, if it really were a "raw" name, wouldn't that be a capital "B"?
>
> -- Jack Krupansky
>
> On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan  >
> wrote:
>
> > I have a string field raw_name like this in my document:
> >
> > {raw_name: beyoncé}
> >
> > (Notice that the last character is a special character.)
> >
> > When I issue this wildcard query:
> >
> > q=raw_name:beyonce*
> >
> > i.e. with the last character simply being the ASCII 'e', Solr returns me
> > the above document.
> >
> > How do I prevent this?
> >
>


Re: Special character and wildcard matching

2015-02-23 Thread Jack Krupansky
Is it really a string field - as opposed to a text field? Show us the field
and field type.

Besides, if it really were a "raw" name, wouldn't that be a capital "B"?

-- Jack Krupansky

On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
wrote:

> I have a string field raw_name like this in my document:
>
> {raw_name: beyoncé}
>
> (Notice that the last character is a special character.)
>
> When I issue this wildcard query:
>
> q=raw_name:beyonce*
>
> i.e. with the last character simply being the ASCII 'e', Solr returns me
> the above document.
>
> How do I prevent this?
>