Re: Special character and wildcard matching

Jack Krupansky Tue, 24 Feb 2015 12:55:50 -0800

It's a string field, so there shouldn't be any analysis. (read back in the
thread for the field and field type.)


-- Jack Krupansky

On Tue, Feb 24, 2015 at 3:19 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> What happens if the query does not have wildcard expansion (*)? If the
> behavior is correct, then the issue is somehow with the
> MultitermQueryAnalysis (a hidden automatically generated analyzer
> chain): http://wiki.apache.org/solr/MultitermQueryAnalysis
>
> Which would still make it a bug, but at least the cause could be narrowed
> down.
>
> Regards,
>    Alex.
>
>
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 24 February 2015 at 14:56, Arun Rangarajan <arunrangara...@gmail.com>
> wrote:
> > Thanks, Jack.
> > I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154
> >
> >
> > On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> Thanks. That at least verifies that the accented e is stored in the
> field.
> >> I don't see anything wrong here, so it is as if the Lucene prefix query
> was
> >> mapping the accented characters. It's not supposed to do that, but...
> >>
> >> Go ahead and file a Jira bug. Include all of the details that you
> provided
> >> in this thread.
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan <
> arunrangara...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Exact query:
> >> > /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> >> >
> >> > Response:
> >> >
> >> > {  "responseHeader": {    "status": 0,    "QTime": 0,    "params": {
> >> >    "fl": "raw_name",      "q": "raw_name:beyonce*",      "wt": "json"
> >> >   }  },  "response": {    "numFound": 2,    "start": 0,    "docs": [
> >> >    {        "raw_name": "beyoncé"      },      {        "raw_name":
> >> > "beyoncé"      }    ]  }}
> >> >
> >> >
> >> >
> >> > On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <
> >> jack.krupan...@gmail.com
> >> > >
> >> > wrote:
> >> >
> >> > > Please post the info I requested - the exact query, and the Solr
> >> > response.
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> >> > > arunrangara...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > In our case, the lower-casing is happening in a custom Java
> indexer
> >> > code,
> >> > > > via Java's String.toLowerCase() method.
> >> > > >
> >> > > > I used the analysis tool in Solr admin (with Jetty). I believe the
> >> raw
> >> > > > bytes explain this.
> >> > > >
> >> > > > Attached are the results for beyonce in file
> beyonce_no_spl_chars.JPG
> >> > and
> >> > > > beyoncé in file beyonce_with_spl_chars.JPG.
> >> > > >
> >> > > > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> >> > > > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> >> > > >
> >> > > > So when you look at the bytes, it seems to explain why beyonce*
> >> matches
> >> > > > beyoncé.
> >> > > >
> >> > > > I tried your approach with a KeywordTokenizer followed by a
> >> > > > LowerCaseFilter, but I see the same behavior.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> >> > > jack.krupan...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > >> But how is that lowercasing occurring? I mean, solr.StrField
> doesn't
> >> > do
> >> > > >> that.
> >> > > >>
> >> > > >> Some containers default to automatically mapping accented
> >> characters,
> >> > so
> >> > > >> that the accented "e" would then get indexed as a normal "e", and
> >> then
> >> > > >> your
> >> > > >> wildcard would match it, and an accented "e" in a query would get
> >> > mapped
> >> > > >> as
> >> > > >> well and then match the normal "e" in the index. What does your
> >> query
> >> > > >> response look like?
> >> > > >>
> >> > > >> This blog post explains that problem:
> >> > > >> http://bensch.be/tomcat-solr-and-special-characters
> >> > > >>
> >> > > >> Note that you could make your string field a text field with the
> >> > keyword
> >> > > >> tokenizer and then filter it for lower case, such as when the
> user
> >> > query
> >> > > >> might have a capital "B". String field is most appropriate when
> the
> >> > > field
> >> > > >> really is 100% raw.
> >> > > >>
> >> > > >>
> >> > > >> -- Jack Krupansky
> >> > > >>
> >> > > >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> >> > > >> arunrangara...@gmail.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >> > Yes, it is a string field and not a text field.
> >> > > >> >
> >> > > >> > <fieldType name="string" class="solr.StrField"
> >> > sortMissingLast="true"
> >> > > >> > omitNorms="true"/>
> >> > > >> > <field name="raw_name" type="string" indexed="true"
> stored="true"
> >> />
> >> > > >> >
> >> > > >> > Lower-casing done to do case-insensitive matching.
> >> > > >> >
> >> > > >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> >> > > >> jack.krupan...@gmail.com>
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Is it really a string field - as opposed to a text field?
> Show
> >> us
> >> > > the
> >> > > >> > field
> >> > > >> > > and field type.
> >> > > >> > >
> >> > > >> > > Besides, if it really were a "raw" name, wouldn't that be a
> >> > capital
> >> > > >> "B"?
> >> > > >> > >
> >> > > >> > > -- Jack Krupansky
> >> > > >> > >
> >> > > >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> >> > > >> > arunrangara...@gmail.com
> >> > > >> > > >
> >> > > >> > > wrote:
> >> > > >> > >
> >> > > >> > > > I have a string field raw_name like this in my document:
> >> > > >> > > >
> >> > > >> > > > {raw_name: beyoncé}
> >> > > >> > > >
> >> > > >> > > > (Notice that the last character is a special character.)
> >> > > >> > > >
> >> > > >> > > > When I issue this wildcard query:
> >> > > >> > > >
> >> > > >> > > > q=raw_name:beyonce*
> >> > > >> > > >
> >> > > >> > > > i.e. with the last character simply being the ASCII 'e',
> Solr
> >> > > >> returns
> >> > > >> > me
> >> > > >> > > > the above document.
> >> > > >> > > >
> >> > > >> > > > How do I prevent this?
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: Special character and wildcard matching

Reply via email to