Re: Unicode Quotes in query parser

Michael Sokolov Tue, 22 Jan 2019 08:20:52 -0800

Right - QueryParsers generally do a first pass, parsing incoming Strings
using their operator characters tok tokenize the input and only after that
do they pass the tokens (or phrases) to an Analyzer. I haven't checked
Dismax - not sure how it does its parsing exactly, but I doubt you can just
"turn on the right Analyzer" to get it to recognize curly quotes as phrase
operators, eg.


On Tue, Jan 22, 2019 at 10:39 AM Mikhail Khludnev <[email protected]> wrote:

> My impression that these quotes are ones which are part of dismax query
> syntax ie they should be handled before the analysis happens.
>
> On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood <[email protected]>
> wrote:
>
>> First, check which transforms are already handled by Unicode
>> normalization. Put this in all of your analyzer chains:
>>
>>         <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>>
>> Probably need this in solrconfig.xml:
>>
>>  <!-- extras for ICU-based Unicode normalization -->
>>   <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/"
>> regex=".*\.jar" />
>>   <lib
>> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
>> regex=".*\.jar" />
>>
>> I really cannot think of a reason to use unnormalized Unicode in Solr.
>> That should be in all the sample files.
>>
>> For search character matching, yes, all spaces should be normalized. I
>> have too many hacks fixing non-breaking spaces spread around the code. When
>> matching, there is zero use for stuff like ideographic space (U+3000).
>>
>> I’m not sure if quotes are normalized. I did some searching around
>> without success. That might come under character folding. There was a
>> draft, now withdrawn, for standard character folding. I’d probably start
>> there for a Unicode folding char filter.
>>
>> https://www.unicode.org/reports/tr30/tr30-4.html
>>
>> wunder
>> Walter Underwood
>> [email protected]
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]> wrote:
>>
>> I think this is probably better to discuss on solr-user, or maybe
>> solr-dev, since it is dismax parser you are talking about, which really
>> lives in Solr. However, my 2c  - this seems somewhat dubious. Maybe people
>> want to include those in their terms? Also, it leads to a kind of slippery
>> slope: would you also want to convert all the various white space
>> characters (no-break space, thin space, em space, etc)  as vanilla ascii
>> 32? How about all the other "operator" characters like brackets?
>>
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]>
>> wrote:
>>
>>> I'm looking to create an issue to add support for Unicode Double Quotes
>>> to the dismax parser.
>>>
>>> I want to replace all types of double quotes with standard ones before
>>> they get stripped
>>>
>>> i.e.
>>>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ ＂
>>>
>>> With
>>>         "
>>> I presume this has been discussed before?
>>>
>>> I have a POC here:
>>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>>>
>>> Thanks,
>>>
>>> John
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Unicode Quotes in query parser

Reply via email to