Re: Unicode Quotes in query parser

Mikhail Khludnev Tue, 22 Jan 2019 07:40:06 -0800

My impression that these quotes are ones which are part of dismax query
syntax ie they should be handled before the analysis happens.


On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood <[email protected]>
wrote:

> First, check which transforms are already handled by Unicode
> normalization. Put this in all of your analyzer chains:
>
>         <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>
> Probably need this in solrconfig.xml:
>
>  <!-- extras for ICU-based Unicode normalization -->
>   <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/"
> regex=".*\.jar" />
>   <lib
> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
> regex=".*\.jar" />
>
> I really cannot think of a reason to use unnormalized Unicode in Solr.
> That should be in all the sample files.
>
> For search character matching, yes, all spaces should be normalized. I
> have too many hacks fixing non-breaking spaces spread around the code. When
> matching, there is zero use for stuff like ideographic space (U+3000).
>
> I’m not sure if quotes are normalized. I did some searching around without
> success. That might come under character folding. There was a draft, now
> withdrawn, for standard character folding. I’d probably start there for a
> Unicode folding char filter.
>
> https://www.unicode.org/reports/tr30/tr30-4.html
>
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
>
> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]> wrote:
>
> I think this is probably better to discuss on solr-user, or maybe
> solr-dev, since it is dismax parser you are talking about, which really
> lives in Solr. However, my 2c  - this seems somewhat dubious. Maybe people
> want to include those in their terms? Also, it leads to a kind of slippery
> slope: would you also want to convert all the various white space
> characters (no-break space, thin space, em space, etc)  as vanilla ascii
> 32? How about all the other "operator" characters like brackets?
>
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]>
> wrote:
>
>> I'm looking to create an issue to add support for Unicode Double Quotes
>> to the dismax parser.
>>
>> I want to replace all types of double quotes with standard ones before
>> they get stripped
>>
>> i.e.
>>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ ＂
>>
>> With
>>         "
>> I presume this has been discussed before?
>>
>> I have a POC here:
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>>
>> Thanks,
>>
>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: Unicode Quotes in query parser

Reply via email to