Re: Unicode Quotes in query parser

John Ryan Tue, 22 Jan 2019 07:35:46 -0800

Thanks Walter,

The solr.ICUNormalizer2CharFilterFactory testing and research I have done leads 
me to believe that quotes are not normalised.


I attempted to do this with character folding, many implementations out there - 
but none actually seem to work. 

I’ll look into the draft.
        
Thanks
--
John  

> On 21 Jan 2019, at 17:09, Walter Underwood <[email protected]> wrote:
> 
> First, check which transforms are already handled by Unicode normalization. 
> Put this in all of your analyzer chains:
> 
>         <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
> 
> Probably need this in solrconfig.xml:
> 
>  <!-- extras for ICU-based Unicode normalization -->
>   <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
> regex=".*\.jar" />
>   <lib 
> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
> regex=".*\.jar" />
> 
> I really cannot think of a reason to use unnormalized Unicode in Solr. That 
> should be in all the sample files.
> 
> For search character matching, yes, all spaces should be normalized. I have 
> too many hacks fixing non-breaking spaces spread around the code. When 
> matching, there is zero use for stuff like ideographic space (U+3000).
> 
> I’m not sure if quotes are normalized. I did some searching around without 
> success. That might come under character folding. There was a draft, now 
> withdrawn, for standard character folding. I’d probably start there for a 
> Unicode folding char filter.
> 
> https://www.unicode.org/reports/tr30/tr30-4.html 
> <https://www.unicode.org/reports/tr30/tr30-4.html>
> 
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
>> since it is dismax parser you are talking about, which really lives in Solr. 
>> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
>> those in their terms? Also, it leads to a kind of slippery slope: would you 
>> also want to convert all the various white space characters (no-break space, 
>> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
>> "operator" characters like brackets?
>> 
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to 
>> the dismax parser. 
>> 
>> I want to replace all types of double quotes with standard ones before they 
>> get stripped 
>> 
>> i.e.
>>         “ ” „ “ „ « » ‟ ❝ ❞ ⹂ ＂
>> 
>> With 
>>         "
>> I presume this has been discussed before?
>> 
>> I have a POC here: 
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
>> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
>> 
>> Thanks, 
>> 
>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
>

Re: Unicode Quotes in query parser

Reply via email to