Thanks Walter,
The solr.ICUNormalizer2CharFilterFactory testing and research I have done leads
me to believe that quotes are not normalised.
I attempted to do this with character folding, many implementations out there -
but none actually seem to work.
I’ll look into the draft.
Thanks
--
John
> On 21 Jan 2019, at 17:09, Walter Underwood <[email protected]> wrote:
>
> First, check which transforms are already handled by Unicode normalization.
> Put this in all of your analyzer chains:
>
> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>
> Probably need this in solrconfig.xml:
>
> <!-- extras for ICU-based Unicode normalization -->
> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/"
> regex=".*\.jar" />
> <lib
> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs"
> regex=".*\.jar" />
>
> I really cannot think of a reason to use unnormalized Unicode in Solr. That
> should be in all the sample files.
>
> For search character matching, yes, all spaces should be normalized. I have
> too many hacks fixing non-breaking spaces spread around the code. When
> matching, there is zero use for stuff like ideographic space (U+3000).
>
> I’m not sure if quotes are normalized. I did some searching around without
> success. That might come under character folding. There was a draft, now
> withdrawn, for standard character folding. I’d probably start there for a
> Unicode folding char filter.
>
> https://www.unicode.org/reports/tr30/tr30-4.html
> <https://www.unicode.org/reports/tr30/tr30-4.html>
>
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/ (my blog)
>
>> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> I think this is probably better to discuss on solr-user, or maybe solr-dev,
>> since it is dismax parser you are talking about, which really lives in Solr.
>> However, my 2c - this seems somewhat dubious. Maybe people want to include
>> those in their terms? Also, it leads to a kind of slippery slope: would you
>> also want to convert all the various white space characters (no-break space,
>> thin space, em space, etc) as vanilla ascii 32? How about all the other
>> "operator" characters like brackets?
>>
>> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]
>> <mailto:[email protected]>> wrote:
>> I'm looking to create an issue to add support for Unicode Double Quotes to
>> the dismax parser.
>>
>> I want to replace all types of double quotes with standard ones before they
>> get stripped
>>
>> i.e.
>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
>>
>> With
>> "
>> I presume this has been discussed before?
>>
>> I have a POC here:
>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x
>> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
>>
>> Thanks,
>>
>> John
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected]
>> <mailto:[email protected]>
>>
>