My impression that these quotes are ones which are part of dismax query syntax ie they should be handled before the analysis happens.
On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood <[email protected]> wrote: > First, check which transforms are already handled by Unicode > normalization. Put this in all of your analyzer chains: > > <charFilter class="solr.ICUNormalizer2CharFilterFactory"/> > > Probably need this in solrconfig.xml: > > <!-- extras for ICU-based Unicode normalization --> > <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" > regex=".*\.jar" /> > <lib > dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" > regex=".*\.jar" /> > > I really cannot think of a reason to use unnormalized Unicode in Solr. > That should be in all the sample files. > > For search character matching, yes, all spaces should be normalized. I > have too many hacks fixing non-breaking spaces spread around the code. When > matching, there is zero use for stuff like ideographic space (U+3000). > > I’m not sure if quotes are normalized. I did some searching around without > success. That might come under character folding. There was a draft, now > withdrawn, for standard character folding. I’d probably start there for a > Unicode folding char filter. > > https://www.unicode.org/reports/tr30/tr30-4.html > > wunder > Walter Underwood > [email protected] > http://observer.wunderwood.org/ (my blog) > > On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]> wrote: > > I think this is probably better to discuss on solr-user, or maybe > solr-dev, since it is dismax parser you are talking about, which really > lives in Solr. However, my 2c - this seems somewhat dubious. Maybe people > want to include those in their terms? Also, it leads to a kind of slippery > slope: would you also want to convert all the various white space > characters (no-break space, thin space, em space, etc) as vanilla ascii > 32? How about all the other "operator" characters like brackets? > > On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]> > wrote: > >> I'm looking to create an issue to add support for Unicode Double Quotes >> to the dismax parser. >> >> I want to replace all types of double quotes with standard ones before >> they get stripped >> >> i.e. >> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ " >> >> With >> " >> I presume this has been discussed before? >> >> I have a POC here: >> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x >> >> Thanks, >> >> John >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > -- Sincerely yours Mikhail Khludnev
