Right - QueryParsers generally do a first pass, parsing incoming Strings using their operator characters tok tokenize the input and only after that do they pass the tokens (or phrases) to an Analyzer. I haven't checked Dismax - not sure how it does its parsing exactly, but I doubt you can just "turn on the right Analyzer" to get it to recognize curly quotes as phrase operators, eg.
On Tue, Jan 22, 2019 at 10:39 AM Mikhail Khludnev <[email protected]> wrote: > My impression that these quotes are ones which are part of dismax query > syntax ie they should be handled before the analysis happens. > > On Mon, Jan 21, 2019 at 8:09 PM Walter Underwood <[email protected]> > wrote: > >> First, check which transforms are already handled by Unicode >> normalization. Put this in all of your analyzer chains: >> >> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/> >> >> Probably need this in solrconfig.xml: >> >> <!-- extras for ICU-based Unicode normalization --> >> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" >> regex=".*\.jar" /> >> <lib >> dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" >> regex=".*\.jar" /> >> >> I really cannot think of a reason to use unnormalized Unicode in Solr. >> That should be in all the sample files. >> >> For search character matching, yes, all spaces should be normalized. I >> have too many hacks fixing non-breaking spaces spread around the code. When >> matching, there is zero use for stuff like ideographic space (U+3000). >> >> I’m not sure if quotes are normalized. I did some searching around >> without success. That might come under character folding. There was a >> draft, now withdrawn, for standard character folding. I’d probably start >> there for a Unicode folding char filter. >> >> https://www.unicode.org/reports/tr30/tr30-4.html >> >> wunder >> Walter Underwood >> [email protected] >> http://observer.wunderwood.org/ (my blog) >> >> On Jan 21, 2019, at 7:43 AM, Michael Sokolov <[email protected]> wrote: >> >> I think this is probably better to discuss on solr-user, or maybe >> solr-dev, since it is dismax parser you are talking about, which really >> lives in Solr. However, my 2c - this seems somewhat dubious. Maybe people >> want to include those in their terms? Also, it leads to a kind of slippery >> slope: would you also want to convert all the various white space >> characters (no-break space, thin space, em space, etc) as vanilla ascii >> 32? How about all the other "operator" characters like brackets? >> >> On Mon, Jan 21, 2019 at 9:50 AM John Ryan <[email protected]> >> wrote: >> >>> I'm looking to create an issue to add support for Unicode Double Quotes >>> to the dismax parser. >>> >>> I want to replace all types of double quotes with standard ones before >>> they get stripped >>> >>> i.e. >>> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ " >>> >>> With >>> " >>> I presume this has been discussed before? >>> >>> I have a POC here: >>> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x >>> >>> Thanks, >>> >>> John >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> > > -- > Sincerely yours > Mikhail Khludnev >
