Re: Terms with hyphens and fuzzy search

Markus Jelsma Tue, 23 Aug 2022 09:00:33 -0700

It's a while ago but i think to remember that fuzzy queries are not
analyzed. That means that you are looking for term-with-hyphens as a single
token, with a maximum of 1 edit distance. But because you use an analyzer
that splits hyphens, you have no term with a hyphen in your index.


If you move to a WhitespaceTokenizer (and no WordDelimiterFilter), and
reindex, you will have term-with-hyphens as it is in the index. Then you
can find it using FuzzyQuery.

Op di 23 aug. 2022 om 17:50 schreef Morten Ernebjerg
<[email protected]>:

> Hi again
>
> OK, so I think this is starting to make sense, What was confusing us was
> that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> just a single term, meaning that fuzzy search should apply as usual.
> However, if I understand you correctly, it sounds like the correct
> statement is actually that fuzzy search applies to *terms that result in a
> single token after indexing*. Since the standard tokenizer splits on
> hyphens, fuzzy search would then not apply. Did I get that right
>
> >phrase query fields
>
> I'm not sure I quite follow - do you mean using the qf query parameter or
> setting up separate "parallel" fields of some sort?
>
> Best,
>
> Morten
>
> On Tue, 23 Aug 2022 at 17:29, Dave <[email protected]> wrote:
>
> > Ok so from what I’m looking at you have a proximity search so the terms
> > have to be within the distance value of each other. In my example, 2,
> which
> > obviously won’t work since there are three terms.  A fuzzy search is
> based
> > on a single term/token. So you need to add ~2 to each term if that’s what
> > you want. There’s really good
> > Documentation about the difference and why it’s not working as you
> > expected here:
> >
> > https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
> >
> > Also try to make use of phrase query fields and boosting them,
> >
> >
> >
> > > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
> > <[email protected]> wrote:
> > >
> > > (replying on behalf of  my colleague Julius who wrote this question
> who
> > is
> > > unable to reply for technical reasons)
> > > Hi David,
> > >
> > > Thanks for the reply! I think your question may point to something we
> > > overlooked. We are actually using Solr 8.11 and we want to use fuzzy
> > search
> > > (
> > >
> >
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
> > ),
> > > i.e. find words that differ from the query by one or a few characters.
> > Our
> > > understanding was that to get matches that differ by max two chars from
> > > (using separate line to avoid adding confusing quotation marks)
> > >
> > > term-with-hyphens
> > >
> > > we should send the following query (without any quotation marks):
> > >
> > > term-with-hyphens~2
> > >
> > > Our thinking was that the hyphenated term is one word so there is no
> need
> > > to quote it. We had a quick try quoting the hyphenated term in the
> query
> > as
> > > you suggested and it looks like it works (i.e. returns matches). Since
> > the
> > > standard tokenizer splits on hyphens, I'm wondering the unquoted query
> > > somehow gets converted to the *proximity search* query
> > >
> > > "term with hyphens"~2
> > >
> > > which then fails (though it looks like it should still match
> > > term-with-hyphens). Would be great to understand what is happening.
> > >
> > > Best,
> > >
> > > Morten
> > >
> > >
> > >
> > >> On Tue, 23 Aug 2022 at 16:30, David Hastings <
> > [email protected]>
> > >> wrote:
> > >>
> > >> I’m not certain of course of your tokenizer but shouldn’t it be
> > >> “terms-with-hyphens”~1
> > >>
> > >> ? Just a syntax thing that may not have translated over email but
> > curious
> > >>
> > >> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo
> <[email protected]
> > >
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I am getting peculiar results when querying for a term containing
> > hyphens
> > >>> and add fuzzy search
> > >>> <
> > >>>
> > >>
> >
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> > >>>>
> > >>> .
> > >>>
> > >>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
> > >>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
> > >> hyphens"
> > >>> both items are returned as expected. The same is the case for escaped
> > >>> hyphens "term\-with\-hyphens".
> > >>>
> > >>> The problem: When I add the fuzzy search parameter (i.e.,
> > >>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> > >> back.
> > >>>
> > >>> I struggle to understand the results, or how to solve this problem.
> My
> > >>> intuition tells me that adding a fuzzy search parameter should surely
> > >>> increase the size of the set of results. I am happy for any help on
> > this!
> > >>>
> > >>> Our current setup is using the "Extended DisMax Query Parser"
> > >>> <
> > https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> > >>>
> > >>> however we observe the same behaviour using the "Standard Query
> Parser
> > >>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
> > We
> > >>> are
> > >>> using the "Standard Tokenizer
> > >>> <
> > >>>
> > >>
> >
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >>>> ",
> > >>> which splits at hyphens. Does this relate to this problem?
> > >>>
> > >>> Thank you!
> > >>>
> > >>> --
> > >>>
> > >>> *Julian Hugo*
> > >>>
> > >>> Working Student
> > >>> Backend Development
> > >>>
> > >>> (he/his)
> > >>>
> > >>>
> > >>> [email protected]
> > >>>
> > >>>
> > >>> D4L data4life gGmbH
> > >>> Charlottenstraße 109
> > >>> 14467 Potsdam, Germany
> > >>>
> > >>> www.data4life.care
> > >>>
> > >>>
> > >>> Amtsgericht Potsdam, HRB 30667
> > >>>
> > >>> Managing Director: Christian-Cornelius Weiß
> > >>>
> > >>>
> > >>> We are Data4Life. We've been certified by the German Federal Office
> for
> > >>> Information Security (BSI) in accordance with ISO 27001 on the basis
> of
> > >>> "IT-Grundschutz".
> > >>>
> > >>>
> > >>> Diversity is the driving force behind our work towards a society
> where
> > >>> digital health improves quality of life for everyone.
> > >>> Data4Life warmly welcomes applicants from the LGBTQI+ community,
> people
> > >>> with a migration background, People of Color, and individuals with
> > >>> disabilities or chronic illnesses to the team.
> > >>>
> > >>>
> > >>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> > >>>
> > >>
> > >
> > >
> > > --
> > >
> > > *Morten Ernebjerg, Ph.D.*
> > >
> > > Senior Developer
> > >
> > >
> > > [email protected]
> > >
> > > D4L data4life gGmbH
> > >
> > > Charlottenstraße 109
> > >
> > > 14467 Potsdam, Germany
> > >
> > > www.data4life.care
> > >
> > > Amtsgericht Potsdam, HRB 30667
> > >
> > > Managing Director: Christian-Cornelius Weiß
> > >
> > >
> > > We are Data4Life. We've been certified by the German Federal Office for
> > > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > > "IT-Grundschutz".
> > >
> > >
> > > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
>
>
> --
>
> *Morten Ernebjerg, Ph.D.*
>
> Senior Developer
>
>
> [email protected]
>
> D4L data4life gGmbH
>
> Charlottenstraße 109
>
> 14467 Potsdam, Germany
>
> www.data4life.care
>
> Amtsgericht Potsdam, HRB 30667
>
> Managing Director: Christian-Cornelius Weiß
>
>
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
>
>
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>

Re: Terms with hyphens and fuzzy search

Reply via email to