Re: Prefix + Suffix Wildcards in Searches

Chris Dempsey Tue, 30 Jun 2020 04:28:57 -0700

@Erick,

You've got the idea. Basically the users can attach zero or more tags (*that
they create*) to a document. So as an example say they've created the tags
(this example is just a small subset of the total tags):

   - paid
   - invoice-paid
   - ms-reply-unpaid-2019
   - credit-ms-reply-unpaid
   - ms-reply-paid-2019
   - ms-reply-paid-2020

and attached them in various combinations to documents. They then want to
find all documents by tag that don't contain the characters "paid" anywhere
in the tag, don't contain tags with the characters "ms-reply-unpaid", but
do include documents tagged with the characters "ms-reply-paid".

The obvious suggestion would be to have the users just use the entire tag
(i.e. don't let them do a "contains") as a condition to eliminate the
wildcards - which would work -  but unfortunately we have customers with (*not
joking*) over 100K different tags (*why have a taxonomy like that is yet a
different issue*). I'm willing to accept that in our scenario n-grams might
be the Solr-based answer (the other being to change what "contains" means
within our application) but thought I'd check I hadn't overlooked any other
options. :)

On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <m...@apache.org> wrote:

> Hello, Chris.
> I suppose index time analysis can yield these terms:
> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
> expensive wildcard queries. Here's why it's worth to avoid them
> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>
> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <cdal...@gmail.com> wrote:
>
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Prefix + Suffix Wildcards in Searches

Reply via email to