Re: Multiword synonyms and term wildcards/substring matching

2021-03-02 Thread Martin Graney
Hi Alex

Thanks for the reply.
We are not using the 'copyField bucket' approach as it is inflexible. Our
textual fields are all multivalued dynamic fields, which allows us to craft
a list of `pf` (phrase fields) with associated weighting boosts that are
meant to be used in the search on a *per-collection* basis. This allows us
to have all of the textual fields indexed independently and then simply
change the query when we want to include/exclude a field from the search
without the need to reindex the entire collection. e/dismax makes this more
flexible approach possible.

I'll take a look at the ComplexQueryParser and see if it is a good fit.
We use a lot of the e/dismax params though, such as `bf` (boost functions),
`bq` (boost queries), and 'pf' (phrase fields), to influence the relevance
score.

FYI: We are using Solr 8.3.

On Tue, 2 Mar 2021 at 13:38, Alexandre Rafalovitch 
wrote:

> I admit to not fully understanding the examples, but ComplexQueryParser
> looks like something worth at least reviewing:
>
>
> https://lucene.apache.org/solr/guide/8_8/other-parsers.html#complex-phrase-query-parser
>
> Also I did not see any references to trying to copyField and process same
> content in different ways. If copyField is not stored, the overhead is not
> as large.
>
> Regards,
>     Alex
>
>
>
> On Tue., Mar. 2, 2021, 7:08 a.m. Martin Graney, 
> wrote:
>
> > Hi All
> >
> > I have been trying to implement multi word synonyms using `sow=false`
> into
> > a pre-existing system that applied pre-processing to the phrase to apply
> > wildcards around the terms, i.e. `bread stick` => `*bread* *stick*`.
> >
> > I got the synonyms expansion working perfectly, after discovering the
> > `preserveOriginal` filter param, but then I needed to re-implement the
> > existing wildcard behaviour.
> > I tried using the edge-ngram filter, but found that when searching for
> the
> > phrase `bread stick` on a field containing the word `breadstick` and
> > `q.op=AND` it returns no results, as the content `breadstick` does not
> > _start with_ `stick`. The previous wildcard behaviour would return all
> > documents that contain the substrings `bread` AND `stick`, which is the
> > desired behaviour.
> > I tried using the ngram filter, but this does not support the
> > `preserveOriginal`, and so loses a lot of relevance for exact matches,
> but
> > it also results in matches that are far too broad, creating 21 tokens
> from
> > `breadstick` for `minGramSize=3` and `maxGramSize=5` that in practice
> > essentially matches all of the documents. Which means that boosts applied
> > to other fields, such as 'in stock', push irrelevant documents to the
> top.
> >
> > Finally, I tried to strip out ngrams entirely and use subquery/LocalParam
> > syntax and local params, a solr feature that is not very well documented.
> > I created something like `q={!edismax sow=true v=$widlcards} OR {!edismax
> > sow=false v=$plain}` to effectively create a union of results, one with
> > multi word synonyms support and one with wildcard support.
> > But then I had to implement the other edismax params and immediately
> > stumbled.
> > Each query in production normally has a slew of `bf` and `bq` params,
> and I
> > cannot see a way to pass these into the nested query using local
> variables.
> > If I have 3 different `bf` params how can I pass them into the local
> param
> > subqueries?
> >
> > Also, as the search in production is across multiple fields I found
> passing
> > `qf` to both subqueries using dereferencing failed, as the parser saw it
> as
> > a single field and threw a 'number format exception'.
> > i.e.
> > q={!edismax sow=true v=$tw tf=$tqf} OR {!edismax sow=false v=$tp tf=$tqf}
> > $tw=*bread* *stick*
> > $tp=bread stick
> > $tqf=title^2 desctiption^0.5
> >
> > As you can guess, I have spent quite some time going down this rabbit
> hole
> > in my attempt to reproduce the existing desired functionality alongside
> > multiterm synonyms.
> > Is there a way to get multiterm synonyms working with substring matching
> > effectively?
> > I am sure there is a much simpler way that I am missing than all of my
> > attempts so far.
> >
> > Solr: 8.3
> >
> > Thanks
> > Martin Graney
> >
> > --
> >  <https://www.linkedin.com/company/sooqr-com/>
> >
>


-- 
Martin Graney
Lead Developer

http://sooqr.com <http://www.sooqr.com/>
http://twitter.com/sooqrcom

Office: +31 (0) 88 766 7700
Mobile: +31 (0) 64 660 8543

-- 
 <https://www.linkedin.com/company/sooqr-com/>


Multiword synonyms and term wildcards/substring matching

2021-03-02 Thread Martin Graney
Hi All

I have been trying to implement multi word synonyms using `sow=false` into
a pre-existing system that applied pre-processing to the phrase to apply
wildcards around the terms, i.e. `bread stick` => `*bread* *stick*`.

I got the synonyms expansion working perfectly, after discovering the
`preserveOriginal` filter param, but then I needed to re-implement the
existing wildcard behaviour.
I tried using the edge-ngram filter, but found that when searching for the
phrase `bread stick` on a field containing the word `breadstick` and
`q.op=AND` it returns no results, as the content `breadstick` does not
_start with_ `stick`. The previous wildcard behaviour would return all
documents that contain the substrings `bread` AND `stick`, which is the
desired behaviour.
I tried using the ngram filter, but this does not support the
`preserveOriginal`, and so loses a lot of relevance for exact matches, but
it also results in matches that are far too broad, creating 21 tokens from
`breadstick` for `minGramSize=3` and `maxGramSize=5` that in practice
essentially matches all of the documents. Which means that boosts applied
to other fields, such as 'in stock', push irrelevant documents to the top.

Finally, I tried to strip out ngrams entirely and use subquery/LocalParam
syntax and local params, a solr feature that is not very well documented.
I created something like `q={!edismax sow=true v=$widlcards} OR {!edismax
sow=false v=$plain}` to effectively create a union of results, one with
multi word synonyms support and one with wildcard support.
But then I had to implement the other edismax params and immediately
stumbled.
Each query in production normally has a slew of `bf` and `bq` params, and I
cannot see a way to pass these into the nested query using local variables.
If I have 3 different `bf` params how can I pass them into the local param
subqueries?

Also, as the search in production is across multiple fields I found passing
`qf` to both subqueries using dereferencing failed, as the parser saw it as
a single field and threw a 'number format exception'.
i.e.
q={!edismax sow=true v=$tw tf=$tqf} OR {!edismax sow=false v=$tp tf=$tqf}
$tw=*bread* *stick*
$tp=bread stick
$tqf=title^2 desctiption^0.5

As you can guess, I have spent quite some time going down this rabbit hole
in my attempt to reproduce the existing desired functionality alongside
multiterm synonyms.
Is there a way to get multiterm synonyms working with substring matching
effectively?
I am sure there is a much simpler way that I am missing than all of my
attempts so far.

Solr: 8.3

Thanks
Martin Graney

-- 
 <https://www.linkedin.com/company/sooqr-com/>


send

2021-02-23 Thread Martin Graney
-- 
Martin Graney
Lead Developer

http://sooqr.com <http://www.sooqr.com/>
http://twitter.com/sooqrcom

Office: +31 (0) 88 766 7700
Mobile: +31 (0) 64 660 8543

-- 
 <https://www.linkedin.com/company/sooqr-com/>