Re: Supporting multiple indexes in one collection
Did the test while back . Revisiting this again. But in standalone solr we have experienced the queries more time if the data exists in 2 shards . That's the main reason this test was done. If anyone has experience want to hear On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke wrote: > How many documents ? > The real difference was only a couple of ms? > > > Am 01.07.2020 um 07:34 schrieb Raji N : > > > > Had 2 indexes in 2 separate shards in one collection and had exact same > > data published with composite router with a prefix. Disabled all caches. > > Issued the same query which is a small query with q parameter and fq > > parameter . Number of queries which got executed (with same threads and > > run for same time ) were more in 2 indexes with 2 separate shards case. > > 90th percentile response time was also few ms better. > > > > Thanks, > > Raji > > > >> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke > wrote: > >> > >> What did you test? Which queries? What were the exact results in terms > of > >> time ? > >> > Am 30.06.2020 um 22:47 schrieb Raji N : > >>> > >>> Hi , > >>> > >>> > >>> Trying to place multiple smaller indexes in one collection (as we read > >>> solrcloud performance degrades as number of collections increase). We > are > >>> exploring two ways > >>> > >>> > >>> 1) Placing each index on a single shard of a collection > >>> > >>> In this case placing documents for a single index is manual and > >>> automatic rebalancing not done by solr > >>> > >>> > >>> 2) Solr routing composite router with a prefix . > >>> > >>> In this case solr doesn’t place all the docs with same prefix in > one > >>> shard , so searches becomes distributed. But shard rebalancing is taken > >>> care by solr. > >>> > >>> > >>> We did a small perf test with both these set up. We saw the performance > >> for > >>> the first case (placing an index explicitly on a shard ) is better. > >>> > >>> > >>> Has anyone done anything similar. Can you please share your experience. > >>> > >>> > >>> Thanks, > >>> > >>> Raji > >> >
Re: Supporting multiple indexes in one collection
How many documents ? The real difference was only a couple of ms? > Am 01.07.2020 um 07:34 schrieb Raji N : > > Had 2 indexes in 2 separate shards in one collection and had exact same > data published with composite router with a prefix. Disabled all caches. > Issued the same query which is a small query with q parameter and fq > parameter . Number of queries which got executed (with same threads and > run for same time ) were more in 2 indexes with 2 separate shards case. > 90th percentile response time was also few ms better. > > Thanks, > Raji > >> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke wrote: >> >> What did you test? Which queries? What were the exact results in terms of >> time ? >> Am 30.06.2020 um 22:47 schrieb Raji N : >>> >>> Hi , >>> >>> >>> Trying to place multiple smaller indexes in one collection (as we read >>> solrcloud performance degrades as number of collections increase). We are >>> exploring two ways >>> >>> >>> 1) Placing each index on a single shard of a collection >>> >>> In this case placing documents for a single index is manual and >>> automatic rebalancing not done by solr >>> >>> >>> 2) Solr routing composite router with a prefix . >>> >>> In this case solr doesn’t place all the docs with same prefix in one >>> shard , so searches becomes distributed. But shard rebalancing is taken >>> care by solr. >>> >>> >>> We did a small perf test with both these set up. We saw the performance >> for >>> the first case (placing an index explicitly on a shard ) is better. >>> >>> >>> Has anyone done anything similar. Can you please share your experience. >>> >>> >>> Thanks, >>> >>> Raji >>
Re: Supporting multiple indexes in one collection
Had 2 indexes in 2 separate shards in one collection and had exact same data published with composite router with a prefix. Disabled all caches. Issued the same query which is a small query with q parameter and fq parameter . Number of queries which got executed (with same threads and run for same time ) were more in 2 indexes with 2 separate shards case. 90th percentile response time was also few ms better. Thanks, Raji On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke wrote: > What did you test? Which queries? What were the exact results in terms of > time ? > > > Am 30.06.2020 um 22:47 schrieb Raji N : > > > > Hi , > > > > > > Trying to place multiple smaller indexes in one collection (as we read > > solrcloud performance degrades as number of collections increase). We are > > exploring two ways > > > > > > 1) Placing each index on a single shard of a collection > > > > In this case placing documents for a single index is manual and > > automatic rebalancing not done by solr > > > > > > 2) Solr routing composite router with a prefix . > > > > In this case solr doesn’t place all the docs with same prefix in one > > shard , so searches becomes distributed. But shard rebalancing is taken > > care by solr. > > > > > > We did a small perf test with both these set up. We saw the performance > for > > the first case (placing an index explicitly on a shard ) is better. > > > > > > Has anyone done anything similar. Can you please share your experience. > > > > > > Thanks, > > > > Raji >
Re: Supporting multiple indexes in one collection
What did you test? Which queries? What were the exact results in terms of time ? > Am 30.06.2020 um 22:47 schrieb Raji N : > > Hi , > > > Trying to place multiple smaller indexes in one collection (as we read > solrcloud performance degrades as number of collections increase). We are > exploring two ways > > > 1) Placing each index on a single shard of a collection > > In this case placing documents for a single index is manual and > automatic rebalancing not done by solr > > > 2) Solr routing composite router with a prefix . > > In this case solr doesn’t place all the docs with same prefix in one > shard , so searches becomes distributed. But shard rebalancing is taken > care by solr. > > > We did a small perf test with both these set up. We saw the performance for > the first case (placing an index explicitly on a shard ) is better. > > > Has anyone done anything similar. Can you please share your experience. > > > Thanks, > > Raji
Supporting multiple indexes in one collection
Hi , Trying to place multiple smaller indexes in one collection (as we read solrcloud performance degrades as number of collections increase). We are exploring two ways 1) Placing each index on a single shard of a collection In this case placing documents for a single index is manual and automatic rebalancing not done by solr 2) Solr routing composite router with a prefix . In this case solr doesn’t place all the docs with same prefix in one shard , so searches becomes distributed. But shard rebalancing is taken care by solr. We did a small perf test with both these set up. We saw the performance for the first case (placing an index explicitly on a shard ) is better. Has anyone done anything similar. Can you please share your experience. Thanks, Raji
RE: Query in quotes cannot find results
Thank you Walter, I'll look into “mm” (minimum match) parameter. Best Regards, Vadim Permakoff -Original Message- From: Walter Underwood Sent: Tuesday, June 30, 2020 2:31 PM To: solr-user@lucene.apache.org Subject: Re: Query in quotes cannot find results This is exactly why the “mm” (minimum match) parameter exists, to reduce the number of hits with fewer matches. Think of it as a sliding scale between OR and AND. On the other hand, I don’t usually worry about hits with fewer matches. Those are not on the first page, so I don’t care. In general, you can either optimize more related hits or optimize fewer unrelated hits. Everything you do to reduce the unrelated hits will cause some related hits to not match. Also, do all of your tuning with real user queries from logs. Making up queries for testing will lead to fixing problems that never occur in production and to missing problems that do occur. wunder Walter Underwood wun...@wunderwood.org https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=Ol5cKm0H8yMMumWsju-SIp8XXKG9UsM1SZdwwfYwRFI&s=Wfu_hghIf8SKFF7k-pk9A0xMA5CMWm0MVNuK2XJSKuQ&e= (my blog) > On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim > wrote: > > Hi Erick, > Thank you for the suggestion, I should of add it. Actually before asking this > question here, I tried to add and remove the FlattenGraphFilterFactory, plus > other variations, like expand / not expand, autoGeneratePhraseQueries / not > autoGeneratePhraseQueries - it just does not work with this particular > example. You can try it yourself. > > Regarding removing the stopwords, I agree, there are many cases when you > don't want to remove the stopwords, but there is one very compelling case > when you want them to be removed. > > Imagine, you have one document with the following text: > 1. "to expand the methods for mailing cancellation" > And another document with the text: > 2. "to expand methods for mailing cancellation" > > The user query is (without quotes): q=expand the methods for mailing > cancellation I don't want to bring all the documents with condition q.op=OR, > it will find too many unrelated documents, so I want to search with q.op=AND. > Unfortunately, the document 2 will not be found as it has no stop word "the" > in it. > What should I do now? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Tuesday, June 30, 2020 12:15 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Well, the first thing is that you haven’t include FlattenGraphFilterFactory > in the index analysis chain, see: > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e= > . IDK whether that actually pertains, but I’d reindex with that included > before pursuing. > > Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s > necessary? Is there any evidence for this or any use-case that shows it _is_ > necessary? Removing stopwords became common in the long-ago days when memory > and disk capacity were vastly more constrained than now. At this point, I > require proof that it’s _necessary_ to remove them before accepting this kind > of requirement. > > There are situations where removing stopwords is worth the difficulty it > causes. But I’ve seen far too many unnecessary requirements to let that one > pass without pushing back ;). > > And you can hack around this by adding slop to the phrase, perhaps you can > get “good enough” results by adding one slop for every stopword, i.e. if the > input is “expand the methods”, detect that there’s one stopword and change it > to “expand the methods”~1. That’ll introduce other problems of course. > > Best, > Erick > >> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim >> wrote: >> >> Hi Erik, >> That's what I did in the past, but this is an enterprise search and I have a >> requirement to remove the stopwords. >> To have both features I can add synonyms in the front-end application, I >> know it will work, but I need a justification why I have to do it in the >> application as it is an additional effort. >> I thought there is a bug for such case to which I can refer, because >> according to documentation it should work, right? >> Anyway, there is more to it. If I'll add the same synonym processing to the >> indexing part, i.e. the configuration will be like this: >> >> > positionIncrementGap="100" autoGeneratePhraseQueries="true"> >> >> >> > ignoreCase="true"/> >> > words="stopwords.txt"/> >> >> >> >> >> > ignoreCase=
Re: Query in quotes cannot find results
This is exactly why the “mm” (minimum match) parameter exists, to reduce the number of hits with fewer matches. Think of it as a sliding scale between OR and AND. On the other hand, I don’t usually worry about hits with fewer matches. Those are not on the first page, so I don’t care. In general, you can either optimize more related hits or optimize fewer unrelated hits. Everything you do to reduce the unrelated hits will cause some related hits to not match. Also, do all of your tuning with real user queries from logs. Making up queries for testing will lead to fixing problems that never occur in production and to missing problems that do occur. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim > wrote: > > Hi Erick, > Thank you for the suggestion, I should of add it. Actually before asking this > question here, I tried to add and remove the FlattenGraphFilterFactory, plus > other variations, like expand / not expand, autoGeneratePhraseQueries / not > autoGeneratePhraseQueries - it just does not work with this particular > example. You can try it yourself. > > Regarding removing the stopwords, I agree, there are many cases when you > don't want to remove the stopwords, but there is one very compelling case > when you want them to be removed. > > Imagine, you have one document with the following text: > 1. "to expand the methods for mailing cancellation" > And another document with the text: > 2. "to expand methods for mailing cancellation" > > The user query is (without quotes): q=expand the methods for mailing > cancellation > I don't want to bring all the documents with condition q.op=OR, it will find > too many unrelated documents, so I want to search with q.op=AND. > Unfortunately, the document 2 will not be found as it has no stop word "the" > in it. > What should I do now? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Tuesday, June 30, 2020 12:15 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Well, the first thing is that you haven’t include FlattenGraphFilterFactory > in the index analysis chain, see: > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e= > . IDK whether that actually pertains, but I’d reindex with that included > before pursuing. > > Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s > necessary? Is there any evidence for this or any use-case that shows it _is_ > necessary? Removing stopwords became common in the long-ago days when memory > and disk capacity were vastly more constrained than now. At this point, I > require proof that it’s _necessary_ to remove them before accepting this kind > of requirement. > > There are situations where removing stopwords is worth the difficulty it > causes. But I’ve seen far too many unnecessary requirements to let that one > pass without pushing back ;). > > And you can hack around this by adding slop to the phrase, perhaps you can > get “good enough” results by adding one slop for every stopword, i.e. if the > input is “expand the methods”, detect that there’s one stopword and change it > to “expand the methods”~1. That’ll introduce other problems of course. > > Best, > Erick > >> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim >> wrote: >> >> Hi Erik, >> That's what I did in the past, but this is an enterprise search and I have a >> requirement to remove the stopwords. >> To have both features I can add synonyms in the front-end application, I >> know it will work, but I need a justification why I have to do it in the >> application as it is an additional effort. >> I thought there is a bug for such case to which I can refer, because >> according to documentation it should work, right? >> Anyway, there is more to it. If I'll add the same synonym processing to the >> indexing part, i.e. the configuration will be like this: >> >> > positionIncrementGap="100" autoGeneratePhraseQueries="true"> >> >> >> > ignoreCase="true"/> >> > words="stopwords.txt"/> >> >> >> >> >> > ignoreCase="true" expand="true"/> >> > words="stopwords.txt"/> >> >> >> >> >> The analysis shows the parsing is matching now for indexing and querying >> path, but the exact match result still cannot be found! This is weird. >> Any thoughts? >> >> Best Regards, >> Vadim Permakoff >> >> >> -Original Message- >> From: Erick Erickson >> Sent: Monday, June 29, 2020 10:19 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >>
RE: Query in quotes cannot find results
Hi Walter, I'm with you, sometimes the stopwords are very important, I did a few years back just for fun the Solr demo for Wikipedia search, you can see - nothing is removed: http://www.softcorporation.com/lab/solr/wiki/?sq=to+be+or+not+to+be But with the enterprise search, sometimes you will be better off removing the stopwords, I replied to Erick why. My question is not "Should we remove the stopwords?", my question is: "Apparently the synonyms with spaces are not working if we are removing the stopwords. Is there a way to fix it or is there a jira for it?" Best Regards, Vadim Permakoff -Original Message- From: Walter Underwood Sent: Tuesday, June 30, 2020 12:50 PM To: solr-user@lucene.apache.org Subject: Re: Query in quotes cannot find results Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove hedgehogs up my arse.” Part of our job as search engineers is to solve the real problem, not implement a pile of requirements from people who don’t understand how search works. Here is an article I wrote 13 years ago about why we didn’t remove stopwords at Netflix. https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2007_05_31_do-2Dall-2Dstopword-2Dqueries-2Dmatter_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=kjHjId_IfQN_w0ISSEAUWfFIrgqEl2H7YiZSx22eRys&s=RhKQkdqdNNyweNUackNjcCPnj-0ahUz7oHjupG4v9yM&e= wunder Walter Underwood wun...@wunderwood.org https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=kjHjId_IfQN_w0ISSEAUWfFIrgqEl2H7YiZSx22eRys&s=8xpxLnqquGUWswYROoC61WTpDxzjwNOnEoRNw3vNvmM&e= (my blog) > On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim > wrote: > > Hi Erik, > That's what I did in the past, but this is an enterprise search and I have a > requirement to remove the stopwords. > To have both features I can add synonyms in the front-end application, I know > it will work, but I need a justification why I have to do it in the > application as it is an additional effort. > I thought there is a bug for such case to which I can refer, because > according to documentation it should work, right? > Anyway, there is more to it. If I'll add the same synonym processing to the > indexing part, i.e. the configuration will be like this: > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > ignoreCase="true"/> > words="stopwords.txt"/> > > > > > ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > > > > > The analysis shows the parsing is matching now for indexing and querying > path, but the exact match result still cannot be found! This is weird. > Any thoughts? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Monday, June 29, 2020 10:19 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Looks like you’re removing stopwords. Stopwords cause issues like this with > the positions being off. > > It’s becoming more and more common to _NOT_ remove stopwords, is that an > option? > > > > Best, > Erick > >> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim >> wrote: >> >> Hi Shawn, >> Many thanks for the response, I checked the field and it is correct. Let's >> call it _text_ to make it easier. >> I believe the parsing is also correct, please see below: >> - Query without quotes (works): >> "querystring":"expand the methods", >> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >> _text_:methods", >> >> - Query with quotes (does not work): >> "querystring":"\"expand the methods\"", >> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, >> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", >> >> The document has text: >> "to expand the methods for mailing cancellation" >> >> The analysis on this field shows that all words are present in the index and >> the query, the order is also correct, but the word "methods" in moved one >> position, I guess that's why the result is not found. >> >> Best Regards, >> Vadim Permakoff >> >> >> >> >> -Original Message- >> From: Shawn Heisey >> Sent: Monday, June 29, 2020 6:28 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >> >> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >>> The basic query q=expand the methods <<< finds the document, >>> the query (in quotes) q="expand the methods" <<< cannot find the document >>> >>> Am I doing something wrong, or is it known bug (I saw similar issues >>> discussed in the past, but not for exact match query) and if yes - what is >>> the Jira for it? >> >> The most helpful information will come from running both queries with debug >> enabled, so you can see how the query is pa
RE: Query in quotes cannot find results
Hi Erick, Thank you for the suggestion, I should of add it. Actually before asking this question here, I tried to add and remove the FlattenGraphFilterFactory, plus other variations, like expand / not expand, autoGeneratePhraseQueries / not autoGeneratePhraseQueries - it just does not work with this particular example. You can try it yourself. Regarding removing the stopwords, I agree, there are many cases when you don't want to remove the stopwords, but there is one very compelling case when you want them to be removed. Imagine, you have one document with the following text: 1. "to expand the methods for mailing cancellation" And another document with the text: 2. "to expand methods for mailing cancellation" The user query is (without quotes): q=expand the methods for mailing cancellation I don't want to bring all the documents with condition q.op=OR, it will find too many unrelated documents, so I want to search with q.op=AND. Unfortunately, the document 2 will not be found as it has no stop word "the" in it. What should I do now? Best Regards, Vadim Permakoff -Original Message- From: Erick Erickson Sent: Tuesday, June 30, 2020 12:15 PM To: solr-user@lucene.apache.org Subject: Re: Query in quotes cannot find results Well, the first thing is that you haven’t include FlattenGraphFilterFactory in the index analysis chain, see: https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e= . IDK whether that actually pertains, but I’d reindex with that included before pursuing. Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s necessary? Is there any evidence for this or any use-case that shows it _is_ necessary? Removing stopwords became common in the long-ago days when memory and disk capacity were vastly more constrained than now. At this point, I require proof that it’s _necessary_ to remove them before accepting this kind of requirement. There are situations where removing stopwords is worth the difficulty it causes. But I’ve seen far too many unnecessary requirements to let that one pass without pushing back ;). And you can hack around this by adding slop to the phrase, perhaps you can get “good enough” results by adding one slop for every stopword, i.e. if the input is “expand the methods”, detect that there’s one stopword and change it to “expand the methods”~1. That’ll introduce other problems of course. Best, Erick > On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim > wrote: > > Hi Erik, > That's what I did in the past, but this is an enterprise search and I have a > requirement to remove the stopwords. > To have both features I can add synonyms in the front-end application, I know > it will work, but I need a justification why I have to do it in the > application as it is an additional effort. > I thought there is a bug for such case to which I can refer, because > according to documentation it should work, right? > Anyway, there is more to it. If I'll add the same synonym processing to the > indexing part, i.e. the configuration will be like this: > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > ignoreCase="true"/> > words="stopwords.txt"/> > > > > > ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > > > > > The analysis shows the parsing is matching now for indexing and querying > path, but the exact match result still cannot be found! This is weird. > Any thoughts? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Monday, June 29, 2020 10:19 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Looks like you’re removing stopwords. Stopwords cause issues like this with > the positions being off. > > It’s becoming more and more common to _NOT_ remove stopwords, is that an > option? > > > > Best, > Erick > >> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim >> wrote: >> >> Hi Shawn, >> Many thanks for the response, I checked the field and it is correct. Let's >> call it _text_ to make it easier. >> I believe the parsing is also correct, please see below: >> - Query without quotes (works): >> "querystring":"expand the methods", >> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >> _text_:methods", >> >> - Query with quotes (does not work): >> "querystring":"\"expand the methods\"", >> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, >> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", >> >> The document has text: >> "to expand the methods for mailing cancellation" >> >> The
Re: How to determine why solr stops running?
Hi, Maybe https://github.com/sematext/solr-diagnostics can be of use? Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On Mon, Jun 29, 2020 at 3:46 PM Erick Erickson wrote: > Really look at your cache size settings. > > This is to eliminate this scenario: > - your cache sizes are very large > - when you looked and the memory was 9G, you also had a lot of cache > entries > - there was a commit, which threw out the old cache and reduced your cache > size > > This is frankly kind of unlikely, but worth checking. > > The other option is that you haven’t been hitting OOMs at all and that’s a > complete > red herring. Let’s say in actuality, you only need an 8G heap or even > smaller. By > overallocating memory garbage will simply accumulate for a long time and > when it > is eventually collected, _lots_ of memory will be collected. > > Another rather unlikely scenario, but again worth checking. > > Best, > Erick > > > On Jun 29, 2020, at 3:27 PM, Ryan W wrote: > > > > On Mon, Jun 29, 2020 at 3:13 PM Erick Erickson > > wrote: > > > >> ps aux | grep solr > >> > > > > [solr@faspbsy0002 database-backups]$ ps aux | grep solr > > solr 72072 1.6 33.4 22847816 10966476 ? Sl 13:35 1:36 java > > -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+ParallelRefProcEnabled > > -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages > > -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails > > -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps > > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime > > -Xloggc:/opt/solr/server/logs/solr_gc.log -XX:+UseGCLogFileRotation > > -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M > > -Dsolr.log.dir=/opt/solr/server/logs -Djetty.port=8983 -DSTOP.PORT=7983 > > -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/solr/server > > -Dsolr.solr.home=/opt/solr/server/solr -Dsolr.data.home= > > -Dsolr.install.dir=/opt/solr > > -Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf > > -Xss256k -Dsolr.jetty.https.port=8983 -Dsolr.log.muteconsole > > -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 > /opt/solr/server/logs > > -jar start.jar --module=http > > > > > > > >> should show you all the parameters Solr is running with, as would the > >> admin screen. You should see something like: > >> > >> -XX:OnOutOfMemoryError=your_solr_directory/bin/oom_solr.sh > >> > >> And there should be some logs laying around if that was the case > >> similar to: > >> $SOLR_LOGS_DIR/solr_oom_killer-$SOLR_PORT-$NOW.log > >> > > > > This log is not being written, even though in the oom_solr.sh it does > > appear a solr_oom_killer-$SOLR_PORT-$NOW.log should be written to the > logs > > directory, but it isn't. There are some log files in > /opt/solr/server/logs, > > and they are indeed being written to. There are fresh entries in the > logs, > > but no sign of any problem. If I grep for oom in the logs directory, the > > only references I see are benign... just a few entries that list all the > > flags, and oom_solr.sh is among the settings visible in the entry. And > > someone did a search for "Mushroom," so there's another instance of oom > > from that search. > > > > > > As for memory, It Depends (tm). There are configurations > >> you can make choices about that will affect the heap requirements. > >> You can’t really draw comparisons between different projects. Your > >> Drupal + Solr app has how many documents? Indexed how? Searched > >> how? .vs. this one. > >> > >> The usual suspect for configuration settings that are responsible > >> include: > >> > >> - filterCache size too large. Each filterCache entry is bounded by > >> maxDoc/8 bytes. I’ve seen people set this to over 1M… > >> > >> - using non-docValues for fields used for sorting, grouping, function > >> queries > >> or faceting. Solr will uninvert the field on the heap, whereas if you > have > >> specified docValues=true, the memory is out in OS memory space rather > than > >> heap. > >> > >> - People just putting too many docs in a collection in a single JVM in > >> aggregate. > >> All replicas in the same instance are using part of the heap. > >> > >> - Having unnecessary options on your fields, although that’s more MMap > >> space than > >> heap. > >> > >> The problem basically is that all of Solr’s access is essentially > random, > >> so for > >> performance reasons lots of stuff has to be in memory. > >> > >> That said, Solr hasn’t been as careful as it should be about using up > >> memory, > >> that’s ongoing. > >> > >> If you really want to know what’s using up memory, throw a heap analysis > >> tool > >> at it. That’ll give you a clue what’s hogging memory and you can go from > >> there. > >> > >>> On Jun 29, 2020, at 1:48 PM, David Hastings < > >> hastings.recurs...@gmail.com> wrote: > >>> > >>> little nit picky note here, use 31gb, never 32. > >>> > >>> On Mon, Jun 29, 2020 at
Re: Query in quotes cannot find results
Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove hedgehogs up my arse.” Part of our job as search engineers is to solve the real problem, not implement a pile of requirements from people who don’t understand how search works. Here is an article I wrote 13 years ago about why we didn’t remove stopwords at Netflix. https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim > wrote: > > Hi Erik, > That's what I did in the past, but this is an enterprise search and I have a > requirement to remove the stopwords. > To have both features I can add synonyms in the front-end application, I know > it will work, but I need a justification why I have to do it in the > application as it is an additional effort. > I thought there is a bug for such case to which I can refer, because > according to documentation it should work, right? > Anyway, there is more to it. If I'll add the same synonym processing to the > indexing part, i.e. the configuration will be like this: > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > ignoreCase="true"/> > words="stopwords.txt"/> > > > > > ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > > > > > The analysis shows the parsing is matching now for indexing and querying > path, but the exact match result still cannot be found! This is weird. > Any thoughts? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Monday, June 29, 2020 10:19 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Looks like you’re removing stopwords. Stopwords cause issues like this with > the positions being off. > > It’s becoming more and more common to _NOT_ remove stopwords, is that an > option? > > > > Best, > Erick > >> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim >> wrote: >> >> Hi Shawn, >> Many thanks for the response, I checked the field and it is correct. Let's >> call it _text_ to make it easier. >> I believe the parsing is also correct, please see below: >> - Query without quotes (works): >> "querystring":"expand the methods", >> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >> _text_:methods", >> >> - Query with quotes (does not work): >> "querystring":"\"expand the methods\"", >> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, >> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", >> >> The document has text: >> "to expand the methods for mailing cancellation" >> >> The analysis on this field shows that all words are present in the index and >> the query, the order is also correct, but the word "methods" in moved one >> position, I guess that's why the result is not found. >> >> Best Regards, >> Vadim Permakoff >> >> >> >> >> -Original Message- >> From: Shawn Heisey >> Sent: Monday, June 29, 2020 6:28 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >> >> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >>> The basic query q=expand the methods <<< finds the document, >>> the query (in quotes) q="expand the methods" <<< cannot find the document >>> >>> Am I doing something wrong, or is it known bug (I saw similar issues >>> discussed in the past, but not for exact match query) and if yes - what is >>> the Jira for it? >> >> The most helpful information will come from running both queries with debug >> enabled, so you can see how the query is parsed. If you add a parameter >> "debugQuery=true" to the URL, then the response should include the parsed >> query. Compare those, and see if you can tell what the differences are. >> >> One of the most common problems for queries like this is that you're not >> searching the field that you THINK you're searching. I don't know whether >> this is the problem, I just mention it because it is a common error. >> >> Thanks, >> Shawn >> >> >> >> This email is intended solely for the recipient. It may contain privileged, >> proprietary or confidential information or material. If you are not the >> intended recipient, please delete this email and any attachments and notify >> the sender of the error. >
Re: Query in quotes cannot find results
Well, the first thing is that you haven’t include FlattenGraphFilterFactory in the index analysis chain, see: https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#synonym-graph-filter. IDK whether that actually pertains, but I’d reindex with that included before pursuing. Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s necessary? Is there any evidence for this or any use-case that shows it _is_ necessary? Removing stopwords became common in the long-ago days when memory and disk capacity were vastly more constrained than now. At this point, I require proof that it’s _necessary_ to remove them before accepting this kind of requirement. There are situations where removing stopwords is worth the difficulty it causes. But I’ve seen far too many unnecessary requirements to let that one pass without pushing back ;). And you can hack around this by adding slop to the phrase, perhaps you can get “good enough” results by adding one slop for every stopword, i.e. if the input is “expand the methods”, detect that there’s one stopword and change it to “expand the methods”~1. That’ll introduce other problems of course. Best, Erick > On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim > wrote: > > Hi Erik, > That's what I did in the past, but this is an enterprise search and I have a > requirement to remove the stopwords. > To have both features I can add synonyms in the front-end application, I know > it will work, but I need a justification why I have to do it in the > application as it is an additional effort. > I thought there is a bug for such case to which I can refer, because > according to documentation it should work, right? > Anyway, there is more to it. If I'll add the same synonym processing to the > indexing part, i.e. the configuration will be like this: > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > ignoreCase="true"/> > words="stopwords.txt"/> > > > > > ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > > > > > The analysis shows the parsing is matching now for indexing and querying > path, but the exact match result still cannot be found! This is weird. > Any thoughts? > > Best Regards, > Vadim Permakoff > > > -Original Message- > From: Erick Erickson > Sent: Monday, June 29, 2020 10:19 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Looks like you’re removing stopwords. Stopwords cause issues like this with > the positions being off. > > It’s becoming more and more common to _NOT_ remove stopwords, is that an > option? > > > > Best, > Erick > >> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim >> wrote: >> >> Hi Shawn, >> Many thanks for the response, I checked the field and it is correct. Let's >> call it _text_ to make it easier. >> I believe the parsing is also correct, please see below: >> - Query without quotes (works): >> "querystring":"expand the methods", >> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >> _text_:methods", >> >> - Query with quotes (does not work): >> "querystring":"\"expand the methods\"", >> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, >> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", >> >> The document has text: >> "to expand the methods for mailing cancellation" >> >> The analysis on this field shows that all words are present in the index and >> the query, the order is also correct, but the word "methods" in moved one >> position, I guess that's why the result is not found. >> >> Best Regards, >> Vadim Permakoff >> >> >> >> >> -Original Message- >> From: Shawn Heisey >> Sent: Monday, June 29, 2020 6:28 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >> >> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >>> The basic query q=expand the methods <<< finds the document, >>> the query (in quotes) q="expand the methods" <<< cannot find the document >>> >>> Am I doing something wrong, or is it known bug (I saw similar issues >>> discussed in the past, but not for exact match query) and if yes - what is >>> the Jira for it? >> >> The most helpful information will come from running both queries with debug >> enabled, so you can see how the query is parsed. If you add a parameter >> "debugQuery=true" to the URL, then the response should include the parsed >> query. Compare those, and see if you can tell what the differences are. >> >> One of the most common problems for queries like this is that you're not >> searching the field that you THINK you're searching. I don't know whether >> this is the problem, I just mention it because it is a common error. >> >> Thanks, >> Shawn >> >> >> >> This email is intended solely for the recipient. It may
RE: Query in quotes cannot find results
Hi Erik, That's what I did in the past, but this is an enterprise search and I have a requirement to remove the stopwords. To have both features I can add synonyms in the front-end application, I know it will work, but I need a justification why I have to do it in the application as it is an additional effort. I thought there is a bug for such case to which I can refer, because according to documentation it should work, right? Anyway, there is more to it. If I'll add the same synonym processing to the indexing part, i.e. the configuration will be like this: The analysis shows the parsing is matching now for indexing and querying path, but the exact match result still cannot be found! This is weird. Any thoughts? Best Regards, Vadim Permakoff -Original Message- From: Erick Erickson Sent: Monday, June 29, 2020 10:19 PM To: solr-user@lucene.apache.org Subject: Re: Query in quotes cannot find results Looks like you’re removing stopwords. Stopwords cause issues like this with the positions being off. It’s becoming more and more common to _NOT_ remove stopwords, is that an option? Best, Erick > On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim > wrote: > > Hi Shawn, > Many thanks for the response, I checked the field and it is correct. Let's > call it _text_ to make it easier. > I believe the parsing is also correct, please see below: > - Query without quotes (works): >"querystring":"expand the methods", >"parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) > _text_:methods", > > - Query with quotes (does not work): >"querystring":"\"expand the methods\"", >"parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, > _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", > > The document has text: > "to expand the methods for mailing cancellation" > > The analysis on this field shows that all words are present in the index and > the query, the order is also correct, but the word "methods" in moved one > position, I guess that's why the result is not found. > > Best Regards, > Vadim Permakoff > > > > > -Original Message- > From: Shawn Heisey > Sent: Monday, June 29, 2020 6:28 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >> The basic query q=expand the methods <<< finds the document, >> the query (in quotes) q="expand the methods" <<< cannot find the document >> >> Am I doing something wrong, or is it known bug (I saw similar issues >> discussed in the past, but not for exact match query) and if yes - what is >> the Jira for it? > > The most helpful information will come from running both queries with debug > enabled, so you can see how the query is parsed. If you add a parameter > "debugQuery=true" to the URL, then the response should include the parsed > query. Compare those, and see if you can tell what the differences are. > > One of the most common problems for queries like this is that you're not > searching the field that you THINK you're searching. I don't know whether > this is the problem, I just mention it because it is a common error. > > Thanks, > Shawn > > > > This email is intended solely for the recipient. It may contain privileged, > proprietary or confidential information or material. If you are not the > intended recipient, please delete this email and any attachments and notify > the sender of the error.
Re: Config files not replicating
Yes, The config is there and it works for me in live environment but not the new staging environment. On Tue, Jun 30, 2020 at 2:29 PM Erick Erickson wrote: > Did you put your auxiliary files in the > confFiles tag? E.g. from the page you referenced: > > schema.xml,stopwords.txt,elevate.xml > > Best, > Erick > > > On Jun 30, 2020, at 5:38 AM, Atita Arora wrote: > > > > Hi, > > > > We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the > > discussion thread these days !!) and lately, I got into this weird issue > > that at each replication trigger my index gets correctly replicated but > my > > config changes are not replicated to my slaves. > > > > We are using referential properties i.e. my solrconfig.xml imports the > > different configs like requesthandler_config.xml, > > replication_handler_config.xml, etc which essentially means if going by > > solr doc ( > https://lucene.apache.org/solr/guide/6_6/index-replication.html) : > > > > Unlike the index files, where the timestamp is good enough to figure out > if > > they are identical, configuration files are compared against their > > checksum. The schema.xml files (on master and slave) are judged to be > > identical if their checksums are identical. > > > > The checksum of my solrconfig.xml would not vary, is it why my files > won't > > replicate? > > > > I already have another Master-Slave in a different environment working > with > > the same config version, so I don't smell any issue with the replication > > configuration. > > > > I have tried manual replication too but the files would not change. > > Maybe it is something weirdly trivial or stupid that I seem to be missing > > here, any pointers or ideas what else can I check? > > > > Thank you, > > > > Atita > >
Re: solrj - get metrics from all nodes
Use nodes=, not node= > 30. jun. 2020 kl. 02:02 skrev ChienHuaWang : > > Hi Jan, > > Thanks for the response. > Could you please share more detail how you request the metric with multiple > nodes same time? > I do something as below, but only get one node info, the data I'm interested > most is, ex: CONTAINER.fs.totalSpace, CONTAINER.fs.usableSpace. etc.. > > > solr/admin/metrics?group=node&node=node1_name,node2_name > > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Config files not replicating
Did you put your auxiliary files in the confFiles tag? E.g. from the page you referenced: schema.xml,stopwords.txt,elevate.xml Best, Erick > On Jun 30, 2020, at 5:38 AM, Atita Arora wrote: > > Hi, > > We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the > discussion thread these days !!) and lately, I got into this weird issue > that at each replication trigger my index gets correctly replicated but my > config changes are not replicated to my slaves. > > We are using referential properties i.e. my solrconfig.xml imports the > different configs like requesthandler_config.xml, > replication_handler_config.xml, etc which essentially means if going by > solr doc (https://lucene.apache.org/solr/guide/6_6/index-replication.html) : > > Unlike the index files, where the timestamp is good enough to figure out if > they are identical, configuration files are compared against their > checksum. The schema.xml files (on master and slave) are judged to be > identical if their checksums are identical. > > The checksum of my solrconfig.xml would not vary, is it why my files won't > replicate? > > I already have another Master-Slave in a different environment working with > the same config version, so I don't smell any issue with the replication > configuration. > > I have tried manual replication too but the files would not change. > Maybe it is something weirdly trivial or stupid that I seem to be missing > here, any pointers or ideas what else can I check? > > Thank you, > > Atita
Re: Prefix + Suffix Wildcards in Searches
That’s not quite the question I was asking. Let’s take "…that don’t contain the characters ‘paid’ “. Start with the fact that no matter what the mechanics of implementing pre-and-post wildcards, something like *:* -tags:*paid* would exclude a doc with a tag of "credit-ms-reply-unpaid" or "ms-reply-unpaid-2019”. I really think this is an XY problem, You’re assuming that the solution is pre-and-post wildcards without a precise definition of the problem you’re trying to solve. Do they want to exclude things with the characters ‘ia’ or ‘id’? Or is their “unit of exclusion” the _entire_ word ‘paid’? Or can we define it so? Because if we can, what I wrote yesterday about using proper tokenization and phrase queries will work. If you break up all your tags in your example into individual tokens on non-alphanumerics, then your problem is much simpler, excluding “*paid*” becomes -tags:paid excluding “*ms-reply*” becomes -tags:”ms reply” trying to exclude “*ms-unpaid*” would _not_ exclude the doc with the tag "credit-ms-reply-unpaid” because “ms” and “unpaid” are not sequential. _Including_ is the same argument. BTW, this is where “positionIncrementGap” comes in. If they can define multiple tags in each document, phrase searching with a gap greater than 1 (100 is the usual default) _and_ each tag is an entry in a multiValued field, you can prevent matching across tags with phrase searches. Consider two tags “ms-tag1” and “paid-2019”. You don’t want “*tag1-paid*” to exclude this doc I’d imagine. The positionIncrementGap takes care of this in the phrase case. Remember that in this solution, the dashes aren’t included in each token. prefix only or postfix only would be a little tricky, one idea would be to copyField into an _untokenized_ field and search there in those cases. But even here, you need to determine precisely what you expect. What would “*d-2019” return? Would it return something ending in “ms-reply-paid-2019”? Alternatively, you wouldn’t need a copyField if you introduced special tokens before and after each tag, so indexing “invoice-paid” would index tokens: specialbegintoken invoice paid specialendtoken and searching for *paid becomes tag:“paid specialendtoken" Best, Erick > On Jun 30, 2020, at 7:29 AM, Chris Dempsey wrote: > > @Mikhail > > Thanks for the link! I'll read through that. > > On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey wrote: > >> @Erick, >> >> You've got the idea. Basically the users can attach zero or more tags (*that >> they create*) to a document. So as an example say they've created the >> tags (this example is just a small subset of the total tags): >> >> - paid >> - invoice-paid >> - ms-reply-unpaid-2019 >> - credit-ms-reply-unpaid >> - ms-reply-paid-2019 >> - ms-reply-paid-2020 >> >> and attached them in various combinations to documents. They then want to >> find all documents by tag that don't contain the characters "paid" anywhere >> in the tag, don't contain tags with the characters "ms-reply-unpaid", but >> do include documents tagged with the characters "ms-reply-paid". >> >> The obvious suggestion would be to have the users just use the entire tag >> (i.e. don't let them do a "contains") as a condition to eliminate the >> wildcards - which would work - but unfortunately we have customers with >> (*not >> joking*) over 100K different tags (*why have a taxonomy like that is yet >> a different issue*). I'm willing to accept that in our scenario n-grams >> might be the Solr-based answer (the other being to change what "contains" >> means within our application) but thought I'd check I hadn't overlooked any >> other options. :) >> >> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev wrote: >> >>> Hello, Chris. >>> I suppose index time analysis can yield these terms: >>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these >>> expensive wildcard queries. Here's why it's worth to avoid them >>> >>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam >>> >>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey wrote: >>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) >>> but I'm looking into options for optimizing something like this: > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR tag:*ms-reply-paid* It's probably not a surprise that we're seeing performance issues with something like this. My understanding is that using the wildcard on both ends forces a full-text index search. Something like the above can't >>> take advantage of something like the ReverseWordFilter either. I believe constructing `n-grams` is an option (*at the expense of index size*) >>> but is there anything I'm overlooking as a possible avenue to look into? >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >>
Re: Prefix + Suffix Wildcards in Searches
@Mikhail Thanks for the link! I'll read through that. On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey wrote: > @Erick, > > You've got the idea. Basically the users can attach zero or more tags (*that > they create*) to a document. So as an example say they've created the > tags (this example is just a small subset of the total tags): > >- paid >- invoice-paid >- ms-reply-unpaid-2019 >- credit-ms-reply-unpaid >- ms-reply-paid-2019 >- ms-reply-paid-2020 > > and attached them in various combinations to documents. They then want to > find all documents by tag that don't contain the characters "paid" anywhere > in the tag, don't contain tags with the characters "ms-reply-unpaid", but > do include documents tagged with the characters "ms-reply-paid". > > The obvious suggestion would be to have the users just use the entire tag > (i.e. don't let them do a "contains") as a condition to eliminate the > wildcards - which would work - but unfortunately we have customers with (*not > joking*) over 100K different tags (*why have a taxonomy like that is yet > a different issue*). I'm willing to accept that in our scenario n-grams > might be the Solr-based answer (the other being to change what "contains" > means within our application) but thought I'd check I hadn't overlooked any > other options. :) > > On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev wrote: > >> Hello, Chris. >> I suppose index time analysis can yield these terms: >> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these >> expensive wildcard queries. Here's why it's worth to avoid them >> >> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam >> >> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey wrote: >> >> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) >> but >> > I'm looking into options for optimizing something like this: >> > >> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR >> > tag:*ms-reply-paid* >> > >> > It's probably not a surprise that we're seeing performance issues with >> > something like this. My understanding is that using the wildcard on both >> > ends forces a full-text index search. Something like the above can't >> take >> > advantage of something like the ReverseWordFilter either. I believe >> > constructing `n-grams` is an option (*at the expense of index size*) >> but is >> > there anything I'm overlooking as a possible avenue to look into? >> > >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> >
Re: Prefix + Suffix Wildcards in Searches
@Erick, You've got the idea. Basically the users can attach zero or more tags (*that they create*) to a document. So as an example say they've created the tags (this example is just a small subset of the total tags): - paid - invoice-paid - ms-reply-unpaid-2019 - credit-ms-reply-unpaid - ms-reply-paid-2019 - ms-reply-paid-2020 and attached them in various combinations to documents. They then want to find all documents by tag that don't contain the characters "paid" anywhere in the tag, don't contain tags with the characters "ms-reply-unpaid", but do include documents tagged with the characters "ms-reply-paid". The obvious suggestion would be to have the users just use the entire tag (i.e. don't let them do a "contains") as a condition to eliminate the wildcards - which would work - but unfortunately we have customers with (*not joking*) over 100K different tags (*why have a taxonomy like that is yet a different issue*). I'm willing to accept that in our scenario n-grams might be the Solr-based answer (the other being to change what "contains" means within our application) but thought I'd check I hadn't overlooked any other options. :) On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev wrote: > Hello, Chris. > I suppose index time analysis can yield these terms: > "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these > expensive wildcard queries. Here's why it's worth to avoid them > https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam > > On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey wrote: > > > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) > but > > I'm looking into options for optimizing something like this: > > > > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR > > tag:*ms-reply-paid* > > > > It's probably not a surprise that we're seeing performance issues with > > something like this. My understanding is that using the wildcard on both > > ends forces a full-text index search. Something like the above can't take > > advantage of something like the ReverseWordFilter either. I believe > > constructing `n-grams` is an option (*at the expense of index size*) but > is > > there anything I'm overlooking as a possible avenue to look into? > > > > > -- > Sincerely yours > Mikhail Khludnev >
Re: solrj - get metrics from all nodes
Hi Jan, Thanks for the response. Could you please share more detail how you request the metric with multiple nodes same time? I do something as below, but only get one node info, the data I'm interested most is, ex: CONTAINER.fs.totalSpace, CONTAINER.fs.usableSpace. etc.. solr/admin/metrics?group=node&node=node1_name,node2_name -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Config files not replicating
Hi, We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the discussion thread these days !!) and lately, I got into this weird issue that at each replication trigger my index gets correctly replicated but my config changes are not replicated to my slaves. We are using referential properties i.e. my solrconfig.xml imports the different configs like requesthandler_config.xml, replication_handler_config.xml, etc which essentially means if going by solr doc (https://lucene.apache.org/solr/guide/6_6/index-replication.html) : Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files are compared against their checksum. The schema.xml files (on master and slave) are judged to be identical if their checksums are identical. The checksum of my solrconfig.xml would not vary, is it why my files won't replicate? I already have another Master-Slave in a different environment working with the same config version, so I don't smell any issue with the replication configuration. I have tried manual replication too but the files would not change. Maybe it is something weirdly trivial or stupid that I seem to be missing here, any pointers or ideas what else can I check? Thank you, Atita
Re: About timeAllowed when using LTR
Hi, Dawn. It might make sense. Feel free to raise a jira, and "patches are welcome!". On Tue, Jun 30, 2020 at 10:33 AM Dawn wrote: > Hi: > > When using the LTR, open timeAllowed parameter, LTR feature of query may > call ExitableFilterAtomicReader. CheckAndThrow timeout detection. > > If a timeout occurs at this point, the exception ExitingReaderException is > thrown, resulting in a no-result return. > > Is it possible to accommodate this exception in LTR so that any result > that THE LTR has cleared will be returned instead of empty. > > This exception occurs in two places: > > 1. LTRScoringQuery. CreateWeight or createWeightsParallel. Here is the > loading stage, timeout directly end is acceptable. > > 2. ModelWeight.scorer. This is a stage that evaluates each Doc and can > catch the exception, end early, and return part of the result. -- Sincerely yours Mikhail Khludnev
About timeAllowed when using LTR
Hi: When using the LTR, open timeAllowed parameter, LTR feature of query may call ExitableFilterAtomicReader. CheckAndThrow timeout detection. If a timeout occurs at this point, the exception ExitingReaderException is thrown, resulting in a no-result return. Is it possible to accommodate this exception in LTR so that any result that THE LTR has cleared will be returned instead of empty. This exception occurs in two places: 1. LTRScoringQuery. CreateWeight or createWeightsParallel. Here is the loading stage, timeout directly end is acceptable. 2. ModelWeight.scorer. This is a stage that evaluates each Doc and can catch the exception, end early, and return part of the result.