Re: Supporting multiple indexes in one collection

2020-06-30 Thread Raji N
Did the test while back . Revisiting this again. But in standalone solr we
have experienced the queries more time if the data exists in 2 shards .
That's the main reason this test was done. If anyone has experience want to
hear

On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke  wrote:

> How many documents ?
> The real difference  was only a couple of ms?
>
> > Am 01.07.2020 um 07:34 schrieb Raji N :
> >
> > Had 2 indexes in 2 separate shards in one collection and had exact same
> > data published with composite router with a prefix. Disabled all caches.
> > Issued the same query which is a small query with q parameter and fq
> > parameter . Number of queries which got executed  (with same threads and
> > run for same time ) were more in 2  indexes with 2 separate shards case.
> > 90th percentile response time was also few ms better.
> >
> > Thanks,
> > Raji
> >
> >> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke 
> wrote:
> >>
> >> What did you test? Which queries? What were the exact results in terms
> of
> >> time ?
> >>
>  Am 30.06.2020 um 22:47 schrieb Raji N :
> >>>
> >>> Hi ,
> >>>
> >>>
> >>> Trying to place multiple smaller indexes in one collection (as we read
> >>> solrcloud performance degrades as number of collections increase). We
> are
> >>> exploring two ways
> >>>
> >>>
> >>> 1) Placing each index on a single shard of a collection
> >>>
> >>>  In this case placing documents for a single index is manual and
> >>> automatic rebalancing not done by solr
> >>>
> >>>
> >>> 2) Solr routing composite router with a prefix .
> >>>
> >>> In this case solr doesn’t place all the docs with same prefix in
> one
> >>> shard , so searches becomes distributed. But shard rebalancing is taken
> >>> care by solr.
> >>>
> >>>
> >>> We did a small perf test with both these set up. We saw the performance
> >> for
> >>> the first case (placing an index explicitly on a shard ) is better.
> >>>
> >>>
> >>> Has anyone done anything similar. Can you please share your experience.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Raji
> >>
>


Re: Supporting multiple indexes in one collection

2020-06-30 Thread Jörn Franke
How many documents ? 
The real difference  was only a couple of ms?

> Am 01.07.2020 um 07:34 schrieb Raji N :
> 
> Had 2 indexes in 2 separate shards in one collection and had exact same
> data published with composite router with a prefix. Disabled all caches.
> Issued the same query which is a small query with q parameter and fq
> parameter . Number of queries which got executed  (with same threads and
> run for same time ) were more in 2  indexes with 2 separate shards case.
> 90th percentile response time was also few ms better.
> 
> Thanks,
> Raji
> 
>> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke  wrote:
>> 
>> What did you test? Which queries? What were the exact results in terms of
>> time ?
>> 
 Am 30.06.2020 um 22:47 schrieb Raji N :
>>> 
>>> Hi ,
>>> 
>>> 
>>> Trying to place multiple smaller indexes in one collection (as we read
>>> solrcloud performance degrades as number of collections increase). We are
>>> exploring two ways
>>> 
>>> 
>>> 1) Placing each index on a single shard of a collection
>>> 
>>>  In this case placing documents for a single index is manual and
>>> automatic rebalancing not done by solr
>>> 
>>> 
>>> 2) Solr routing composite router with a prefix .
>>> 
>>> In this case solr doesn’t place all the docs with same prefix in one
>>> shard , so searches becomes distributed. But shard rebalancing is taken
>>> care by solr.
>>> 
>>> 
>>> We did a small perf test with both these set up. We saw the performance
>> for
>>> the first case (placing an index explicitly on a shard ) is better.
>>> 
>>> 
>>> Has anyone done anything similar. Can you please share your experience.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Raji
>> 


Re: Supporting multiple indexes in one collection

2020-06-30 Thread Raji N
Had 2 indexes in 2 separate shards in one collection and had exact same
data published with composite router with a prefix. Disabled all caches.
Issued the same query which is a small query with q parameter and fq
parameter . Number of queries which got executed  (with same threads and
run for same time ) were more in 2  indexes with 2 separate shards case.
90th percentile response time was also few ms better.

Thanks,
Raji

On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke  wrote:

> What did you test? Which queries? What were the exact results in terms of
> time ?
>
> > Am 30.06.2020 um 22:47 schrieb Raji N :
> >
> > Hi ,
> >
> >
> > Trying to place multiple smaller indexes in one collection (as we read
> > solrcloud performance degrades as number of collections increase). We are
> > exploring two ways
> >
> >
> > 1) Placing each index on a single shard of a collection
> >
> >   In this case placing documents for a single index is manual and
> > automatic rebalancing not done by solr
> >
> >
> > 2) Solr routing composite router with a prefix .
> >
> >  In this case solr doesn’t place all the docs with same prefix in one
> > shard , so searches becomes distributed. But shard rebalancing is taken
> > care by solr.
> >
> >
> > We did a small perf test with both these set up. We saw the performance
> for
> > the first case (placing an index explicitly on a shard ) is better.
> >
> >
> > Has anyone done anything similar. Can you please share your experience.
> >
> >
> > Thanks,
> >
> > Raji
>


Re: Supporting multiple indexes in one collection

2020-06-30 Thread Jörn Franke
What did you test? Which queries? What were the exact results in terms of time ?

> Am 30.06.2020 um 22:47 schrieb Raji N :
> 
> Hi ,
> 
> 
> Trying to place multiple smaller indexes in one collection (as we read
> solrcloud performance degrades as number of collections increase). We are
> exploring two ways
> 
> 
> 1) Placing each index on a single shard of a collection
> 
>   In this case placing documents for a single index is manual and
> automatic rebalancing not done by solr
> 
> 
> 2) Solr routing composite router with a prefix .
> 
>  In this case solr doesn’t place all the docs with same prefix in one
> shard , so searches becomes distributed. But shard rebalancing is taken
> care by solr.
> 
> 
> We did a small perf test with both these set up. We saw the performance for
> the first case (placing an index explicitly on a shard ) is better.
> 
> 
> Has anyone done anything similar. Can you please share your experience.
> 
> 
> Thanks,
> 
> Raji


Supporting multiple indexes in one collection

2020-06-30 Thread Raji N
Hi ,


Trying to place multiple smaller indexes in one collection (as we read
solrcloud performance degrades as number of collections increase). We are
exploring two ways


1) Placing each index on a single shard of a collection

   In this case placing documents for a single index is manual and
automatic rebalancing not done by solr


2) Solr routing composite router with a prefix .

  In this case solr doesn’t place all the docs with same prefix in one
shard , so searches becomes distributed. But shard rebalancing is taken
care by solr.


We did a small perf test with both these set up. We saw the performance for
the first case (placing an index explicitly on a shard ) is better.


Has anyone done anything similar. Can you please share your experience.


Thanks,

Raji


RE: Query in quotes cannot find results

2020-06-30 Thread Permakoff, Vadim
Thank you Walter, I'll look into “mm” (minimum match) parameter.

Best Regards,
Vadim Permakoff


-Original Message-
From: Walter Underwood  
Sent: Tuesday, June 30, 2020 2:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Query in quotes cannot find results

This is exactly why the “mm” (minimum match) parameter exists, to reduce the 
number of hits with fewer matches. Think of it as a sliding scale between OR 
and AND.

On the other hand, I don’t usually worry about hits with fewer matches. Those 
are not on the first page, so I don’t care.

In general, you can either optimize more related hits or optimize fewer 
unrelated hits. Everything you do to reduce the unrelated hits will cause some 
related hits to not match. 

Also, do all of your tuning with real user queries from logs. Making up queries 
for testing will lead to fixing problems that never occur in production and to 
missing problems that do occur.

wunder
Walter Underwood
wun...@wunderwood.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=Ol5cKm0H8yMMumWsju-SIp8XXKG9UsM1SZdwwfYwRFI&s=Wfu_hghIf8SKFF7k-pk9A0xMA5CMWm0MVNuK2XJSKuQ&e=
   (my blog)

> On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erick,
> Thank you for the suggestion, I should of add it. Actually before asking this 
> question here, I tried to add and remove the FlattenGraphFilterFactory, plus 
> other variations, like expand / not expand, autoGeneratePhraseQueries / not 
> autoGeneratePhraseQueries - it just does not work with this particular 
> example. You can try it yourself.
> 
> Regarding removing the stopwords, I agree, there are many cases when you 
> don't want to remove the stopwords, but there is one very compelling case 
> when you want them to be removed.
> 
> Imagine, you have one document with the following text: 
> 1. "to expand the methods for mailing cancellation" 
> And another document with the text: 
> 2. "to expand methods for mailing cancellation"
> 
> The user query is (without quotes): q=expand the methods for mailing 
> cancellation I don't want to bring all the documents with condition q.op=OR, 
> it will find too many unrelated documents, so I want to search with q.op=AND. 
> Unfortunately, the document 2 will not be found as it has no stop word "the" 
> in it.
> What should I do now?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson 
> Sent: Tuesday, June 30, 2020 12:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Well, the first thing is that you haven’t include FlattenGraphFilterFactory 
> in the index analysis chain, see: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e=
>  . IDK whether that actually pertains, but I’d reindex with that included 
> before pursuing.
> 
> Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s 
> necessary? Is there any evidence for this or any use-case that shows it _is_ 
> necessary? Removing stopwords became common in the long-ago days when memory 
> and disk capacity were vastly more constrained than now. At this point, I 
> require proof that it’s _necessary_ to remove them before accepting this kind 
> of requirement.
> 
> There are situations where removing stopwords is worth the difficulty it 
> causes. But I’ve seen far too many unnecessary requirements to let that one 
> pass without pushing back ;).
> 
> And you can hack around this by adding slop to the phrase, perhaps you can 
> get “good enough” results by adding one slop for every stopword, i.e. if the 
> input is “expand the methods”, detect that there’s one stopword and change it 
> to “expand the methods”~1. That’ll introduce other problems of course.
> 
> Best,
> Erick
> 
>> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Erik,
>> That's what I did in the past, but this is an enterprise search and I have a 
>> requirement to remove the stopwords.
>> To have both features I can add synonyms in the front-end application, I 
>> know it will work, but I need a justification why I have to do it in the 
>> application as it is an additional effort.
>> I thought there is a bug for such case to which I can refer, because 
>> according to documentation it should work, right?
>> Anyway, there is more to it. If I'll add the same synonym processing to the 
>> indexing part, i.e. the configuration will be like this:
>> 
>>   > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> 
>>   
>>   > ignoreCase="true"/>
>>   > words="stopwords.txt"/>
>>   
>> 
>> 
>>   
>>   > ignoreCase=

Re: Query in quotes cannot find results

2020-06-30 Thread Walter Underwood
This is exactly why the “mm” (minimum match) parameter exists, to reduce the 
number of hits with fewer matches. Think of it as a sliding scale between OR 
and AND.

On the other hand, I don’t usually worry about hits with fewer matches. Those 
are not on the first page, so I don’t care.

In general, you can either optimize more related hits or optimize fewer 
unrelated hits. Everything you do to reduce the unrelated hits will cause some 
related hits to not match. 

Also, do all of your tuning with real user queries from logs. Making up queries 
for testing will lead to fixing problems that never occur in production and to 
missing problems that do occur.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erick,
> Thank you for the suggestion, I should of add it. Actually before asking this 
> question here, I tried to add and remove the FlattenGraphFilterFactory, plus 
> other variations, like expand / not expand, autoGeneratePhraseQueries / not 
> autoGeneratePhraseQueries - it just does not work with this particular 
> example. You can try it yourself.
> 
> Regarding removing the stopwords, I agree, there are many cases when you 
> don't want to remove the stopwords, but there is one very compelling case 
> when you want them to be removed.
> 
> Imagine, you have one document with the following text: 
> 1. "to expand the methods for mailing cancellation" 
> And another document with the text: 
> 2. "to expand methods for mailing cancellation"
> 
> The user query is (without quotes): q=expand the methods for mailing 
> cancellation
> I don't want to bring all the documents with condition q.op=OR, it will find 
> too many unrelated documents, so I want to search with q.op=AND. 
> Unfortunately, the document 2 will not be found as it has no stop word "the" 
> in it.
> What should I do now?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Tuesday, June 30, 2020 12:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Well, the first thing is that you haven’t include FlattenGraphFilterFactory 
> in the index analysis chain, see: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e=
>  . IDK whether that actually pertains, but I’d reindex with that included 
> before pursuing.
> 
> Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s 
> necessary? Is there any evidence for this or any use-case that shows it _is_ 
> necessary? Removing stopwords became common in the long-ago days when memory 
> and disk capacity were vastly more constrained than now. At this point, I 
> require proof that it’s _necessary_ to remove them before accepting this kind 
> of requirement.
> 
> There are situations where removing stopwords is worth the difficulty it 
> causes. But I’ve seen far too many unnecessary requirements to let that one 
> pass without pushing back ;).
> 
> And you can hack around this by adding slop to the phrase, perhaps you can 
> get “good enough” results by adding one slop for every stopword, i.e. if the 
> input is “expand the methods”, detect that there’s one stopword and change it 
> to “expand the methods”~1. That’ll introduce other problems of course.
> 
> Best,
> Erick
> 
>> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Erik,
>> That's what I did in the past, but this is an enterprise search and I have a 
>> requirement to remove the stopwords.
>> To have both features I can add synonyms in the front-end application, I 
>> know it will work, but I need a justification why I have to do it in the 
>> application as it is an additional effort.
>> I thought there is a bug for such case to which I can refer, because 
>> according to documentation it should work, right?
>> Anyway, there is more to it. If I'll add the same synonym processing to the 
>> indexing part, i.e. the configuration will be like this:
>> 
>>   > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> 
>>   
>>   > ignoreCase="true"/>
>>   > words="stopwords.txt"/>
>>   
>> 
>> 
>>   
>>   > ignoreCase="true" expand="true"/>
>>   > words="stopwords.txt"/>
>>   
>> 
>>   
>> 
>> The analysis shows the parsing is matching now for indexing and querying 
>> path, but the exact match result still cannot be found! This is weird.
>> Any thoughts?
>> 
>> Best Regards,
>> Vadim Permakoff
>> 
>> 
>> -Original Message-
>> From: Erick Erickson  
>> Sent: Monday, June 29, 2020 10:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query in quotes cannot find results
>> 

RE: Query in quotes cannot find results

2020-06-30 Thread Permakoff, Vadim
Hi Walter,
I'm with you, sometimes the stopwords are very important, I did a few years 
back just for fun the Solr demo for Wikipedia search, you can see - nothing is 
removed:
http://www.softcorporation.com/lab/solr/wiki/?sq=to+be+or+not+to+be

But with the enterprise search, sometimes you will be better off removing the 
stopwords, I replied to Erick why. 
My question is not "Should we remove the stopwords?", my question is: 
"Apparently the synonyms with spaces are not working if we are removing the 
stopwords. Is there a way to fix it or is there a jira for it?"

Best Regards,
Vadim Permakoff


-Original Message-
From: Walter Underwood  
Sent: Tuesday, June 30, 2020 12:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Query in quotes cannot find results

Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove 
hedgehogs up my arse.”

Part of our job as search engineers is to solve the real problem, not implement 
a pile of requirements from people who don’t understand how search works.

Here is an article I wrote 13 years ago about why we didn’t remove stopwords at 
Netflix.

https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2007_05_31_do-2Dall-2Dstopword-2Dqueries-2Dmatter_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=kjHjId_IfQN_w0ISSEAUWfFIrgqEl2H7YiZSx22eRys&s=RhKQkdqdNNyweNUackNjcCPnj-0ahUz7oHjupG4v9yM&e=
 

wunder
Walter Underwood
wun...@wunderwood.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=kjHjId_IfQN_w0ISSEAUWfFIrgqEl2H7YiZSx22eRys&s=8xpxLnqquGUWswYROoC61WTpDxzjwNOnEoRNw3vNvmM&e=
   (my blog)

> On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erik,
> That's what I did in the past, but this is an enterprise search and I have a 
> requirement to remove the stopwords.
> To have both features I can add synonyms in the front-end application, I know 
> it will work, but I need a justification why I have to do it in the 
> application as it is an additional effort.
> I thought there is a bug for such case to which I can refer, because 
> according to documentation it should work, right?
> Anyway, there is more to it. If I'll add the same synonym processing to the 
> indexing part, i.e. the configuration will be like this:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>  
>
> ignoreCase="true"/>
> words="stopwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
>  
>
> 
> The analysis shows the parsing is matching now for indexing and querying 
> path, but the exact match result still cannot be found! This is weird.
> Any thoughts?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Monday, June 29, 2020 10:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Looks like you’re removing stopwords. Stopwords cause issues like this with 
> the positions being off.
> 
> It’s becoming more and more common to _NOT_ remove stopwords, is that an 
> option?
> 
> 
> 
> Best,
> Erick
> 
>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Shawn,
>> Many thanks for the response, I checked the field and it is correct. Let's 
>> call it _text_ to make it easier.
>> I believe the parsing is also correct, please see below:
>> - Query without quotes (works):
>>   "querystring":"expand the methods",
>>   "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
>> _text_:methods",
>> 
>> - Query with quotes (does not work):
>>   "querystring":"\"expand the methods\"",
>>   "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
>> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
>> 
>> The document has text:
>> "to expand the methods for mailing cancellation"
>> 
>> The analysis on this field shows that all words are present in the index and 
>> the query, the order is also correct, but the word "methods" in moved one 
>> position, I guess that's why the result is not found.
>> 
>> Best Regards,
>> Vadim Permakoff
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Monday, June 29, 2020 6:28 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query in quotes cannot find results
>> 
>> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:
>>> The basic query q=expand the methods   <<< finds the document,
>>> the query (in quotes) q="expand the methods"   <<< cannot find the document
>>> 
>>> Am I doing something wrong, or is it known bug (I saw similar issues 
>>> discussed in the past, but not for exact match query) and if yes - what is 
>>> the Jira for it?
>> 
>> The most helpful information will come from running both queries with debug 
>> enabled, so you can see how the query is pa

RE: Query in quotes cannot find results

2020-06-30 Thread Permakoff, Vadim
Hi Erick,
Thank you for the suggestion, I should of add it. Actually before asking this 
question here, I tried to add and remove the FlattenGraphFilterFactory, plus 
other variations, like expand / not expand, autoGeneratePhraseQueries / not 
autoGeneratePhraseQueries - it just does not work with this particular example. 
You can try it yourself.

Regarding removing the stopwords, I agree, there are many cases when you don't 
want to remove the stopwords, but there is one very compelling case when you 
want them to be removed.

Imagine, you have one document with the following text: 
1. "to expand the methods for mailing cancellation" 
And another document with the text: 
2. "to expand methods for mailing cancellation"

The user query is (without quotes): q=expand the methods for mailing 
cancellation
I don't want to bring all the documents with condition q.op=OR, it will find 
too many unrelated documents, so I want to search with q.op=AND. Unfortunately, 
the document 2 will not be found as it has no stop word "the" in it.
What should I do now?

Best Regards,
Vadim Permakoff


-Original Message-
From: Erick Erickson  
Sent: Tuesday, June 30, 2020 12:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Query in quotes cannot find results

Well, the first thing is that you haven’t include FlattenGraphFilterFactory in 
the index analysis chain, see: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e=
 . IDK whether that actually pertains, but I’d reindex with that included 
before pursuing.

Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s 
necessary? Is there any evidence for this or any use-case that shows it _is_ 
necessary? Removing stopwords became common in the long-ago days when memory 
and disk capacity were vastly more constrained than now. At this point, I 
require proof that it’s _necessary_ to remove them before accepting this kind 
of requirement.

There are situations where removing stopwords is worth the difficulty it 
causes. But I’ve seen far too many unnecessary requirements to let that one 
pass without pushing back ;).

And you can hack around this by adding slop to the phrase, perhaps you can get 
“good enough” results by adding one slop for every stopword, i.e. if the input 
is “expand the methods”, detect that there’s one stopword and change it to 
“expand the methods”~1. That’ll introduce other problems of course.

Best,
Erick

> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erik,
> That's what I did in the past, but this is an enterprise search and I have a 
> requirement to remove the stopwords.
> To have both features I can add synonyms in the front-end application, I know 
> it will work, but I need a justification why I have to do it in the 
> application as it is an additional effort.
> I thought there is a bug for such case to which I can refer, because 
> according to documentation it should work, right?
> Anyway, there is more to it. If I'll add the same synonym processing to the 
> indexing part, i.e. the configuration will be like this:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>  
>
> ignoreCase="true"/>
> words="stopwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
>  
>
> 
> The analysis shows the parsing is matching now for indexing and querying 
> path, but the exact match result still cannot be found! This is weird.
> Any thoughts?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Monday, June 29, 2020 10:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Looks like you’re removing stopwords. Stopwords cause issues like this with 
> the positions being off.
> 
> It’s becoming more and more common to _NOT_ remove stopwords, is that an 
> option?
> 
> 
> 
> Best,
> Erick
> 
>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Shawn,
>> Many thanks for the response, I checked the field and it is correct. Let's 
>> call it _text_ to make it easier.
>> I believe the parsing is also correct, please see below:
>> - Query without quotes (works):
>>   "querystring":"expand the methods",
>>   "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
>> _text_:methods",
>> 
>> - Query with quotes (does not work):
>>   "querystring":"\"expand the methods\"",
>>   "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
>> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
>> 
>> The document has text:
>> "to expand the methods for mailing cancellation"
>> 
>> The 

Re: How to determine why solr stops running?

2020-06-30 Thread Otis Gospodnetić
Hi,

Maybe https://github.com/sematext/solr-diagnostics can be of use?

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



On Mon, Jun 29, 2020 at 3:46 PM Erick Erickson 
wrote:

> Really look at your cache size settings.
>
> This is to eliminate this scenario:
> - your cache sizes are very large
> - when you looked and the memory was 9G, you also had a lot of cache
> entries
> - there was a commit, which threw out the old cache and reduced your cache
> size
>
> This is frankly kind of unlikely, but worth checking.
>
> The other option is that you haven’t been hitting OOMs at all and that’s a
> complete
> red herring. Let’s say in actuality, you only need an 8G heap or even
> smaller. By
> overallocating memory garbage will simply accumulate for a long time and
> when it
> is eventually collected, _lots_ of memory will be collected.
>
> Another rather unlikely scenario, but again worth checking.
>
> Best,
> Erick
>
> > On Jun 29, 2020, at 3:27 PM, Ryan W  wrote:
> >
> > On Mon, Jun 29, 2020 at 3:13 PM Erick Erickson 
> > wrote:
> >
> >> ps aux | grep solr
> >>
> >
> > [solr@faspbsy0002 database-backups]$ ps aux | grep solr
> > solr  72072  1.6 33.4 22847816 10966476 ?   Sl   13:35   1:36 java
> > -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+ParallelRefProcEnabled
> > -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages
> > -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
> > -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> > -Xloggc:/opt/solr/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
> > -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M
> > -Dsolr.log.dir=/opt/solr/server/logs -Djetty.port=8983 -DSTOP.PORT=7983
> > -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/solr/server
> > -Dsolr.solr.home=/opt/solr/server/solr -Dsolr.data.home=
> > -Dsolr.install.dir=/opt/solr
> > -Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf
> > -Xss256k -Dsolr.jetty.https.port=8983 -Dsolr.log.muteconsole
> > -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983
> /opt/solr/server/logs
> > -jar start.jar --module=http
> >
> >
> >
> >> should show you all the parameters Solr is running with, as would the
> >> admin screen. You should see something like:
> >>
> >> -XX:OnOutOfMemoryError=your_solr_directory/bin/oom_solr.sh
> >>
> >> And there should be some logs laying around if that was the case
> >> similar to:
> >> $SOLR_LOGS_DIR/solr_oom_killer-$SOLR_PORT-$NOW.log
> >>
> >
> > This log is not being written, even though in the oom_solr.sh it does
> > appear a solr_oom_killer-$SOLR_PORT-$NOW.log should be written to the
> logs
> > directory, but it isn't. There are some log files in
> /opt/solr/server/logs,
> > and they are indeed being written to.  There are fresh entries in the
> logs,
> > but no sign of any problem.  If I grep for oom in the logs directory, the
> > only references I see are benign... just a few entries that list all the
> > flags, and oom_solr.sh is among the settings visible in the entry.  And
> > someone did a search for "Mushroom," so there's another instance of oom
> > from that search.
> >
> >
> > As for memory, It Depends (tm). There are configurations
> >> you can make choices about that will affect the heap requirements.
> >> You can’t really draw comparisons between different projects. Your
> >> Drupal + Solr app has how many documents? Indexed how? Searched
> >> how? .vs. this one.
> >>
> >> The usual suspect for configuration settings that are responsible
> >> include:
> >>
> >> - filterCache size too large. Each filterCache entry is bounded by
> >> maxDoc/8 bytes. I’ve seen people set this to over 1M…
> >>
> >> - using non-docValues for fields used for sorting, grouping, function
> >> queries
> >> or faceting. Solr will uninvert the field on the heap, whereas if you
> have
> >> specified docValues=true, the memory is out in OS memory space rather
> than
> >> heap.
> >>
> >> - People just putting too many docs in a collection in a single JVM in
> >> aggregate.
> >> All replicas in the same instance are using part of the heap.
> >>
> >> - Having unnecessary options on your fields, although that’s more MMap
> >> space than
> >> heap.
> >>
> >> The problem basically is that all of Solr’s access is essentially
> random,
> >> so for
> >> performance reasons lots of stuff has to be in memory.
> >>
> >> That said, Solr hasn’t been as careful as it should be about using up
> >> memory,
> >> that’s ongoing.
> >>
> >> If you really want to know what’s using up memory, throw a heap analysis
> >> tool
> >> at it. That’ll give you a clue what’s hogging memory and you can go from
> >> there.
> >>
> >>> On Jun 29, 2020, at 1:48 PM, David Hastings <
> >> hastings.recurs...@gmail.com> wrote:
> >>>
> >>> little nit picky note here, use 31gb, never 32.
> >>>
> >>> On Mon, Jun 29, 2020 at 

Re: Query in quotes cannot find results

2020-06-30 Thread Walter Underwood
Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove 
hedgehogs up my arse.”

Part of our job as search engineers is to solve the real problem, not implement 
a pile of requirements from people who don’t understand how search works.

Here is an article I wrote 13 years ago about why we didn’t remove stopwords at 
Netflix.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erik,
> That's what I did in the past, but this is an enterprise search and I have a 
> requirement to remove the stopwords.
> To have both features I can add synonyms in the front-end application, I know 
> it will work, but I need a justification why I have to do it in the 
> application as it is an additional effort.
> I thought there is a bug for such case to which I can refer, because 
> according to documentation it should work, right?
> Anyway, there is more to it. If I'll add the same synonym processing to the 
> indexing part, i.e. the configuration will be like this:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>  
>
> ignoreCase="true"/>
> words="stopwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
>  
>
> 
> The analysis shows the parsing is matching now for indexing and querying 
> path, but the exact match result still cannot be found! This is weird.
> Any thoughts?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Monday, June 29, 2020 10:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Looks like you’re removing stopwords. Stopwords cause issues like this with 
> the positions being off.
> 
> It’s becoming more and more common to _NOT_ remove stopwords, is that an 
> option?
> 
> 
> 
> Best,
> Erick
> 
>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Shawn,
>> Many thanks for the response, I checked the field and it is correct. Let's 
>> call it _text_ to make it easier.
>> I believe the parsing is also correct, please see below:
>> - Query without quotes (works):
>>   "querystring":"expand the methods",
>>   "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
>> _text_:methods",
>> 
>> - Query with quotes (does not work):
>>   "querystring":"\"expand the methods\"",
>>   "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
>> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
>> 
>> The document has text:
>> "to expand the methods for mailing cancellation"
>> 
>> The analysis on this field shows that all words are present in the index and 
>> the query, the order is also correct, but the word "methods" in moved one 
>> position, I guess that's why the result is not found.
>> 
>> Best Regards,
>> Vadim Permakoff
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Monday, June 29, 2020 6:28 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query in quotes cannot find results
>> 
>> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:
>>> The basic query q=expand the methods   <<< finds the document,
>>> the query (in quotes) q="expand the methods"   <<< cannot find the document
>>> 
>>> Am I doing something wrong, or is it known bug (I saw similar issues 
>>> discussed in the past, but not for exact match query) and if yes - what is 
>>> the Jira for it?
>> 
>> The most helpful information will come from running both queries with debug 
>> enabled, so you can see how the query is parsed.  If you add a parameter 
>> "debugQuery=true" to the URL, then the response should include the parsed 
>> query.  Compare those, and see if you can tell what the differences are.
>> 
>> One of the most common problems for queries like this is that you're not 
>> searching the field that you THINK you're searching.  I don't know whether 
>> this is the problem, I just mention it because it is a common error.
>> 
>> Thanks,
>> Shawn
>> 
>> 
>> 
>> This email is intended solely for the recipient. It may contain privileged, 
>> proprietary or confidential information or material. If you are not the 
>> intended recipient, please delete this email and any attachments and notify 
>> the sender of the error.
> 



Re: Query in quotes cannot find results

2020-06-30 Thread Erick Erickson
Well, the first thing is that you haven’t include FlattenGraphFilterFactory in 
the index analysis chain, see: 
https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#synonym-graph-filter.
 IDK whether that actually pertains, but I’d reindex with that included before 
pursuing.

Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s 
necessary? Is there any evidence for this or any use-case that shows it _is_ 
necessary? Removing stopwords became common in the long-ago days when memory 
and disk capacity were vastly more constrained than now. At this point, I 
require proof that it’s _necessary_ to remove them before accepting this kind 
of requirement.

There are situations where removing stopwords is worth the difficulty it 
causes. But I’ve seen far too many unnecessary requirements to let that one 
pass without pushing back ;).

And you can hack around this by adding slop to the phrase, perhaps you can get 
“good enough” results by adding one slop for every stopword, i.e. if the input 
is “expand the methods”, detect that there’s one stopword and change it to 
“expand the methods”~1. That’ll introduce other problems of course.

Best,
Erick

> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim  
> wrote:
> 
> Hi Erik,
> That's what I did in the past, but this is an enterprise search and I have a 
> requirement to remove the stopwords.
> To have both features I can add synonyms in the front-end application, I know 
> it will work, but I need a justification why I have to do it in the 
> application as it is an additional effort.
> I thought there is a bug for such case to which I can refer, because 
> according to documentation it should work, right?
> Anyway, there is more to it. If I'll add the same synonym processing to the 
> indexing part, i.e. the configuration will be like this:
> 
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>  
>
> ignoreCase="true"/>
> words="stopwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
>  
>
> 
> The analysis shows the parsing is matching now for indexing and querying 
> path, but the exact match result still cannot be found! This is weird.
> Any thoughts?
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Monday, June 29, 2020 10:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> Looks like you’re removing stopwords. Stopwords cause issues like this with 
> the positions being off.
> 
> It’s becoming more and more common to _NOT_ remove stopwords, is that an 
> option?
> 
> 
> 
> Best,
> Erick
> 
>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
>> wrote:
>> 
>> Hi Shawn,
>> Many thanks for the response, I checked the field and it is correct. Let's 
>> call it _text_ to make it easier.
>> I believe the parsing is also correct, please see below:
>> - Query without quotes (works):
>>   "querystring":"expand the methods",
>>   "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
>> _text_:methods",
>> 
>> - Query with quotes (does not work):
>>   "querystring":"\"expand the methods\"",
>>   "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
>> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
>> 
>> The document has text:
>> "to expand the methods for mailing cancellation"
>> 
>> The analysis on this field shows that all words are present in the index and 
>> the query, the order is also correct, but the word "methods" in moved one 
>> position, I guess that's why the result is not found.
>> 
>> Best Regards,
>> Vadim Permakoff
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Monday, June 29, 2020 6:28 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query in quotes cannot find results
>> 
>> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:
>>> The basic query q=expand the methods   <<< finds the document,
>>> the query (in quotes) q="expand the methods"   <<< cannot find the document
>>> 
>>> Am I doing something wrong, or is it known bug (I saw similar issues 
>>> discussed in the past, but not for exact match query) and if yes - what is 
>>> the Jira for it?
>> 
>> The most helpful information will come from running both queries with debug 
>> enabled, so you can see how the query is parsed.  If you add a parameter 
>> "debugQuery=true" to the URL, then the response should include the parsed 
>> query.  Compare those, and see if you can tell what the differences are.
>> 
>> One of the most common problems for queries like this is that you're not 
>> searching the field that you THINK you're searching.  I don't know whether 
>> this is the problem, I just mention it because it is a common error.
>> 
>> Thanks,
>> Shawn
>> 
>> 
>> 
>> This email is intended solely for the recipient. It may 

RE: Query in quotes cannot find results

2020-06-30 Thread Permakoff, Vadim
Hi Erik,
That's what I did in the past, but this is an enterprise search and I have a 
requirement to remove the stopwords.
To have both features I can add synonyms in the front-end application, I know 
it will work, but I need a justification why I have to do it in the application 
as it is an additional effort.
I thought there is a bug for such case to which I can refer, because according 
to documentation it should work, right?
Anyway, there is more to it. If I'll add the same synonym processing to the 
indexing part, i.e. the configuration will be like this:


  




  
  




  


The analysis shows the parsing is matching now for indexing and querying path, 
but the exact match result still cannot be found! This is weird.
Any thoughts?

Best Regards,
Vadim Permakoff


-Original Message-
From: Erick Erickson  
Sent: Monday, June 29, 2020 10:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Query in quotes cannot find results

Looks like you’re removing stopwords. Stopwords cause issues like this with the 
positions being off.

It’s becoming more and more common to _NOT_ remove stopwords, is that an option?



Best,
Erick

> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim  
> wrote:
> 
> Hi Shawn,
> Many thanks for the response, I checked the field and it is correct. Let's 
> call it _text_ to make it easier.
> I believe the parsing is also correct, please see below:
> - Query without quotes (works):
>"querystring":"expand the methods",
>"parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) 
> _text_:methods",
> 
> - Query with quotes (does not work):
>"querystring":"\"expand the methods\"",
>"parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, 
> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))",
> 
> The document has text:
> "to expand the methods for mailing cancellation"
> 
> The analysis on this field shows that all words are present in the index and 
> the query, the order is also correct, but the word "methods" in moved one 
> position, I guess that's why the result is not found.
> 
> Best Regards,
> Vadim Permakoff
> 
> 
> 
> 
> -Original Message-
> From: Shawn Heisey 
> Sent: Monday, June 29, 2020 6:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query in quotes cannot find results
> 
> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:
>> The basic query q=expand the methods   <<< finds the document,
>> the query (in quotes) q="expand the methods"   <<< cannot find the document
>> 
>> Am I doing something wrong, or is it known bug (I saw similar issues 
>> discussed in the past, but not for exact match query) and if yes - what is 
>> the Jira for it?
> 
> The most helpful information will come from running both queries with debug 
> enabled, so you can see how the query is parsed.  If you add a parameter 
> "debugQuery=true" to the URL, then the response should include the parsed 
> query.  Compare those, and see if you can tell what the differences are.
> 
> One of the most common problems for queries like this is that you're not 
> searching the field that you THINK you're searching.  I don't know whether 
> this is the problem, I just mention it because it is a common error.
> 
> Thanks,
> Shawn
> 
> 
> 
> This email is intended solely for the recipient. It may contain privileged, 
> proprietary or confidential information or material. If you are not the 
> intended recipient, please delete this email and any attachments and notify 
> the sender of the error.



Re: Config files not replicating

2020-06-30 Thread Atita Arora
Yes, The config is there and it works for me in live environment but not
the new staging environment.


On Tue, Jun 30, 2020 at 2:29 PM Erick Erickson 
wrote:

> Did you put your auxiliary files in the
> confFiles tag? E.g. from the page you referenced:
>
> schema.xml,stopwords.txt,elevate.xml
>
> Best,
> Erick
>
> > On Jun 30, 2020, at 5:38 AM, Atita Arora  wrote:
> >
> > Hi,
> >
> > We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the
> > discussion thread these days !!) and lately, I got into this weird issue
> > that at each replication trigger my index gets correctly replicated but
> my
> > config changes are not replicated to my slaves.
> >
> > We are using referential properties i.e. my solrconfig.xml imports the
> > different configs like requesthandler_config.xml,
> > replication_handler_config.xml, etc  which essentially means if going by
> > solr doc (
> https://lucene.apache.org/solr/guide/6_6/index-replication.html) :
> >
> > Unlike the index files, where the timestamp is good enough to figure out
> if
> > they are identical, configuration files are compared against their
> > checksum. The schema.xml files (on master and slave) are judged to be
> > identical if their checksums are identical.
> >
> > The checksum of my solrconfig.xml would not vary, is it why my files
> won't
> > replicate?
> >
> > I already have another Master-Slave in a different environment working
> with
> > the same config version, so I don't smell any issue with the replication
> > configuration.
> >
> > I have tried manual replication too but the files would not change.
> > Maybe it is something weirdly trivial or stupid that I seem to be missing
> > here, any pointers or ideas what else can I check?
> >
> > Thank you,
> >
> > Atita
>
>


Re: solrj - get metrics from all nodes

2020-06-30 Thread Jan Høydahl
Use nodes=, not node=

> 30. jun. 2020 kl. 02:02 skrev ChienHuaWang :
> 
> Hi Jan,
> 
> Thanks for the response.
> Could you please share more detail how you request the metric with multiple
> nodes same time?
> I do something as below, but only get one node info, the data I'm interested
> most is, ex: CONTAINER.fs.totalSpace, CONTAINER.fs.usableSpace. etc..
> 
> 
> solr/admin/metrics?group=node&node=node1_name,node2_name
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Config files not replicating

2020-06-30 Thread Erick Erickson
Did you put your auxiliary files in the 
confFiles tag? E.g. from the page you referenced:

schema.xml,stopwords.txt,elevate.xml

Best,
Erick

> On Jun 30, 2020, at 5:38 AM, Atita Arora  wrote:
> 
> Hi,
> 
> We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the
> discussion thread these days !!) and lately, I got into this weird issue
> that at each replication trigger my index gets correctly replicated but my
> config changes are not replicated to my slaves.
> 
> We are using referential properties i.e. my solrconfig.xml imports the
> different configs like requesthandler_config.xml,
> replication_handler_config.xml, etc  which essentially means if going by
> solr doc (https://lucene.apache.org/solr/guide/6_6/index-replication.html) :
> 
> Unlike the index files, where the timestamp is good enough to figure out if
> they are identical, configuration files are compared against their
> checksum. The schema.xml files (on master and slave) are judged to be
> identical if their checksums are identical.
> 
> The checksum of my solrconfig.xml would not vary, is it why my files won't
> replicate?
> 
> I already have another Master-Slave in a different environment working with
> the same config version, so I don't smell any issue with the replication
> configuration.
> 
> I have tried manual replication too but the files would not change.
> Maybe it is something weirdly trivial or stupid that I seem to be missing
> here, any pointers or ideas what else can I check?
> 
> Thank you,
> 
> Atita



Re: Prefix + Suffix Wildcards in Searches

2020-06-30 Thread Erick Erickson
That’s not quite the question I was asking.

Let’s take "…that don’t contain the characters ‘paid’ “.

Start with the fact that no matter what the mechanics of
implementing pre-and-post wildcards, something like

*:* -tags:*paid*

would exclude a doc with a tag of "credit-ms-reply-unpaid" or
"ms-reply-unpaid-2019”. I really think this is an XY problem,
You’re assuming that the solution is pre-and-post wildcards
without a precise definition of the problem you’re trying to solve.

Do they want to exclude things with the characters ‘ia’ or ‘id’? Or
is their “unit of exclusion” the _entire_ word ‘paid’? Or can we
define it so? Because if we can, what I wrote yesterday about
using proper tokenization and phrase queries will work.

If you break up all your tags in your example into individual
tokens on non-alphanumerics, then your problem is much simpler,
excluding “*paid*” becomes

-tags:paid

excluding “*ms-reply*” becomes 

-tags:”ms reply”

trying to exclude “*ms-unpaid*”

would _not_ exclude the doc with the tag "credit-ms-reply-unpaid”
because “ms” and “unpaid” are not sequential.

_Including_ is the same argument.

BTW, this is where “positionIncrementGap” comes in. If they can
define multiple tags in each document, phrase searching with
a gap greater than 1 (100 is the usual default) _and_ each tag
is an entry in a multiValued field, you can prevent matching
across tags with phrase searches. Consider two tags “ms-tag1”
and “paid-2019”. You don’t want “*tag1-paid*” to exclude this
doc I’d imagine. The positionIncrementGap takes care of this in the
phrase case. Remember that in this solution, the dashes aren’t
included in each token.

prefix only or postfix only would be a little tricky, one idea would be
to copyField into an _untokenized_ field and search
there in those cases. But even here, you need to determine precisely
what you expect. What would “*d-2019” return? Would it return 
something ending in “ms-reply-paid-2019”?

Alternatively, you wouldn’t need a copyField if you introduced
special tokens before and after each tag, so indexing “invoice-paid”
would index tokens:
specialbegintoken invoice paid specialendtoken
and searching for 

*paid 

becomes tag:“paid specialendtoken"

Best,
Erick

> On Jun 30, 2020, at 7:29 AM, Chris Dempsey  wrote:
> 
> @Mikhail
> 
> Thanks for the link! I'll read through that.
> 
> On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey  wrote:
> 
>> @Erick,
>> 
>> You've got the idea. Basically the users can attach zero or more tags (*that
>> they create*) to a document. So as an example say they've created the
>> tags (this example is just a small subset of the total tags):
>> 
>>   - paid
>>   - invoice-paid
>>   - ms-reply-unpaid-2019
>>   - credit-ms-reply-unpaid
>>   - ms-reply-paid-2019
>>   - ms-reply-paid-2020
>> 
>> and attached them in various combinations to documents. They then want to
>> find all documents by tag that don't contain the characters "paid" anywhere
>> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
>> do include documents tagged with the characters "ms-reply-paid".
>> 
>> The obvious suggestion would be to have the users just use the entire tag
>> (i.e. don't let them do a "contains") as a condition to eliminate the
>> wildcards - which would work -  but unfortunately we have customers with 
>> (*not
>> joking*) over 100K different tags (*why have a taxonomy like that is yet
>> a different issue*). I'm willing to accept that in our scenario n-grams
>> might be the Solr-based answer (the other being to change what "contains"
>> means within our application) but thought I'd check I hadn't overlooked any
>> other options. :)
>> 
>> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev  wrote:
>> 
>>> Hello, Chris.
>>> I suppose index time analysis can yield these terms:
>>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>>> expensive wildcard queries. Here's why it's worth to avoid them
>>> 
>>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>> 
>>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey  wrote:
>>> 
 Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>>> but
 I'm looking into options for optimizing something like this:
 
> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
 tag:*ms-reply-paid*
 
 It's probably not a surprise that we're seeing performance issues with
 something like this. My understanding is that using the wildcard on both
 ends forces a full-text index search. Something like the above can't
>>> take
 advantage of something like the ReverseWordFilter either. I believe
 constructing `n-grams` is an option (*at the expense of index size*)
>>> but is
 there anything I'm overlooking as a possible avenue to look into?
 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> 
>> 



Re: Prefix + Suffix Wildcards in Searches

2020-06-30 Thread Chris Dempsey
@Mikhail

Thanks for the link! I'll read through that.

On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey  wrote:

> @Erick,
>
> You've got the idea. Basically the users can attach zero or more tags (*that
> they create*) to a document. So as an example say they've created the
> tags (this example is just a small subset of the total tags):
>
>- paid
>- invoice-paid
>- ms-reply-unpaid-2019
>- credit-ms-reply-unpaid
>- ms-reply-paid-2019
>- ms-reply-paid-2020
>
> and attached them in various combinations to documents. They then want to
> find all documents by tag that don't contain the characters "paid" anywhere
> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
> do include documents tagged with the characters "ms-reply-paid".
>
> The obvious suggestion would be to have the users just use the entire tag
> (i.e. don't let them do a "contains") as a condition to eliminate the
> wildcards - which would work -  but unfortunately we have customers with (*not
> joking*) over 100K different tags (*why have a taxonomy like that is yet
> a different issue*). I'm willing to accept that in our scenario n-grams
> might be the Solr-based answer (the other being to change what "contains"
> means within our application) but thought I'd check I hadn't overlooked any
> other options. :)
>
> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev  wrote:
>
>> Hello, Chris.
>> I suppose index time analysis can yield these terms:
>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>> expensive wildcard queries. Here's why it's worth to avoid them
>>
>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>
>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey  wrote:
>>
>> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>> > I'm looking into options for optimizing something like this:
>> >
>> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>> > tag:*ms-reply-paid*
>> >
>> > It's probably not a surprise that we're seeing performance issues with
>> > something like this. My understanding is that using the wildcard on both
>> > ends forces a full-text index search. Something like the above can't
>> take
>> > advantage of something like the ReverseWordFilter either. I believe
>> > constructing `n-grams` is an option (*at the expense of index size*)
>> but is
>> > there anything I'm overlooking as a possible avenue to look into?
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>


Re: Prefix + Suffix Wildcards in Searches

2020-06-30 Thread Chris Dempsey
@Erick,

You've got the idea. Basically the users can attach zero or more tags (*that
they create*) to a document. So as an example say they've created the tags
(this example is just a small subset of the total tags):

   - paid
   - invoice-paid
   - ms-reply-unpaid-2019
   - credit-ms-reply-unpaid
   - ms-reply-paid-2019
   - ms-reply-paid-2020

and attached them in various combinations to documents. They then want to
find all documents by tag that don't contain the characters "paid" anywhere
in the tag, don't contain tags with the characters "ms-reply-unpaid", but
do include documents tagged with the characters "ms-reply-paid".

The obvious suggestion would be to have the users just use the entire tag
(i.e. don't let them do a "contains") as a condition to eliminate the
wildcards - which would work -  but unfortunately we have customers with (*not
joking*) over 100K different tags (*why have a taxonomy like that is yet a
different issue*). I'm willing to accept that in our scenario n-grams might
be the Solr-based answer (the other being to change what "contains" means
within our application) but thought I'd check I hadn't overlooked any other
options. :)

On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev  wrote:

> Hello, Chris.
> I suppose index time analysis can yield these terms:
> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
> expensive wildcard queries. Here's why it's worth to avoid them
> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>
> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey  wrote:
>
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: solrj - get metrics from all nodes

2020-06-30 Thread ChienHuaWang
Hi Jan,

Thanks for the response.
Could you please share more detail how you request the metric with multiple
nodes same time?
I do something as below, but only get one node info, the data I'm interested
most is, ex: CONTAINER.fs.totalSpace, CONTAINER.fs.usableSpace. etc..


solr/admin/metrics?group=node&node=node1_name,node2_name




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Config files not replicating

2020-06-30 Thread Atita Arora
Hi,

We are using Solr 6.6.2 in the Master-Slave mode ( hot star of the
discussion thread these days !!) and lately, I got into this weird issue
that at each replication trigger my index gets correctly replicated but my
config changes are not replicated to my slaves.

We are using referential properties i.e. my solrconfig.xml imports the
different configs like requesthandler_config.xml,
replication_handler_config.xml, etc  which essentially means if going by
solr doc (https://lucene.apache.org/solr/guide/6_6/index-replication.html) :

Unlike the index files, where the timestamp is good enough to figure out if
they are identical, configuration files are compared against their
checksum. The schema.xml files (on master and slave) are judged to be
identical if their checksums are identical.

The checksum of my solrconfig.xml would not vary, is it why my files won't
replicate?

I already have another Master-Slave in a different environment working with
the same config version, so I don't smell any issue with the replication
configuration.

I have tried manual replication too but the files would not change.
Maybe it is something weirdly trivial or stupid that I seem to be missing
here, any pointers or ideas what else can I check?

Thank you,

Atita


Re: About timeAllowed when using LTR

2020-06-30 Thread Mikhail Khludnev
Hi, Dawn.

It might make sense. Feel free to raise a jira, and "patches are welcome!".


On Tue, Jun 30, 2020 at 10:33 AM Dawn  wrote:

> Hi:
>
> When using the LTR, open timeAllowed parameter, LTR feature of query may
> call ExitableFilterAtomicReader. CheckAndThrow timeout detection.
>
> If a timeout occurs at this point, the exception ExitingReaderException is
> thrown, resulting in a no-result return.
>
> Is it possible to accommodate this exception in LTR so that any result
> that THE LTR has cleared will be returned instead of empty.
>
> This exception occurs in two places:
>
> 1. LTRScoringQuery. CreateWeight or createWeightsParallel. Here is the
> loading stage, timeout directly end is acceptable.
>
> 2. ModelWeight.scorer. This is a stage that evaluates each Doc and can
> catch the exception, end early, and return part of the result.



-- 
Sincerely yours
Mikhail Khludnev


About timeAllowed when using LTR

2020-06-30 Thread Dawn
Hi:

When using the LTR, open timeAllowed parameter, LTR feature of query may call 
ExitableFilterAtomicReader. CheckAndThrow timeout detection.

If a timeout occurs at this point, the exception ExitingReaderException is 
thrown, resulting in a no-result return.

Is it possible to accommodate this exception in LTR so that any result that THE 
LTR has cleared will be returned instead of empty.

This exception occurs in two places:

1. LTRScoringQuery. CreateWeight or createWeightsParallel. Here is the loading 
stage, timeout directly end is acceptable.

2. ModelWeight.scorer. This is a stage that evaluates each Doc and can catch 
the exception, end early, and return part of the result.