Re: Solr relevancy score different on replicated nodes
Thanks Erick for your suggestions and time. On Tue, Feb 12, 2019, 22:32 Erick Erickson You really only have four > 1> use exactstats. This won't guarantee precise matches, but they'll be > closer > 2> optimize (not particularly recommended, but if you're willing to do > it periodically it'll have the stats match until the next updates). > 3> use TLOG/PULL replicas and confine the requests to the PULL > replicas. There'll _still_ be some window for mismatches, > specifically the default is commit_interval/2 > 4> define the problem away. > > Best, > Erick > > On Tue, Feb 12, 2019 at 2:42 AM Aman Tandon > wrote: > > > > Hi Erick, > > > > Any suggestions on this? > > > > Regards, > > Aman > > > > On Fri, Feb 8, 2019, 17:07 Aman Tandon > > > > Hi Erick, > > > > > > I find this thread very relevant to the people who are facing the same > > > problem. > > > > > > In our case, we have a signals aggregation collection which is having > > > total of around 8 million records. We have Solr cloud architecture(3 > shards > > > and 4 replicas) and the whole size of index is of around 2.5 GB. > > > > > > We use this collection to fetch the most clicked products against a > query > > > and boost in search results. Boost score is the query score on > aggregation > > > collection. > > > > > > But when the query goes to different replica we get different boost > score > > > for some of the keywords, hence on page refresh results ordering keep > on > > > changing. > > > > > > In order to solve we tried the exactstats cache for distributed IDF > and on > > > debug level I am seeing global stats merge in logs but still the > different > > > scores coming on refreshing the results from aggregation collection. > > > > > > Our indexing occur once a day so should we do daily optimization or > should > > > we reduce merge segment count to 2/3 currently it is -1. > > > > > > What are your suggestions on this? > > > > > > Regards, > > > Aman > > > > > > On Fri, Feb 8, 2019, 00:15 Erick Erickson wrote: > > > > > >> Optimization is safe. The large segment is irrelevant, you'll > > >> lose a little parallelization, but on an index with this few > > >> documents I doubt you'll notice. > > >> > > >> As of Solr 5, optimize will respect the max segment size > > >> which defaults to 5G, but you're well under that limit. > > >> > > >> Best, > > >> Erick > > >> > > >> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht > > > >> wrote: > > >> > > > >> > Thanks Erick and everyone.We are checking on stats cache. > > >> > > > >> > I noticed stats skew again and optimized the index to correct the > > >> same.As > > >> > per the documents. > > >> > > > >> > > > >> > https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ > > >> > and > > >> > > > >> > https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ > > >> > > > >> > wanted to check on below points considering we want stats skew to be > > >> > corrected. > > >> > > > >> > 1.When optimized single segment won't be natural merged easily.As we > > >> might > > >> > be doing manual optimize every time,what I visualize is at a certain > > >> point > > >> > in future we might be having a single large segment.What impact this > > >> large > > >> > segment is going to have? > > >> > Our index ~30k documents i.e files with content(Segment size <1Gb > as of > > >> now) > > >> > > > >> > 1.Do you recommend going for optimize in these situations?Probably > it > > >> will > > >> > be done only when stats skew.Is it safe? > > >> > > > >> > Regards > > >> > Ashish > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > -- > > >> > Sent from: > http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > >> > > > >
Re: Solr relevancy score different on replicated nodes
You really only have four 1> use exactstats. This won't guarantee precise matches, but they'll be closer 2> optimize (not particularly recommended, but if you're willing to do it periodically it'll have the stats match until the next updates). 3> use TLOG/PULL replicas and confine the requests to the PULL replicas. There'll _still_ be some window for mismatches, specifically the default is commit_interval/2 4> define the problem away. Best, Erick On Tue, Feb 12, 2019 at 2:42 AM Aman Tandon wrote: > > Hi Erick, > > Any suggestions on this? > > Regards, > Aman > > On Fri, Feb 8, 2019, 17:07 Aman Tandon > > Hi Erick, > > > > I find this thread very relevant to the people who are facing the same > > problem. > > > > In our case, we have a signals aggregation collection which is having > > total of around 8 million records. We have Solr cloud architecture(3 shards > > and 4 replicas) and the whole size of index is of around 2.5 GB. > > > > We use this collection to fetch the most clicked products against a query > > and boost in search results. Boost score is the query score on aggregation > > collection. > > > > But when the query goes to different replica we get different boost score > > for some of the keywords, hence on page refresh results ordering keep on > > changing. > > > > In order to solve we tried the exactstats cache for distributed IDF and on > > debug level I am seeing global stats merge in logs but still the different > > scores coming on refreshing the results from aggregation collection. > > > > Our indexing occur once a day so should we do daily optimization or should > > we reduce merge segment count to 2/3 currently it is -1. > > > > What are your suggestions on this? > > > > Regards, > > Aman > > > > On Fri, Feb 8, 2019, 00:15 Erick Erickson > > >> Optimization is safe. The large segment is irrelevant, you'll > >> lose a little parallelization, but on an index with this few > >> documents I doubt you'll notice. > >> > >> As of Solr 5, optimize will respect the max segment size > >> which defaults to 5G, but you're well under that limit. > >> > >> Best, > >> Erick > >> > >> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht > >> wrote: > >> > > >> > Thanks Erick and everyone.We are checking on stats cache. > >> > > >> > I noticed stats skew again and optimized the index to correct the > >> same.As > >> > per the documents. > >> > > >> > > >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ > >> > and > >> > > >> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ > >> > > >> > wanted to check on below points considering we want stats skew to be > >> > corrected. > >> > > >> > 1.When optimized single segment won't be natural merged easily.As we > >> might > >> > be doing manual optimize every time,what I visualize is at a certain > >> point > >> > in future we might be having a single large segment.What impact this > >> large > >> > segment is going to have? > >> > Our index ~30k documents i.e files with content(Segment size <1Gb as of > >> now) > >> > > >> > 1.Do you recommend going for optimize in these situations?Probably it > >> will > >> > be done only when stats skew.Is it safe? > >> > > >> > Regards > >> > Ashish > >> > > >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > >> > >
Re: Solr relevancy score different on replicated nodes
Hi Erick, Any suggestions on this? Regards, Aman On Fri, Feb 8, 2019, 17:07 Aman Tandon Hi Erick, > > I find this thread very relevant to the people who are facing the same > problem. > > In our case, we have a signals aggregation collection which is having > total of around 8 million records. We have Solr cloud architecture(3 shards > and 4 replicas) and the whole size of index is of around 2.5 GB. > > We use this collection to fetch the most clicked products against a query > and boost in search results. Boost score is the query score on aggregation > collection. > > But when the query goes to different replica we get different boost score > for some of the keywords, hence on page refresh results ordering keep on > changing. > > In order to solve we tried the exactstats cache for distributed IDF and on > debug level I am seeing global stats merge in logs but still the different > scores coming on refreshing the results from aggregation collection. > > Our indexing occur once a day so should we do daily optimization or should > we reduce merge segment count to 2/3 currently it is -1. > > What are your suggestions on this? > > Regards, > Aman > > On Fri, Feb 8, 2019, 00:15 Erick Erickson >> Optimization is safe. The large segment is irrelevant, you'll >> lose a little parallelization, but on an index with this few >> documents I doubt you'll notice. >> >> As of Solr 5, optimize will respect the max segment size >> which defaults to 5G, but you're well under that limit. >> >> Best, >> Erick >> >> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht >> wrote: >> > >> > Thanks Erick and everyone.We are checking on stats cache. >> > >> > I noticed stats skew again and optimized the index to correct the >> same.As >> > per the documents. >> > >> > >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ >> > and >> > >> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ >> > >> > wanted to check on below points considering we want stats skew to be >> > corrected. >> > >> > 1.When optimized single segment won't be natural merged easily.As we >> might >> > be doing manual optimize every time,what I visualize is at a certain >> point >> > in future we might be having a single large segment.What impact this >> large >> > segment is going to have? >> > Our index ~30k documents i.e files with content(Segment size <1Gb as of >> now) >> > >> > 1.Do you recommend going for optimize in these situations?Probably it >> will >> > be done only when stats skew.Is it safe? >> > >> > Regards >> > Ashish >> > >> > >> > >> > >> > >> > >> > -- >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >> >
Re: Solr relevancy score different on replicated nodes
Hi Erick, I find this thread very relevant to the people who are facing the same problem. In our case, we have a signals aggregation collection which is having total of around 8 million records. We have Solr cloud architecture(3 shards and 4 replicas) and the whole size of index is of around 2.5 GB. We use this collection to fetch the most clicked products against a query and boost in search results. Boost score is the query score on aggregation collection. But when the query goes to different replica we get different boost score for some of the keywords, hence on page refresh results ordering keep on changing. In order to solve we tried the exactstats cache for distributed IDF and on debug level I am seeing global stats merge in logs but still the different scores coming on refreshing the results from aggregation collection. Our indexing occur once a day so should we do daily optimization or should we reduce merge segment count to 2/3 currently it is -1. What are your suggestions on this? Regards, Aman On Fri, Feb 8, 2019, 00:15 Erick Erickson Optimization is safe. The large segment is irrelevant, you'll > lose a little parallelization, but on an index with this few > documents I doubt you'll notice. > > As of Solr 5, optimize will respect the max segment size > which defaults to 5G, but you're well under that limit. > > Best, > Erick > > On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht > wrote: > > > > Thanks Erick and everyone.We are checking on stats cache. > > > > I noticed stats skew again and optimized the index to correct the same.As > > per the documents. > > > > > https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ > > and > > > https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ > > > > wanted to check on below points considering we want stats skew to be > > corrected. > > > > 1.When optimized single segment won't be natural merged easily.As we > might > > be doing manual optimize every time,what I visualize is at a certain > point > > in future we might be having a single large segment.What impact this > large > > segment is going to have? > > Our index ~30k documents i.e files with content(Segment size <1Gb as of > now) > > > > 1.Do you recommend going for optimize in these situations?Probably it > will > > be done only when stats skew.Is it safe? > > > > Regards > > Ashish > > > > > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Solr relevancy score different on replicated nodes
Optimization is safe. The large segment is irrelevant, you'll lose a little parallelization, but on an index with this few documents I doubt you'll notice. As of Solr 5, optimize will respect the max segment size which defaults to 5G, but you're well under that limit. Best, Erick On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht wrote: > > Thanks Erick and everyone.We are checking on stats cache. > > I noticed stats skew again and optimized the index to correct the same.As > per the documents. > > https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ > and > https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ > > wanted to check on below points considering we want stats skew to be > corrected. > > 1.When optimized single segment won't be natural merged easily.As we might > be doing manual optimize every time,what I visualize is at a certain point > in future we might be having a single large segment.What impact this large > segment is going to have? > Our index ~30k documents i.e files with content(Segment size <1Gb as of now) > > 1.Do you recommend going for optimize in these situations?Probably it will > be done only when stats skew.Is it safe? > > Regards > Ashish > > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Thanks Erick and everyone.We are checking on stats cache. I noticed stats skew again and optimized the index to correct the same.As per the documents. https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ and https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ wanted to check on below points considering we want stats skew to be corrected. 1.When optimized single segment won't be natural merged easily.As we might be doing manual optimize every time,what I visualize is at a certain point in future we might be having a single large segment.What impact this large segment is going to have? Our index ~30k documents i.e files with content(Segment size <1Gb as of now) 1.Do you recommend going for optimize in these situations?Probably it will be done only when stats skew.Is it safe? Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Is this a sharded Solr Cloud collection? If so, you can try using global IDF. That should make the scores more similar on different nodes. https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 29, 2019, at 10:38 AM, David Hastings > wrote: > > Maybe instead of using the solr score in your metrics, find a way to use > the documents location in the results? you can never trust the score to > be consistent, its constantly changing as the indexes changes > > On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht > wrote: > >> Hi Erick, >> >> Our business wanted score not to be totally based on default relevancy >> algo. >> Instead a mix of solr relevancy+usermetrics(80%+20%). >> >> Each result doc is calculated against max score as a fraction of >> 80.Remaining 20 is from user metrics. >> >> Finally sort happens on new score. >> >> But say we got first page correctly, and for the second page if the request >> goes to other replica where max score is different. UI may result give >> wrong >> sort as compared to first page. For e.g last value of page 1 is 70 and >> first >> value of second page can be 72 I. e distorted sorting. >> >> On top of it we are not using pagination but a infinite scroll which makes >> it more noticeable. >> >> Please suggest. >> >> Regards >> Ashish >> >> >> >> >> >> >> >> >> -- >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >>
Re: Solr relevancy score different on replicated nodes
Maybe instead of using the solr score in your metrics, find a way to use the documents location in the results? you can never trust the score to be consistent, its constantly changing as the indexes changes On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht wrote: > Hi Erick, > > Our business wanted score not to be totally based on default relevancy > algo. > Instead a mix of solr relevancy+usermetrics(80%+20%). > > Each result doc is calculated against max score as a fraction of > 80.Remaining 20 is from user metrics. > > Finally sort happens on new score. > > But say we got first page correctly, and for the second page if the request > goes to other replica where max score is different. UI may result give > wrong > sort as compared to first page. For e.g last value of page 1 is 70 and > first > value of second page can be 72 I. e distorted sorting. > > On top of it we are not using pagination but a infinite scroll which makes > it more noticeable. > > Please suggest. > > Regards > Ashish > > > > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Solr relevancy score different on replicated nodes
Hi Erick, Our business wanted score not to be totally based on default relevancy algo. Instead a mix of solr relevancy+usermetrics(80%+20%). Each result doc is calculated against max score as a fraction of 80.Remaining 20 is from user metrics. Finally sort happens on new score. But say we got first page correctly, and for the second page if the request goes to other replica where max score is different. UI may result give wrong sort as compared to first page. For e.g last value of page 1 is 70 and first value of second page can be 72 I. e distorted sorting. On top of it we are not using pagination but a infinite scroll which makes it more noticeable. Please suggest. Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
No, this is not a bug but a consequence of the design. ExactStats can help, but there is no guarantee that different replicas will compute the exact same score. Scores should be very close however. You haven't explained why you need the scores to match. 99% of the time, worrying about scores at this level is misguided. So I'd really try to figure out whether they're necessary or not. Best, Erick On Tue, Jan 29, 2019 at 1:51 AM Ashish Bisht wrote: > > Hi Erick, > > To test this scenario I added replica again and from few days have been > monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview* > section of core.Checked *Segments Info* section too.Everything looks in > sync. > > http://:8983/solr/#/MyTestCollection_*shard1_replica_n7*/ > http://:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/ > > If in future they go out of sync,just wanted to confirm if this is a bug > although you mentioned as > > *bq. Shouldn't both replica and leader come to same state > after this much long period. > > No. After that long, the docs will be the same, all the docs > present on one replica will be present and searchable on > the other. However, they will be in different segments so the > "stats skew" will remain. * > > > We need these score,so as a temporary solution if we monitor these metrics > for any issues and take action (either optimize or delete-add replica) > accordingly.Does it make sense? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Hi Erick, To test this scenario I added replica again and from few days have been monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview* section of core.Checked *Segments Info* section too.Everything looks in sync. http://:8983/solr/#/MyTestCollection_*shard1_replica_n7*/ http://:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/ If in future they go out of sync,just wanted to confirm if this is a bug although you mentioned as *bq. Shouldn't both replica and leader come to same state after this much long period. No. After that long, the docs will be the same, all the docs present on one replica will be present and searchable on the other. However, they will be in different segments so the "stats skew" will remain. * We need these score,so as a temporary solution if we monitor these metrics for any issues and take action (either optimize or delete-add replica) accordingly.Does it make sense? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
What Elizabeth said. Really, this is an intractable problem. Even in the TLOG and PULL replica case, an index getting updates will still fire their replication requests at different wall-clock time. Even if that were coordinated, the vagaries of networks etc. would _still_ mean the various replicas would see slightly different "snapshots" of the index. True, the window would be smaller The only situations I've seen where the scores on different replicas are always identical is when the index is optimized, which isn't recommended except if you can do it all the time. Or TLOG and PULL replicas are used and the index is not undergoing continuous updates. As for locking subsequent requests to a set of nodes, the idea has been bandied about but usually falls down when it's realized that this has the potential to unevenly distribute the load. Best, Erick On Fri, Jan 11, 2019 at 3:13 AM Elizabeth Haubert wrote: > > Hello, > > To a certain extent, I agree with Eric, that this isn't a problem, but > looks like one. The nature of TF*IDF is such that you will see different > scores for the same query over time on the same replica, or different > replicas for the same query with most replication schemes. This is mildly > annoying when the score is displayed to the user, although I have found > most end users do not pay that much attention to the floating point score. > Testers do. On a small index with high write/delete traffic and homogenous > docs, I've seen it cause document re-orderings when the same query is > repeated and sent to different replicas such as for paging, and that is > noticeable to end users. > > How big is your index, and how different are the percentages you are > seeing? This is a much more pronounced problem on smaller indices; it is > possible this is a problem with your test setup, but not production. > > Your solution at directing users to a consistent replica will solve the > change in values over a session-sized window of time. With a single > shard, you could use a Master/Slave setup, direct queries at a given > slave. This has a number of operational consequences though, as it means > you will lose the benefits of SolrCloud. > > Mikhail's suggestion to use ExactStats would be cleaner: > https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_ > > > Elizabeth > > On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht > wrote: > > > Hi Erick, > > > > Your statement "*At best, I've seen UIs where they display, say, 1 to 5 > > stars that are just showing the percentile that the particular doc had > > _relative to the max score*" is something we are trying to achieve,but we > > are dealing in percentages rather stars(ratings) > > > > Change in MaxScore per node is messing it. > > > > I was thinking if it possible to make one complete request(for a term) go > > though one replica,i.e if to the client we could tell which replica hit the > > first request and subsequently further paginated requests should go though > > that replica until keyword is changed.Do you think it is possible or a good > > idea?If yes is there a way in solr to know which replica served request? > > > > Regards > > Ashish > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > >
Re: Solr relevancy score different on replicated nodes
Hello, To a certain extent, I agree with Eric, that this isn't a problem, but looks like one. The nature of TF*IDF is such that you will see different scores for the same query over time on the same replica, or different replicas for the same query with most replication schemes. This is mildly annoying when the score is displayed to the user, although I have found most end users do not pay that much attention to the floating point score. Testers do. On a small index with high write/delete traffic and homogenous docs, I've seen it cause document re-orderings when the same query is repeated and sent to different replicas such as for paging, and that is noticeable to end users. How big is your index, and how different are the percentages you are seeing? This is a much more pronounced problem on smaller indices; it is possible this is a problem with your test setup, but not production. Your solution at directing users to a consistent replica will solve the change in values over a session-sized window of time. With a single shard, you could use a Master/Slave setup, direct queries at a given slave. This has a number of operational consequences though, as it means you will lose the benefits of SolrCloud. Mikhail's suggestion to use ExactStats would be cleaner: https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_ Elizabeth On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht wrote: > Hi Erick, > > Your statement "*At best, I've seen UIs where they display, say, 1 to 5 > stars that are just showing the percentile that the particular doc had > _relative to the max score*" is something we are trying to achieve,but we > are dealing in percentages rather stars(ratings) > > Change in MaxScore per node is messing it. > > I was thinking if it possible to make one complete request(for a term) go > though one replica,i.e if to the client we could tell which replica hit the > first request and subsequently further paginated requests should go though > that replica until keyword is changed.Do you think it is possible or a good > idea?If yes is there a way in solr to know which replica served request? > > Regards > Ashish > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Solr relevancy score different on replicated nodes
Hi Erick, Your statement "*At best, I've seen UIs where they display, say, 1 to 5 stars that are just showing the percentile that the particular doc had _relative to the max score*" is something we are trying to achieve,but we are dealing in percentages rather stars(ratings) Change in MaxScore per node is messing it. I was thinking if it possible to make one complete request(for a term) go though one replica,i.e if to the client we could tell which replica hit the first request and subsequently further paginated requests should go though that replica until keyword is changed.Do you think it is possible or a good idea?If yes is there a way in solr to know which replica served request? Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
bq. Shouldn't both replica and leader come to same state after this much long period. No. After that long, the docs will be the same, all the docs present on one replica will be present and searchable on the other. However, they will be in different segments so the "stats skew" will remain. But displaying the scores isn't a good reason to worry about this. Frankly, that's almost always a mistake. Scores are meaningless outside of ranking the docs _in a single query_. Because a doc in one query got a score of 10 but some other doc in some other query scored 5 doesn't say anything at all about whether one was "twice as good" as another. Even within the same query, the same two scores don't mean one doc is "twice as good". I think this is a waste of effort frankly. At best, I've seen UIs where they display, say, 1 to 5 stars that are just showing the percentile that the particular doc had _relative to the max score of that query_, unrelated to any other query. If you insist (and again I think it's a mistake) you can optimize periodically, but if you're using anything earlier than Solr 7.5 that has its own traps and I do NOT recommend it unless you can do it every time you change your index. See: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ and https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/ On Tue, Jan 8, 2019 at 7:28 AM Ashish Bisht wrote: > > Thank you Erick for explaining. > > In my senario, I stopped indexing and updates too and waited for 1 day. > Restarted solr too.Shouldn't both replica and leader come to same state > after this much long period. As you said this gets corrected by segment > merging, hope it is internal process itself and no manual activity required. > > For us score matters as we are using it to display some scenarios on search > and it gave changing values.As of now we are dependent of single > shard-replica but in future we might need more replicas > Will planning indexing and updates outside peak query hour help? > > I have tried the exact cache while debugging score difference during > sharding.Didn't help much.Anyhow that's a different topic. > > Thanks again, > > Regards > Ashish Bisht > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Thank you Erick for explaining. In my senario, I stopped indexing and updates too and waited for 1 day. Restarted solr too.Shouldn't both replica and leader come to same state after this much long period. As you said this gets corrected by segment merging, hope it is internal process itself and no manual activity required. For us score matters as we are using it to display some scenarios on search and it gave changing values.As of now we are dependent of single shard-replica but in future we might need more replicas Will planning indexing and updates outside peak query hour help? I have tried the exact cache while debugging score difference during sharding.Didn't help much.Anyhow that's a different topic. Thanks again, Regards Ashish Bisht -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
You misunderstand my point. The wall clock times _will_ be different on leader and follower. It follows that the documents contained in the individual segments on the leader and follower will _not_ be identical. This leads to _deleted_ documents being in different segments on the leader and follower. Which also means that the merge decisions will eventually merge different segments. Now remember that over time when you update a doc, the doc is "marked as deleted", but some of the stats e.g. termfrequency _still_ include the data for the deleted docs and will until the segment is merged. So the term frequency for some term on the leader will be slightly different than on the follower and thus the scoring will differ depending on which replica gets the query. Etc. The fact that you deleted and re-added the follower supports the above. And your scores will skew as you continue to update documents over time. Generally this isn't something that people concern themselves with, but if it's important to you you can try enabling exactstatscache helps, see: https://lucene.apache.org/solr/guide/6_6/distributed-requests.html Best, Erick On Sun, Jan 6, 2019 at 10:25 PM Ashish Bisht wrote: > > Hi Erick, > > Thank you for the details,but doesn't look like a time difference in > autocommit caused this issue.As I said if I do retrieve all query/keyword > query on both server,they returned correct number of docs,its just relevancy > score is taking diff values. > > I waited for brief period,still discrepancy was coming(no indexing also).So > I went ahead deleting the follower node(thinking leader replica should be in > correct state).After adding the new replica again,the issue is not > appearing. > > We will monitor same if it appears in future. > > Regards > Ashish > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Hi Erick, Thank you for the details,but doesn't look like a time difference in autocommit caused this issue.As I said if I do retrieve all query/keyword query on both server,they returned correct number of docs,its just relevancy score is taking diff values. I waited for brief period,still discrepancy was coming(no indexing also).So I went ahead deleting the follower node(thinking leader replica should be in correct state).After adding the new replica again,the issue is not appearing. We will monitor same if it appears in future. Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Ashish: Deleting and re-adding a replica is not a solution. Even if you did, that would then be identical only until you started indexing again, then the stats could skew a bit. When you index to NRT replicas, the wall clock times that cause the commits to trigger will be different due to network delays. What happens essentially is that the doc gets indexed to the leader at time X but hits the replica Y milliseconds later. So on leader, the autocommit interval expires at time X+Z (Z being your autocommit interval) but X+Y+Z on the follower. However, some additional docs may have already been indexed on the leader but not yet on the follower when the autocommit trigger happens so the newly-closed segment on the leader can have docs that the newly-closed segment on the follower does not have. the point is that the termfreq does _not_ change when a document is deleted in some segment (and remember that an update is really a delete followed by an add). The data associated with deleted docs is not purged until segments are merged. Further, the decision about which segments to merge is influenced by how many documents are deleted in each. All of which means that the tf/idf statistics are different (slightly) and you either have to use destributed IDF or just live with it. You're saying that the document count of live documents is different, and that's more concerning. Is this true for brief intervals or is it true when there is _no_ indexing going on _and_ your autocommit interval is allowed to expire? In that case it's a different problem. However, if the condition is transitory and goes away if you stop indexing, then it's the same issue I outlined above; autocommit is happening at different wall-clock times. Best, Erick On Fri, Jan 4, 2019 at 11:12 AM Ashish Bisht wrote: > > Hi Erick, > > I have updated that I am not facing this problem in a new collection. > > As per 3) I can try deleting a replica and adding it again, but the > confusion is which one out of two should I delete.(wondering which replica > is giving correct score for query) > > Both replicas give same number of docs while doing all query.Its strange > that in query explain docCount and docFreq is differing. > > Regards > Ashish > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Hi Erick, I have updated that I am not facing this problem in a new collection. As per 3) I can try deleting a replica and adding it again, but the confusion is which one out of two should I delete.(wondering which replica is giving correct score for query) Both replicas give same number of docs while doing all query.Its strange that in query explain docCount and docFreq is differing. Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr relevancy score different on replicated nodes
Replicated segments might have different deleted documents by design. Precise numbers can be achieved via exact stats. see https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_ On Fri, Jan 4, 2019 at 2:40 PM AshB wrote: > Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes > Machine-1,Machine-2 > holding single instances of solr > > We are having a collection which was single shard and single replica i.e > s=1 > and rf=1 > > Few days back we tried to add replica to it.But the score for same query is > coming different from different replicas. > > > http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ > > whereas on another machine(replica) > > > http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ > > The maxScore is different. > > Relevancy gets affected due to sharding but replication was not expected as > same documents get copied to other node. score explaination gives issue > with > docCount and docFreq uneven. > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) > from: > 1.050635000 docCount :*10020.0* docFreq :*3504.000* > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) > from: > 1.068795100 > > docCount :*10291.0* docFreq :*3534.000* > > Is this expected?What could be wrong here?Please suggest > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- Sincerely yours Mikhail Khludnev
Re: Solr relevancy score different on replicated nodes
See particularly point 3 here and to a lesser extent point 2. https://support.lucidworks.com/s/question/0D5803LRpijCAD/the-number-of-results-returned-is-not-constant-every-time-i-query-solr For point two (the internal Lucene doc IDs are different) you can easily correct it by adding sort=score desc, solrId asc to the query. That article was written before TLOG and PULL replicas came into the picture. Since those replica types all have the exact same index structure you shouldn't have this problem in that case. Best, Erick On Fri, Jan 4, 2019 at 3:40 AM AshB wrote: > > Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2 > holding single instances of solr > > We are having a collection which was single shard and single replica i.e s=1 > and rf=1 > > Few days back we tried to add replica to it.But the score for same query is > coming different from different replicas. > > http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ > > whereas on another machine(replica) > > http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ > > The maxScore is different. > > Relevancy gets affected due to sharding but replication was not expected as > same documents get copied to other node. score explaination gives issue with > docCount and docFreq uneven. > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: > 1.050635000 docCount :*10020.0* docFreq :*3504.000* > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: > 1.068795100 > > docCount :*10291.0* docFreq :*3534.000* > > Is this expected?What could be wrong here?Please suggest > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr relevancy score different on replicated nodes
Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2 holding single instances of solr We are having a collection which was single shard and single replica i.e s=1 and rf=1 Few days back we tried to add replica to it.But the score for same query is coming different from different replicas. http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ whereas on another machine(replica) http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ The maxScore is different. Relevancy gets affected due to sharding but replication was not expected as same documents get copied to other node. score explaination gives issue with docCount and docFreq uneven. idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.050635000 docCount :*10020.0* docFreq :*3504.000* idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.068795100 docCount :*10291.0* docFreq :*3534.000* Is this expected?What could be wrong here?Please suggest -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html