Re: optimize boosting parameters
We monitor the response time (pingdom) of the page that uses these boosting parameters. Since the addition of these boosting parameters and an additional field to search on (which I will create a thread on it in the mailing list), the page average response time has increased by 1-2 seconds. Management has feedback on this. If it does turn out to be the boosting (and IIRC the map function can be expensive), can you pre-compute some number of the boosts? Your requirements look like they can be computed at index time, then boost by just the value of the pre-computed field. I have gone through the list of functions and map function is the only one that can meet the requirements. Or is there a less expensive function that I missed out? By pre-compute some number, do you mean before the indexing at preparation stage, check the value of P_SupplierResponseRate. If the value = 3, specify 'boost="0.4"' for the field of the document? BTW, boosts < 1.0 _reduce_ the score. I mention that just in case that’s a surprise ;) Oh it is to reduce the score?! Not increase (multiply or add) the score by less than 1? You use termfreq, which changes of course, but 1> if your corpus is updated often enough, the termfreqs will be relatively stable. in that case you can pre-compute them too. We do incremental indexing every half an hour on this collection. Average of 50K-100K documents during each indexing. Collection has 7+ milliion documents. So the entire corpus does not get updated in every indexing. 2> your problem statement has nothing to do with termfreq so why are you using it in the first place? I read up on termfreq function again. It returns the number of times the term appears in the field for that document. It does not really fit the requirements. Thank you for pointing it out. I should use map instead? Derek On 8/12/2020 9:48 pm, Erick Erickson wrote: Before worrying about it too much, exactly _how_ much has the performance changed? I’ve just been in too many situations where there’s no objective measure of performance before and after, just someone saying “it seems slower” and had those performance changes disappear when a rigorous test is done. Then spent a lot of time figuring out that the person reporting the problem hadn’t had coffee yet. Or the network was slow. Or…. If it does turn out to be the boosting (and IIRC the map function can be expensive), can you pre-compute some number of the boosts? Your requirements look like they can be computed at index time, then boost by just the value of the pre-computed field. BTW, boosts < 1.0 _reduce_ the score. I mention that just in case that’s a surprise ;) Of course that means that to change the boosting you need to re-index. You use termfreq, which changes of course, but 1> if your corpus is updated often enough, the termfreqs will be relatively stable. in that case you can pre-compute them too. 2> your problem statement has nothing to do with termfreq so why are you using it in the first place? Best, Erick On Dec 8, 2020, at 12:46 AM, Radu Gheorghe wrote: Hi Derek, Ah, then my reply was completely off :) I don’t really see a better way. Maybe other than changing termfreq to field, if the numeric field has docValues? That may be faster, but I don’t know for sure. Best regards, Radu -- Sematext Cloud - Full Stack Observability - https://sematext.com Solr and Elasticsearch Consulting, Training and Production Support On 8 Dec 2020, at 06:17, Derek Poh wrote: Hi Radu Apologies for not making myself clear. I would like to know if there is a more simple or efficient way to craft the boosting parameters based on the requirements. For example, I am using 'if', 'map' and 'termfreq' functions in the bf parameters. Is there a more efficient or simple function that can be use instead? Or craft the 'formula' it in a more efficient way? On 7/12/2020 10:05 pm, Radu Gheorghe wrote: Hi Derek, It’s hard to tell whether your boosts can be made better without knowing your data and what users expect of it. Which is a problem in itself. I would suggest gathering judgements, like if a user queries for X, what doc IDs do you expect to get back? Once you have enough of these judgements, you can experiment with boosts and see how the query results change. There are measures such as nDCG ( https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG ) that can help you measure that per query, and you can average this score across all your judgements to get an overall measure of how well you’re doing. Or even better, you can have something like Quaerite play with boost values for you: https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga Best regards, Radu -- Sematext Cloud - Full Stack Observability - https://sematext.com Solr and Elasticsearch Consulting, Training and Production Support On 7 Dec 2020, at 10:51, Derek Poh wrote:
Re: Need help to configure automated deletion of shard in solr
Hi Erick, COLSTATUS does not work with Implicit router type collection . Is there any way to get the replica detail ? Regards On Mon, Nov 30, 2020 at 8:48 PM Erick Erickson wrote: > Are you using the implicit router? Otherwise you cannot delete a shard. > And you won’t have any shards that have zero documents anyway. > > It’d be a little convoluted, but you could use the collections COLSTATUS > Api to > find the names of all your replicas. Then query _one_ replica of each > shard with something like > solr/collection1_shard1_replica_n1/q=*:*&distrib=false > > that’ll return the number of live docs (i.e. non-deleted docs) and if it’s > zero > you can delete the shard. > > But the implicit router requires you take complete control of where > documents > go, i.e. which shard they land on. > > This really sounds like an XY problem. What’s the use case you’re trying > to support where you expect a shard’s number of live docs to drop to zero? > > Best, > Erick > > > On Nov 30, 2020, at 4:57 AM, Pushkar Mishra > wrote: > > > > Hi Solr team, > > > > I am using solr cloud.(version 8.5.x). I have a need to find out a > > configuration where I can delete a shard , when number of documents > reaches > > to zero in the shard , can some one help me out to achieve that ? > > > > > > It is urgent , so a quick response will be highly appreciated . > > > > Thanks > > Pushkar > > > > -- > > Pushkar Kumar Mishra > > "Reactions are always instinctive whereas responses are always well > thought > > of... So start responding rather than reacting in life" > > -- Pushkar Kumar Mishra "Reactions are always instinctive whereas responses are always well thought of... So start responding rather than reacting in life"
solrcloud with EKS kubernetes
Hello guys, We are kind of facing some of the issues(Like timeout etc.) which are very inconsistent. By any chance can it be related to EKS? We are using solr 7.7 and zookeeper 3.4.13. Should we move to ECS? Regards, Abhishek
Re: Can I express this nested query in JSON DSL?
Hi, Mikhail. Shouldn't be a big deal "bool":{ "must":[ "x", {"bool": {"should":["y","z"]}}] } On Tue, Dec 8, 2020 at 6:13 AM Mikhail Edoshin wrote: > Hi, > > I'm more or less new to Solr. I need to run queries that use joins all > over the place. (The idea is to index database records pretty much as > they are and then query them in interesting ways and, most importantly, > get the rank. Our dataset is not too large so the performance is great.) > > I managed to express the logic using the following approach. For > example, I want to search people by their names or addresses: > >q=type:Person^=0 AND ({!edismax qf= v=$p0} OR {!join > v=$p1}) >p1={!edismax qf= v=p0} >p0= > > (Here 'type:Person' works as a filter so I zero its score.) This seems > to work as expected and give the right results and ranking. It also > seems to scale nicely for two levels of joins, although the queries > become rather hard to follow in their raw form (I used a custom > XML-to-query transformer to actually formulate more complex queries). > > So my question is that: can I express an equivalent query using the > query DSL? I know I can use 'bool' like that: > > { >"query": { > "bool" : { > "must" : [ ... ]; > "should" : [ ... ] >} > } > } > > But how do I actually go from 'x AND (y OR z)' to 'bool' in the query > DSL? I seem to lose the nice compositional properties of the expression. > Here, for example, the expression implies that at least 'y' or 'z' must > match; I don't quite see how I can express this in the DSL. > > Kind regards, > Mikhail > -- Sincerely yours Mikhail Khludnev
Boost a dynamic field
Hello, I'm trying to boost a document score based on the existence of a dynamic field. I can't seem to get the syntax right and get either Solr server errors or it just doesn't do anything to the Solr response. In solrconfig.xml the dynamic fields are defined as... stored="true" multiValued="true"/> The field I want to check for is called DYNAMIC_rank. If it exists I want to boost the score so the document shows up first. Hoping someone can help! Thanks, Kelv
Re: No numShards attribute exists in 'core.properties' with the newly added replica
I raised this JIRA: https://issues.apache.org/jira/browse/SOLR-15035 What’s not clear to me is whether numShards should even be in core.properties at all, even on the create command. In the state.json file it’s a collection-level property and not reflected in the individual replica’s information. However, we should be consistent. Best, Erick > On Dec 8, 2020, at 4:34 AM, Dawn wrote: > > Hi > > Solr8.7.0 > > No numShards attribute exists in 'core.properties' with the newly added > replica. Causes numShards to be null using CloudDescriptor. > > Since the ADDREPLICA command does not get numShards property, the > coreProps will not save numShards in the constructor that creates the > CoreDescriptor, so that the 'core.properties' file will be generated without > numShards. > > Can the numShards attribute function be added to the process of adding > replica so that the 'core-properties' file of replica can contain numShards > attribute?
Can I express this nested query in JSON DSL?
Hi, I'm more or less new to Solr. I need to run queries that use joins all over the place. (The idea is to index database records pretty much as they are and then query them in interesting ways and, most importantly, get the rank. Our dataset is not too large so the performance is great.) I managed to express the logic using the following approach. For example, I want to search people by their names or addresses: q=type:Person^=0 AND ({!edismax qf= v=$p0} OR {!join v=$p1}) p1={!edismax qf= v=p0} p0= (Here 'type:Person' works as a filter so I zero its score.) This seems to work as expected and give the right results and ranking. It also seems to scale nicely for two levels of joins, although the queries become rather hard to follow in their raw form (I used a custom XML-to-query transformer to actually formulate more complex queries). So my question is that: can I express an equivalent query using the query DSL? I know I can use 'bool' like that: { "query": { "bool" : { "must" : [ ... ]; "should" : [ ... ] } } } But how do I actually go from 'x AND (y OR z)' to 'bool' in the query DSL? I seem to lose the nice compositional properties of the expression. Here, for example, the expression implies that at least 'y' or 'z' must match; I don't quite see how I can express this in the DSL. Kind regards, Mikhail
Re: optimize boosting parameters
Before worrying about it too much, exactly _how_ much has the performance changed? I’ve just been in too many situations where there’s no objective measure of performance before and after, just someone saying “it seems slower” and had those performance changes disappear when a rigorous test is done. Then spent a lot of time figuring out that the person reporting the problem hadn’t had coffee yet. Or the network was slow. Or…. If it does turn out to be the boosting (and IIRC the map function can be expensive), can you pre-compute some number of the boosts? Your requirements look like they can be computed at index time, then boost by just the value of the pre-computed field. BTW, boosts < 1.0 _reduce_ the score. I mention that just in case that’s a surprise ;) Of course that means that to change the boosting you need to re-index. You use termfreq, which changes of course, but 1> if your corpus is updated often enough, the termfreqs will be relatively stable. in that case you can pre-compute them too. 2> your problem statement has nothing to do with termfreq so why are you using it in the first place? Best, Erick > On Dec 8, 2020, at 12:46 AM, Radu Gheorghe wrote: > > Hi Derek, > > Ah, then my reply was completely off :) > > I don’t really see a better way. Maybe other than changing termfreq to field, > if the numeric field has docValues? That may be faster, but I don’t know for > sure. > > Best regards, > Radu > -- > Sematext Cloud - Full Stack Observability - https://sematext.com > Solr and Elasticsearch Consulting, Training and Production Support > >> On 8 Dec 2020, at 06:17, Derek Poh wrote: >> >> Hi Radu >> >> Apologies for not making myself clear. >> >> I would like to know if there is a more simple or efficient way to craft the >> boosting parameters based on the requirements. >> >> For example, I am using 'if', 'map' and 'termfreq' functions in the bf >> parameters. >> >> Is there a more efficient or simple function that can be use instead? Or >> craft the 'formula' it in a more efficient way? >> >> On 7/12/2020 10:05 pm, Radu Gheorghe wrote: >>> Hi Derek, >>> >>> It’s hard to tell whether your boosts can be made better without knowing >>> your data and what users expect of it. Which is a problem in itself. >>> >>> I would suggest gathering judgements, like if a user queries for X, what >>> doc IDs do you expect to get back? >>> >>> Once you have enough of these judgements, you can experiment with boosts >>> and see how the query results change. There are measures such as nDCG ( >>> https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG >>> ) that can help you measure that per query, and you can average this score >>> across all your judgements to get an overall measure of how well you’re >>> doing. >>> >>> Or even better, you can have something like Quaerite play with boost values >>> for you: >>> >>> https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga >>> >>> >>> Best regards, >>> Radu >>> -- >>> Sematext Cloud - Full Stack Observability - >>> https://sematext.com >>> >>> Solr and Elasticsearch Consulting, Training and Production Support >>> >>> On 7 Dec 2020, at 10:51, Derek Poh wrote: Hi I have added the following boosting requirements to the search query of a page. Feedback from monitoring team is that the overall response of the page has increased since then. I am trying to find out if the added boosting parameters (below) could have contributed to the increased. The boosting is working as per requirements. May I know if the implemented boosting parameters can be enhanced or optimized further? Hopefully to improve on the response time of the query and the page. Requirements: 1. If P_SupplierResponseRate is: a. 3, boost by 0.4 b. 2, boost by 0.2 2. If P_SupplierResponseTime is: a. 4, boost by 0.4 b. 3, boost by 0.2 3. If P_MWSScore is: a. between 80-100, boost by 1.6 b. between 60-79, boost by 0.8 4. If P_SupplierRanking is: a. 3, boost by 0.3 b. 4, boost by 0.6 c. 5, boost by 0.9 b. 6, boost by 1.2 Boosting parameters implemented: bf=map(P_SupplierResponseRate,3,3,0.4,0) bf=map(P_SupplierResponseRate,2,2,0.2,0) bf=map(P_SupplierResponseTime,4,4,0.4,0) bf=map(P_SupplierResponseTime,3,3,0.2,0) bf=map(P_MWSScore,80,100,1.6,0) bf=map(P_MWSScore,60,79,0.8,0) bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0 I am using Solr 7.7.2 -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged inform
Re: Is there a way to search for "..." (three dots)?
Yes, but… Odds are your analysis configuration for the field is removing the dots. Go to the admin/analysis page, pick your field type and put examples in the “index” and “query” boxes and you’ll see what I mean. You need something like WhitespaceTokenizer, as your tokenizer, and avoid things like WordDelimiter(Graph)FilterFactory. You’ll find this is tricky though. For instance, if you index “…something is here”, WhitespaceTokenizer will split this into “…something”, “is”, “here” and you won’t be able to search for “something” since the _token_ is “…something”. You could use one of the other tokenizers or use one of the regular expression tokenizers. Best, Erick > On Dec 8, 2020, at 5:56 AM, nettadalet wrote: > > Hi, > I need to be able to search for "..." (three dots), meaning the query should > be "..." and the search should return results that have "..." in their > names. > Is there a way to do it? > Thanks in advance. > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Commits (with openSearcher = true) are too slow in solr 8
matthew sporleder wrote > I would stick to soft commits and schedule hard-commits as > spaced-out-as-possible in regular maintenance windows until you can > find the culprit of the timeout. > > This way you will have very focused windows for intense monitoring > during the hard-commit runs. *Little correction:* In my last post, I had mentioned that softCommit is working fine and there no delay or error message. Here is what happening: 1. Hard commit with openSearcher=true curl "http://:solr_port/solr/my_collection/update?openSearcher=true&commit=true&wt=json" All the cores started processing commit except , the one hosted ``. Also we are getting timeout error on this. 2. softCommit curl "http://:solr_port/solr/my_collection/update?softCommit=true&wt=json" Same as 1. 3.Hard commit with openSearcher=false curl "http://:solr_port/solr/my_collection/update?openSearcher=false&commit=true&wt=json" All the cores started processing commit immediately and there is no error. Solr commands used to set up system Solr start command #/var/solr-8.5.2/bin/solr start -c -p solr_port -z zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -s /var/node_my_collection_1/solr-8.5.2/server/solr -h -m 26g -DzkClientTimeout=3 -force Creat Collection 1.upload config to zookeper #var/solr-8.5.2/server/scripts/cloud-scripts/./zkcli.sh -z zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -cmd upconfig -confname my_collection -confdir / 2. Cretaed collection with 3 shards (shard1,shard2,shard3), #curl "http://:solr_port/solr/admin/collections?action=CREATE&name=my_collection&numShards=3&replicationFactor=1&maxShardsPerNode=1&collection.configName=my_collection&createNodeSet=solr_node1:solr_port,solr_node2:solr_port,solr_node3:solr_port" 3. Used SPLITSHARD command to split each shards into two half (shard1_1,shard1_0,shard2_0,...) e.g #curl "http://:solr_port/solr/admin/collections?action=SPLITSHARD&collection=my_collection&shard=shard1 4. Used DELETESHARD command to delete old shatds (shard1,shard2,shard3). e.g #curl "http://:solr_port/solr/admin/collections?action=DELETESHARD&collection=my_collection&shard=shard1 -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to get the config set name of Solr core
Hi, I was able to add the config set to the STATUS response by implementing a custom extended CoreAdminHandler. However, it would be nice if this could be added in Solr itself. I've create a JIRA for this: https://issues.apache.org/jira/browse/SOLR-15034 Kind regards, Andreas -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Getting Reset cancel_stream_error on solr-8.5.2
Hey All, We have updated our system from solr 5.4 to solr 8.5.2 and we are suddenly seeing a lot of the below errors in our logs. HttpChannelState org.eclipse.jetty.io.EofException: Reset cancel_stream_error Is this related to some system level or solr level config? How do I find the cause of this? How do I solve this? *Solr Setup Details:* Solr version => solr-8.5.2 GC setting: GC_TUNE=" -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=60 -XX:+UseLargePages -XX:+AggressiveOpts " Solr Collection details: (running in solrCloud mode) It has 6 shards, and each shard has only one replica (which is also a leader) and replica type is NRT. Total doc in collection: 77 million and each shard index size: 11 GB and avg size/doc: 1.0Kb Zookeeper => We are using external zookeeper ensemble (3 node cluster) System Datails: Centos (7.7); disk size: 250 GB; cpu: (8 vcpus, 64 GiB memory) Solr OPs Solr start command #/var/solr-8.5.2/bin/solr start -c -p solr_port -z zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -s /var/node_my_collection_1/solr-8.5.2/server/solr -h -m 26g -DzkClientTimeout=3 -force Creat Collection 1.upload config to zookeper #var/solr-8.5.2/server/scripts/cloud-scripts/./zkcli.sh -z zk_host1:zk_port,zk_host1:zk_port,zk_host1:zk_port -cmd upconfig -confname my_collection -confdir / 2. Cretaed collection with 3 shards (shard1,shard2,shard3), #curl "http://:solr_port/solr/admin/collections?action=CREATE&name=my_collection&numShards=3&replicationFactor=1&maxShardsPerNode=1&collection.configName=my_collection&createNodeSet=solr_node1:solr_port,solr_node2:solr_port,solr_node3:solr_port" 3. Used SPLITSHARD command to split each shards into two half (shard1_1,shard1_0,shard2_0,...) e.g #curl "http://:solr_port/solr/admin/collections?action=SPLITSHARD&collection=my_collection&shard=shard1 4. Used DELETESHARD command to delete old shatds (shard1,shard2,shard3). e.g #curl "http://:solr_port/solr/admin/collections?action=DELETESHARD&collection=my_collection&shard=shard1 -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Is there a way to search for "..." (three dots)?
Hi, I need to be able to search for "..." (three dots), meaning the query should be "..." and the search should return results that have "..." in their names. Is there a way to do it? Thanks in advance. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
No numShards attribute exists in 'core.properties' with the newly added replica
Hi Solr8.7.0 No numShards attribute exists in 'core.properties' with the newly added replica. Causes numShards to be null using CloudDescriptor. Since the ADDREPLICA command does not get numShards property, the coreProps will not save numShards in the constructor that creates the CoreDescriptor, so that the 'core.properties' file will be generated without numShards. Can the numShards attribute function be added to the process of adding replica so that the 'core-properties' file of replica can contain numShards attribute?