Re: Unfinished Business: Fast Global IDF
I’ve never been in that part of the code, but it feels like it could have a small biast radius. We already have an interface for global IDF, so calculating it differently shouldn’t be huge. It does need a change in the shard response format. It wouldn’t hurt to return DF in the response to regular clients. That would help with distributed search across collections, clusters, or even different kinds of engines. We did that ages ago at Verity with a SOAP interface (yuk). wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 27, 2024, at 8:10 PM, David Smiley wrote: > > Thanks for sharing Walter! I hope someone enterprising tackles it. > It'd be nice to have global IDF by default without having to go enable > something that adds a performance risk. > > I'm sure you have many career stories to tell. If you find yourself > at Acadia National Park hiking & backpacking, as you like to do, shoot > me a message. :-D > > ~ David > > On Tue, Aug 27, 2024 at 3:01 PM Walter Underwood > wrote: >> >> When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. >> Back in 1995, Infoseek figured out how to do that with no speed penalty. >> They patented it, but that patent expired several years ago. I’ll try and >> hunt it down. >> >> Short version, from each shard return the number of docs and the df for each >> term. When combining results, add all the DF, add all the NUMDOCS, divide, >> and you have the global IDF. This is constant for the whole result list. >> Each shard already needs that info for local score, so it shouldn’t be extra >> work. >> >> When does this matter? When the relevant documents for a term are mostly on >> one shard, either intentionally or accidentally. Let’s say we have a news >> search and all the stories for August 2024 are on one shard. The term >> “kamala” will be much more common on that shard, giving a lower IDF, but…the >> relevant documents are probably on that shard. So the best documents have a >> lower score using local IDF. >> >> This also shows up with lots of shards or small shards, because there will >> be uneven distribution of docs. When I retired from LexisNexis, we had a >> cluster with 320 shards. I’m sure that had some interesting IDF behavior. >> >> I wrote up how we did this in a Java distributed search layer for Ultraseek: >> https://observer.wunderwood.org/2007/04/04/progressive-reranking/ >> >> There is some earlier discussion here: >> https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf >> >> I don’t think there is a Jira issue for this. >> >> I think that is all the unfinished business since putting Solr 1.3 into >> production at Netflix. Pretty darned good job everybody. Huge thanks to all >> the contributors and committers who have put in years of effort over that >> time. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org >
Re: Unfinished Business: Fast Global IDF
This is the patent. Last assignee was Google, expired in 2017. https://patents.google.com/patent/US5659732A/en —wunder > On Aug 27, 2024, at 12:01 PM, Walter Underwood wrote: > > When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. > Back in 1995, Infoseek figured out how to do that with no speed penalty. They > patented it, but that patent expired several years ago. I’ll try and hunt it > down. > > Short version, from each shard return the number of docs and the df for each > term. When combining results, add all the DF, add all the NUMDOCS, divide, > and you have the global IDF. This is constant for the whole result list. Each > shard already needs that info for local score, so it shouldn’t be extra work. > > When does this matter? When the relevant documents for a term are mostly on > one shard, either intentionally or accidentally. Let’s say we have a news > search and all the stories for August 2024 are on one shard. The term > “kamala” will be much more common on that shard, giving a lower IDF, but…the > relevant documents are probably on that shard. So the best documents have a > lower score using local IDF. > > This also shows up with lots of shards or small shards, because there will be > uneven distribution of docs. When I retired from LexisNexis, we had a cluster > with 320 shards. I’m sure that had some interesting IDF behavior. > > I wrote up how we did this in a Java distributed search layer for Ultraseek: > https://observer.wunderwood.org/2007/04/04/progressive-reranking/ > > There is some earlier discussion here: > https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf > > I don’t think there is a Jira issue for this. > > I think that is all the unfinished business since putting Solr 1.3 into > production at Netflix. Pretty darned good job everybody. Huge thanks to all > the contributors and committers who have put in years of effort over that > time. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) >
Unfinished Business: Fast Global IDF
When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. Back in 1995, Infoseek figured out how to do that with no speed penalty. They patented it, but that patent expired several years ago. I’ll try and hunt it down. Short version, from each shard return the number of docs and the df for each term. When combining results, add all the DF, add all the NUMDOCS, divide, and you have the global IDF. This is constant for the whole result list. Each shard already needs that info for local score, so it shouldn’t be extra work. When does this matter? When the relevant documents for a term are mostly on one shard, either intentionally or accidentally. Let’s say we have a news search and all the stories for August 2024 are on one shard. The term “kamala” will be much more common on that shard, giving a lower IDF, but…the relevant documents are probably on that shard. So the best documents have a lower score using local IDF. This also shows up with lots of shards or small shards, because there will be uneven distribution of docs. When I retired from LexisNexis, we had a cluster with 320 shards. I’m sure that had some interesting IDF behavior. I wrote up how we did this in a Java distributed search layer for Ultraseek: https://observer.wunderwood.org/2007/04/04/progressive-reranking/ There is some earlier discussion here: https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf I don’t think there is a Jira issue for this. I think that is all the unfinished business since putting Solr 1.3 into production at Netflix. Pretty darned good job everybody. Huge thanks to all the contributors and committers who have put in years of effort over that time. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
Re: Unfinished Business: Fuzzy in edismax
Oops. https://issues.apache.org/jira/browse/SOLR-629 —wunder > On Aug 27, 2024, at 11:40 AM, Walter Underwood wrote: > > I’m retired and not working on Solr all the time, but there are two things I > didn’t finish that should be picked up. I’m not going to do these, I’ve got > plenty of retirement stuff to do. > > The first is SOLR-629, probably the oldest open feature request and a good > first project for someone. This adds support for fuzzy search to the edismax > query parser. The external impact is tiny, the qf config just says “title~” > instead of “title”. > > The most recent patch is for 4.x. It doesn’t apply 100% to the current code > (more like 50%), but it should be fairly easy to figure out the needed mods. > > This should be a nice project for a first-time contributor, because it is > localized to the edismax parse. That is spread out a bit, but not too bad. > Besides, who gets to work on a three-digit Jira issue? > > Two notes: > > 1. You’ll get the urge to rewrite the whole damned edismax config parser with > a real parser generator. Resist that and just make the change. > 2. It isn’t possible to have a higher boost for an exact match and a lower > boost for a fuzzy match because it only handles one config spec per field > name. And it doesn’t throw an error for the second time, either. It really > should handle “title^4 title~^2”. The workaround is to make a copy of the > title field. Maybe that should be a separate Jira issue? > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) >
Unfinished Business: Fuzzy in edismax
I’m retired and not working on Solr all the time, but there are two things I didn’t finish that should be picked up. I’m not going to do these, I’ve got plenty of retirement stuff to do. The first is SOLR-629, probably the oldest open feature request and a good first project for someone. This adds support for fuzzy search to the edismax query parser. The external impact is tiny, the qf config just says “title~” instead of “title”. The most recent patch is for 4.x. It doesn’t apply 100% to the current code (more like 50%), but it should be fairly easy to figure out the needed mods. This should be a nice project for a first-time contributor, because it is localized to the edismax parse. That is spread out a bit, but not too bad. Besides, who gets to work on a three-digit Jira issue? Two notes: 1. You’ll get the urge to rewrite the whole damned edismax config parser with a real parser generator. Resist that and just make the change. 2. It isn’t possible to have a higher boost for an exact match and a lower boost for a fuzzy match because it only handles one config spec per field name. And it doesn’t throw an error for the second time, either. It really should handle “title^4 title~^2”. The workaround is to make a copy of the title field. Maybe that should be a separate Jira issue? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
Re: ZkStateReader.getUpdateLock / ClusterState immutability
Would per-replica state (PRS) help with that? That slices by replica, not collection, but it should allow finer-grained locking. https://searchscale.com/blog/prs/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 16, 2024, at 9:03 AM, David Smiley wrote: > > At work, in a scenario when a node starts with thousands of cores for > thousands of collections, we've seen that core registration can > bottleneck on ZkStateReader.forceUpdateCollection(collection) which > synchronizes on getUpdateLock, a global lock (not per-collection). I > don't know the history or strategy behind that lock, but it's a > code-smell to see a global lock that is used in a circumstance that is > scoped to one collection. I suspect it's there because ClusterState > is immutable and encompasses basically all state. If it was instead a > cache that can be snapshotted (for consumers that require an immutable > state to act on), we could probably make getUpdateLock go away. *If* > a collection's state needs to be locked (and I'm suspicious that it > is, so long as cache insertion is done properly / exclusively), we > could have a lock just for the collection. > > Any concerns with this idea? > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org >
Re: ZkStateReader.getUpdateLock / ClusterState immutability
Would per-replica state (PRS) help with that? That slices by replica, not collection, but it should allow finer-grained locking. https://searchscale.com/blog/prs/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 16, 2024, at 9:03 AM, David Smiley wrote: > > At work, in a scenario when a node starts with thousands of cores for > thousands of collections, we've seen that core registration can > bottleneck on ZkStateReader.forceUpdateCollection(collection) which > synchronizes on getUpdateLock, a global lock (not per-collection). I > don't know the history or strategy behind that lock, but it's a > code-smell to see a global lock that is used in a circumstance that is > scoped to one collection. I suspect it's there because ClusterState > is immutable and encompasses basically all state. If it was instead a > cache that can be snapshotted (for consumers that require an immutable > state to act on), we could probably make getUpdateLock go away. *If* > a collection's state needs to be locked (and I'm suspicious that it > is, so long as cache insertion is done properly / exclusively), we > could have a lock just for the collection. > > Any concerns with this idea? > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org >
Re: Compatibility of Solrj with older versions of Solr
First, this question belongs on the users@solr.apache <mailto:users@solr.apache>.org mailing list. Second, I would not use any SolrJ later than 6.6 against Solr 6.6. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 15, 2024, at 10:48 AM, Todd Stevenson > wrote: > > I’m trying to upgrade the Java apps I support to current versions of > SpringBoot and Java. I’m wanting to use the later versions of Solrj also. > These apps run against Solr 6.6 (I have no control over upgrading Solr). > What versions of Solrj are compatible with Solr 6.6. I’ve looked extensively > to find this information and can’t see it in the documentation. > > Can you point me to a Solrj user guide? The only documentation I can see > are the javadocs. I need more help than the javadocs. > > Thank you so much. > > Todd Stevenson > Software Engineer – Technical Lead > Intermountain Health, Canyons Region > Cell: 801-589-1115 > Work Schedule: Monday to Thursday > > <https://www.instagram.com/intermountain/?hl=en> > <https://www.facebook.com/Intermountain/> <https://twitter.com/intermountain> > <https://www.linkedin.com/company/intermountain-healthcare> > <http://www.intermountainhealth.org/> > > > NOTICE: This e-mail is for the sole use of the intended recipient and may > contain confidential and privileged information. If you are not the intended > recipient, you are prohibited from reviewing, using, disclosing or > distributing this e-mail or its contents. If you have received this e-mail in > error, please contact the sender by reply e-mail and destroy all copies of > this e-mail and its contents.
Re: solr query alerting
The functionality is alerts, but that doesn’t mean it has to be a push API. Alerts can be fetched just as easily as pushed. I don’t know the limits of this proposal, but LexisNexis needs alerting as we move all of our 114 billion documents onto Solr. I’m retiring this week, so I won’t be around to implement it, but that is one potential large customer. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 1, 2024, at 2:26 PM, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) > wrote: > >> I kind of like "search-alerts". "query-alerts" sounds like alerting on >> query metrics, but IMO "search-alerts" doesn't come with the same baggage. > > Someone in the PR had mentioned that "alerts" is a bit off because the > proposal does not really manage alerts and it feels too far out of solr's > domain. The current approach, much like percolator, simply exposes a > request/response API that then can be **used** by an alerting system > (request/stream could also be considered if there is worry about > scaling the number of queries one request can match). > >> I think this is certainly something that can start in the sandbox and move> >> into the main repo once it's clear that there is interest from >> multiple committers and community members in using and maintaining it. > > I've seen many homegrown/complex solutions of percolator-type functionality > so even this narrower "inverted search" solution has **some** use but > admittedly this is a niche area. It might not really gain traction unless it > is marketed the right way as there are probably very few solr users that > happen to be thinking about revamping their saved-search platform in any > given year. Given that, what do you think I can do to reach them? :-) > > I am trying my best to talk about this within my firm but the sample is > obviously smaller. > > From: dev@solr.apache.org At: 05/01/24 16:16:50 UTC-4:00To: > dev@solr.apache.org > Subject: Re: solr query alerting > > I think I'd prefer a more self-descriptive name than "Luwak", which is just > a product name that was decided a while ago. > > I kind of like "search-alerts". "query-alerts" sounds like alerting on > query metrics, but IMO "search-alerts" doesn't come with the same baggage. > > Luwak is fine though if everyone agrees on that. > > On one hand we have a number of committers here from >> Bloomberg, yet the abandoned and now-removed "analytics" component >> shows that abandonment is a risk nonetheless. >> > > I don't want to bikeshed here, but I'm not sure this is a fair > assessment of what happened with the analytics module. > Sure there wasn't a ton of development, but in general it was feature rich > and had very little feature requests. > It was removed in 10, because a lack of user usage, not because it was > "abandoned" IMO. If there were requests from users > to keep it or improve it, then it would be a much different story. The > whole "thrown over the wall" comment is fair, but > not particularly relevant to this PR, which is being worked on in public. > > I think this is certainly something that can start in the sandbox and move > into the main repo once it's clear that there is interest from > multiple committers and community members in using and maintaining it. > > - Houston > > On Wed, May 1, 2024 at 2:32 PM David Smiley wrote: > >> Luwak is good to me! >> >> On Tue, Apr 30, 2024 at 4:01 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD >> A) wrote: >>> >>> I love the name "luwak"! I was about to suggest the same but was worried >> about the trademark concerns and I assumed there was a reason they changed >> the name when donating it to lucene. >>> >>> From: dev@solr.apache.org At: 04/30/24 15:56:22 UTC-4:00To: >> dev@solr.apache.org >>> Subject: Re: solr query alerting >>> >>> Luwak is the original name of the Lucene monitor, contributed by Flax >> back in >>> the days: https://github.com/flaxsearch/luwak >>> >>> Perhaps we could go full circle (if no trademark issues) to call it the >> Solr >>> luwak module? Luwak is a type of coffee, and thus related to percolator >> 😉 >>> >>> Otherwise “stored-queries” is an option. >>> >>> Jan Høydahl >>> >>>> 30. apr. 2024 kl. 19:26 skrev David Smiley : >>>> >>>> I agree the feature is relevant / useful. >>>>
Re: solr query alerting
Do people want to spend the next ten years explaining that the the alerting feature is called “Luwak”? I’d call it “Alerting” or “Alerts". —wonder > On May 1, 2024, at 1:16 PM, Houston Putman wrote: > > I think I'd prefer a more self-descriptive name than "Luwak", which is just > a product name that was decided a while ago. > > I kind of like "search-alerts". "query-alerts" sounds like alerting on > query metrics, but IMO "search-alerts" doesn't come with the same baggage. > > Luwak is fine though if everyone agrees on that. > > On one hand we have a number of committers here from >> Bloomberg, yet the abandoned and now-removed "analytics" component >> shows that abandonment is a risk nonetheless. >> > > I don't want to bikeshed here, but I'm not sure this is a fair > assessment of what happened with the analytics module. > Sure there wasn't a ton of development, but in general it was feature rich > and had very little feature requests. > It was removed in 10, because a lack of user usage, not because it was > "abandoned" IMO. If there were requests from users > to keep it or improve it, then it would be a much different story. The > whole "thrown over the wall" comment is fair, but > not particularly relevant to this PR, which is being worked on in public. > > I think this is certainly something that can start in the sandbox and move > into the main repo once it's clear that there is interest from > multiple committers and community members in using and maintaining it. > > - Houston > > On Wed, May 1, 2024 at 2:32 PM David Smiley wrote: > >> Luwak is good to me! >> >> On Tue, Apr 30, 2024 at 4:01 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD >> A) wrote: >>> >>> I love the name "luwak"! I was about to suggest the same but was worried >> about the trademark concerns and I assumed there was a reason they changed >> the name when donating it to lucene. >>> >>> From: dev@solr.apache.org At: 04/30/24 15:56:22 UTC-4:00To: >> dev@solr.apache.org >>> Subject: Re: solr query alerting >>> >>> Luwak is the original name of the Lucene monitor, contributed by Flax >> back in >>> the days: https://github.com/flaxsearch/luwak >>> >>> Perhaps we could go full circle (if no trademark issues) to call it the >> Solr >>> luwak module? Luwak is a type of coffee, and thus related to percolator >> 😉 >>> >>> Otherwise “stored-queries” is an option. >>> >>> Jan Høydahl >>> 30. apr. 2024 kl. 19:26 skrev David Smiley : I agree the feature is relevant / useful. Another angle on the module vs sandbox or wherever else is maintenance cost. If a lot of code is being contributed as is here, then as a PMC member I hope to get a subjective sense that folks are interested in maintaining it. On one hand we have a number of committers here from Bloomberg, yet the abandoned and now-removed "analytics" component shows that abandonment is a risk nonetheless. I don't know how to conclude this thought but I'm hoping to hear from folks that they intend to look after this module. It's not just being "thrown over the wall", so to speak. Naming is hard... * ...-monitor-: sorry I hate it * ...-percolator- No clue why this was chosen for ElasticSearch. I can appreciate a curious/non-obvious name like this that is not going to conflict with anyone's guesses at what a general name might convey. * "indexed-queries" or "query-indexing" would be a good name? This is the best technical name I can think of. * "reverse search" came to mind (based on the Netflix article) although that makes me think of leading-wildcard / suffix-search. * "inverted-search" * "indexed-query-alerts" incorporates "alerts" thus might better convey the use-case > On Mon, Apr 1, 2024 at 3:53 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD > A) wrote: > > Hi All, > > A few months ago I wrote the user list about potentially integrating >> lucene >>> monitor into solr. I have raised this PR with a first attempt at >> implementing >>> this integration. I'd greatly appreciate any feedback on this even >> though I >>> still have it marked as draft. I want to make sure I'm heading in the >> right >>> direction here so input from solr dev community would be extremely >> valuable :-) > > Many thanks, > Luke - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> For additional commands, e-mail: dev-h...@solr.apache.org >> >> - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: d
Re: timeout HTTP response code; use 524?
I still think 503 is appropriate when timeAllowed is exceeded. The service requested is a reponse within the set time. That service is not available. Here are the RFC definitions of 500 and 503. Exceeding timeAllowed isn’t an “unexpected condition”, it is part of the normal operation of that limit. 6.6.1. 500 Internal Server Error The 500 (Internal Server Error) status code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.1 6.6.4 503 Service Unavailable The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. The server MAY send a Retry-After header field (Section 7.1.3) to suggest an appropriate amount of time for the client to wait before retrying the request https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.4 Solr could even return 503 with a message of “timeAllowed exceeded”. I spent about a decade working on a search engine with an integrated web spider. Accurate HTTP response codes are really useful. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 19, 2024, at 3:12 PM, Chris Hostetter wrote: > > > Agree on all of Uwe's points below > > I think 500 is the most appropriate for exceeding QueryLimits -- > unless/until we decie we want Solr to start using custom response codes in > some cases, but in that case i would suggest we explicitly *avoid* using > 504, 524, & 529 precisely because they already have specific meanings in > well known HTTP proxies/services that don't match what we're talking about > here. > > As far as one of David's specific observations... > > : > ideal IMO because Solr's health could reasonably be judged by looking > : > for 500's specifically as a sign of a general error that service > : > operators should pay attention to. > > Any client that is interpreting a '500' error as a *general* indication of > a problem with Solr, and not specific to that request, would not be > respecting the spec on what '500' means. *Some* '5xx' are documented > to indicate that there may be a general problem afflicting the > server/service as a whole (notably '503') but most do not. > > But i also think that if we really want to cover our basis -- we can > always make it configurable. Let people configure Solr to return > 500, 400, 418, 666, 999, ... wtf they want ... but 500 is probably the > best sane default that doesn't carry around implicit baggage. > > : 524 or 504 both refer to timeouts, but both are meant for proxies (so > reverse > : proxy can't reach the backend server in time). So both of them do not match. > : > : 408 is "request timeout", but that's client's fault (4xx code). In that case > : its a more technical code because it also requires to close the connection > and > : not keep it alive, so we can't trigger that from Servlet API in a correct > way. > : > : 503 does not fit well as Solr is not overloaded, but would be the only > : alternative I see. Maybe add a new Solr-specific one? Anyways, I think 500 > : seems the best response unless you find another one not proxy-related. > : > : Uwe > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org >
Re: timeout HTTP response code; use 524?
503 Service Unavailable is the standard response for down or overloaded. I don’t see that 529 is significantly different. I do think it is a good idea to distinguish overload or down conditions from the catch-all 500 error. I interpret that as a broken server, not one that is functioning properly but overloaded. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 18, 2024, at 3:23 PM, David Smiley wrote: > > If timeAllowed is set and Solr takes too long then we fail the > response with an HTTP 500 response code. It's not bad but it's not > ideal IMO because Solr's health could reasonably be judged by looking > for 500's specifically as a sign of a general error that service > operators should pay attention to. There is a 529 response code used > by CloudFlare (judging from Wikipedia): > https://en.wikipedia.org/wiki/List_of_HTTP_status_codes > > Any opinion on the use of 529 instead of 500; or alternative perspectives? > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org
Re: Moving to bin/solr start defaulting to SolrCloud mode?
Standalone makes sense for the configs. Each node has their own local set of configs which are not shared. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 28, 2024, at 10:51 AM, David Smiley wrote: > > On Wed, Feb 28, 2024 at 7:50 AM Gus Heck wrote: >> IIRC "standalone" was deemed the wrong color for the shed because >> [original/non-cloud/standalone/legacy/user-managed] solr can have more than >> one machine, and does distributed search. > > Nonetheless each node acts alone and/or acts on requests which include > URLs. "standalone" may be an imperfect word but perhaps no word is > perfect. What "standalone" has going for it is mindshare / usage for > a decade now. > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org >
Re: MixedCase or dashed-case for long options in Solr CLI?
Long options are dashed-case, following the GNU convention. POSIX only specifies single character options. The “—“ prefix for long options is a GNU invention, as far as I know. Older Unix commands with long option names, e.g. find, only use a single dash. https://www.gnu.org/software/libc/manual/html_node/Argument-Syntax.html https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap12.html wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 26, 2024, at 5:29 AM, Eric Pugh > wrote: > > I hear a vote for dashed-case, how about some more votes? —solr-update-url > versus —solrUpdateUrl ? > > > >> On Feb 26, 2024, at 7:29 AM, Jason Gerlowski wrote: >> >> My guess is that "dashed-case" is slightly more common -- at least, >> that's my sense from haphazardly checking a few tools I use often >> ("curl", "kubectl", "git", "docker"). >> >> But I don't have an opinion as long as we're internally consistent >> about using one convention or the other. >> >> Best, >> >> Jason >> >> On Sat, Feb 24, 2024 at 11:35 AM Eric Pugh >> mailto:ep...@opensourceconnections.com>> >> wrote: >>> >>> Hi all, >>> >>> I wanted to get the communities input on formatting of long options for the >>> Solr CLI. I noticed on https://commons.apache.org/proper/commons-cli/ >>> that their examples all are —dashed-case. >>> >>> However, we have —solrUrl or —zkHost as our pattern. Though in working on >>> the PostTool, I used —solr-update-url as the parameter because I had been >>> reading the commons-cli docs... >>> >>> I’d like to get this sorted so that I can get >>> https://issues.apache.org/jira/browse/SOLR-16824 over the finish line. So >>> please do speak up with preferences! (And please let’s not support both!) >>> >>> >>> The changes to the formatting will be a 10x thing. >>> >>> Eric >>> >>> ___ >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | >>> http://www.opensourceconnections.com >>> <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/> >>> | My Free/Busy <http://tinyurl.com/eric-cal> >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed >>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> >>> This e-mail and all contents, including attachments, is considered to be >>> Company Confidential unless explicitly stated otherwise, regardless of >>> whether attachments are marked as such. >>> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> <mailto:dev-unsubscr...@solr.apache.org> >> For additional commands, e-mail: dev-h...@solr.apache.org >> <mailto:dev-h...@solr.apache.org> > ___ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com <http://www.opensourceconnections.com/> > | My Free/Busy <http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. >
Re: Use cases for interacting direct with ZK versus using our APIs?
Zookeeper file size limits are probably the most common failure. I had to mess around a lot with our suggestion dictionary to get it to upload. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2024, at 11:25 AM, Eric Pugh > wrote: > > Ah.. yeah, I can’t speak to Solr 6.x! In 9x at least you could use the > configset API to deploy configs and avoid the direct ZK interaction. > > It would be interesting to explore if the process of deploying a configset is > risky, has a high chance of things failing, then how do we account for that > as part of the process?So you don’t have to do things like upload the > previous config ;-). > > And other common reasons to use ZK directly? > >> On Feb 11, 2024, at 12:14 PM, Walter Underwood wrote: >> >> The was deploying configs with Jenkins on Solr 6.x. Maybe the APIs were >> there, but I didn't know about them. >> >> Rebuilding the suggester did need external help, since that needs to be done >> separately on each node. >> >> I think working directly with Zookeeper is less risky. If there is any issue >> with the upload, then don’t reload the collections. You can back out the >> changes by uploading the previous config to Zookeeper. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >> http://observer.wunderwood.org/ (my blog) >> >>> On Feb 11, 2024, at 11:07 AM, Eric Pugh >> <mailto:ep...@opensourceconnections.com>> wrote: >>> >>> Could you share more about “update Solr remotely” that you were doing? >>> Are we missing some APIs that would have made whatever you had to do >>> require ZK direct access? >>> >>> While it’s cool that we can impact Solr via hacking around in ZK, it also >>> seems like an approach fraught with risk! >>> >>>> On Feb 11, 2024, at 11:32 AM, Walter Underwood >>>> wrote: >>>> >>>> I wanted something that didn’t require installing Solr locally in order to >>>> update Solr remotely, so I didn’t use the provided zk commands. I wrote >>>> some Python to dig the Zookeeper addresses out of clusterstatus (I think) >>>> then uploaded directly to Zookeeper with the Python kazoo package. >>>> >>>> The tool had a bunch of other things, like async reload checking for >>>> results, and rebuilding suggestion dictionaries on each node. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>>> On Feb 11, 2024, at 9:04 AM, Gus Heck wrote: >>>>> >>>>> I pretty much always use zk upconfig, which also works for overwriting >>>>> existing. I certainly tell my clients to use apis from the ref guide for >>>>> such operations, but zk upconfig certainly counts as one. Mostly I tell >>>>> them that they should only break out things like >>>>> https://github.com/rgs1/zk_shell as a last resort (which is what I think >>>>> of >>>>> as direct modification), and if they are unsure, call me *before* doing >>>>> anything in zk directly. >>>>> >>>>> By the way, I don't know if this has come up in a dev/build setting or >>>>> not, >>>>> but are you aware of https://plugins.gradle.org/search?term=solr ? It is >>>>> presently only really suitable for local dev, with a single config set, >>>>> but >>>>> could easily grow patches and suggestions welcome of course. >>>>> >>>>> On Sun, Feb 11, 2024, 9:10 AM Eric Pugh >>>>> wrote: >>>>> >>>>>> Hi all.. I was playing around with a cluster and wanted to upload a >>>>>> configset into Solr…. >>>>>> >>>>>> I ran bin/solr and noticed a bin/solr config -h command, but it just lets >>>>>> me tweak a config. Then I ran bin/solr create -h and it appears to let >>>>>> me >>>>>> upload a configset, but I have to create the collection as well, and I’m >>>>>> not ready to do that. >>>>>> >>>>>> Then I poked around and discovered hidden under bin/solr zk a command >>>>>> upconfig…. So bin/solr zk upconfig will let me get my configset into >>>>>> Solr, >>>>>&g
Re: Use cases for interacting direct with ZK versus using our APIs?
The was deploying configs with Jenkins on Solr 6.x. Maybe the APIs were there, but I didn't know about them. Rebuilding the suggester did need external help, since that needs to be done separately on each node. I think working directly with Zookeeper is less risky. If there is any issue with the upload, then don’t reload the collections. You can back out the changes by uploading the previous config to Zookeeper. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2024, at 11:07 AM, Eric Pugh > wrote: > > Could you share more about “update Solr remotely” that you were doing? Are > we missing some APIs that would have made whatever you had to do require ZK > direct access? > > While it’s cool that we can impact Solr via hacking around in ZK, it also > seems like an approach fraught with risk! > >> On Feb 11, 2024, at 11:32 AM, Walter Underwood wrote: >> >> I wanted something that didn’t require installing Solr locally in order to >> update Solr remotely, so I didn’t use the provided zk commands. I wrote some >> Python to dig the Zookeeper addresses out of clusterstatus (I think) then >> uploaded directly to Zookeeper with the Python kazoo package. >> >> The tool had a bunch of other things, like async reload checking for >> results, and rebuilding suggestion dictionaries on each node. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On Feb 11, 2024, at 9:04 AM, Gus Heck wrote: >>> >>> I pretty much always use zk upconfig, which also works for overwriting >>> existing. I certainly tell my clients to use apis from the ref guide for >>> such operations, but zk upconfig certainly counts as one. Mostly I tell >>> them that they should only break out things like >>> https://github.com/rgs1/zk_shell as a last resort (which is what I think of >>> as direct modification), and if they are unsure, call me *before* doing >>> anything in zk directly. >>> >>> By the way, I don't know if this has come up in a dev/build setting or not, >>> but are you aware of https://plugins.gradle.org/search?term=solr ? It is >>> presently only really suitable for local dev, with a single config set, but >>> could easily grow patches and suggestions welcome of course. >>> >>> On Sun, Feb 11, 2024, 9:10 AM Eric Pugh >>> wrote: >>> >>>> Hi all.. I was playing around with a cluster and wanted to upload a >>>> configset into Solr…. >>>> >>>> I ran bin/solr and noticed a bin/solr config -h command, but it just lets >>>> me tweak a config. Then I ran bin/solr create -h and it appears to let me >>>> upload a configset, but I have to create the collection as well, and I’m >>>> not ready to do that. >>>> >>>> Then I poked around and discovered hidden under bin/solr zk a command >>>> upconfig…. So bin/solr zk upconfig will let me get my configset into Solr, >>>> but does require me to remember what my magic ZK string is ;-). >>>> >>>> I went and checked the ref guide, and yes, it states that there are two >>>> ways: >>>> >>>> A configset can be uploaded to ZooKeeper either via the Configsets API < >>>> https://solr.apache.org/guide/solr/latest/configuration-guide/configsets-api.html> >>>> or more directly via bin/solr zk upconfig < >>>> https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html#upload-a-configuration-set>. >>>> The Configsets API has some other operations as well, and likewise, so does >>>> the CLI. >>>> >>>> Are there use cases where interacting directly with ZooKeeper is preferred >>>> over making changes via the APIs? Of is the use of bin/solr zk upconfig >>>> more of a evolutionary byproduct of how we built SolrCloud? >>>> >>>> Eric >>>> >>>> ___ >>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | >>>> http://www.opensourceconnections.com < >>>> http://www.opensourceconnections.com/> | My Free/Busy < >>>> http://tinyurl.com/eric-cal> >>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < >>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> >>>> >>>> This
Re: Use cases for interacting direct with ZK versus using our APIs?
I wanted something that didn’t require installing Solr locally in order to update Solr remotely, so I didn’t use the provided zk commands. I wrote some Python to dig the Zookeeper addresses out of clusterstatus (I think) then uploaded directly to Zookeeper with the Python kazoo package. The tool had a bunch of other things, like async reload checking for results, and rebuilding suggestion dictionaries on each node. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2024, at 9:04 AM, Gus Heck wrote: > > I pretty much always use zk upconfig, which also works for overwriting > existing. I certainly tell my clients to use apis from the ref guide for > such operations, but zk upconfig certainly counts as one. Mostly I tell > them that they should only break out things like > https://github.com/rgs1/zk_shell as a last resort (which is what I think of > as direct modification), and if they are unsure, call me *before* doing > anything in zk directly. > > By the way, I don't know if this has come up in a dev/build setting or not, > but are you aware of https://plugins.gradle.org/search?term=solr ? It is > presently only really suitable for local dev, with a single config set, but > could easily grow patches and suggestions welcome of course. > > On Sun, Feb 11, 2024, 9:10 AM Eric Pugh > wrote: > >> Hi all.. I was playing around with a cluster and wanted to upload a >> configset into Solr…. >> >> I ran bin/solr and noticed a bin/solr config -h command, but it just lets >> me tweak a config. Then I ran bin/solr create -h and it appears to let me >> upload a configset, but I have to create the collection as well, and I’m >> not ready to do that. >> >> Then I poked around and discovered hidden under bin/solr zk a command >> upconfig…. So bin/solr zk upconfig will let me get my configset into Solr, >> but does require me to remember what my magic ZK string is ;-). >> >> I went and checked the ref guide, and yes, it states that there are two >> ways: >> >> A configset can be uploaded to ZooKeeper either via the Configsets API < >> https://solr.apache.org/guide/solr/latest/configuration-guide/configsets-api.html> >> or more directly via bin/solr zk upconfig < >> https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html#upload-a-configuration-set>. >> The Configsets API has some other operations as well, and likewise, so does >> the CLI. >> >> Are there use cases where interacting directly with ZooKeeper is preferred >> over making changes via the APIs? Of is the use of bin/solr zk upconfig >> more of a evolutionary byproduct of how we built SolrCloud? >> >> Eric >> >> ___ >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | >> http://www.opensourceconnections.com < >> http://www.opensourceconnections.com/> | My Free/Busy < >> http://tinyurl.com/eric-cal> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < >> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> >> >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless of >> whether attachments are marked as such. >> >>
Re: Collections LIST semantics
If a program gets a list from a remote server, then expects that list to be accurate when they make calls based on it, well, my kindest thought is “charmingly naive”. Really, that is just bad code that hasn’t broken yet. That is true even if it gets a list from Zookeeper. Things change while you aren’t looking at them. Solr could make that happen less often or more often, but it will happen. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 29, 2024, at 10:42 AM, Jason Gerlowski wrote: > > Thanks for calling this out more explicitly; definitelyf worth discussing. > >> If a client/caller/user lists collections and then loops them to take > some action on them, it needs to be tolerant of the collection not working; > may seem to not exist. > > I'd go even a step further and say that users should always have > error-handling around their calls to Solr. > > But even so I'm leery of changing the semantics here. I think the > assumption of most folks is that each entry returned by a "list" exists > fully, unless the response gives more granular info to augment that. I'd > worry that returning partially-created or partially-deleted collections > would be confusing and unintuitive to most users. (e.g. Imagine iterating > over a "list", getting a not-found error running some operation on one of > the entries, but still seeing the collection when you call "list" again to > double-check.) > > I understand the need for a more scalable API, or a way to detect orphaned > data in ZK. But I'd personally rather not see us change the LIST semantics > to accomplish that. If you need the ZK child nodes, is there maybe a > scalable way to invoke ZookeeperInfoHandler to get that information? > > Best, > > Jason > > On Fri, Jan 26, 2024 at 2:46 PM David Smiley wrote: > >> https://issues.apache.org/jira/browse/SOLR-16909 >>> Collections LIST command should fetch ZK data, not cached state >> >> I want to get further input from folks that changing the semantics is >> okay. If the change is applied, LIST will be much faster but it will >> return collections that have not yet been fully constructed or >> deleted. If a client/caller/user lists collections and then loops >> them to take some action on them, it needs to be tolerant of the >> collection not working; may seem to not exist. I argue callers should >> *already* behave in this way or it may be brittle to circumstances >> that are hard to reason about. On the other hand, maybe this would >> increase the frequency of errors to existing clients that didn't >> encounter this in testing? Shrug. I could imagine ways to solve this >> but it would add some complexity and it's not clear it's worthwhile. >> >> A related aside: the method ClusterStatus.getCollectionsMap is not >> scalable for clusters with 10K+ collections because it loops every >> collection to fetch the latest stake from ZK, putting a massive load >> on ZK. Our implementation of collection listing calls it, as does a >> number of places across Solr. Some could be changed with relative >> ease; some are more thorny. I'd love to rename this thing, putting >> "slow" in the name so that you think twice before calling it :-) >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> For additional commands, e-mail: dev-h...@solr.apache.org >> >>
Re: New Feature: Query Elevation on "fq" field.
fq doesn’t calculate scores, so it doesn’t do any ranking. Query elevation for fq doesn’t make any sense. What problem do you think this solves. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 23, 2023, at 1:33 PM, Mouhcine Boutinzer > wrote: > > Hi there, > I am planning to suggest/introduce a new Query Elevation feature to Solr. > Currently, Solr supports Query Elevation for query parameter only (aka "q"). > What I suggest is to allow users to set a Query Elevation configuration for > filter query parameter as well (aka "fq"). > I am willing to create a JIRA issue for that, but I thought it might be > better if I communicate the idea to the dev team before submitting my issue. > Many thanks. > Regards, > Mouhcine
Re: New branch and feature freeze for Solr 9.4.0
I think this is missing a some words: "A new Always-On trace id generator and the rid parameter is being deprecated” Maybe “…generator has been added and…”? As it stands, it looks like the new trace id generator is being deprecated. The circuit breaker descriptions are accurate, but probably do not need to be capitalized. In general, there seem to be extra capitalizations, like “Always-On” and “Backup, Restore, and Split”. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 3, 2023, at 2:00 PM, Alex Deparvu wrote: > > Please update the draft release notes if you have any suggestions: > > https://cwiki.apache.org/confluence/display/SOLR/ReleaseNote9_4_0 > > best, > alex > > > > On Tue, Oct 3, 2023 at 11:26 AM Alex Deparvu wrote: > >> NOTICE: >> >> Branch branch_9_4 has been cut and versions updated to 9.5 on stable the >> branch. >> >> Please observe the normal rules: >> >> * No new features may be committed to the branch. >> * Documentation patches, build patches and serious bug fixes may be >> committed to the branch. However, you should submit all patches you >> want to commit to Jira first to give others the chance to review >> and possibly vote against the patch. Keep in mind that it is our >> main intention to keep the branch as stable as possible. >> * All patches that are intended for the branch should first be committed >> to the unstable branch, merged into the stable branch, and then into >> the current release branch. >> * Normal unstable and stable branch development may continue as usual. >> However, if you plan to commit a big change to the unstable branch >> while the branch feature freeze is in effect, think twice: can't the >> addition wait a couple more days? Merges of bug fixes into the branch >> may become more difficult. >> * Only Jira issues with Fix version 9.4 and priority "Blocker" will delay >> a release candidate build. >>
Re: Sitemap to get latest reference manual to rank in Google/Bing?
I would also prefer to have the old versions in web search. Antora can build a sitemap.xml file, so the right place to do this work is probably in the ref guide part of the Solr build. URLs that are not in the sitemap will still get indexed, so we can use the sitemap to hint that the latest guide is preferred. The entries would look something like this. https://solr.apache.org/guide/solr/latest/index.html 0.80 Default priority is 0.5, so 0.8 would make the latest more important. wunder > On Sep 21, 2023, at 3:14 PM, Arrieta, Alejandro > wrote: > > Hello, > > Please don't remove the indexing of older Solr guides. It helps to search > for "Solr X.Y what_to_search" and get the link to the corresponding guide. > Thumbs up to give higher priority to the latest guide. > > Kind Regards, > Alejandro Arrieta > > On Thu, Sep 21, 2023 at 3:42 PM Walter Underwood > wrote: > >> Actually, the robots.txt file should also disallow the 9.x guides. That >> won’t touch guide/latest. >> >> User-agent: * >> Disallow: /guide/9* >> Disallow: /guide/8* >> Disallow: /guide/7* >> Disallow: /guide/6* >> >> wunder >> >>> On Sep 21, 2023, at 2:38 PM, Walter Underwood >> wrote: >>> >>> I’m actually OK with them being indexed. It could be helpful to search >> for “Solr 8.11 aliases” or something like that. >>> >>> The priority attribute in sitemap.xml should boost the default, latest >> manual and that shouldn’t require any web server config. I’m glad to craft >> a static sitemap.xml file. One generated from the guide would be better, >> but that can be a later improvement. >>> >>> To get the old versions completely out of the index, add a robots.txt >> file to the solr-site repo under contents/ with these lines: >>> >>> User-agent: * >>> Disallow: /guide/8* >>> Disallow: /guide/7* >>> Disallow: /guide/6* >>> >>> Note that the wildcards on the paths aren't needed, but they helps >> humans understand that the disallows are a prefix match. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>>> On Sep 21, 2023, at 12:08 PM, Houston Putman >> wrote: >>>> >>>> I've been trying to get this working for the last year. Basically our >> issue >>>> is that the htaccess files do not add the right X-Robots-Tag header for >> old >>>> ref guide pages. >>>> >>>> >> https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1 >>>> >>>> This works locally, but in the actual Solr site, the headers are not >>>> returned. I have no idea why. Would love some help though, as I also >> hate >>>> seeing the old ref guide in the google results. >>>> >>>> - Houston >>>> >>>> On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood < >> wun...@wunderwood.org> >>>> wrote: >>>> >>>>> When I get web search results that include the Solr Reference Guide, I >>>>> often get older versions (6.6, 7.4) in the results. I would prefer to >>>>> always get the latest reference ( >>>>> https://solr.apache.org/guide/solr/latest/index.html). >>>>> >>>>> I think we can list the URLs for that in a sitemap.xml file with a >> higher >>>>> priority to suggest to the crawlers that these are the preferred pages. >>>>> >>>>> I don’t see a sitemap.xml or sitemap.xml.gz at >> https://solr.apached.org < >>>>> https://solr.apached.org/>. >>>>> >>>>> Should we prefer the latest manual? How do we build/deploy a sitemap? >> See: >>>>> https://www.sitemaps.org/ >>>>> >>>>> wunder >>>>> Walter Underwood >>>>> wun...@wunderwood.org >>>>> http://observer.wunderwood.org/ (my blog) >>>>> >>>>> >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> For additional commands, e-mail: dev-h...@solr.apache.org >> >> - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org
Re: Sitemap to get latest reference manual to rank in Google/Bing?
Actually, the robots.txt file should also disallow the 9.x guides. That won’t touch guide/latest. User-agent: * Disallow: /guide/9* Disallow: /guide/8* Disallow: /guide/7* Disallow: /guide/6* wunder > On Sep 21, 2023, at 2:38 PM, Walter Underwood wrote: > > I’m actually OK with them being indexed. It could be helpful to search for > “Solr 8.11 aliases” or something like that. > > The priority attribute in sitemap.xml should boost the default, latest manual > and that shouldn’t require any web server config. I’m glad to craft a static > sitemap.xml file. One generated from the guide would be better, but that can > be a later improvement. > > To get the old versions completely out of the index, add a robots.txt file to > the solr-site repo under contents/ with these lines: > > User-agent: * > Disallow: /guide/8* > Disallow: /guide/7* > Disallow: /guide/6* > > Note that the wildcards on the paths aren't needed, but they helps humans > understand that the disallows are a prefix match. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Sep 21, 2023, at 12:08 PM, Houston Putman wrote: >> >> I've been trying to get this working for the last year. Basically our issue >> is that the htaccess files do not add the right X-Robots-Tag header for old >> ref guide pages. >> >> https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1 >> >> This works locally, but in the actual Solr site, the headers are not >> returned. I have no idea why. Would love some help though, as I also hate >> seeing the old ref guide in the google results. >> >> - Houston >> >> On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood >> wrote: >> >>> When I get web search results that include the Solr Reference Guide, I >>> often get older versions (6.6, 7.4) in the results. I would prefer to >>> always get the latest reference ( >>> https://solr.apache.org/guide/solr/latest/index.html). >>> >>> I think we can list the URLs for that in a sitemap.xml file with a higher >>> priority to suggest to the crawlers that these are the preferred pages. >>> >>> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org < >>> https://solr.apached.org/>. >>> >>> Should we prefer the latest manual? How do we build/deploy a sitemap? See: >>> https://www.sitemaps.org/ >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> > - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org
Re: Sitemap to get latest reference manual to rank in Google/Bing?
I’m actually OK with them being indexed. It could be helpful to search for “Solr 8.11 aliases” or something like that. The priority attribute in sitemap.xml should boost the default, latest manual and that shouldn’t require any web server config. I’m glad to craft a static sitemap.xml file. One generated from the guide would be better, but that can be a later improvement. To get the old versions completely out of the index, add a robots.txt file to the solr-site repo under contents/ with these lines: User-agent: * Disallow: /guide/8* Disallow: /guide/7* Disallow: /guide/6* Note that the wildcards on the paths aren't needed, but they helps humans understand that the disallows are a prefix match. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 21, 2023, at 12:08 PM, Houston Putman wrote: > > I've been trying to get this working for the last year. Basically our issue > is that the htaccess files do not add the right X-Robots-Tag header for old > ref guide pages. > > https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1 > > This works locally, but in the actual Solr site, the headers are not > returned. I have no idea why. Would love some help though, as I also hate > seeing the old ref guide in the google results. > > - Houston > > On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood > wrote: > >> When I get web search results that include the Solr Reference Guide, I >> often get older versions (6.6, 7.4) in the results. I would prefer to >> always get the latest reference ( >> https://solr.apache.org/guide/solr/latest/index.html). >> >> I think we can list the URLs for that in a sitemap.xml file with a higher >> priority to suggest to the crawlers that these are the preferred pages. >> >> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org < >> https://solr.apached.org/>. >> >> Should we prefer the latest manual? How do we build/deploy a sitemap? See: >> https://www.sitemaps.org/ >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>
Sitemap to get latest reference manual to rank in Google/Bing?
When I get web search results that include the Solr Reference Guide, I often get older versions (6.6, 7.4) in the results. I would prefer to always get the latest reference (https://solr.apache.org/guide/solr/latest/index.html). I think we can list the URLs for that in a sitemap.xml file with a higher priority to suggest to the crawlers that these are the preferred pages. I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org <https://solr.apached.org/>. Should we prefer the latest manual? How do we build/deploy a sitemap? See: https://www.sitemaps.org/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)