Re: Unfinished Business: Fast Global IDF

2024-08-28 Thread Walter Underwood
I’ve never been in that part of the code, but it feels like it could have a 
small biast radius. We already have an interface for global IDF, so calculating 
it differently shouldn’t be huge. It does need a change in the shard response 
format.

It wouldn’t hurt to return DF in the response to regular clients. That would 
help with distributed search across collections, clusters, or even different 
kinds of engines. We did that ages ago at Verity with a SOAP interface (yuk). 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 27, 2024, at 8:10 PM, David Smiley  wrote:
> 
> Thanks for sharing Walter!  I hope someone enterprising tackles it.
> It'd be nice to have global IDF by default without having to go enable
> something that adds a performance risk.
> 
> I'm sure you have many career stories to tell.  If you find yourself
> at Acadia National Park hiking & backpacking, as you like to do, shoot
> me a message. :-D
> 
> ~ David
> 
> On Tue, Aug 27, 2024 at 3:01 PM Walter Underwood  
> wrote:
>> 
>> When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
>> Back in 1995, Infoseek figured out how to do that with no speed penalty. 
>> They patented it, but that patent expired several years ago. I’ll try and 
>> hunt it down.
>> 
>> Short version, from each shard return the number of docs and the df for each 
>> term. When combining results, add all the DF, add all the NUMDOCS, divide, 
>> and you have the global IDF. This is constant for the whole result list. 
>> Each shard already needs that info for local score, so it shouldn’t be extra 
>> work.
>> 
>> When does this matter? When the relevant documents for a term are mostly on 
>> one shard, either intentionally or accidentally. Let’s say we have a news 
>> search and all the stories for August 2024 are on one shard. The term 
>> “kamala” will be much more common on that shard, giving a lower IDF, but…the 
>> relevant documents are probably on that shard. So the best documents have a 
>> lower score using local IDF.
>> 
>> This also shows up with lots of shards or small shards, because there will 
>> be uneven distribution of docs. When I retired from LexisNexis, we had a 
>> cluster with 320 shards. I’m sure that had some interesting IDF behavior.
>> 
>> I wrote up how we did this in a Java distributed search layer for Ultraseek: 
>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/
>> 
>> There is some earlier discussion here: 
>> https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf
>> 
>> I don’t think there is a Jira issue for this.
>> 
>> I think that is all the unfinished business since putting Solr 1.3 into 
>> production at Netflix. Pretty darned good job everybody. Huge thanks to all 
>> the contributors and committers who have put in years of effort over that 
>> time.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 



Re: Unfinished Business: Fast Global IDF

2024-08-27 Thread Walter Underwood
This is the patent. Last assignee was Google, expired in 2017. 
https://patents.google.com/patent/US5659732A/en  —wunder

> On Aug 27, 2024, at 12:01 PM, Walter Underwood  wrote:
> 
> When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
> Back in 1995, Infoseek figured out how to do that with no speed penalty. They 
> patented it, but that patent expired several years ago. I’ll try and hunt it 
> down.
> 
> Short version, from each shard return the number of docs and the df for each 
> term. When combining results, add all the DF, add all the NUMDOCS, divide, 
> and you have the global IDF. This is constant for the whole result list. Each 
> shard already needs that info for local score, so it shouldn’t be extra work.
> 
> When does this matter? When the relevant documents for a term are mostly on 
> one shard, either intentionally or accidentally. Let’s say we have a news 
> search and all the stories for August 2024 are on one shard. The term 
> “kamala” will be much more common on that shard, giving a lower IDF, but…the 
> relevant documents are probably on that shard. So the best documents have a 
> lower score using local IDF.
> 
> This also shows up with lots of shards or small shards, because there will be 
> uneven distribution of docs. When I retired from LexisNexis, we had a cluster 
> with 320 shards. I’m sure that had some interesting IDF behavior.
> 
> I wrote up how we did this in a Java distributed search layer for Ultraseek: 
> https://observer.wunderwood.org/2007/04/04/progressive-reranking/
> 
> There is some earlier discussion here: 
> https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf
> 
> I don’t think there is a Jira issue for this.
> 
> I think that is all the unfinished business since putting Solr 1.3 into 
> production at Netflix. Pretty darned good job everybody. Huge thanks to all 
> the contributors and committers who have put in years of effort over that 
> time.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 



Unfinished Business: Fast Global IDF

2024-08-27 Thread Walter Underwood
When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
Back in 1995, Infoseek figured out how to do that with no speed penalty. They 
patented it, but that patent expired several years ago. I’ll try and hunt it 
down.

Short version, from each shard return the number of docs and the df for each 
term. When combining results, add all the DF, add all the NUMDOCS, divide, and 
you have the global IDF. This is constant for the whole result list. Each shard 
already needs that info for local score, so it shouldn’t be extra work.

When does this matter? When the relevant documents for a term are mostly on one 
shard, either intentionally or accidentally. Let’s say we have a news search 
and all the stories for August 2024 are on one shard. The term “kamala” will be 
much more common on that shard, giving a lower IDF, but…the relevant documents 
are probably on that shard. So the best documents have a lower score using 
local IDF.

This also shows up with lots of shards or small shards, because there will be 
uneven distribution of docs. When I retired from LexisNexis, we had a cluster 
with 320 shards. I’m sure that had some interesting IDF behavior.

I wrote up how we did this in a Java distributed search layer for Ultraseek: 
https://observer.wunderwood.org/2007/04/04/progressive-reranking/

There is some earlier discussion here: 
https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf

I don’t think there is a Jira issue for this.

I think that is all the unfinished business since putting Solr 1.3 into 
production at Netflix. Pretty darned good job everybody. Huge thanks to all the 
contributors and committers who have put in years of effort over that time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Unfinished Business: Fuzzy in edismax

2024-08-27 Thread Walter Underwood
Oops. https://issues.apache.org/jira/browse/SOLR-629  —wunder

> On Aug 27, 2024, at 11:40 AM, Walter Underwood  wrote:
> 
> I’m retired and not working on Solr all the time, but there are two things I 
> didn’t finish that should be picked up. I’m not going to do these, I’ve got 
> plenty of retirement stuff to do.
> 
> The first is SOLR-629, probably the oldest open feature request and a good 
> first project for someone. This adds support for fuzzy search to the edismax 
> query parser. The external impact is tiny, the qf config just says “title~” 
> instead of “title”.
> 
> The most recent patch is for 4.x. It doesn’t apply 100% to the current code 
> (more like 50%), but it should be fairly easy to figure out the needed mods.
> 
> This should be a nice project for a first-time contributor, because it is 
> localized to the edismax parse. That is spread out a bit, but not too bad. 
> Besides, who gets to work on a three-digit Jira issue?
> 
> Two notes:
> 
> 1. You’ll get the urge to rewrite the whole damned edismax config parser with 
> a real parser generator. Resist that and just make the change.
> 2. It isn’t possible to have a higher boost for an exact match and a lower 
> boost for a fuzzy match because it only handles one config spec per field 
> name. And it doesn’t throw an error for the second time, either. It really 
> should handle “title^4 title~^2”. The workaround is to make a copy of the 
> title field. Maybe that should be a separate Jira issue?
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 



Unfinished Business: Fuzzy in edismax

2024-08-27 Thread Walter Underwood
I’m retired and not working on Solr all the time, but there are two things I 
didn’t finish that should be picked up. I’m not going to do these, I’ve got 
plenty of retirement stuff to do.

The first is SOLR-629, probably the oldest open feature request and a good 
first project for someone. This adds support for fuzzy search to the edismax 
query parser. The external impact is tiny, the qf config just says “title~” 
instead of “title”.

The most recent patch is for 4.x. It doesn’t apply 100% to the current code 
(more like 50%), but it should be fairly easy to figure out the needed mods.

This should be a nice project for a first-time contributor, because it is 
localized to the edismax parse. That is spread out a bit, but not too bad. 
Besides, who gets to work on a three-digit Jira issue?

Two notes:

1. You’ll get the urge to rewrite the whole damned edismax config parser with a 
real parser generator. Resist that and just make the change.
2. It isn’t possible to have a higher boost for an exact match and a lower 
boost for a fuzzy match because it only handles one config spec per field name. 
And it doesn’t throw an error for the second time, either. It really should 
handle “title^4 title~^2”. The workaround is to make a copy of the title field. 
Maybe that should be a separate Jira issue?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: ZkStateReader.getUpdateLock / ClusterState immutability

2024-07-16 Thread Walter Underwood
Would per-replica state (PRS) help with that? That slices by replica, not 
collection, but it should allow finer-grained locking.

https://searchscale.com/blog/prs/
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 16, 2024, at 9:03 AM, David Smiley  wrote:
> 
> At work, in a scenario when a node starts with thousands of cores for
> thousands of collections, we've seen that core registration can
> bottleneck on ZkStateReader.forceUpdateCollection(collection) which
> synchronizes on getUpdateLock, a global lock (not per-collection).  I
> don't know the history or strategy behind that lock, but it's a
> code-smell to see a global lock that is used in a circumstance that is
> scoped to one collection.  I suspect it's there because ClusterState
> is immutable and encompasses basically all state.  If it was instead a
> cache that can be snapshotted (for consumers that require an immutable
> state to act on), we could probably make getUpdateLock go away.  *If*
> a collection's state needs to be locked (and I'm suspicious that it
> is, so long as cache insertion is done properly / exclusively), we
> could have a lock just for the collection.
> 
> Any concerns with this idea?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 



Re: ZkStateReader.getUpdateLock / ClusterState immutability

2024-07-16 Thread Walter Underwood
Would per-replica state (PRS) help with that? That slices by replica, not 
collection, but it should allow finer-grained locking.

https://searchscale.com/blog/prs/
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 16, 2024, at 9:03 AM, David Smiley  wrote:
> 
> At work, in a scenario when a node starts with thousands of cores for
> thousands of collections, we've seen that core registration can
> bottleneck on ZkStateReader.forceUpdateCollection(collection) which
> synchronizes on getUpdateLock, a global lock (not per-collection).  I
> don't know the history or strategy behind that lock, but it's a
> code-smell to see a global lock that is used in a circumstance that is
> scoped to one collection.  I suspect it's there because ClusterState
> is immutable and encompasses basically all state.  If it was instead a
> cache that can be snapshotted (for consumers that require an immutable
> state to act on), we could probably make getUpdateLock go away.  *If*
> a collection's state needs to be locked (and I'm suspicious that it
> is, so long as cache insertion is done properly / exclusively), we
> could have a lock just for the collection.
> 
> Any concerns with this idea?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 



Re: Compatibility of Solrj with older versions of Solr

2024-05-15 Thread Walter Underwood
First, this question belongs on the users@solr.apache 
<mailto:users@solr.apache>.org mailing list. Second, I would not use any SolrJ 
later than 6.6 against Solr 6.6.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 15, 2024, at 10:48 AM, Todd Stevenson 
>  wrote:
> 
> I’m trying to upgrade the Java apps I support to current versions of 
> SpringBoot and Java.   I’m wanting to use the later versions of Solrj also.   
> These apps run against Solr 6.6  (I have no control over upgrading Solr).   
> What versions of Solrj are compatible with Solr 6.6.  I’ve looked extensively 
> to find this information and can’t see it in the documentation.
>  
> Can you point me to a  Solrj user guide?  The only documentation I can see 
> are the javadocs.   I need more help than the javadocs.
>  
> Thank you so much.
>  
> Todd Stevenson
> Software Engineer – Technical Lead
> Intermountain Health, Canyons Region
> Cell: 801-589-1115
> Work Schedule:  Monday to Thursday
>  
>  <https://www.instagram.com/intermountain/?hl=en> 
> <https://www.facebook.com/Intermountain/> <https://twitter.com/intermountain> 
> <https://www.linkedin.com/company/intermountain-healthcare>
>  <http://www.intermountainhealth.org/>
>  
>  
> NOTICE: This e-mail is for the sole use of the intended recipient and may 
> contain confidential and privileged information. If you are not the intended 
> recipient, you are prohibited from reviewing, using, disclosing or 
> distributing this e-mail or its contents. If you have received this e-mail in 
> error, please contact the sender by reply e-mail and destroy all copies of 
> this e-mail and its contents.



Re: solr query alerting

2024-05-01 Thread Walter Underwood
The functionality is alerts, but that doesn’t mean it has to be a push API. 
Alerts can be fetched just as easily as pushed.

I don’t know the limits of this proposal, but LexisNexis needs alerting as we 
move all of our 114 billion documents onto Solr. I’m retiring this week, so I 
won’t be around to implement it, but that is one potential large customer.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 1, 2024, at 2:26 PM, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) 
>  wrote:
> 
>> I kind of like "search-alerts". "query-alerts" sounds like alerting on
>> query metrics, but IMO "search-alerts" doesn't come with the same baggage.
> 
> Someone in the PR had mentioned that "alerts" is a bit off because the 
> proposal does not really manage alerts and it feels too far out of solr's 
> domain. The current approach, much like percolator, simply exposes a 
> request/response API that then can be **used** by an alerting system 
> (request/stream could also be considered if there is worry about 
> scaling the number of queries one request can match). 
> 
>> I think this is certainly something that can start in the sandbox and move> 
>> into the main repo once it's clear that there is interest from
>> multiple committers and community members in using and maintaining it.
> 
> I've seen many homegrown/complex solutions of percolator-type functionality 
> so even this narrower "inverted search" solution has **some** use but 
> admittedly this is a niche area. It might not really gain traction unless it 
> is marketed the right way as there are probably very few solr users that 
> happen to be thinking about revamping their saved-search platform in any 
> given year. Given that, what do you think I can do to reach them? :-)
> 
> I am trying my best to talk about this within my firm but the sample is 
> obviously smaller.
> 
> From: dev@solr.apache.org At: 05/01/24 16:16:50 UTC-4:00To:  
> dev@solr.apache.org
> Subject: Re: solr query alerting
> 
> I think I'd prefer a more self-descriptive name than "Luwak", which is just
> a product name that was decided a while ago.
> 
> I kind of like "search-alerts". "query-alerts" sounds like alerting on
> query metrics, but IMO "search-alerts" doesn't come with the same baggage.
> 
> Luwak is fine though if everyone agrees on that.
> 
> On one hand we have a number of committers here from
>> Bloomberg, yet the abandoned and now-removed "analytics" component
>> shows that abandonment is a risk nonetheless.
>> 
> 
> I don't want to bikeshed here, but I'm not sure this is a fair
> assessment of what happened with the analytics module.
> Sure there wasn't a ton of development, but in general it was feature rich
> and had very little feature requests.
> It was removed in 10, because a lack of user usage, not because it was
> "abandoned" IMO. If there were requests from users
> to keep it or improve it, then it would be a much different story. The
> whole "thrown over the wall" comment is fair, but
> not particularly relevant to this PR, which is being worked on in public.
> 
> I think this is certainly something that can start in the sandbox and move
> into the main repo once it's clear that there is interest from
> multiple committers and community members in using and maintaining it.
> 
> - Houston
> 
> On Wed, May 1, 2024 at 2:32 PM David Smiley  wrote:
> 
>> Luwak is good to me!
>> 
>> On Tue, Apr 30, 2024 at 4:01 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD
>> A)  wrote:
>>> 
>>> I love the name "luwak"! I was about to suggest the same but was worried
>> about the trademark concerns and I assumed there was a reason they changed
>> the name when donating it to lucene.
>>> 
>>> From: dev@solr.apache.org At: 04/30/24 15:56:22 UTC-4:00To:
>> dev@solr.apache.org
>>> Subject: Re: solr query alerting
>>> 
>>> Luwak is the original name of the Lucene monitor, contributed by Flax
>> back in
>>> the days: https://github.com/flaxsearch/luwak
>>> 
>>> Perhaps we could go full circle (if no trademark issues) to call it the
>> Solr
>>> luwak module? Luwak is a type of coffee, and thus related to percolator
>> 😉
>>> 
>>> Otherwise “stored-queries” is an option.
>>> 
>>> Jan Høydahl
>>> 
>>>> 30. apr. 2024 kl. 19:26 skrev David Smiley :
>>>> 
>>>> I agree the feature is relevant / useful.
>>>>

Re: solr query alerting

2024-05-01 Thread Walter Underwood
Do people want to spend the next ten years explaining that the the alerting 
feature is called “Luwak”? I’d call it “Alerting” or “Alerts".  —wonder

> On May 1, 2024, at 1:16 PM, Houston Putman  wrote:
> 
> I think I'd prefer a more self-descriptive name than "Luwak", which is just
> a product name that was decided a while ago.
> 
> I kind of like "search-alerts". "query-alerts" sounds like alerting on
> query metrics, but IMO "search-alerts" doesn't come with the same baggage.
> 
> Luwak is fine though if everyone agrees on that.
> 
> On one hand we have a number of committers here from
>> Bloomberg, yet the abandoned and now-removed "analytics" component
>> shows that abandonment is a risk nonetheless.
>> 
> 
> I don't want to bikeshed here, but I'm not sure this is a fair
> assessment of what happened with the analytics module.
> Sure there wasn't a ton of development, but in general it was feature rich
> and had very little feature requests.
> It was removed in 10, because a lack of user usage, not because it was
> "abandoned" IMO. If there were requests from users
> to keep it or improve it, then it would be a much different story. The
> whole "thrown over the wall" comment is fair, but
> not particularly relevant to this PR, which is being worked on in public.
> 
> I think this is certainly something that can start in the sandbox and move
> into the main repo once it's clear that there is interest from
> multiple committers and community members in using and maintaining it.
> 
> - Houston
> 
> On Wed, May 1, 2024 at 2:32 PM David Smiley  wrote:
> 
>> Luwak is good to me!
>> 
>> On Tue, Apr 30, 2024 at 4:01 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD
>> A)  wrote:
>>> 
>>> I love the name "luwak"! I was about to suggest the same but was worried
>> about the trademark concerns and I assumed there was a reason they changed
>> the name when donating it to lucene.
>>> 
>>> From: dev@solr.apache.org At: 04/30/24 15:56:22 UTC-4:00To:
>> dev@solr.apache.org
>>> Subject: Re: solr query alerting
>>> 
>>> Luwak is the original name of the Lucene monitor, contributed by Flax
>> back in
>>> the days: https://github.com/flaxsearch/luwak
>>> 
>>> Perhaps we could go full circle (if no trademark issues) to call it the
>> Solr
>>> luwak module? Luwak is a type of coffee, and thus related to percolator
>> 😉
>>> 
>>> Otherwise “stored-queries” is an option.
>>> 
>>> Jan Høydahl
>>> 
 30. apr. 2024 kl. 19:26 skrev David Smiley :
 
 I agree the feature is relevant / useful.
 
 Another angle on the module vs sandbox or wherever else is maintenance
 cost.  If a lot of code is being contributed as is here, then as a PMC
 member I hope to get a subjective sense that folks are interested in
 maintaining it.  On one hand we have a number of committers here from
 Bloomberg, yet the abandoned and now-removed "analytics" component
 shows that abandonment is a risk nonetheless.  I don't know how to
 conclude this thought but I'm hoping to hear from folks that they
 intend to look after this module.  It's not just being "thrown over
 the wall", so to speak.
 
 Naming is hard...
 * ...-monitor-: sorry I hate it
 * ...-percolator- No clue why this was chosen for ElasticSearch.
 I can appreciate a curious/non-obvious name like this that is not
 going to conflict with anyone's guesses at what a general name might
 convey.
 * "indexed-queries" or "query-indexing" would be a good name?  This is
 the best technical name I can think of.
 *  "reverse search" came to mind (based on the Netflix article)
 although that makes me think of leading-wildcard / suffix-search.
 * "inverted-search"
 *  "indexed-query-alerts" incorporates "alerts" thus might better
 convey the use-case
 
> On Mon, Apr 1, 2024 at 3:53 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD
> A)  wrote:
> 
> Hi All,
> 
> A few months ago I wrote the user list about potentially integrating
>> lucene
>>> monitor into solr. I have raised this PR with a first attempt at
>> implementing
>>> this integration. I'd greatly appreciate any feedback on this even
>> though I
>>> still have it marked as draft. I want to make sure I'm heading in the
>> right
>>> direction here so input from solr dev community would be extremely
>> valuable :-)
> 
> Many thanks,
> Luke
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
 For additional commands, e-mail: dev-h...@solr.apache.org
 
>>> 
>>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>> For additional commands, e-mail: dev-h...@solr.apache.org
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: d

Re: timeout HTTP response code; use 524?

2024-03-19 Thread Walter Underwood
I still think 503 is appropriate when timeAllowed is exceeded. The service 
requested is a reponse within the set time. That service is not available. Here 
are the RFC definitions of 500 and 503. Exceeding timeAllowed isn’t an 
“unexpected condition”, it is part of the normal operation of that limit.

6.6.1.  500 Internal Server Error

   The 500 (Internal Server Error) status code indicates that the server
   encountered an unexpected condition that prevented it from fulfilling
   the request.
https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.1

 6.6.4 503 Service Unavailable

   The 503 (Service Unavailable) status code indicates that the server
   is currently unable to handle the request due to a temporary overload
   or scheduled maintenance, which will likely be alleviated after some
   delay.  The server MAY send a Retry-After header field
   (Section 7.1.3) to suggest an appropriate amount of time for the
   client to wait before retrying the request
https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.4

Solr could even return 503 with a message of “timeAllowed exceeded”.

I spent about a decade working on a search engine with an integrated web 
spider. Accurate HTTP response codes are really useful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 19, 2024, at 3:12 PM, Chris Hostetter  wrote:
> 
> 
> Agree on all of Uwe's points below
> 
> I think 500 is the most appropriate for exceeding QueryLimits -- 
> unless/until we decie we want Solr to start using custom response codes in 
> some cases, but in that case i would suggest we explicitly *avoid* using 
> 504, 524, & 529 precisely because they already have specific meanings in 
> well known HTTP proxies/services that don't match what we're talking about 
> here.
> 
> As far as one of David's specific observations...
> 
> : > ideal IMO because Solr's health could reasonably be judged by looking
> : > for 500's specifically as a sign of a general error that service
> : > operators should pay attention to.
> 
> Any client that is interpreting a '500' error as a *general* indication of 
> a problem with Solr, and not specific to that request, would not be 
> respecting the spec on what '500' means.  *Some* '5xx' are documented 
> to indicate that there may be a general problem afflicting the 
> server/service as a whole (notably '503') but most do not.
> 
> But i also think that if we really want to cover our basis -- we can 
> always make it configurable.  Let people configure Solr to return 
> 500, 400, 418, 666, 999, ... wtf they want ... but 500 is probably the 
> best sane default that doesn't carry around implicit baggage.
> 
> : 524 or 504 both refer to timeouts, but both are meant for proxies (so 
> reverse
> : proxy can't reach the backend server in time). So both of them do not match.
> : 
> : 408 is "request timeout", but that's client's fault (4xx code). In that case
> : its a more technical code because it also requires to close the connection 
> and
> : not keep it alive, so we can't trigger that from Servlet API in a correct 
> way.
> : 
> : 503 does not fit well as Solr is not overloaded, but would be the only
> : alternative I see. Maybe add a new Solr-specific one? Anyways, I think 500
> : seems the best response unless you find another one not proxy-related.
> : 
> : Uwe
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 



Re: timeout HTTP response code; use 524?

2024-03-18 Thread Walter Underwood
503 Service Unavailable is the standard response for down or overloaded. I 
don’t see that 529 is significantly different. 

I do think it is a good idea to distinguish overload or down conditions from 
the catch-all 500 error. I interpret that as a broken server, not one that is 
functioning properly but overloaded.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 18, 2024, at 3:23 PM, David Smiley  wrote:
> 
> If timeAllowed is set and Solr takes too long then we fail the
> response with an HTTP 500 response code.  It's not bad but it's not
> ideal IMO because Solr's health could reasonably be judged by looking
> for 500's specifically as a sign of a general error that service
> operators should pay attention to.  There is a 529 response code used
> by CloudFlare (judging from Wikipedia):
> https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
> 
> Any opinion on the use of 529 instead of 500; or alternative perspectives?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org



Re: Moving to bin/solr start defaulting to SolrCloud mode?

2024-02-28 Thread Walter Underwood
Standalone makes sense for the configs. Each node has their own local set of 
configs which are not shared.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 28, 2024, at 10:51 AM, David Smiley  wrote:
> 
> On Wed, Feb 28, 2024 at 7:50 AM Gus Heck  wrote:
>> IIRC "standalone" was deemed the wrong color for the shed because
>> [original/non-cloud/standalone/legacy/user-managed] solr can have more than
>> one machine, and does distributed search.
> 
> Nonetheless each node acts alone and/or acts on requests which include
> URLs.  "standalone" may be an imperfect word but perhaps no word is
> perfect.  What "standalone" has going for it is mindshare / usage for
> a decade now.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 



Re: MixedCase or dashed-case for long options in Solr CLI?

2024-02-26 Thread Walter Underwood
Long options are dashed-case, following the GNU convention. POSIX only 
specifies single character options. The “—“ prefix for long options is a GNU 
invention, as far as I know. Older Unix commands with long option names, e.g. 
find, only use a single dash.

https://www.gnu.org/software/libc/manual/html_node/Argument-Syntax.html
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap12.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 26, 2024, at 5:29 AM, Eric Pugh  
> wrote:
> 
> I hear a vote for dashed-case, how about some more votes?   —solr-update-url 
> versus —solrUpdateUrl ?
> 
> 
> 
>> On Feb 26, 2024, at 7:29 AM, Jason Gerlowski  wrote:
>> 
>> My guess is that "dashed-case" is slightly more common -- at least,
>> that's my sense from haphazardly checking a few tools I use often
>> ("curl", "kubectl", "git", "docker").
>> 
>> But I don't have an opinion as long as we're internally consistent
>> about using one convention or the other.
>> 
>> Best,
>> 
>> Jason
>> 
>> On Sat, Feb 24, 2024 at 11:35 AM Eric Pugh
>> mailto:ep...@opensourceconnections.com>> 
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I wanted to get the communities input on formatting of long options for the 
>>> Solr CLI.   I noticed on https://commons.apache.org/proper/commons-cli/ 
>>> that their examples all are —dashed-case.
>>> 
>>> However, we have —solrUrl or —zkHost as our pattern.   Though in working on 
>>> the PostTool, I used —solr-update-url as the parameter because I had been 
>>> reading the commons-cli docs...
>>> 
>>> I’d like to get this sorted so that I can get 
>>> https://issues.apache.org/jira/browse/SOLR-16824 over the finish line.   So 
>>> please do speak up with preferences!   (And please let’s not support both!)
>>> 
>>> 
>>> The changes to the formatting will be a 10x thing.
>>> 
>>> Eric
>>> 
>>> ___
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
>>> http://www.opensourceconnections.com 
>>> <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/>
>>>  | My Free/Busy <http://tinyurl.com/eric-cal>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
>>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>> This e-mail and all contents, including attachments, is considered to be 
>>> Company Confidential unless explicitly stated otherwise, regardless of 
>>> whether attachments are marked as such.
>>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org 
>> <mailto:dev-unsubscr...@solr.apache.org>
>> For additional commands, e-mail: dev-h...@solr.apache.org 
>> <mailto:dev-h...@solr.apache.org>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com <http://www.opensourceconnections.com/> 
> | My Free/Busy <http://tinyurl.com/eric-cal>  
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>   
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
> 



Re: Use cases for interacting direct with ZK versus using our APIs?

2024-02-11 Thread Walter Underwood
Zookeeper file size limits are probably the most common failure. I had to mess 
around a lot with our suggestion dictionary to get it to upload.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2024, at 11:25 AM, Eric Pugh  
> wrote:
> 
> Ah.. yeah, I can’t speak to Solr 6.x!   In 9x at least you could use the 
> configset API to deploy configs and avoid the direct ZK interaction.
> 
> It would be interesting to explore if the process of deploying a configset is 
> risky, has a high chance of things failing, then how do we account for that 
> as part of the process?So you don’t have to do things like upload the 
> previous config ;-).
> 
> And other common reasons to use ZK directly?
> 
>> On Feb 11, 2024, at 12:14 PM, Walter Underwood  wrote:
>> 
>> The was deploying configs with Jenkins on Solr 6.x. Maybe the APIs were 
>> there, but I didn't know about them.
>> 
>> Rebuilding the suggester did need external help, since that needs to be done 
>> separately on each node.
>> 
>> I think working directly with Zookeeper is less risky. If there is any issue 
>> with the upload, then don’t reload the collections. You can back out the 
>> changes by uploading the previous config to Zookeeper.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 11, 2024, at 11:07 AM, Eric Pugh >> <mailto:ep...@opensourceconnections.com>> wrote:
>>> 
>>> Could you share more about “update Solr remotely” that you were doing?   
>>> Are we missing some APIs that would have made whatever you had to do 
>>> require ZK direct access?   
>>> 
>>> While it’s cool that we can impact Solr via hacking around in ZK, it also 
>>> seems like an approach fraught with risk!
>>> 
>>>> On Feb 11, 2024, at 11:32 AM, Walter Underwood  
>>>> wrote:
>>>> 
>>>> I wanted something that didn’t require installing Solr locally in order to 
>>>> update Solr remotely, so I didn’t use the provided zk commands. I wrote 
>>>> some Python to dig the Zookeeper addresses out of clusterstatus (I think) 
>>>> then uploaded directly to Zookeeper with the Python kazoo package.
>>>> 
>>>> The tool had a bunch of other things, like async reload checking for 
>>>> results, and rebuilding suggestion dictionaries on each node.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Feb 11, 2024, at 9:04 AM, Gus Heck  wrote:
>>>>> 
>>>>> I pretty much always use zk upconfig, which also works for overwriting
>>>>> existing. I certainly tell my clients to use apis from the ref guide for
>>>>> such operations, but zk upconfig certainly counts as one. Mostly I tell
>>>>> them that they should only break out things like
>>>>> https://github.com/rgs1/zk_shell as a last resort (which is what I think 
>>>>> of
>>>>> as direct modification), and if they are unsure, call me *before* doing
>>>>> anything in zk directly.
>>>>> 
>>>>> By the way, I don't know if this has come up in a dev/build setting or 
>>>>> not,
>>>>> but are you aware of https://plugins.gradle.org/search?term=solr ? It is
>>>>> presently only really suitable for local dev, with a single config set, 
>>>>> but
>>>>> could easily grow patches and suggestions welcome of course.
>>>>> 
>>>>> On Sun, Feb 11, 2024, 9:10 AM Eric Pugh 
>>>>> wrote:
>>>>> 
>>>>>> Hi all..   I was playing around with a cluster and wanted to upload a
>>>>>> configset into Solr….
>>>>>> 
>>>>>> I ran bin/solr and noticed a bin/solr config -h command, but it just lets
>>>>>> me tweak a config.   Then I ran bin/solr create -h and it appears to let 
>>>>>> me
>>>>>> upload a configset, but I have to create the collection as well, and I’m
>>>>>> not ready to do that.
>>>>>> 
>>>>>> Then I poked around and discovered hidden under bin/solr zk a command
>>>>>> upconfig…. So bin/solr zk upconfig will let me get my configset into 
>>>>>> Solr,
>>>>>&g

Re: Use cases for interacting direct with ZK versus using our APIs?

2024-02-11 Thread Walter Underwood
The was deploying configs with Jenkins on Solr 6.x. Maybe the APIs were there, 
but I didn't know about them.

Rebuilding the suggester did need external help, since that needs to be done 
separately on each node.

I think working directly with Zookeeper is less risky. If there is any issue 
with the upload, then don’t reload the collections. You can back out the 
changes by uploading the previous config to Zookeeper.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2024, at 11:07 AM, Eric Pugh  
> wrote:
> 
> Could you share more about “update Solr remotely” that you were doing?   Are 
> we missing some APIs that would have made whatever you had to do require ZK 
> direct access?   
> 
> While it’s cool that we can impact Solr via hacking around in ZK, it also 
> seems like an approach fraught with risk!
> 
>> On Feb 11, 2024, at 11:32 AM, Walter Underwood  wrote:
>> 
>> I wanted something that didn’t require installing Solr locally in order to 
>> update Solr remotely, so I didn’t use the provided zk commands. I wrote some 
>> Python to dig the Zookeeper addresses out of clusterstatus (I think) then 
>> uploaded directly to Zookeeper with the Python kazoo package.
>> 
>> The tool had a bunch of other things, like async reload checking for 
>> results, and rebuilding suggestion dictionaries on each node.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 11, 2024, at 9:04 AM, Gus Heck  wrote:
>>> 
>>> I pretty much always use zk upconfig, which also works for overwriting
>>> existing. I certainly tell my clients to use apis from the ref guide for
>>> such operations, but zk upconfig certainly counts as one. Mostly I tell
>>> them that they should only break out things like
>>> https://github.com/rgs1/zk_shell as a last resort (which is what I think of
>>> as direct modification), and if they are unsure, call me *before* doing
>>> anything in zk directly.
>>> 
>>> By the way, I don't know if this has come up in a dev/build setting or not,
>>> but are you aware of https://plugins.gradle.org/search?term=solr ? It is
>>> presently only really suitable for local dev, with a single config set, but
>>> could easily grow patches and suggestions welcome of course.
>>> 
>>> On Sun, Feb 11, 2024, 9:10 AM Eric Pugh 
>>> wrote:
>>> 
>>>> Hi all..   I was playing around with a cluster and wanted to upload a
>>>> configset into Solr….
>>>> 
>>>> I ran bin/solr and noticed a bin/solr config -h command, but it just lets
>>>> me tweak a config.   Then I ran bin/solr create -h and it appears to let me
>>>> upload a configset, but I have to create the collection as well, and I’m
>>>> not ready to do that.
>>>> 
>>>> Then I poked around and discovered hidden under bin/solr zk a command
>>>> upconfig…. So bin/solr zk upconfig will let me get my configset into Solr,
>>>> but does require me to remember what my magic ZK string is ;-).
>>>> 
>>>> I went and checked the ref guide, and yes, it states that there are two
>>>> ways:
>>>> 
>>>> A configset can be uploaded to ZooKeeper either via the Configsets API <
>>>> https://solr.apache.org/guide/solr/latest/configuration-guide/configsets-api.html>
>>>> or more directly via bin/solr zk upconfig <
>>>> https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html#upload-a-configuration-set>.
>>>> The Configsets API has some other operations as well, and likewise, so does
>>>> the CLI.
>>>> 
>>>> Are there use cases where interacting directly with ZooKeeper is preferred
>>>> over making changes via the APIs?  Of is the use of bin/solr zk upconfig
>>>> more of a evolutionary byproduct of how we built SolrCloud?
>>>> 
>>>> Eric
>>>> 
>>>> ___
>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>>> http://www.opensourceconnections.com <
>>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>>> http://tinyurl.com/eric-cal>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>>> 
>>>> This 

Re: Use cases for interacting direct with ZK versus using our APIs?

2024-02-11 Thread Walter Underwood
I wanted something that didn’t require installing Solr locally in order to 
update Solr remotely, so I didn’t use the provided zk commands. I wrote some 
Python to dig the Zookeeper addresses out of clusterstatus (I think) then 
uploaded directly to Zookeeper with the Python kazoo package.

The tool had a bunch of other things, like async reload checking for results, 
and rebuilding suggestion dictionaries on each node.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2024, at 9:04 AM, Gus Heck  wrote:
> 
> I pretty much always use zk upconfig, which also works for overwriting
> existing. I certainly tell my clients to use apis from the ref guide for
> such operations, but zk upconfig certainly counts as one. Mostly I tell
> them that they should only break out things like
> https://github.com/rgs1/zk_shell as a last resort (which is what I think of
> as direct modification), and if they are unsure, call me *before* doing
> anything in zk directly.
> 
> By the way, I don't know if this has come up in a dev/build setting or not,
> but are you aware of https://plugins.gradle.org/search?term=solr ? It is
> presently only really suitable for local dev, with a single config set, but
> could easily grow patches and suggestions welcome of course.
> 
> On Sun, Feb 11, 2024, 9:10 AM Eric Pugh 
> wrote:
> 
>> Hi all..   I was playing around with a cluster and wanted to upload a
>> configset into Solr….
>> 
>> I ran bin/solr and noticed a bin/solr config -h command, but it just lets
>> me tweak a config.   Then I ran bin/solr create -h and it appears to let me
>> upload a configset, but I have to create the collection as well, and I’m
>> not ready to do that.
>> 
>> Then I poked around and discovered hidden under bin/solr zk a command
>> upconfig…. So bin/solr zk upconfig will let me get my configset into Solr,
>> but does require me to remember what my magic ZK string is ;-).
>> 
>> I went and checked the ref guide, and yes, it states that there are two
>> ways:
>> 
>> A configset can be uploaded to ZooKeeper either via the Configsets API <
>> https://solr.apache.org/guide/solr/latest/configuration-guide/configsets-api.html>
>> or more directly via bin/solr zk upconfig <
>> https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html#upload-a-configuration-set>.
>> The Configsets API has some other operations as well, and likewise, so does
>> the CLI.
>> 
>> Are there use cases where interacting directly with ZooKeeper is preferred
>> over making changes via the APIs?  Of is the use of bin/solr zk upconfig
>> more of a evolutionary byproduct of how we built SolrCloud?
>> 
>> Eric
>> 
>> ___
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com <
>> http://www.opensourceconnections.com/> | My Free/Busy <
>> http://tinyurl.com/eric-cal>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>> 
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.
>> 
>> 



Re: Collections LIST semantics

2024-01-29 Thread Walter Underwood
If a program gets a list from a remote server, then expects that list to be 
accurate when they make calls based on it, well, my kindest thought is 
“charmingly naive”. Really, that is just bad code that hasn’t broken yet.

That is true even if it gets a list from Zookeeper. Things change while you 
aren’t looking at them.

Solr could make that happen less often or more often, but it will happen.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 29, 2024, at 10:42 AM, Jason Gerlowski  wrote:
> 
> Thanks for calling this out more explicitly; definitelyf worth discussing.
> 
>> If a client/caller/user lists collections and then loops them to take
> some action on them, it needs to be tolerant of the collection not working;
> may seem to not exist.
> 
> I'd go even a step further and say that users should always have
> error-handling around their calls to Solr.
> 
> But even so I'm leery of changing the semantics here.  I think the
> assumption of most folks is that each entry returned by a "list" exists
> fully, unless the response gives more granular info to augment that.  I'd
> worry that returning partially-created or partially-deleted collections
> would be confusing and unintuitive to most users.  (e.g. Imagine iterating
> over a "list", getting a not-found error running some operation on one of
> the entries, but still seeing the collection when you call "list" again to
> double-check.)
> 
> I understand the need for a more scalable API, or a way to detect orphaned
> data in ZK.  But I'd personally rather not see us change the LIST semantics
> to accomplish that.  If you need the ZK child nodes, is there maybe a
> scalable way to invoke ZookeeperInfoHandler to get that information?
> 
> Best,
> 
> Jason
> 
> On Fri, Jan 26, 2024 at 2:46 PM David Smiley  wrote:
> 
>> https://issues.apache.org/jira/browse/SOLR-16909
>>> Collections LIST command should fetch ZK data, not cached state
>> 
>> I want to get further input from folks that changing the semantics is
>> okay.  If the change is applied, LIST will be much faster but it will
>> return collections that have not yet been fully constructed or
>> deleted.  If a client/caller/user lists collections and then loops
>> them to take some action on them, it needs to be tolerant of the
>> collection not working; may seem to not exist.  I argue callers should
>> *already* behave in this way or it may be brittle to circumstances
>> that are hard to reason about.  On the other hand, maybe this would
>> increase the frequency of errors to existing clients that didn't
>> encounter this in testing?  Shrug.  I could imagine ways to solve this
>> but it would add some complexity and it's not clear it's worthwhile.
>> 
>> A related aside: the method ClusterStatus.getCollectionsMap is not
>> scalable for clusters with 10K+ collections because it loops every
>> collection to fetch the latest stake from ZK, putting a massive load
>> on ZK.  Our implementation of collection listing calls it, as does a
>> number of places across Solr.  Some could be changed with relative
>> ease; some are more thorny.  I'd love to rename this thing, putting
>> "slow" in the name so that you think twice before calling it :-)
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>> For additional commands, e-mail: dev-h...@solr.apache.org
>> 
>> 



Re: New Feature: Query Elevation on "fq" field.

2023-11-24 Thread Walter Underwood
fq doesn’t calculate scores, so it doesn’t do any ranking. Query elevation for 
fq doesn’t make any sense.

What problem do you think this solves.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 23, 2023, at 1:33 PM, Mouhcine Boutinzer 
>  wrote:
> 
> Hi there,
> I am planning to suggest/introduce a new Query Elevation feature to Solr.
> Currently, Solr supports Query Elevation for query parameter only (aka "q").
> What I suggest is to allow users to set a Query Elevation configuration for
> filter query parameter as well (aka "fq").
> I am willing to create a JIRA issue for that, but I thought it might be
> better if I communicate the idea to the dev team before submitting my issue.
> Many thanks.
> Regards,
> Mouhcine



Re: New branch and feature freeze for Solr 9.4.0

2023-10-03 Thread Walter Underwood
I think this is missing a some words:

"A new Always-On trace id generator and the rid parameter is being deprecated”

Maybe “…generator has been added and…”? As it stands, it looks like the new 
trace id generator is being deprecated.

The circuit breaker descriptions are accurate, but probably do not need to be 
capitalized. In general, there seem to be extra capitalizations, like 
“Always-On” and “Backup, Restore, and Split”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 3, 2023, at 2:00 PM, Alex Deparvu  wrote:
> 
> Please update the draft release notes if you have any suggestions:
> 
> https://cwiki.apache.org/confluence/display/SOLR/ReleaseNote9_4_0
> 
> best,
> alex
> 
> 
> 
> On Tue, Oct 3, 2023 at 11:26 AM Alex Deparvu  wrote:
> 
>> NOTICE:
>> 
>> Branch branch_9_4 has been cut and versions updated to 9.5 on stable the
>> branch.
>> 
>> Please observe the normal rules:
>> 
>> * No new features may be committed to the branch.
>> * Documentation patches, build patches and serious bug fixes may be
>>  committed to the branch. However, you should submit all patches you
>>  want to commit to Jira first to give others the chance to review
>>  and possibly vote against the patch. Keep in mind that it is our
>>  main intention to keep the branch as stable as possible.
>> * All patches that are intended for the branch should first be committed
>>  to the unstable branch, merged into the stable branch, and then into
>>  the current release branch.
>> * Normal unstable and stable branch development may continue as usual.
>>  However, if you plan to commit a big change to the unstable branch
>>  while the branch feature freeze is in effect, think twice: can't the
>>  addition wait a couple more days? Merges of bug fixes into the branch
>>  may become more difficult.
>> * Only Jira issues with Fix version 9.4 and priority "Blocker" will delay
>>  a release candidate build.
>> 



Re: Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Walter Underwood
I would also prefer to have the old versions in web search.

Antora can build a sitemap.xml file, so the right place to do this work is 
probably in the 
ref guide part of the Solr build.

URLs that are not in the sitemap will still get indexed, so we can use the 
sitemap to
hint that the latest guide is preferred. The entries would look something like 
this.


  https://solr.apache.org/guide/solr/latest/index.html
  0.80


Default priority is 0.5, so 0.8 would make the latest more important.

wunder

> On Sep 21, 2023, at 3:14 PM, Arrieta, Alejandro  
> wrote:
> 
> Hello,
> 
> Please don't remove the indexing of older Solr guides. It helps to search
> for "Solr X.Y what_to_search" and get the link to the corresponding guide.
> Thumbs up to give higher priority to the latest guide.
> 
> Kind Regards,
> Alejandro Arrieta
> 
> On Thu, Sep 21, 2023 at 3:42 PM Walter Underwood 
> wrote:
> 
>> Actually, the robots.txt file should also disallow the 9.x guides. That
>> won’t touch guide/latest.
>> 
>> User-agent: *
>> Disallow: /guide/9*
>> Disallow: /guide/8*
>> Disallow: /guide/7*
>> Disallow: /guide/6*
>> 
>> wunder
>> 
>>> On Sep 21, 2023, at 2:38 PM, Walter Underwood 
>> wrote:
>>> 
>>> I’m actually OK with them being indexed. It could be helpful to search
>> for “Solr 8.11 aliases” or something like that.
>>> 
>>> The priority attribute in sitemap.xml should boost the default, latest
>> manual and that shouldn’t require any web server config. I’m glad to craft
>> a static sitemap.xml file. One generated from the guide would be better,
>> but that can be a later improvement.
>>> 
>>> To get the old versions completely out of the index, add a robots.txt
>> file to the solr-site repo under contents/ with these lines:
>>> 
>>> User-agent: *
>>> Disallow: /guide/8*
>>> Disallow: /guide/7*
>>> Disallow: /guide/6*
>>> 
>>> Note that the wildcards on the paths aren't needed, but they helps
>> humans understand that the disallows are a prefix match.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Sep 21, 2023, at 12:08 PM, Houston Putman 
>> wrote:
>>>> 
>>>> I've been trying to get this working for the last year. Basically our
>> issue
>>>> is that the htaccess files do not add the right X-Robots-Tag header for
>> old
>>>> ref guide pages.
>>>> 
>>>> 
>> https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1
>>>> 
>>>> This works locally, but in the actual Solr site, the headers are not
>>>> returned. I have no idea why. Would love some help though, as I also
>> hate
>>>> seeing the old ref guide in the google results.
>>>> 
>>>> - Houston
>>>> 
>>>> On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood <
>> wun...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> When I get web search results that include the Solr Reference Guide, I
>>>>> often get older versions (6.6, 7.4) in the results. I would prefer to
>>>>> always get the latest reference (
>>>>> https://solr.apache.org/guide/solr/latest/index.html).
>>>>> 
>>>>> I think we can list the URLs for that in a sitemap.xml file with a
>> higher
>>>>> priority to suggest to the crawlers that these are the preferred pages.
>>>>> 
>>>>> I don’t see a sitemap.xml or sitemap.xml.gz at
>> https://solr.apached.org <
>>>>> https://solr.apached.org/>.
>>>>> 
>>>>> Should we prefer the latest manual? How do we build/deploy a sitemap?
>> See:
>>>>> https://www.sitemaps.org/
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>> For additional commands, e-mail: dev-h...@solr.apache.org
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org



Re: Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Walter Underwood
Actually, the robots.txt file should also disallow the 9.x guides. That won’t 
touch guide/latest.

User-agent: *
Disallow: /guide/9* 
Disallow: /guide/8* 
Disallow: /guide/7*
Disallow: /guide/6*

wunder

> On Sep 21, 2023, at 2:38 PM, Walter Underwood  wrote:
> 
> I’m actually OK with them being indexed. It could be helpful to search for 
> “Solr 8.11 aliases” or something like that.
> 
> The priority attribute in sitemap.xml should boost the default, latest manual 
> and that shouldn’t require any web server config. I’m glad to craft a static 
> sitemap.xml file. One generated from the guide would be better, but that can 
> be a later improvement.
> 
> To get the old versions completely out of the index, add a robots.txt file to 
> the solr-site repo under contents/ with these lines:
> 
> User-agent: *
> Disallow: /guide/8*
> Disallow: /guide/7*
> Disallow: /guide/6*
> 
> Note that the wildcards on the paths aren't needed, but they helps humans 
> understand that the disallows are a prefix match.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Sep 21, 2023, at 12:08 PM, Houston Putman  wrote:
>> 
>> I've been trying to get this working for the last year. Basically our issue
>> is that the htaccess files do not add the right X-Robots-Tag header for old
>> ref guide pages.
>> 
>> https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1
>> 
>> This works locally, but in the actual Solr site, the headers are not
>> returned. I have no idea why. Would love some help though, as I also hate
>> seeing the old ref guide in the google results.
>> 
>> - Houston
>> 
>> On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood 
>> wrote:
>> 
>>> When I get web search results that include the Solr Reference Guide, I
>>> often get older versions (6.6, 7.4) in the results. I would prefer to
>>> always get the latest reference (
>>> https://solr.apache.org/guide/solr/latest/index.html).
>>> 
>>> I think we can list the URLs for that in a sitemap.xml file with a higher
>>> priority to suggest to the crawlers that these are the preferred pages.
>>> 
>>> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org <
>>> https://solr.apached.org/>.
>>> 
>>> Should we prefer the latest manual? How do we build/deploy a sitemap? See:
>>> https://www.sitemaps.org/
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org



Re: Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Walter Underwood
I’m actually OK with them being indexed. It could be helpful to search for 
“Solr 8.11 aliases” or something like that.

The priority attribute in sitemap.xml should boost the default, latest manual 
and that shouldn’t require any web server config. I’m glad to craft a static 
sitemap.xml file. One generated from the guide would be better, but that can be 
a later improvement.

To get the old versions completely out of the index, add a robots.txt file to 
the solr-site repo under contents/ with these lines:

User-agent: *
Disallow: /guide/8*
Disallow: /guide/7*
Disallow: /guide/6*

Note that the wildcards on the paths aren't needed, but they helps humans 
understand that the disallows are a prefix match.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 21, 2023, at 12:08 PM, Houston Putman  wrote:
> 
> I've been trying to get this working for the last year. Basically our issue
> is that the htaccess files do not add the right X-Robots-Tag header for old
> ref guide pages.
> 
> https://github.com/apache/solr-site/blob/main/themes/solr/templates/htaccess.ref-guide-old#L1
> 
> This works locally, but in the actual Solr site, the headers are not
> returned. I have no idea why. Would love some help though, as I also hate
> seeing the old ref guide in the google results.
> 
> - Houston
> 
> On Thu, Sep 21, 2023 at 11:30 AM Walter Underwood 
> wrote:
> 
>> When I get web search results that include the Solr Reference Guide, I
>> often get older versions (6.6, 7.4) in the results. I would prefer to
>> always get the latest reference (
>> https://solr.apache.org/guide/solr/latest/index.html).
>> 
>> I think we can list the URLs for that in a sitemap.xml file with a higher
>> priority to suggest to the crawlers that these are the preferred pages.
>> 
>> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org <
>> https://solr.apached.org/>.
>> 
>> Should we prefer the latest manual? How do we build/deploy a sitemap? See:
>> https://www.sitemaps.org/
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 



Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Walter Underwood
When I get web search results that include the Solr Reference Guide, I often 
get older versions (6.6, 7.4) in the results. I would prefer to always get the 
latest reference (https://solr.apache.org/guide/solr/latest/index.html).

I think we can list the URLs for that in a sitemap.xml file with a higher 
priority to suggest to the crawlers that these are the preferred pages.

I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org 
<https://solr.apached.org/>.

Should we prefer the latest manual? How do we build/deploy a sitemap? See: 
https://www.sitemaps.org/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)