I can see that some of these configurations should be moved to
clusterporps.json, I don’t believe this is the case for all of them. Some
are configurations that are targeting the local node (i.e sharedLib path),
some are needed before connecting to ZooKeeper (zk config). Configuration
of global handlers and components, while in general you do want to see the
same conf across all nodes, you may not want the changes to reflect
atomically and instead rely on a phased upgrade (rolling, blue/green, etc),
where the conf goes together with the binaries that are being deployed. I
also fear that making the configuration of some of these components dynamic
means we have to make the code handle them dynamically (i.e. recreate the
CollectionsHandler based on callback from ZooKeeper). This would be very
hardly used in reality, but all our code needs to be restructured to handle
this, I fear this will complicate the code needlessly, and may introduce
leaks and races of all kinds. If those components can have configuration
that should be dynamic (some toggle, threshold, etc), I’d love to see those
as clusterporps, key-value mostly.

If we were to put this configuration in clusterprops, would that mean that
I’m only able to do config changes via API? On a new cluster, do I need to
start Solr, make a collections API call to change the collections handler?
Or am I supposed to manually change the clusterporps file before starting
Solr and push it to Zookeeper (having a file intended for manual edits and
API edits is bad IMO)? Maybe via the cli, but still, I’d need to do this
for every cluster I create (vs have the solr.xml in my source repository
and Docker image, for example). Also I lose the ability to have this
configuration in my git repo?

I'm +1 to keep a node configuration local to the node in the filesystem.
Currently, it's solr.xml. I've seen comments about xml difficult to
read/write, I think that's personal preference so, while I don't see it
that way, I understand lots of people do and things have been moving away
to other formats, I'm open to discuss that as a change.

> However, 1, 2, and 3, are not trivial for a large number of Solr nodes
and if they aren’t right diagnosing them can be “challenging”…
In my mind, solr.xml goes with your code. Having it up to date means having
all your nodes running the same version of your code. As I said, this is
the "desired state" of the cluster, but may not be the case all the time
(i.e. during deployments), and that's fine. Depending on how you manage the
cluster, you may want to live with different versions for some time (you
may have canaries or be doing a blue/green deployment, etc). Realistically
speaking, if you have a 500+ node cluster, you must have a system in place
to manage configuration and versions, let's not try to bend backwards for a
situation that isn't that realistic.

Let me put an example of things I fear with making these changes atomic.
Let's say I want to start using a new, custom HealthCheckHandler
implementation, that I have put in a jar (and let's assume the jar is
already in all nodes). If I use solr.xml (where one can currently
configures this implementation), I can do a phased deployment (yes, this is
a restart of all nodes), if the healthcheck handler is buggy and fails
request, the nodes with the new code will never show as healthy, so the
deployment will likely stop (i.e. if you are using Kubernetes and using
probes, those instances will keep restarting, if you use ASG in AWS you can
do the same thing). If you make it an atomic change, bye-bye cluster, all
nodes will start reporting unhealthy (Kubernetes and ASG will kill all
those nodes). Good luck doing API changes to revert now, there is no node
to respond to those requests. Hopefully you were using some sort of stable
storage because all ephemeral is gone. Bringing back that cluster is going
to be a PITA. I have seen similar things happen.


On Thu, Sep 3, 2020 at 9:40 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> bq.  Isn’t solr.xml is a way to hardcode config in a more flexible way
> that a Java class?
>
> Yes, and the problem word here is “flexible”. For a single-node system
> that flexibility is desirable. Flexibility comes at the cost of complexity,
> especially in the SolrCloud case. In this case, not so much Solr code
> complexity as operations complexity.
>
> For me this isn’t so much a question of functionality as
> administration/troubleshooting/barrier to entry.
>
> If:
> 1. you can guarantee that every solr.xml file on every node in your entire
> 500 node cluster is up to date
> 2. or you can guarantee that the solr.xml stored on Zookeeper
> 3. and you can guarantee that clusterprops.json in cloud mode is
> interacting properly with whichever solr.xml is read
> 4. Then I’d have no problem with solr.xml.
>
> However, 1, 2, and 3, are not trivial for a large number of Solr nodes and
> if they aren’t right diagnosing them can be “challenging”…
>
> Imagine all the ways that “somehow” the solr.xml file on one node or more
> nodes of a 500 node cluster didn’t get updated and you’re trying to track
> down why query X isn’t working as you expect. Some of the time. When you
> happen to hit conditions X, Y and Z on a subrequest that goes to the node
> in question (which won’t be all of the time, or even possibly a significant
> fraction of the time). Do Containers matter here? Some glitch in Puppet or
> similar? Somebody didn’t follow every step in the process in the playbook?
> It doesn’t matter how you got into this situation, tracking it down would
> be a nightmare.
>
> Or, for that matter, you’ve solved all the distribution concerns and _can_
> guarantee 1 and 3. Then somebody pushes a solr.xml to ZK either
> intentionally or by mistake (OH, I thought I was on the QA system, oops).
> Now I get to spend a week tracking down why the guarantee of 1 is still
> true, it’s just not relevant any more.
>
> To me, it’s the same problem that is solved by the blob store for jar
> files, or having configsets in ZK. When I want something available to all
> my Solr instances, I do not want to have to run around to every node and
> determine that the object I copied there is the right one, especially if
> I’m trying to track down a problem.
>
> Sure, all my concerns can be solved, but why make it harder than it needs
> to be? Distributed systems are hard enough already…
>
> FWIW,
> Erick
>
>
>
>
> > On Sep 3, 2020, at 11:00 AM, Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> >  Isn’t solr.xml is a way to hardcode config in a more flexible way that
> a Java class?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to