Re: Solr configuration options

Noble Paul Wed, 02 Sep 2020 22:05:39 -0700

Let's take a step back and take a look at the history of Solr.

Long ago there was only standalone Solr with a single core
there were 3 files


* solr.xml : everything required for CoreContainer went here
* solr.config.xml : per core configurations go here
* schema.xml: this is not relevant for this discussion

Now we are in the cloud world where everything lives in ZK. This also
means there are potentially 1000's of nodes reading configuration from
ZK. This is quite a convenient setup. The same configset is being
shared by a very large no:of nodes and everyone is happy.

But, solr.xml still stands out like a sore thumb. We have no idea what
it is for? is it a node specific configuration? or is it something
that every single node in the cluster should have in common?

e:g: shardHandlerFactoryConfig.

Does it even make sense for to use a separate
"shardHandlerFactoryConfig" for each node? Or should we have every
node have the same shardHandlerFactoryConfig? It totally makes no
sense to have a different config in each node. Here is an exhaustive
list of parameters that we can configure in solr.xml

https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/NodeConfig.java

99% of these parameters information should not be configured on a per
node basis. It should be shared across the cluster. If that is the
case, we should always have this xml file stored in ZK and not in
every node.So, if this file is in ZK, does it make sense to be able to
update this file in ZK and not reload all nodes? Yes totally. Anyone
who has 1000's of nodes in a cluster will definitely hesitate to
restart their clusters. Editing XML file is extremely hard. Users hate
XML. Everyone is familiar with JSON and they love JSON. The entire
security framework runs off of security.json. It's extremely easy to
manipulate and read.

So we introduced clusterprops.json for some attributes we may change
in a live cluster. The advantages were

* It's easy to manipulate this data. Simple APIs can be provided.
These APIs may validate your input. there is a near 1:1 mapping
between API input and the actual config
* users can directly edit this JSON using simple tools
* We can change data live and it's quite easy to reload a specific component

My preference is
* Support everything in solr.xml in clusterprops.json. Get rid of XML everywhere
* This file shall be supported in standalone mode as well. There is no harm
* May be there are a few attributes we do not want to configure on a
cluster-wide setup. Use a simple node.properties file for that. Nobody
likes XML configuration. It's error-prone to edit xml files

--Noble




On Sat, Aug 29, 2020 at 5:32 AM Alexandre Rafalovitch
<[email protected]> wrote:
>
> This is way above my head, but I wonder if we could dogfood any of
> this with a future Solr cloud example? At the moment, it sets up 2-4
> nodes, 1 collection, any number of shards/replicas. And it does it by
> directory clone and some magic in bin/solr to ensure logs don't step
> on each other's foot.
>
> If we have an idea of what this should look like and an example we
> actually ship, we could probably make it much more concrete.
>
> Regards,
>    Alex.
>
>
> On Fri, 28 Aug 2020 at 15:12, Gus Heck <[email protected]> wrote:
> >
> > Sure of course someone has to set up the first one, that should be an 
> > initial collaboration with devops one can never escape that. Mount points 
> > can be established in an automated fashion and named by convention. My 
> > yearning is to make the devops side of it devops based (provide machines 
> > that look like X where all the "X things" are attributes familiar to devops 
> > people such as CPUs/mounts/RAM/etc.) and the Solr side of it controlled by 
> > those who are experts in Solr to the greatest extent possible. So my desire 
> > is that Solr specific stuff go in ZK and machine definitions be controlled 
> > by devops. Once the initial setup for type X is done then the solr guy says 
> > to devops pls give me 3 more of type X (zk locations are a devops thing 
> > btw, they might move zk as they see fit) and when they start, the nodes 
> > join the cluster. Solr guy does his thing, twiddles configs to make it hum 
> > (within limits, of course, some changes require machine level changes), 
> > occasionally requests reboots, and when he doesn't need the machines he 
> > says... you can turn off machine A, B and C now. Solr guy doesn't care if 
> > it's AMI or docker or that new Flazllebarp thing that devops seem to like 
> > for no clear reason other than it's sold to them by TABS 
> > (TinyAuspexBananaSoft Inc) who threw it in when they sold them a bunch of 
> > other stuff...
> >
> > The config is packaged with the code because there's no better way for a 
> > lot of software out there. Use of Zk to serve up configuration gives us the 
> > opportunity to do better (well I think it sounds better YMMV of course).
> >
> > -Gus
> >
> > On Fri, Aug 28, 2020 at 2:43 PM Tomás Fernández Löbbe 
> > <[email protected]> wrote:
> >>
> >> As for AMIs, you have to do it at least once, right? or are you thinking 
> >> in someone using an pre-existing AMI? I see your point for the case of 
> >> someone using the official Solr image as-is without any volume mounts I 
> >> guess. I'm wondering if trying to put node configuration inside ZooKeeper 
> >> is another thing were we try to solve things inside Solr that the industry 
> >> already solved differently (AMIs, Docker images are exactly about 
> >> packaging code and config)
> >>
> >> On Fri, Aug 28, 2020 at 11:11 AM Gus Heck <[email protected]> wrote:
> >>>
> >>> Which means whoever wants to make changes to solr needs to be 
> >>> able/willing/competent to make AMI/dockers/etc ... and one has to manage 
> >>> versions of those variants as opposed to managing versions of config 
> >>> files.
> >>>
> >>> On Fri, Aug 28, 2020 at 1:55 PM Tomás Fernández Löbbe 
> >>> <[email protected]> wrote:
> >>>>
> >>>> I think if you are using AMIs (or Docker), you could put the node 
> >>>> configuration inside the AMI (or Docker image), as Ilan said, together 
> >>>> with the binaries. Say you have a custom top-level handler (Collections, 
> >>>> Cores, Info, whatever), which takes some arguments and it's configured 
> >>>> in solr.xml and you are doing an upgrade, you probably want your old 
> >>>> nodes (running with your old AMI/Docker image with old jars) to keep the 
> >>>> old configuration and your new nodes to use the new.
> >>>>
> >>>> On Fri, Aug 28, 2020 at 10:42 AM Gus Heck <[email protected]> wrote:
> >>>>>
> >>>>> Putting solr.xml in zookeeper means you can add a node simply by 
> >>>>> starting solr pointing to the zookeeper, and ensure a consistent 
> >>>>> solr.xml for the new node if you've customized it. Since I rarely 
> >>>>> (never) hit use cases where I need different per node solr.xml. I 
> >>>>> generally advocate putting it in ZK, I'd say heterogeneous node configs 
> >>>>> is the special case for advanced use here.  I'm a fan of a 
> >>>>> (hypothetical future) world where nodes can be added/removed simply 
> >>>>> without need for local configuration. It would be desirable IMHO to 
> >>>>> have a smooth node add and remove process and having to install a file 
> >>>>> into a distribution manually after unpacking it (or having coordinate 
> >>>>> variations of config to be pushed to machines) is a minus. If and when 
> >>>>> autoscaling is happy again I'd like to be able to start an AMI in AWS 
> >>>>> pointing at zk (or similar) and have it join automatically, and then 
> >>>>> receive replicas to absorb load (per whatever autoscaling is 
> >>>>> specified), and then be able to issue a single command to a node to 
> >>>>> sunset the node that moves replicas back off of it (again per 
> >>>>> autoscaling preferences, failing if autoscaling constraints would be 
> >>>>> violated) and then asks the node to shut down so that the instance in 
> >>>>> AWS (or wherever) can be shut down safely.  This is a black friday,  
> >>>>> new tenants/lost tenants, or new feature/EOL feature sort of use case.
> >>>>>
> >>>>> Thus IMHO all config for cloud should live somewhere in ZK. File system 
> >>>>> access should not be required to add/remove capacity. If multiple node 
> >>>>> configurations need to be supported we should have nodeTypes directory 
> >>>>> in zk (similar to configsets for collections), possible node specific 
> >>>>> configs there and an env var that can be read to determine the type 
> >>>>> (with some cluster level designation of a default node type). I think 
> >>>>> that would be sufficient to parameterize AMI stuff (or containers) by 
> >>>>> reading tags into env variables
> >>>>>
> >>>>> As for knowing what a node loaded, we really should be able to emit any 
> >>>>> config file we've loaded (without reference to disk or zk). They aren't 
> >>>>> that big and in most cases don't change that fast, so caching a simple 
> >>>>> copy as a string in memory (but only if THAT node loaded it) for 
> >>>>> verification would seem smart. Having a file on disk doesn't tell you 
> >>>>> if solr loaded with that version or if it's changed since solr loaded 
> >>>>> it either.
> >>>>>
> >>>>> Anyway, that's the pie in my sky...
> >>>>>
> >>>>> -Gus
> >>>>>
> >>>>> On Fri, Aug 28, 2020 at 11:51 AM Ilan Ginzburg <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> What I'm really looking for (and currently my understanding is that 
> >>>>>> solr.xml is the only option) is a cluster config a Solr dev can set as 
> >>>>>> a default when introducing a new feature for example, so that the 
> >>>>>> config is picked out of the box in SolrCloud, yet allowing the end 
> >>>>>> user to override it if he so wishes.
> >>>>>>
> >>>>>> But "cluster config" in this context with a caveat: when doing a 
> >>>>>> rolling upgrade, nodes running new code need the new cluster config, 
> >>>>>> nodes running old code need the previous cluster config... Having a 
> >>>>>> per node solr.xml deployed atomically with the code as currently the 
> >>>>>> case has disadvantages, but solves this problem effectively in a very 
> >>>>>> simple way. If we were to move to a central cluster config, we'd 
> >>>>>> likely need to introduce config versioning or as Noble suggested 
> >>>>>> elsewhere, only write code that's backward compatible (w.r.t. config), 
> >>>>>> deploy that code everywhere then once no old code is running, update 
> >>>>>> the cluster config. I find this approach complicated from both dev and 
> >>>>>> operational perspective with an unclear added value.
> >>>>>>
> >>>>>> Ilan
> >>>>>>
> >>>>>> PS. I've stumbled upon the loading of solr.xml from Zookeeper in the 
> >>>>>> past but couldn't find it as I wrote my message so I thought I 
> >>>>>> imagined it...
> >>>>>>
> >>>>>> It's in SolrDispatchFilter.loadNodeConfig(). It establishes a 
> >>>>>> connection to ZK for fetching solr.xml then closes it.
> >>>>>> It relies on system property waitForZk as the connection timeout (in 
> >>>>>> seconds, defaults to 30) and system property zkHost as the Zookeeper 
> >>>>>> host.
> >>>>>>
> >>>>>> I believe solr.xml can only end up in ZK through the use of ZkCLI. 
> >>>>>> Then the user is on his own to manage SolrCloud version upgrades: if a 
> >>>>>> new solr.xml is included as part of a new version of SolrCloud, the 
> >>>>>> user having pushed a previous version into ZK will not see the update.
> >>>>>> I wonder if putting solr.xml in ZK is a common practice.
> >>>>>>
> >>>>>> On Fri, Aug 28, 2020 at 4:58 PM Jan Høydahl <[email protected]> 
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I interpret solr.xml as the node-local configuration for a single 
> >>>>>>> node.
> >>>>>>> clusterprops.json is the cluster-wide configuration applying to all 
> >>>>>>> nodes.
> >>>>>>> solrconfig.xml is of course per core etc
> >>>>>>>
> >>>>>>> solr.in.sh is the per-node ENV-VAR way of configuring a node, and 
> >>>>>>> many of those are picked up in solr.xml (other in bin/solr).
> >>>>>>>
> >>>>>>> I think it is important to keep a file-local config file which can 
> >>>>>>> only be modified if you have shell access to that local node, it 
> >>>>>>> provides an extra layer of security.
> >>>>>>> And in certain cases a node may need a different configuration from 
> >>>>>>> another node, i.e. during an upgrade.
> >>>>>>>
> >>>>>>> I put solr.xml in zookeeper. It may have been a mistake, since it may 
> >>>>>>> not make all that much sense to load solr.xml which is a node-level 
> >>>>>>> file, from ZK. But if it uses var substitutions for all node-level 
> >>>>>>> stuff, it will still work since those vars are pulled from local 
> >>>>>>> properties when parsed anyway.
> >>>>>>>
> >>>>>>> I’m also somewhat against hijacking clusterprops.json as a general 
> >>>>>>> purpose JSON config file for the cluster. It was supposed to be for 
> >>>>>>> simple properties.
> >>>>>>>
> >>>>>>> Jan
> >>>>>>>
> >>>>>>> > 28. aug. 2020 kl. 14:23 skrev Erick Erickson 
> >>>>>>> > <[email protected]>:
> >>>>>>> >
> >>>>>>> > Solr.xml can also exist on Zookeeper, it doesn’t _have_ to exist 
> >>>>>>> > locally. You do have to restart to have any changes take effect.
> >>>>>>> >
> >>>>>>> > Long ago in a Solr far away solr.xml was where all the cores were 
> >>>>>>> > defined. This was before “core discovery” was put in. Since 
> >>>>>>> > solr.xml had to be there anyway and was read at startup, other 
> >>>>>>> > global information was added and it’s lived on...
> >>>>>>> >
> >>>>>>> > Then clusterprops.json came along as a place to put, well, 
> >>>>>>> > cluster-wide properties so having solr.xml too seems awkward. 
> >>>>>>> > Although if you do have solr.xml locally to each node, you could 
> >>>>>>> > theoretically have different settings for different Solr instances. 
> >>>>>>> > Frankly I consider this more of a bug than a feature.
> >>>>>>> >
> >>>>>>> > I know there have been some talk about removing solr.xml entirely, 
> >>>>>>> > but I’m not sure what the thinking is about what to do instead. 
> >>>>>>> > Whatever we do needs to accommodate standalone. We could do the 
> >>>>>>> > same trick we do now, and essentially move all the current options 
> >>>>>>> > in solr.xml to clusterprops.json (or other ZK node) and read it 
> >>>>>>> > locally for stand-alone. The API could even be used to change it if 
> >>>>>>> > it was stored locally.
> >>>>>>> >
> >>>>>>> > That still leaves the chicken-and-egg problem if connecting to ZK 
> >>>>>>> > in the first place.
> >>>>>>> >
> >>>>>>> >> On Aug 28, 2020, at 7:43 AM, Ilan Ginzburg <[email protected]> 
> >>>>>>> >> wrote:
> >>>>>>> >>
> >>>>>>> >> I want to ramp-up/discuss/inventory configuration options in Solr. 
> >>>>>>> >> Here's my understanding of what exists and what could/should be 
> >>>>>>> >> used depending on the need. Please correct/complete as needed (or 
> >>>>>>> >> point to documentation I might have missed).
> >>>>>>> >>
> >>>>>>> >> There are currently 3 sources of general configuration I'm aware 
> >>>>>>> >> of:
> >>>>>>> >>      • Collection specific config bootstrapped by file 
> >>>>>>> >> solrconfig.xml and copied into the initial (_default) then 
> >>>>>>> >> subsequent Config Sets in Zookeeper.
> >>>>>>> >>      • Cluster wide config in Zookeeper /clusterprops.json 
> >>>>>>> >> editable globally through Zookeeper interaction using an API. Not 
> >>>>>>> >> bootstrapped by anything (i.e. does not exist until the user 
> >>>>>>> >> explicitly creates it)
> >>>>>>> >>      • Node config file solr.xml deployed with Solr on each node 
> >>>>>>> >> and loaded when Solr starts. Changes to this file are per node and 
> >>>>>>> >> require node restart to be taken into account.
> >>>>>>> >> The Collection specific config (file solrconfig.xml then in 
> >>>>>>> >> Zookeeper /configs/<config set name>/solrconfig.xml) allows Solr 
> >>>>>>> >> devs to set reasonable defaults (the file is part of the Solr 
> >>>>>>> >> distribution). Content can be changed by users as they create new 
> >>>>>>> >> Config Sets persisted in Zookeeper.
> >>>>>>> >>
> >>>>>>> >> Zookeeper's /clusterprops.json can be edited through the 
> >>>>>>> >> collection admin API CLUSTERPROP. If users do not set anything 
> >>>>>>> >> there, the file doesn't even exist in Zookeeper therefore `Solr 
> >>>>>>> >> devs cannot use it to set a default cluster config, there's no 
> >>>>>>> >> clusterprops.json file in the Solr distrib like there's a 
> >>>>>>> >> solrconfig.xml.
> >>>>>>> >>
> >>>>>>> >> File solr.xml is used by Solr devs to set some reasonable default 
> >>>>>>> >> configuration (parametrized through property files or system 
> >>>>>>> >> properties). There's no API to change that file, users would have 
> >>>>>>> >> to edit/redeploy the file on each node and restart the Solr JVM on 
> >>>>>>> >> that node for the new config to be taken into account.
> >>>>>>> >>
> >>>>>>> >> Based on the above, my vision (or mental model) of what to use 
> >>>>>>> >> depending on the need:
> >>>>>>> >>
> >>>>>>> >> solrconfig.xml is the only per collection config. IMO it does its 
> >>>>>>> >> job correctly: Solr devs can set defaults, users tailor the 
> >>>>>>> >> content to what they need for new config sets. It's the only 
> >>>>>>> >> option for per collection config anyway.
> >>>>>>> >>
> >>>>>>> >> The real hesitation could be between solr.xml and Zookeeper 
> >>>>>>> >> /clusterprops.json. What should go where?
> >>>>>>> >>
> >>>>>>> >> For user configs (anything the user does to the Solr cluster AFTER 
> >>>>>>> >> it was deployed and started), /clusterprops.json seems to be the 
> >>>>>>> >> obvious choice and offers the right abstractions (global config, 
> >>>>>>> >> no need to worry about individual nodes, all nodes pick up configs 
> >>>>>>> >> and changes to configs dynamically).
> >>>>>>> >>
> >>>>>>> >> For configs that need to be available without requiring user 
> >>>>>>> >> intervention or needed before the connection to ZK is established, 
> >>>>>>> >> there's currently no other choice than using solr.xml. Such 
> >>>>>>> >> configuration obviously include parameters that are needed to 
> >>>>>>> >> connect to ZK (timeouts, credential provider and hopefully one day 
> >>>>>>> >> an option to either use direct ZK interaction code or Curator 
> >>>>>>> >> code), but also configuration of general features that should be 
> >>>>>>> >> the default without requiring users to opt in yet allowing then to 
> >>>>>>> >> easily opt out by editing solr.xml before deploying to their 
> >>>>>>> >> cluster (in the future, this could include which Lucene version to 
> >>>>>>> >> load in Solr for example).
> >>>>>>> >>
> >>>>>>> >> To summarize:
> >>>>>>> >>      • Collection specific config? --> solrconfig.xml
> >>>>>>> >>      • User provided cluster config once SolrCloud is running? --> 
> >>>>>>> >> ZK /clusterprops.json
> >>>>>>> >>      • Solr dev provided cluster config? --> solr.xml
> >>>>>>> >>
> >>>>>>> >> Going forward, some (but only some!) of the config that currently 
> >>>>>>> >> can only live in solr.xml could be made to go to 
> >>>>>>> >> /clusterprops.json or another ZK based config file. This would 
> >>>>>>> >> require adding code to create that ZK file upon initial cluster 
> >>>>>>> >> start (to not force the user to push it) and devise a mechanism 
> >>>>>>> >> (likely a script, could be tricky though) to update that file in 
> >>>>>>> >> ZK when a new release of Solr is deployed and a previous version 
> >>>>>>> >> of that file already exists. Not impossible tasks, but not trivial 
> >>>>>>> >> ones either. Whatever the needs of such an approach are, it might 
> >>>>>>> >> be easier to keep the existing solr.xml as a file and allow users 
> >>>>>>> >> to define overrides in Zookeeper for the configuration parameters 
> >>>>>>> >> from solr.xml that make sense to be overridden in ZK (obviously ZK 
> >>>>>>> >> credentials or connection timeout do not make sense in that 
> >>>>>>> >> context, but defining the shard handler implementation class does 
> >>>>>>> >> since it is likely loaded after a node managed to connect to ZK).
> >>>>>>> >>
> >>>>>>> >> Some config will have to stay in a local Node file system file and 
> >>>>>>> >> only there no matter what: Zookeeper timeout definition or any 
> >>>>>>> >> node configuration that is needed before the node connects to 
> >>>>>>> >> Zookeeper.
> >>>>>>> >>
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > ---------------------------------------------------------------------
> >>>>>>> > To unsubscribe, e-mail: [email protected]
> >>>>>>> > For additional commands, e-mail: [email protected]
> >>>>>>> >
> >>>>>>>
> >>>>>>>
> >>>>>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://www.needhamsoftware.com (work)
> >>>>> http://www.the111shift.com (play)
> >>>
> >>>
> >>>
> >>> --
> >>> http://www.needhamsoftware.com (work)
> >>> http://www.the111shift.com (play)
> >
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


-- 
-----------------------------------------------------
Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Solr configuration options

Reply via email to