autoAddReplicas – what am I missing?
Hi, I have a SolrCloud 6.6.5 cluster with three nodes (.1, .2, .3). It has 3 collections, all of which are configured with replicationFactor=3 and autoAddReplicas=true. When I go to the Cloud/Graph page of the admin interface, I see what I expect – 3 collections, 3 nodes each, all green. Then I try my experiment. 1) I bring up a 4th node (.4) and wait for it to join the cluster. I now see .1, .2, .3, and .4 in live_nodes, and .1, .2, and .3 on the graph, still as expected. 2) I kill .2. Predictably, it falls off the list of live_nodes and turns gray on the cloud diagram. Expected: 3 fully green collections replicated on .1, .3, and .4, and .2 dropped from the cloud. Actual: 3 collections replicated on .1 (green), .2 (gray), and .3 (green), and .4 nowhere to be seen (except in live_nodes). I don't have any special rules or snitches or anything configured. What gives? What else should I be looking at? Thanks, Michael
Load balanced Solr cluster not updating leader
Hi all, I've encountered a reproducible and confusing issue with our Solr 6.6 cluster. (Updating to 7.x is an option, but not an immediate one.) This is in our staging environment, running on AWS. To save money, we scale our entire stack down to zero instances every night and spin it back up every morning. Here's the process: SCALE DOWN: 1) Commit & Optimize all collections. 2) Back up each collection to a shared volume (using the Collections API). 3) Spin down all (3) solr instances. 4) Spin down all (2) zookeeper instances. SPIN UP: 1) Spin up zookeeper instances; wait for the instances to find each other and the ensemble to stabilize. 2) Spin up solr instances; wait for them all to stabilize and for zookeeper to recognize them as live nodes. 3) Restore each collection (using the Collections API). It works ALMOST perfectly. The restore operation reports success, and if I look at the UI, everything looks great in the Cloud graph view. All green, one leader and two other active instances per collection. But once we start updating, we run into problems. The two NON-leaders in each collection get the updates, but the leader never does. Since the instances are behind a round robin load balancer, every third query hits an out-of-date core, with unfortunate (for our near-real-time indexing dependent app) results. Reloading the collection doesn't seem to help, but if I use the Collections API to DELETEREPLICA the leader of each collection and follow it with an ADDREPLICA, everything syncs up (with a new leader) and stays in sync from there on out. I don't know what to look for in my settings or my logs to diagnose or try to fix this issue. It only affects collections that have been restored from backup. Any suggestions or guidance would be a big help. Thanks, Michael -- Michael B. Klein Lead Developer, Repository Development and Administration Northwestern University Libraries
Re: Replication Question
And the one that isn't getting the updates is the one marked in the cloud diagram as the leader. /me bangs head on desk On Wed, Aug 2, 2017 at 10:31 AM, Michael B. Klein <mbkl...@gmail.com> wrote: > Another observation: After bringing the cluster back up just now, the > "1-in-3 nodes don't get the updates" issue persists, even with the cloud > diagram showing 3 nodes, all green. > > On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com> > wrote: > >> Thanks for your responses, Shawn and Erick. >> >> Some clarification questions, but first a description of my >> (non-standard) use case: >> >> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are >> working well so far on the production cluster (knock wood); its the staging >> cluster that's giving me fits. Here's why: In order to save money, I have >> the AWS auto-scaler scale the cluster down to zero nodes when it's not in >> use. Here's the (automated) procedure: >> >> SCALE DOWN >> 1) Call admin/collections?action=BACKUP for each collection to a shared >> NFS volume >> 2) Shut down all the nodes >> >> SCALE UP >> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize >> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's >> live_nodes >> 3) Call admin/collections?action=RESTORE to put all the collections back >> >> This has been working very well, for the most part, with the following >> complications/observations: >> >> 1) If I don't optimize each collection right before BACKUP, the backup >> fails (see the attached solr_backup_error.json). >> 2) If I don't specify a replicationFactor during RESTORE, the admin >> interface's Cloud diagram only shows one active node per collection. Is >> this expected? Am I required to specify the replicationFactor unless I'm >> using a shared HDFS volume for solr data? >> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning >> message in the response, even though the restore seems to succeed. >> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do >> not currently have any replication stuff configured (as it seems I should >> not). >> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud >> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to >> think all replicas were live and synced and happy, and because I was >> accessing solr through a round-robin load balancer, I was never able to >> tell which node was out of sync. >> >> If it happens again, I'll make node-by-node requests and try to figure >> out what's different about the failing one. But the fact that this happened >> (and the way it happened) is making me wonder if/how I can automate this >> automated staging environment scaling reliably and with confidence that it >> will Just Work™. >> >> Comments and suggestions would be GREATLY appreciated. >> >> Michael >> >> >> >> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> And please do not use optimize unless your index is >>> totally static. I only recommend it when the pattern is >>> to update the index periodically, like every day or >>> something and not update any docs in between times. >>> >>> Implied in Shawn's e-mail was that you should undo >>> anything you've done in terms of configuring replication, >>> just go with the defaults. >>> >>> Finally, my bet is that your problematic Solr node is misconfigured. >>> >>> Best, >>> Erick >>> >>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> >>> wrote: >>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote: >>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most >>> stuff >>> >> seems to be working OK, except that one of the nodes never seems to >>> get its >>> >> replica updated. >>> >> >>> >> Queries take place through a non-caching, round-robin load balancer. >>> The >>> >> collection looks fine, with one shard and a replicationFactor of 3. >>> >> Everything in the cloud diagram is green. >>> >> >>> >> But if I (for example) select?q=id:hd76s004z, the results come up >>> empty 1 >>> >> out of every 3 times. >>> >> >>> >> Even several minutes after a commit and optimize, one repl
Re: Replication Question
Another observation: After bringing the cluster back up just now, the "1-in-3 nodes don't get the updates" issue persists, even with the cloud diagram showing 3 nodes, all green. On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com> wrote: > Thanks for your responses, Shawn and Erick. > > Some clarification questions, but first a description of my (non-standard) > use case: > > My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are > working well so far on the production cluster (knock wood); its the staging > cluster that's giving me fits. Here's why: In order to save money, I have > the AWS auto-scaler scale the cluster down to zero nodes when it's not in > use. Here's the (automated) procedure: > > SCALE DOWN > 1) Call admin/collections?action=BACKUP for each collection to a shared > NFS volume > 2) Shut down all the nodes > > SCALE UP > 1) Spin up 2 Zookeeper nodes and wait for them to stabilize > 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's > live_nodes > 3) Call admin/collections?action=RESTORE to put all the collections back > > This has been working very well, for the most part, with the following > complications/observations: > > 1) If I don't optimize each collection right before BACKUP, the backup > fails (see the attached solr_backup_error.json). > 2) If I don't specify a replicationFactor during RESTORE, the admin > interface's Cloud diagram only shows one active node per collection. Is > this expected? Am I required to specify the replicationFactor unless I'm > using a shared HDFS volume for solr data? > 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning > message in the response, even though the restore seems to succeed. > 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do > not currently have any replication stuff configured (as it seems I should > not). > 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud > diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to > think all replicas were live and synced and happy, and because I was > accessing solr through a round-robin load balancer, I was never able to > tell which node was out of sync. > > If it happens again, I'll make node-by-node requests and try to figure out > what's different about the failing one. But the fact that this happened > (and the way it happened) is making me wonder if/how I can automate this > automated staging environment scaling reliably and with confidence that it > will Just Work™. > > Comments and suggestions would be GREATLY appreciated. > > Michael > > > > On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> And please do not use optimize unless your index is >> totally static. I only recommend it when the pattern is >> to update the index periodically, like every day or >> something and not update any docs in between times. >> >> Implied in Shawn's e-mail was that you should undo >> anything you've done in terms of configuring replication, >> just go with the defaults. >> >> Finally, my bet is that your problematic Solr node is misconfigured. >> >> Best, >> Erick >> >> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> wrote: >> > On 8/1/2017 12:09 PM, Michael B. Klein wrote: >> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff >> >> seems to be working OK, except that one of the nodes never seems to >> get its >> >> replica updated. >> >> >> >> Queries take place through a non-caching, round-robin load balancer. >> The >> >> collection looks fine, with one shard and a replicationFactor of 3. >> >> Everything in the cloud diagram is green. >> >> >> >> But if I (for example) select?q=id:hd76s004z, the results come up >> empty 1 >> >> out of every 3 times. >> >> >> >> Even several minutes after a commit and optimize, one replica still >> isn’t >> >> returning the right info. >> >> >> >> Do I need to configure my `solrconfig.xml` with `replicateAfter` >> options on >> >> the `/replication` requestHandler, or is that a non-solrcloud, >> >> standalone-replication thing? >> > >> > This is one of the more confusing aspects of SolrCloud. >> > >> > When everything is working perfectly in a SolrCloud install, the feature >> > in Solr called "replication" is *never* used. SolrCloud does require >> &g
Re: Replication Question
Thanks for your responses, Shawn and Erick. Some clarification questions, but first a description of my (non-standard) use case: My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working well so far on the production cluster (knock wood); its the staging cluster that's giving me fits. Here's why: In order to save money, I have the AWS auto-scaler scale the cluster down to zero nodes when it's not in use. Here's the (automated) procedure: SCALE DOWN 1) Call admin/collections?action=BACKUP for each collection to a shared NFS volume 2) Shut down all the nodes SCALE UP 1) Spin up 2 Zookeeper nodes and wait for them to stabilize 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's live_nodes 3) Call admin/collections?action=RESTORE to put all the collections back This has been working very well, for the most part, with the following complications/observations: 1) If I don't optimize each collection right before BACKUP, the backup fails (see the attached solr_backup_error.json). 2) If I don't specify a replicationFactor during RESTORE, the admin interface's Cloud diagram only shows one active node per collection. Is this expected? Am I required to specify the replicationFactor unless I'm using a shared HDFS volume for solr data? 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning message in the response, even though the restore seems to succeed. 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do not currently have any replication stuff configured (as it seems I should not). 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to think all replicas were live and synced and happy, and because I was accessing solr through a round-robin load balancer, I was never able to tell which node was out of sync. If it happens again, I'll make node-by-node requests and try to figure out what's different about the failing one. But the fact that this happened (and the way it happened) is making me wonder if/how I can automate this automated staging environment scaling reliably and with confidence that it will Just Work™. Comments and suggestions would be GREATLY appreciated. Michael On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> wrote: > And please do not use optimize unless your index is > totally static. I only recommend it when the pattern is > to update the index periodically, like every day or > something and not update any docs in between times. > > Implied in Shawn's e-mail was that you should undo > anything you've done in terms of configuring replication, > just go with the defaults. > > Finally, my bet is that your problematic Solr node is misconfigured. > > Best, > Erick > > On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 8/1/2017 12:09 PM, Michael B. Klein wrote: > >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff > >> seems to be working OK, except that one of the nodes never seems to get > its > >> replica updated. > >> > >> Queries take place through a non-caching, round-robin load balancer. The > >> collection looks fine, with one shard and a replicationFactor of 3. > >> Everything in the cloud diagram is green. > >> > >> But if I (for example) select?q=id:hd76s004z, the results come up empty > 1 > >> out of every 3 times. > >> > >> Even several minutes after a commit and optimize, one replica still > isn’t > >> returning the right info. > >> > >> Do I need to configure my `solrconfig.xml` with `replicateAfter` > options on > >> the `/replication` requestHandler, or is that a non-solrcloud, > >> standalone-replication thing? > > > > This is one of the more confusing aspects of SolrCloud. > > > > When everything is working perfectly in a SolrCloud install, the feature > > in Solr called "replication" is *never* used. SolrCloud does require > > the replication feature, though ... which is what makes this whole thing > > very confusing. > > > > Replication is used to replicate an entire Lucene index (consisting of a > > bunch of files on the disk) from a core on a master server to a core on > > a slave server. This is how replication was done before SolrCloud was > > created. > > > > The way that SolrCloud keeps replicas in sync is *entirely* different. > > SolrCloud has no masters and no slaves. When you index or delete a > > document in a SolrCloud collection, the request is forwarded to the > > leader of the correct shard for that document. The leader then sends a > > copy of that req
Replication Question
I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff seems to be working OK, except that one of the nodes never seems to get its replica updated. Queries take place through a non-caching, round-robin load balancer. The collection looks fine, with one shard and a replicationFactor of 3. Everything in the cloud diagram is green. But if I (for example) select?q=id:hd76s004z, the results come up empty 1 out of every 3 times. Even several minutes after a commit and optimize, one replica still isn’t returning the right info. Do I need to configure my `solrconfig.xml` with `replicateAfter` options on the `/replication` requestHandler, or is that a non-solrcloud, standalone-replication thing? Michael
Re: Need help indexing/querying a particular type of hierarchy
After a whole lot of facet-wrangling, I've come up with a practical solution that suits my situation, which is to index each triple as a series of paths. For example, if the shelve process of the accessionWF workflow is completed, it gets indexed as: field name=wf_wpsaccessionWF/field field name=wf_wpsaccessionWF:shelve/field field name=wf_wpsaccessionWF:shelve:completed/field field name=wf_wspaccessionWF/field field name=wf_wspaccessionWF:completed/field field name=wf_wspaccessionWF:completed:shelve/field field name=wf_swpcompleted/field field name=wf_swpcompleted:accessionWF/field field name=wf_swpcompleted:accessionWF:shelve/field (I could use PathHierarchyTokenizerFactory to eliminate 2/3 of those field declarations, but doing it this way keeps me from having to upgrade my Solr to 3.1 yet.) That lets solr return a facet structure that looks like this: lst name=facet_fields lst name=wf_wps int name=accessionWF554/int int name=accessionWF:shelve554/int int name=accessionWF:shelve:completed550/int int name=accessionWF:shelve:error4/int /lst lst name=wf_wsp int name=accessionWF554/int int name=accessionWF:completed554/int int name=accessionWF:completed:shelve550/int int name=accessionWF:error4/int int name=accessionWF:error:shelve4/int /lst lst name=wf_swp int name=completed554/int int name=completed:accessionWF554/int int name=completed:accessionWF:shelve550/int int name=error4/int int name=error:accessionWF4/int int name=error:accessionWF:shelve4/int /lst /lst I then use some Ruby post-processing to turn it into: { wf_wps: { accessionWF: [554, { shelve: [554, { completed: 550, error: 4 }], publish: [554, { completed: 554 }] }] }, wf_swp: { completed: [554, { accessionWF: [554, { shelve: 550, publish: 554 }] }], error: [4, { accessionWF: [4, { shelve: 4 }] }] }, wf_wsp: { accessionWF: [554, { completed: [554, { shelve: 550, publish: 554 }], error: [4, { shelve: 4 }] }] } } Eventually I may try to code up something that does the restructuring on the solr side, but for now, this suits my purposes. Michael
Need help indexing/querying a particular type of hierarchy
Hi all, I have a particular data structure I'm trying to index into a solr document so that I can query and facet it in a particular way, and I can't quite figure out the best way to go about it. One sample object is here: https://gist.github.com/1139065 The part that's tripping me up is the workflows. Each workflow has a name (in this case, digitizationWF and accessionWF). Each workflow is made up of a number of processes, each of which has its own current status. Every time the status of a process within a workflow changes, the object is reindexed. What I'd like to be able to do is present several hierarchies of facets: In one, the workflow name is the top-level facet, with the second level showing each process, under which is listed each status (completed, waiting, or error) and the number of documents with that status for that process (some values omitted for brevity): accessionWF (583) publish (583) completed (574) waiting (6) error (3) shelve (583) completed (583) etc. I'd also like to be able to invert that presentation: accessionWF (583) completed (583) publish (574) shelve (583) waiting (6) publish (6) error (3) publish (3) or even completed (583) accessionWF (583) publish (574) shelve (583) digitizationWF (583) initiate (583) error (3) accessionWF (3) shelve (3) etc. I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for, because the status values are ambiguous when not qualified by the process name -- the object itself has no completed status, only a publish:completed and a shelve:completed that I want to be able to group together into a count/list of objects with completed processes. I also don't think PathHierarchyTokenizerFactory is quite the answer either. What kind of Solr magic, if any, am I looking for here? Thanks in advance for any help or advice. Michael --- Michael B. Klein Digitization Workflow Engineer Stanford University Libraries
Re: Need help indexing/querying a particular type of hierarchy
I've been experimenting with that, but that fq wouldn't limit my facet counts adequately. Since the document has both an accessionWF and a digitizationWF, the fq would match (and count) the document no matter what the status for each process. I suppose I could do something like this: field name=status_wpsaccessionWF:start-accession:completed/field field name=status_wpsaccessionWF:cleanup:waiting/field field name=status_wpsaccessionWF:descriptive-metadata:completed/field field name=status_wpsaccessionWF:content-metadata:completed/field field name=status_wpsaccessionWF:rights-metadata:completed/field field name=status_wpsaccessionWF:publish:completed/field field name=status_wpsaccessionWF:shelve:error/field field name=status_wspaccessionWF:completed:start-accession/field field name=status_wspaccessionWF:waiting:cleanup/field field name=status_wspaccessionWF:completed:descriptive-metadata/field field name=status_wspaccessionWF:completed:content-metadata/field field name=status_wspaccessionWF:completed:rights-metadata/field field name=status_wspaccessionWF:completed:publish/field field name=status_wspaccessionWF:error:shelve/field field name=status_swpcompleted:accessionWF:start-accession/field field name=status_swpwaiting:accessionWF:cleanup/field field name=status_swpcompleted:accessionWF:descriptive-metadata/field field name=status_swpcompleted:accessionWF:content-metadata/field field name=status_swpcompleted:accessionWF:rights-metadata/field field name=status_swpcompleted:accessionWF:publish/field field name=status_swperror:accessionWF:shelve/field and use a PathHierarchyTokenizerFactory with : as the delimiter. Then I could use facet.field=status_wpsf.status_wps.facet.prefix=accessionWF: to get the counts for all the accessionWF processes and statuses, then repeat using status_wsp and status_swp for the various inversions. I was hoping for something easier. :) On Thu, Aug 11, 2011 at 6:40 AM, Dmitry Kan dmitry@gmail.com wrote: Hi, Can you keep your hierarchy flat in SOLR and then use filter queries (fq=wf:accessionWF) inside you facet queries (facet.field=status)? Or is the requirement to have one single facet query producing the hierarchical facet counts? On Thu, Aug 11, 2011 at 10:43 AM, Michael B. Klein mbkl...@gmail.com wrote: Hi all, I have a particular data structure I'm trying to index into a solr document so that I can query and facet it in a particular way, and I can't quite figure out the best way to go about it. One sample object is here: https://gist.github.com/1139065 The part that's tripping me up is the workflows. Each workflow has a name (in this case, digitizationWF and accessionWF). Each workflow is made up of a number of processes, each of which has its own current status. Every time the status of a process within a workflow changes, the object is reindexed. What I'd like to be able to do is present several hierarchies of facets: In one, the workflow name is the top-level facet, with the second level showing each process, under which is listed each status (completed, waiting, or error) and the number of documents with that status for that process (some values omitted for brevity): accessionWF (583) publish (583) completed (574) waiting (6) error (3) shelve (583) completed (583) etc. I'd also like to be able to invert that presentation: accessionWF (583) completed (583) publish (574) shelve (583) waiting (6) publish (6) error (3) publish (3) or even completed (583) accessionWF (583) publish (574) shelve (583) digitizationWF (583) initiate (583) error (3) accessionWF (3) shelve (3) etc. I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for, because the status values are ambiguous when not qualified by the process name -- the object itself has no completed status, only a publish:completed and a shelve:completed that I want to be able to group together into a count/list of objects with completed processes. I also don't think PathHierarchyTokenizerFactory is quite the answer either. What kind of Solr magic, if any, am I looking for here? Thanks in advance for any help or advice. Michael --- Michael B. Klein Digitization Workflow Engineer Stanford University Libraries -- Regards, Dmitry Kan