autoAddReplicas – what am I missing?

2018-10-01 Thread Michael B. Klein
Hi,

I have a SolrCloud 6.6.5 cluster with three nodes (.1, .2, .3). It has 3
collections, all of which are configured with replicationFactor=3 and
autoAddReplicas=true. When I go to the Cloud/Graph page of the admin
interface, I see what I expect – 3 collections, 3 nodes each, all green.

Then I try my experiment.

1) I bring up a 4th node (.4) and wait for it to join the cluster. I now
see .1, .2, .3, and .4 in live_nodes, and .1, .2, and .3 on the graph,
still as expected.
2) I kill .2. Predictably, it falls off the list of live_nodes and turns
gray on the cloud diagram.

Expected: 3 fully green collections replicated on .1, .3, and .4, and .2
dropped from the cloud.
Actual: 3 collections replicated on .1 (green), .2 (gray), and .3 (green),
and .4 nowhere to be seen (except in live_nodes).

I don't have any special rules or snitches or anything configured.

What gives? What else should I be looking at?

Thanks,
Michael


Load balanced Solr cluster not updating leader

2018-05-02 Thread Michael B. Klein
Hi all,

I've encountered a reproducible and confusing issue with our Solr 6.6
cluster. (Updating to 7.x is an option, but not an immediate one.) This is
in our staging environment, running on AWS. To save money, we scale our
entire stack down to zero instances every night and spin it back up every
morning. Here's the process:

SCALE DOWN:
1) Commit & Optimize all collections.
2) Back up each collection to a shared volume (using the Collections API).
3) Spin down all (3) solr instances.
4) Spin down all (2) zookeeper instances.

SPIN UP:
1) Spin up zookeeper instances; wait for the instances to find each other
and the ensemble to stabilize.
2) Spin up solr instances; wait for them all to stabilize and for zookeeper
to recognize them as live nodes.
3) Restore each collection (using the Collections API).

It works ALMOST perfectly. The restore operation reports success, and if I
look at the UI, everything looks great in the Cloud graph view. All green,
one leader and two other active instances per collection.

But once we start updating, we run into problems. The two NON-leaders in
each collection get the updates, but the leader never does. Since the
instances are behind a round robin load balancer, every third query hits an
out-of-date core, with unfortunate (for our near-real-time indexing
dependent app) results.

Reloading the collection doesn't seem to help, but if I use the Collections
API to DELETEREPLICA the leader of each collection and follow it with an
ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
there on out.

I don't know what to look for in my settings or my logs to diagnose or try
to fix this issue. It only affects collections that have been restored from
backup. Any suggestions or guidance would be a big help.

Thanks,
Michael

-- 
Michael B. Klein
Lead Developer, Repository Development and Administration
Northwestern University Libraries


Re: Replication Question

2017-08-02 Thread Michael B. Klein
And the one that isn't getting the updates is the one marked in the cloud
diagram as the leader.

/me bangs head on desk

On Wed, Aug 2, 2017 at 10:31 AM, Michael B. Klein <mbkl...@gmail.com> wrote:

> Another observation: After bringing the cluster back up just now, the
> "1-in-3 nodes don't get the updates" issue persists, even with the cloud
> diagram showing 3 nodes, all green.
>
> On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com>
> wrote:
>
>> Thanks for your responses, Shawn and Erick.
>>
>> Some clarification questions, but first a description of my
>> (non-standard) use case:
>>
>> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
>> working well so far on the production cluster (knock wood); its the staging
>> cluster that's giving me fits. Here's why: In order to save money, I have
>> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
>> use. Here's the (automated) procedure:
>>
>> SCALE DOWN
>> 1) Call admin/collections?action=BACKUP for each collection to a shared
>> NFS volume
>> 2) Shut down all the nodes
>>
>> SCALE UP
>> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
>> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
>> live_nodes
>> 3) Call admin/collections?action=RESTORE to put all the collections back
>>
>> This has been working very well, for the most part, with the following
>> complications/observations:
>>
>> 1) If I don't optimize each collection right before BACKUP, the backup
>> fails (see the attached solr_backup_error.json).
>> 2) If I don't specify a replicationFactor during RESTORE, the admin
>> interface's Cloud diagram only shows one active node per collection. Is
>> this expected? Am I required to specify the replicationFactor unless I'm
>> using a shared HDFS volume for solr data?
>> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
>> message in the response, even though the restore seems to succeed.
>> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
>> not currently have any replication stuff configured (as it seems I should
>> not).
>> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
>> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
>> think all replicas were live and synced and happy, and because I was
>> accessing solr through a round-robin load balancer, I was never able to
>> tell which node was out of sync.
>>
>> If it happens again, I'll make node-by-node requests and try to figure
>> out what's different about the failing one. But the fact that this happened
>> (and the way it happened) is making me wonder if/how I can automate this
>> automated staging environment scaling reliably and with confidence that it
>> will Just Work™.
>>
>> Comments and suggestions would be GREATLY appreciated.
>>
>> Michael
>>
>>
>>
>> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> And please do not use optimize unless your index is
>>> totally static. I only recommend it when the pattern is
>>> to update the index periodically, like every day or
>>> something and not update any docs in between times.
>>>
>>> Implied in Shawn's e-mail was that you should undo
>>> anything you've done in terms of configuring replication,
>>> just go with the defaults.
>>>
>>> Finally, my bet is that your problematic Solr node is misconfigured.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org>
>>> wrote:
>>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most
>>> stuff
>>> >> seems to be working OK, except that one of the nodes never seems to
>>> get its
>>> >> replica updated.
>>> >>
>>> >> Queries take place through a non-caching, round-robin load balancer.
>>> The
>>> >> collection looks fine, with one shard and a replicationFactor of 3.
>>> >> Everything in the cloud diagram is green.
>>> >>
>>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>>> empty 1
>>> >> out of every 3 times.
>>> >>
>>> >> Even several minutes after a commit and optimize, one repl

Re: Replication Question

2017-08-02 Thread Michael B. Klein
Another observation: After bringing the cluster back up just now, the
"1-in-3 nodes don't get the updates" issue persists, even with the cloud
diagram showing 3 nodes, all green.

On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com> wrote:

> Thanks for your responses, Shawn and Erick.
>
> Some clarification questions, but first a description of my (non-standard)
> use case:
>
> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
> working well so far on the production cluster (knock wood); its the staging
> cluster that's giving me fits. Here's why: In order to save money, I have
> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
> use. Here's the (automated) procedure:
>
> SCALE DOWN
> 1) Call admin/collections?action=BACKUP for each collection to a shared
> NFS volume
> 2) Shut down all the nodes
>
> SCALE UP
> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
> live_nodes
> 3) Call admin/collections?action=RESTORE to put all the collections back
>
> This has been working very well, for the most part, with the following
> complications/observations:
>
> 1) If I don't optimize each collection right before BACKUP, the backup
> fails (see the attached solr_backup_error.json).
> 2) If I don't specify a replicationFactor during RESTORE, the admin
> interface's Cloud diagram only shows one active node per collection. Is
> this expected? Am I required to specify the replicationFactor unless I'm
> using a shared HDFS volume for solr data?
> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
> message in the response, even though the restore seems to succeed.
> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
> not currently have any replication stuff configured (as it seems I should
> not).
> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
> think all replicas were live and synced and happy, and because I was
> accessing solr through a round-robin load balancer, I was never able to
> tell which node was out of sync.
>
> If it happens again, I'll make node-by-node requests and try to figure out
> what's different about the failing one. But the fact that this happened
> (and the way it happened) is making me wonder if/how I can automate this
> automated staging environment scaling reliably and with confidence that it
> will Just Work™.
>
> Comments and suggestions would be GREATLY appreciated.
>
> Michael
>
>
>
> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> And please do not use optimize unless your index is
>> totally static. I only recommend it when the pattern is
>> to update the index periodically, like every day or
>> something and not update any docs in between times.
>>
>> Implied in Shawn's e-mail was that you should undo
>> anything you've done in terms of configuring replication,
>> just go with the defaults.
>>
>> Finally, my bet is that your problematic Solr node is misconfigured.
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
>> >> seems to be working OK, except that one of the nodes never seems to
>> get its
>> >> replica updated.
>> >>
>> >> Queries take place through a non-caching, round-robin load balancer.
>> The
>> >> collection looks fine, with one shard and a replicationFactor of 3.
>> >> Everything in the cloud diagram is green.
>> >>
>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>> empty 1
>> >> out of every 3 times.
>> >>
>> >> Even several minutes after a commit and optimize, one replica still
>> isn’t
>> >> returning the right info.
>> >>
>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
>> options on
>> >> the `/replication` requestHandler, or is that a non-solrcloud,
>> >> standalone-replication thing?
>> >
>> > This is one of the more confusing aspects of SolrCloud.
>> >
>> > When everything is working perfectly in a SolrCloud install, the feature
>> > in Solr called "replication" is *never* used.  SolrCloud does require
>> &g

Re: Replication Question

2017-08-02 Thread Michael B. Klein
Thanks for your responses, Shawn and Erick.

Some clarification questions, but first a description of my (non-standard)
use case:

My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working
well so far on the production cluster (knock wood); its the staging cluster
that's giving me fits. Here's why: In order to save money, I have the AWS
auto-scaler scale the cluster down to zero nodes when it's not in use.
Here's the (automated) procedure:

SCALE DOWN
1) Call admin/collections?action=BACKUP for each collection to a shared NFS
volume
2) Shut down all the nodes

SCALE UP
1) Spin up 2 Zookeeper nodes and wait for them to stabilize
2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
live_nodes
3) Call admin/collections?action=RESTORE to put all the collections back

This has been working very well, for the most part, with the following
complications/observations:

1) If I don't optimize each collection right before BACKUP, the backup
fails (see the attached solr_backup_error.json).
2) If I don't specify a replicationFactor during RESTORE, the admin
interface's Cloud diagram only shows one active node per collection. Is
this expected? Am I required to specify the replicationFactor unless I'm
using a shared HDFS volume for solr data?
3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
message in the response, even though the restore seems to succeed.
4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
not currently have any replication stuff configured (as it seems I should
not).
5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
think all replicas were live and synced and happy, and because I was
accessing solr through a round-robin load balancer, I was never able to
tell which node was out of sync.

If it happens again, I'll make node-by-node requests and try to figure out
what's different about the failing one. But the fact that this happened
(and the way it happened) is making me wonder if/how I can automate this
automated staging environment scaling reliably and with confidence that it
will Just Work™.

Comments and suggestions would be GREATLY appreciated.

Michael



On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> And please do not use optimize unless your index is
> totally static. I only recommend it when the pattern is
> to update the index periodically, like every day or
> something and not update any docs in between times.
>
> Implied in Shawn's e-mail was that you should undo
> anything you've done in terms of configuring replication,
> just go with the defaults.
>
> Finally, my bet is that your problematic Solr node is misconfigured.
>
> Best,
> Erick
>
> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
> >> seems to be working OK, except that one of the nodes never seems to get
> its
> >> replica updated.
> >>
> >> Queries take place through a non-caching, round-robin load balancer. The
> >> collection looks fine, with one shard and a replicationFactor of 3.
> >> Everything in the cloud diagram is green.
> >>
> >> But if I (for example) select?q=id:hd76s004z, the results come up empty
> 1
> >> out of every 3 times.
> >>
> >> Even several minutes after a commit and optimize, one replica still
> isn’t
> >> returning the right info.
> >>
> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
> options on
> >> the `/replication` requestHandler, or is that a non-solrcloud,
> >> standalone-replication thing?
> >
> > This is one of the more confusing aspects of SolrCloud.
> >
> > When everything is working perfectly in a SolrCloud install, the feature
> > in Solr called "replication" is *never* used.  SolrCloud does require
> > the replication feature, though ... which is what makes this whole thing
> > very confusing.
> >
> > Replication is used to replicate an entire Lucene index (consisting of a
> > bunch of files on the disk) from a core on a master server to a core on
> > a slave server.  This is how replication was done before SolrCloud was
> > created.
> >
> > The way that SolrCloud keeps replicas in sync is *entirely* different.
> > SolrCloud has no masters and no slaves.  When you index or delete a
> > document in a SolrCloud collection, the request is forwarded to the
> > leader of the correct shard for that document.  The leader then sends a
> > copy of that req

Replication Question

2017-08-01 Thread Michael B. Klein
I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
seems to be working OK, except that one of the nodes never seems to get its
replica updated.

Queries take place through a non-caching, round-robin load balancer. The
collection looks fine, with one shard and a replicationFactor of 3.
Everything in the cloud diagram is green.

But if I (for example) select?q=id:hd76s004z, the results come up empty 1
out of every 3 times.

Even several minutes after a commit and optimize, one replica still isn’t
returning the right info.

Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
the `/replication` requestHandler, or is that a non-solrcloud,
standalone-replication thing?

Michael


Re: Need help indexing/querying a particular type of hierarchy

2011-08-12 Thread Michael B. Klein
After a whole lot of facet-wrangling, I've come up with a practical
solution that suits my situation, which is to index each triple as a
series of paths. For example, if the shelve process of the
accessionWF workflow is completed, it gets indexed as:

field name=wf_wpsaccessionWF/field
field name=wf_wpsaccessionWF:shelve/field
field name=wf_wpsaccessionWF:shelve:completed/field
field name=wf_wspaccessionWF/field
field name=wf_wspaccessionWF:completed/field
field name=wf_wspaccessionWF:completed:shelve/field
field name=wf_swpcompleted/field
field name=wf_swpcompleted:accessionWF/field
field name=wf_swpcompleted:accessionWF:shelve/field

(I could use PathHierarchyTokenizerFactory to eliminate 2/3 of those
field declarations, but doing it this way keeps me from having to
upgrade my Solr to 3.1 yet.)

That lets solr return a facet structure that looks like this:

lst name=facet_fields
    lst name=wf_wps
        int name=accessionWF554/int
        int name=accessionWF:shelve554/int
        int name=accessionWF:shelve:completed550/int
        int name=accessionWF:shelve:error4/int
    /lst
    lst name=wf_wsp
        int name=accessionWF554/int
        int name=accessionWF:completed554/int
        int name=accessionWF:completed:shelve550/int
        int name=accessionWF:error4/int
        int name=accessionWF:error:shelve4/int
    /lst
    lst name=wf_swp
        int name=completed554/int
        int name=completed:accessionWF554/int
        int name=completed:accessionWF:shelve550/int
        int name=error4/int
        int name=error:accessionWF4/int
        int name=error:accessionWF:shelve4/int
    /lst
/lst

I then use some Ruby post-processing to turn it into:

{
wf_wps: {
accessionWF: [554, {
shelve: [554, {
completed: 550,
error: 4
}],
publish: [554, {
completed: 554
}]
}]
},
wf_swp: {
completed: [554, {
accessionWF: [554, {
shelve: 550,
publish: 554
}]
}],
error: [4, {
accessionWF: [4, {
shelve: 4
}]
}]
},
wf_wsp: {
accessionWF: [554, {
completed: [554, {
shelve: 550,
publish: 554
}],
error: [4, {
shelve: 4
}]
}]
}
}

Eventually I may try to code up something that does the restructuring
on the solr side, but for now, this suits my purposes.

Michael


Need help indexing/querying a particular type of hierarchy

2011-08-11 Thread Michael B. Klein
Hi all,

I have a particular data structure I'm trying to index into a solr document
so that I can query and facet it in a particular way, and I can't quite
figure out the best way to go about it.

One sample object is here: https://gist.github.com/1139065

The part that's tripping me up is the workflows. Each workflow has a name
(in this case, digitizationWF and accessionWF). Each workflow is made up of
a number of processes, each of which has its own current status. Every time
the status of a process within a workflow changes, the object is reindexed.

What I'd like to be able to do is present several hierarchies of facets: In
one, the workflow name is the top-level facet, with the second level showing
each process, under which is listed each status (completed, waiting, or
error) and the number of documents with that status for that process (some
values omitted for brevity):

accessionWF (583)
  publish (583)
completed (574)
waiting (6)
error (3)
  shelve (583)
completed (583)

etc.

I'd also like to be able to invert that presentation:

accessionWF (583)
  completed (583)
publish (574)
shelve (583)
  waiting (6)
publish (6)
  error (3)
publish (3)

or even

completed (583)
  accessionWF (583)
publish (574)
shelve (583)
  digitizationWF (583)
initiate (583)
error (3)
  accessionWF (3)
shelve (3)

etc.

I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for,
because the status values are ambiguous when not qualified by the process
name -- the object itself has no completed status, only a
publish:completed and a shelve:completed that I want to be able to group
together into a count/list of objects with completed processes. I also
don't think PathHierarchyTokenizerFactory is quite the answer either.

What kind of Solr magic, if any, am I looking for here?

Thanks in advance for any help or advice.
Michael

---
Michael B. Klein
Digitization Workflow Engineer
Stanford University Libraries


Re: Need help indexing/querying a particular type of hierarchy

2011-08-11 Thread Michael B. Klein
I've been experimenting with that, but that fq wouldn't limit my facet
counts adequately. Since the document has both an accessionWF and a
digitizationWF, the fq would match (and count) the document no matter what
the status for each process.

I suppose I could do something like this:

field name=status_wpsaccessionWF:start-accession:completed/field
field name=status_wpsaccessionWF:cleanup:waiting/field
field 
name=status_wpsaccessionWF:descriptive-metadata:completed/field
field name=status_wpsaccessionWF:content-metadata:completed/field
field name=status_wpsaccessionWF:rights-metadata:completed/field
field name=status_wpsaccessionWF:publish:completed/field
field name=status_wpsaccessionWF:shelve:error/field
field name=status_wspaccessionWF:completed:start-accession/field
field name=status_wspaccessionWF:waiting:cleanup/field
field 
name=status_wspaccessionWF:completed:descriptive-metadata/field
field name=status_wspaccessionWF:completed:content-metadata/field
field name=status_wspaccessionWF:completed:rights-metadata/field
field name=status_wspaccessionWF:completed:publish/field
field name=status_wspaccessionWF:error:shelve/field
field name=status_swpcompleted:accessionWF:start-accession/field
field name=status_swpwaiting:accessionWF:cleanup/field
field 
name=status_swpcompleted:accessionWF:descriptive-metadata/field
field name=status_swpcompleted:accessionWF:content-metadata/field
field name=status_swpcompleted:accessionWF:rights-metadata/field
field name=status_swpcompleted:accessionWF:publish/field
field name=status_swperror:accessionWF:shelve/field

and use a PathHierarchyTokenizerFactory with : as the delimiter. Then I
could use facet.field=status_wpsf.status_wps.facet.prefix=accessionWF: to
get the counts for all the accessionWF processes and statuses, then repeat
using status_wsp and status_swp for the various inversions. I was hoping for
something easier. :)

On Thu, Aug 11, 2011 at 6:40 AM, Dmitry Kan dmitry@gmail.com wrote:

 Hi,

 Can you keep your hierarchy flat in SOLR and then use filter queries
 (fq=wf:accessionWF) inside you facet queries (facet.field=status)?

 Or is the requirement to have one single facet query producing the
 hierarchical facet counts?

 On Thu, Aug 11, 2011 at 10:43 AM, Michael B. Klein mbkl...@gmail.com
 wrote:

  Hi all,
 
  I have a particular data structure I'm trying to index into a solr
 document
  so that I can query and facet it in a particular way, and I can't quite
  figure out the best way to go about it.
 
  One sample object is here: https://gist.github.com/1139065
 
  The part that's tripping me up is the workflows. Each workflow has a name
  (in this case, digitizationWF and accessionWF). Each workflow is made up
 of
  a number of processes, each of which has its own current status. Every
 time
  the status of a process within a workflow changes, the object is
 reindexed.
 
  What I'd like to be able to do is present several hierarchies of facets:
 In
  one, the workflow name is the top-level facet, with the second level
  showing
  each process, under which is listed each status (completed, waiting, or
  error) and the number of documents with that status for that process
 (some
  values omitted for brevity):
 
  accessionWF (583)
   publish (583)
 completed (574)
 waiting (6)
 error (3)
   shelve (583)
 completed (583)
 
  etc.
 
  I'd also like to be able to invert that presentation:
 
  accessionWF (583)
   completed (583)
 publish (574)
 shelve (583)
   waiting (6)
 publish (6)
   error (3)
 publish (3)
 
  or even
 
  completed (583)
   accessionWF (583)
 publish (574)
 shelve (583)
   digitizationWF (583)
 initiate (583)
  error (3)
   accessionWF (3)
 shelve (3)
 
  etc.
 
  I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking
  for,
  because the status values are ambiguous when not qualified by the process
  name -- the object itself has no completed status, only a
  publish:completed and a shelve:completed that I want to be able to
  group
  together into a count/list of objects with completed processes. I also
  don't think PathHierarchyTokenizerFactory is quite the answer either.
 
  What kind of Solr magic, if any, am I looking for here?
 
  Thanks in advance for any help or advice.
  Michael
 
  ---
  Michael B. Klein
  Digitization Workflow Engineer
  Stanford University Libraries
 



 --
 Regards,

 Dmitry Kan