Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

2019-09-05 Thread Jack Schlederer
My mistake on the link, which should be this:
https://lucene.apache.org/solr/guide/7_1/solrcloud-autoscaling-auto-add-replicas.html#implementation-using-autoaddreplicas-trigger

--Jack

On Thu, Sep 5, 2019 at 11:02 AM Jack Schlederer 
wrote:

> I'd defer to the committers if they have any further advice, but you might
> have to suspend the autoAddReplicas trigger through the autoscaling API (
> https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ )
> if you set up your collections with autoAddReplicas enabled. Then, the
> system will not try to re-create missing replicas.
>
> Just another note on your setup-- It seems to me like using only 3 nodes
> for 168 GB worth of indices isn't making the most of SolrCloud, which
> provides the capabilities for sharding indices across a high number of
> nodes. Just a data point for you to consider when considering your cluster
> sizing, my org is running only about 50GB of indices, but we run it over 35
> nodes with 8GB of heap apiece, each collection with 2+ shards.
>
> --Jack
>
> On Thu, Sep 5, 2019 at 8:47 AM Doss  wrote:
>
>> Thanks Eric for the explanation. Sum of all our index size is about 138
>> GB,
>> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
>> require at least couple of days, till that time is there any option to
>> control the replication method?
>>
>> Thanks,
>> Doss.
>>
>> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson 
>> wrote:
>>
>> > You say you have three nodes, 130 replicas and a replication factor of
>> 3,
>> > so
>> > you have 130 cores/node. At least one of those cores has a 20G index,
>> > right?
>> >
>> > What is the sum of all the indexes on a single physical machine?
>> >
>> > I think your system is under-provisioned and that you’ve been riding at
>> > the edge
>> > of instability for quite some time and have added enough more docs that
>> > you finally reached a tipping point. But that’s largely speculation.
>> >
>> > So adding more heap may help. But Real Soon Now you need to think about
>> > adding
>> > more hardware and moving some of your replicas to that new hardware.
>> >
>> > Again, this is speculation. But when systems are running with an
>> > _aggregate_
>> > index size that is many multiples of the available memory (total
>> phycisal
>> > memory)
>> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
>> > all replicas…
>> >
>> > Best,
>> > Erick
>> >
>> > > On Sep 5, 2019, at 8:08 AM, Doss  wrote:
>> > >
>> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
>> > >
>> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
>> > node,
>> > > out of which 14 GB assigned for HEAP, you mean to say we have to
>> allocate
>> > > more HEAP? or we need add more Physical RAM?
>> > >
>> > > This system ran for 8 to 9 months without any major issues, in recent
>> > times
>> > > only we are facing too many such incidents.
>> > >
>> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <
>> erickerick...@gmail.com>
>> > > wrote:
>> > >
>> > >> If I'm reading this correctly, you have a huge amount of index in not
>> > much
>> > >> memory. You only have 14g allocated across 130 replicas, at least
>> one of
>> > >> which has a 20g index. You don't need as much memory as your
>> aggregate
>> > >> index size, but this system feels severely under provisioned. I
>> suspect
>> > >> that's the root of your instability
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Thu, Sep 5, 2019, 07:08 Doss  wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
>> ensemble.
>> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
>> > NRT)
>> > >>> with index size ranging from 700MB to 20GB.
>> > >>>
>> > >>> autoCommit - 10 minutes once
>> > >>> softCommit - 30 Sec Once
>> > >>>
>> > >>> At peak time if a shard goes to recovery mode many other shards also
>> > >> going
>> > >>> to recovery mode in few minutes, which creates huge load (200+ load
>> > >>> average) and SOLR becomes non responsive. To fix this we are
>> restarting
>> > >> the
>> > >>> node, again leader tries to correct the index by initiating
>> > replication,
>> > >>> which causes load again, and the node goes to non responsive state.
>> > >>>
>> > >>> As soon as a node starts the replication process initiated for all
>> 130
>> > >>> cores, is there any we control it, like one after the other?
>> > >>>
>> > >>> Thanks,
>> > >>> Doss.
>> > >>>
>> > >>
>> >
>> >
>>
>


Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

2019-09-05 Thread Jack Schlederer
I'd defer to the committers if they have any further advice, but you might
have to suspend the autoAddReplicas trigger through the autoscaling API (
https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ )
if you set up your collections with autoAddReplicas enabled. Then, the
system will not try to re-create missing replicas.

Just another note on your setup-- It seems to me like using only 3 nodes
for 168 GB worth of indices isn't making the most of SolrCloud, which
provides the capabilities for sharding indices across a high number of
nodes. Just a data point for you to consider when considering your cluster
sizing, my org is running only about 50GB of indices, but we run it over 35
nodes with 8GB of heap apiece, each collection with 2+ shards.

--Jack

On Thu, Sep 5, 2019 at 8:47 AM Doss  wrote:

> Thanks Eric for the explanation. Sum of all our index size is about 138 GB,
> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
> require at least couple of days, till that time is there any option to
> control the replication method?
>
> Thanks,
> Doss.
>
> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson 
> wrote:
>
> > You say you have three nodes, 130 replicas and a replication factor of 3,
> > so
> > you have 130 cores/node. At least one of those cores has a 20G index,
> > right?
> >
> > What is the sum of all the indexes on a single physical machine?
> >
> > I think your system is under-provisioned and that you’ve been riding at
> > the edge
> > of instability for quite some time and have added enough more docs that
> > you finally reached a tipping point. But that’s largely speculation.
> >
> > So adding more heap may help. But Real Soon Now you need to think about
> > adding
> > more hardware and moving some of your replicas to that new hardware.
> >
> > Again, this is speculation. But when systems are running with an
> > _aggregate_
> > index size that is many multiples of the available memory (total phycisal
> > memory)
> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
> > all replicas…
> >
> > Best,
> > Erick
> >
> > > On Sep 5, 2019, at 8:08 AM, Doss  wrote:
> > >
> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
> > >
> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
> > node,
> > > out of which 14 GB assigned for HEAP, you mean to say we have to
> allocate
> > > more HEAP? or we need add more Physical RAM?
> > >
> > > This system ran for 8 to 9 months without any major issues, in recent
> > times
> > > only we are facing too many such incidents.
> > >
> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson  >
> > > wrote:
> > >
> > >> If I'm reading this correctly, you have a huge amount of index in not
> > much
> > >> memory. You only have 14g allocated across 130 replicas, at least one
> of
> > >> which has a 20g index. You don't need as much memory as your aggregate
> > >> index size, but this system feels severely under provisioned. I
> suspect
> > >> that's the root of your instability
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Thu, Sep 5, 2019, 07:08 Doss  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
> ensemble.
> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
> > NRT)
> > >>> with index size ranging from 700MB to 20GB.
> > >>>
> > >>> autoCommit - 10 minutes once
> > >>> softCommit - 30 Sec Once
> > >>>
> > >>> At peak time if a shard goes to recovery mode many other shards also
> > >> going
> > >>> to recovery mode in few minutes, which creates huge load (200+ load
> > >>> average) and SOLR becomes non responsive. To fix this we are
> restarting
> > >> the
> > >>> node, again leader tries to correct the index by initiating
> > replication,
> > >>> which causes load again, and the node goes to non responsive state.
> > >>>
> > >>> As soon as a node starts the replication process initiated for all
> 130
> > >>> cores, is there any we control it, like one after the other?
> > >>>
> > >>> Thanks,
> > >>> Doss.
> > >>>
> > >>
> >
> >
>


Different DIH failure behavior on non-sharded and sharded collections

2019-08-26 Thread Jack Schlederer
Hello,

The size and complexity of a collection that I'm running on a SolrCloud
(v7.5) has recently grown to the point where it warranted splitting the
collection into two shards. I run the data import handler once a day to
index documents returned by a MSSQL stored proc. Previously, on the
single-shard collection, when the DIH encountered a document that was
missing a required field or otherwise couldn't be indexed, it would throw a
warning into the log and continue. Now, with a doubly-sharded collection, a
similar event causes the entire DIH full import to fail with a
DistributedUpdatesAsyncException when posting that document to another
node. I was wondering if this is a known issue with the DIH as of 7.5 and
if there's a way to have the DistributedUpdateProcessor sort of "warn and
continue" when this type of document is encountered.

Thanks in advance!
Jack


Restoring and upgrading a standalone index to SolrCloud

2018-10-03 Thread Jack Schlederer
Hello,

We currently run Solr 5.4 as our production search backend. We run it in a
master/slave replication architecture, and we're starting an upgrade to
Solr 7.5 using a SolrCloud architecture.

One of our collections is around 20GB and hosts about 200M documents, and
it would take around 6 hours to do a full dataimport from the database, so
we'd like to upgrade the index and restore it to SolrCloud. I've
successfully upgraded the Lucene 5 index to Lucene 6, and then to Lucene 7,
so I think I have an index that can be restored to Solr 7. Do you know if
it's possible to restore an index like this to a SolrCloud environment if I
can get it into a directory that is shared by all the nodes?

Thanks,
Jack


Re: ZooKeeper issues with AWS

2018-09-05 Thread Jack Schlederer
Ah, yes. We use ZK 3.4.13 for our ZK server nodes, but we never thought to
upgrade the ZK JAR within Solr. We included that in our Solr image, and
it's working like a charm, re-resolving DNS names when new ZKs come up with
different IPs. Thanks for the help guys!

--Jack

On Sat, Sep 1, 2018 at 9:41 AM Shawn Heisey  wrote:

> On 9/1/2018 3:42 AM, Björn Häuser wrote:
> > as far as I can see the required fix for this is finally in 3.4.13:
> >
> > - https://issues.apache.org/jira/browse/ZOOKEEPER-2184 <
> https://issues.apache.org/jira/browse/ZOOKEEPER-2184>
> >
> > Would be great to have this in the next solr update.
>
> Issue created.
>
> https://issues.apache.org/jira/browse/SOLR-12727
>
> Note that you can actually do this upgrade yourself on your Solr
> install.  In server/solr-webapp/webapp/WEB-INF/lib, just delete the
> current zookeeper jar, copy the 3.4.13 jar into the directory, then
> restart Solr.  If you're on Windows, you'll need to stop Solr before you
> can do that.  Windows doesn't allow deleting a file that is open.
>
> I expect that if you do this upgrade yourself, Solr should work without
> problems.  Typically in the past when a new ZK version is included, no
> code changes are required.
>
> Thanks,
> Shawn
>
>


Re: ZooKeeper issues with AWS

2018-08-31 Thread Jack Schlederer
Thanks Erick. After some more testing, I'd like to correct the failure case
we're seeing. It's not when 2 ZK nodes are killed that we have trouble
recovering, but rather when all 3 ZK nodes that came up when the cluster
was initially started get killed at some point. Even if it's one at a time,
and we wait for a new one to spin up and join the cluster before killing
the next one, we get into a bad state when none of the 3 nodes that were in
the cluster initially are there anymore, even though they've been replaced
by our cloud provider spinning up new ZK's. We assign DNS names to the
ZooKeepers as they spin up, with a 10 second TTL, and those are what get
set as the ZK_HOST environment variable on the Solr hosts (i.e., ZK_HOST=
zk1.foo.com:2182,zk2.foo.com:2182,zk3.foo.com:2182). Our working hypothesis
is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names
when it starts up, and doesn't re-query DNS for some reason when it finds
that that IP address is no longer reachable (i.e., when a ZooKeeper node
dies and spins up at a different IP). Our current trajectory has us finding
a way to assign known static IPs to the ZK nodes upon startup, and
assigning those IPs to the ZK_HOST env var, so we can take DNS lookups out
of the picture entirely.

We can reproduce this in our cloud environment, as each ZK node has its own
IP and DNS name, but it's difficult to reproduce locally due to all the
ZooKeeper containers having the same IP when running locally (127.0.0.1).

Please let us know if you have insight into this issue.

Thanks,
Jack

On Fri, Aug 31, 2018 at 10:40 AM Erick Erickson 
wrote:

> Jack:
>
> Is it possible to reproduce "manually"? By that I mean without the
> chaos bit by the following:
>
> - Start 3 ZK nodes
> - Create a multi-node, multi-shard Solr collection.
> - Sequentially stop and start the ZK nodes, waiting for the ZK quorum
> to recover between restarts.
> - Solr does not reconnect to the restarted ZK node and will think it's
> lost quorum after the second node is restarted.
>
> bq. Kill 2, however, and we lose the quorum and we have
> collections/replicas that appear as "gone" on the Solr Admin UI's
> cloud graph display.
>
> It's odd that replicas appear as "gone", and suggests that your ZK
> ensemble is possibly not correctly configured, although exactly how is
> a mystery. Solr pulls it's picture of the topology of the network from
> ZK, establishes watches and the like. For most operations, Solr
> doesn't even ask ZooKeeper for anything since it's picture of the
> cluster is stored locally. ZKs job is to inform the various Solr nodes
> when the topology changes, i.e. _Solr_ nodes change state. For
> querying and indexing, there's no ZK involved at all. Even if _all_
> ZooKeeper nodes disappear, Solr should still be able to talk to other
> Solr nodes and shouldn't show them as down just because it can't talk
> to ZK. Indeed, querying should be OK although indexing will fail if
> quorum is lost.
>
> But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
> the ZK config seems right. Is there any chance your chaos testing
> "somehow" restarts the ZK nodes with any changes to the configs?
> Shooting in the dark here.
>
> For a replica to be "gone", the host node should _also_ be removed
> form the "live_nodes" znode, H. I do wonder if what you're
> observing is a consequence of both killing ZK nodes and Solr nodes.
> I'm not saying this is what _should_ happen, just trying to understand
> what you're reporting.
>
> My theory here is that your chaos testing kills some Solr nodes and
> that fact is correctly propagated to the remaining Solr nodes. Then
> your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
> appropriately so it's picture of the cluster has the node as
> permanently down. Then you restart the Solr node and that information
> isn't propagated to the Solr nodes since they didn't reconnect. If
> that were the case, then I'd expect the admin UI to correctly show the
> state of the cluster when hit on a Solr node that has never been
> restarted.
>
> As you can tell, I'm using something of a scattergun approach here b/c
> this isn't what _should_ happen given what you describe.
> Theoretically, all the ZK nodes should be able to go away and come
> back and Solr reconnect...
>
> As an aside, if you are ever in the code you'll see that for a replica
> to be usable, it must have both the state set to "active" _and_ the
> corresponding node has to be present in the live_nodes ephemeral
> zNode.
>
> Is there any chance you could try the manual steps above (AWS isn't
> necessary here) and let us know what happens? And if we can get a
> reproducible set of steps, feel free to open a

Re: ZooKeeper issues with AWS

2018-08-30 Thread Jack Schlederer
We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at
the same time. Our chaos process only kills approximately one node per
hour, and our cloud service provider automatically spins up another ZK node
when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to
each other and syncing data. It's just that Solr doesn't seem to recognize
it. We'd have to restart Solr to get it to recognize the new Zookeepers,
which we can't do without taking downtime or losing data that's stored on
non-persistent disk within the container.

The ZK_HOST environment variable lists all 3 ZK nodes.

We're running ZooKeeper version 3.4.13.

Thanks,
Jack

On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood 
wrote:

> How many Zookeeper nodes in your ensemble? You need five nodes to
> handle two failures.
>
> Are your Solr instances started with a zkHost that lists all five
> Zookeeper nodes?
>
> What version of Zookeeper?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> jack.schlede...@directsupply.com> wrote:
> >
> > Hi all,
> >
> > My team is attempting to spin up a SolrCloud cluster with an external
> > ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> > fault-tolerant such that we can lose either 1 Solr instance or 1
> ZooKeeper
> > and not take downtime. We use chaos engineering to randomly kill
> instances
> > to test our fault-tolerance. Killing Solr instances seems to be solved,
> as
> > we use a high enough replication factor and Solr's built in autoscaling
> to
> > ensure that new Solr nodes added to the cluster get the replicas that
> were
> > lost from the killed node. However, ZooKeeper seems to be a different
> > story. We can kill 1 ZooKeeper instance and still maintain, and
> everything
> > is good. It comes back and starts participating in leader elections, etc.
> > Kill 2, however, and we lose the quorum and we have collections/replicas
> > that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> > get Java errors in the log reporting that collections can't be read from
> > ZK. This means we aren't servicing search requests. We found an open JIRA
> > that reports this same issue, but its only affected version is 5.3.1. We
> > are experiencing this problem in 7.3.1. Has there been any progress or
> > potential workarounds on this issue since?
> >
> > Thanks,
> > Jack
> >
> > Reference:
> > https://issues.apache.org/jira/browse/SOLR-8868
>
>


ZooKeeper issues with AWS

2018-08-30 Thread Jack Schlederer
Hi all,

My team is attempting to spin up a SolrCloud cluster with an external
ZooKeeper ensemble. We're trying to engineer our solution to be HA and
fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper
and not take downtime. We use chaos engineering to randomly kill instances
to test our fault-tolerance. Killing Solr instances seems to be solved, as
we use a high enough replication factor and Solr's built in autoscaling to
ensure that new Solr nodes added to the cluster get the replicas that were
lost from the killed node. However, ZooKeeper seems to be a different
story. We can kill 1 ZooKeeper instance and still maintain, and everything
is good. It comes back and starts participating in leader elections, etc.
Kill 2, however, and we lose the quorum and we have collections/replicas
that appear as "gone" on the Solr Admin UI's cloud graph display, and we
get Java errors in the log reporting that collections can't be read from
ZK. This means we aren't servicing search requests. We found an open JIRA
that reports this same issue, but its only affected version is 5.3.1. We
are experiencing this problem in 7.3.1. Has there been any progress or
potential workarounds on this issue since?

Thanks,
Jack

Reference:
https://issues.apache.org/jira/browse/SOLR-8868


Problem with 60 cc and 60cc

2015-07-30 Thread Jack Schlederer
Hi,

I'm in the process of revising a schema for the search function of an
eCommerce platform.  One of the sticking points is a particular use case of
searching for xx yy where xx is any number and yy is an abbreviation for
a unit of measurement (mm, cc, ml, in, etc.).  The problem is that
searching for xx yy and xxyy return different results. One possible
solution I tried was applying a few PatternReplaceCharFilterFactories to
remove the whitespace between xx and yy if there was any (at both index-
and query-time).  These are the first few lines in the analyzer:

charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(pounds?|lbs?) replacement=$1lb /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(inch[es]?|in?) replacement=$1in /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(ounc[es]?|oz) replacement=$1oz /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(quarts?|qts?) replacement=$1qt /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(gallons?|gal?) replacement=$1gal /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=(?i)(\d+)\s?(mm|cc|ml) replacement=$1$2 /

A few more lines down, I use a PatternCaptureGroupFilterFactory to emit the
tokens xxyy, xx, and yy:

filter class=solr.PatternCaptureGroupFilterFactory
pattern=(\d+)(lb|oz|in|qt|gal|mm|cc|ml) preserve_original=true /

In Solr admin's analysis tool for the field type this applies to, both xx
yy and xxyy are tokenized and filtered down indentically (at both index-
and -query time).

The platform I'm working on searches many different fields by default, but
even when I rig up the query to only search in this one field, I still get
different results for xxyy and xx yy.  I'm wondering why this is.

Attached is a screenshot from Solr analysis.

Thanks,
John