Re: Solr 7.7.2 Autoscaling policy - Poor performance
> there are known perf issues in computing very large clusters Is there any documentation/open tickets on this that you have handy? If that is the case, then we might be back to looking at separate Znodes. Right now if we provide a nodeset on collection creation, it is creating them quickly. I don't want to make many changes as this is part of our production at this time. From: Noble Paul Sent: Wednesday, September 4, 2019 12:14 AM To: solr-user@lucene.apache.org Subject: Re: Solr 7.7.2 Autoscaling policy - Poor performance there are known perf issues in computing very large clusters give it a try with the following rules "FOO_CUSTOMER":[ { "replica":"0", "sysprop.HELM_CHART":"!FOO_CUSTOMER", "strict":"true"}, { "replica":"<2", "node":"#ANY", "strict":"false"}] On Wed, Sep 4, 2019 at 1:49 AM Andrew Kettmann wrote: > > Currently our 7.7.2 cluster has ~600 hosts and each collection is using an > autoscaling policy based on system property. Our goal is a single core per > host (container, running on K8S). However as we have rolled more > containers/collections into the cluster any creation/move actions are taking a huge amount of time. In fact we generally hit the 180 second timeout if we don't schedule it as async. Though the action gets completed anyway. Looking at the code, it looks like for each core it is considering the entire cluster. > > Right now our autoscaling policies look like this, note we are feeding a > sysprop on startup for each collection to map to specific containers: > > "FOO_CUSTOMER":[ > { > "replica":"#ALL", > "sysprop.HELM_CHART":"FOO_CUSTOMER", > "strict":"true"}, > { > "replica":"<2", > "node":"#ANY", > "strict":"false"}] > > Does name based filtering allow wildcards ? Also would that likely fix the > issue of the time it takes for Solr to decide where cores can go? Or any > other suggestions for making this more efficient on the Solr overseer? We do > have dedicated overseer nodes, but the leader maxes out CPU for awhile while it is thinking about this. > > We are considering putting each collection into its own zookeeper > znode/chroot if we can't support this many nodes per overseer. I would like > to avoid that if possible, but also creating a collection in sub 10 minutes > would be neat too. > > I appreciate any input/suggestions anyone has! > > [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> > Andrew Kettmann > DevOps Engineer > P: 1.314.596.2836 > [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] > <https://twitter.com/evolve24> [Instagram] > <https://www.instagram.com/evolve_24> > > evolve24 Confidential & Proprietary Statement: This email and any attachments > are confidential and may contain information that is privileged, confidential > or exempt from disclosure under applicable law. It is intended for the use of > the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error. -- - Noble Paul
Re: Solr 7.7.2 Autoscaling policy - Poor performance
> You’re going to want to start by having more than 3gb for memory in my > opinion but the rest of your set up is more complex than I’ve dealt with. right now the overseer is set to a max heap of 3GB, but is only using ~260MB of heap, so memory doesn't seem to be the issue unless there is a part of the picture I am missing there? Our overseers only jobs are being overseer and holding the .system collection. I would imagine if the overseer were hitting memory constraints it would have allocated more than 300MB of the total 3GB it is allowed, right? evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Re: Solr 7.7.2 Autoscaling policy - Poor performance
> How many zookeepers do you have? How many collections? What is there size? > How much CPU / memory do you give per container? How much heap in comparison > to total memory of the container ? 3 Zookeepers. 733 containers/nodes 735 total cores. Each core ranges from ~4-10GB of index. (Autoscaling splits at 12GB) 10 collections, ranging from 147 shards at most, to 3 at least. Replication factor of 2 other than .system which has 3 replicas. Each container has a min/max heap of 750MB other than the overseer containers which have a min/max of 3GB. Containers aren't hard limited by K8S on memory or CPU but the machines the containers are on have 4 cores and ~13GB of ram. Now that I look at the CPU usage on a per container basis, it looks like it is maxing out all four cores on the VM that is hosting the overseer container. Barely using the heap (300MB). I suppose that means that if we put the overseers on machines with more cores, it might be able to get things done a bit faster. Though that still seems like a limited solution as we are going to grow this cluster at least double in size if not larger. We are using the solr:7.7.2 container. Java Options on the home page are below: -DHELM_CHART=overseer -DSTOP.KEY=solrrocks -DSTOP.PORT=7983 -Dhost=overseer-solr-0.solr.DOMAIN -Djetty.home=/opt/solr/server -Djetty.port=8983 -Dsolr.data.home= -Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf -Dsolr.install.dir=/opt/solr -Dsolr.jetty.https.port=8983 -Dsolr.log.dir=/opt/solr/server/logs -Dsolr.log.level=INFO -Dsolr.solr.home=/opt/solr/server/home -Duser.timezone=UTC -DzkClientTimeout=6 -DzkHost=zookeeper-1.DOMAIN:2181,zookeeper-2.DOMAIN:2181,zookeeper-3.DOMAIN:2181/ZNODE -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:+ParallelRefProcEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:-OmitStackTraceInFastThrow -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:ConcGCThreads=4 -XX:MaxTenuringThreshold=8 -XX:NewRatio=3 -XX:ParallelGCThreads=4 -XX:PretenureSizeThreshold=64m -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -Xlog:gc*:file=/opt/solr/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M -Xmx3g -Xmx3g -Xss256k evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Solr 7.7.2 Autoscaling policy - Poor performance
Currently our 7.7.2 cluster has ~600 hosts and each collection is using an autoscaling policy based on system property. Our goal is a single core per host (container, running on K8S). However as we have rolled more containers/collections into the cluster any creation/move actions are taking a huge amount of time. In fact we generally hit the 180 second timeout if we don't schedule it as async. Though the action gets completed anyway. Looking at the code, it looks like for each core it is considering the entire cluster. Right now our autoscaling policies look like this, note we are feeding a sysprop on startup for each collection to map to specific containers: "FOO_CUSTOMER":[ { "replica":"#ALL", "sysprop.HELM_CHART":"FOO_CUSTOMER", "strict":"true"}, { "replica":"<2", "node":"#ANY", "strict":"false"}] Does name based filtering allow wildcards ? Also would that likely fix the issue of the time it takes for Solr to decide where cores can go? Or any other suggestions for making this more efficient on the Solr overseer? We do have dedicated overseer nodes, but the leader maxes out CPU for awhile while it is thinking about this. We are considering putting each collection into its own zookeeper znode/chroot if we can't support this many nodes per overseer. I would like to avoid that if possible, but also creating a collection in sub 10 minutes would be neat too. I appreciate any input/suggestions anyone has! [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Re: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules
Entered ticket https://issues.apache.org/jira/browse/SOLR-13586 Sadly, no patch attached this time as it is a much more complicated issue than my last one, and a good bit above my paygrade with Java. From: Andrzej Białecki Sent: Friday, June 28, 2019 4:29:49 AM To: solr-user@lucene.apache.org Subject: Re: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules Andrew, please create a JIRA issue - in my opinion this is a bug not a feature, or at least something that needs clarification. > On 27 Jun 2019, at 23:56, Andrew Kettmann > wrote: > > I found the issue. Autoscaling seems to silently ignore rules (at least > sysprop rules). Example rule: > > > {'set-policy': {'sales-uat': [{'node': '#ANY', > 'replica': '<2', > 'strict': 'false'}, > {'replica': '#ALL', > 'strict': 'true', > 'sysprop.HELM_CHART': 'foo'}]}} > > > Two cases will get the sysprop rule ignored: > > 1. No nodes have a HELM_CHART system property defined > 2. No nodes have the value "foo" for the HELM_CHART system property > > > If you have SOME nodes that have -DHELM_CHART=foo, then it will fail if it > cannot satisfy another strict rule. So sysprop autoscaling rules appear to be > unable to be strict on their own it appears. > > > Hopefully this can solve some issues for other people as well. > > > From: Andrew Kettmann > Sent: Tuesday, June 25, 2019 1:04:21 PM > To: solr-user@lucene.apache.org > Subject: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, > possibly all rules > > > Using docker 7.7.2 image > > > Solr 7.7.2 on new Znode on ZK. Created the chroot using solr zk mkroot. > > > Created a policy: > > {'set-policy': {'banana': [{'replica': '#ALL', >'sysprop.HELM_CHART': 'notbanana'}]}} > > > No errors on creation of the policy. > > > I have no nodes that have that value for the system property "HELM_CHART", I > have nodes that contain "banana" and "rulesos" for that value only. > > > I create the collection with a call to the /admin/collections: > > {'action': 'CREATE', > 'collection.configName': 'project-solr-7', > 'name': 'banana', > 'numShards': '2', > 'policy': 'banana', > 'replicationFactor': '2'} > > > and it creates the collection without an error. Which what I expected was the > collection creation to fail. This is the behavior I had seen in the past, but > after tearing down and recreating the cluster in a higher environment, it > does not appear to function. > > > Is there some prerequisite before policies will be respected? The .system > collection is in place as expected, and I am not seeing anything in the logs > on the overseer to suggest any problems. > > [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> > Andrew Kettmann > DevOps Engineer > P: 1.314.596.2836 > [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] > <https://twitter.com/evolve24> [Instagram] > <https://www.instagram.com/evolve_24> > > evolve24 Confidential & Proprietary Statement: This email and any attachments > are confidential and may contain information that is privileged, confidential > or exempt from disclosure under applicable law. It is intended for the use of > the recipients. If you are not the intended recipient, or believe that you > have received this communication in error, please do not read, print, copy, > retransmit, disseminate, or otherwise use the information. Please delete this > email and attachments, without reading, printing, copying, forwarding or > saving them, and notify the Sender immediately by reply email. No > confidentiality or privilege is waived or lost by any transmission in error.
Re: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules
I found the issue. Autoscaling seems to silently ignore rules (at least sysprop rules). Example rule: {'set-policy': {'sales-uat': [{'node': '#ANY', 'replica': '<2', 'strict': 'false'}, {'replica': '#ALL', 'strict': 'true', 'sysprop.HELM_CHART': 'foo'}]}} Two cases will get the sysprop rule ignored: 1. No nodes have a HELM_CHART system property defined 2. No nodes have the value "foo" for the HELM_CHART system property If you have SOME nodes that have -DHELM_CHART=foo, then it will fail if it cannot satisfy another strict rule. So sysprop autoscaling rules appear to be unable to be strict on their own it appears. Hopefully this can solve some issues for other people as well. ____ From: Andrew Kettmann Sent: Tuesday, June 25, 2019 1:04:21 PM To: solr-user@lucene.apache.org Subject: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules Using docker 7.7.2 image Solr 7.7.2 on new Znode on ZK. Created the chroot using solr zk mkroot. Created a policy: {'set-policy': {'banana': [{'replica': '#ALL', 'sysprop.HELM_CHART': 'notbanana'}]}} No errors on creation of the policy. I have no nodes that have that value for the system property "HELM_CHART", I have nodes that contain "banana" and "rulesos" for that value only. I create the collection with a call to the /admin/collections: {'action': 'CREATE', 'collection.configName': 'project-solr-7', 'name': 'banana', 'numShards': '2', 'policy': 'banana', 'replicationFactor': '2'} and it creates the collection without an error. Which what I expected was the collection creation to fail. This is the behavior I had seen in the past, but after tearing down and recreating the cluster in a higher environment, it does not appear to function. Is there some prerequisite before policies will be respected? The .system collection is in place as expected, and I am not seeing anything in the logs on the overseer to suggest any problems. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Re: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules
Is there some step I am missing here? Policies seem to be entirely ignored in this new cluster and I am at a loss. Is there some default setting that will cause autoscaling to be ignored? From: Andrew Kettmann Sent: Tuesday, June 25, 2019 1:04:21 PM To: solr-user@lucene.apache.org Subject: Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules Using docker 7.7.2 image Solr 7.7.2 on new Znode on ZK. Created the chroot using solr zk mkroot. Created a policy: {'set-policy': {'banana': [{'replica': '#ALL', 'sysprop.HELM_CHART': 'notbanana'}]}} No errors on creation of the policy. I have no nodes that have that value for the system property "HELM_CHART", I have nodes that contain "banana" and "rulesos" for that value only. I create the collection with a call to the /admin/collections: {'action': 'CREATE', 'collection.configName': 'project-solr-7', 'name': 'banana', 'numShards': '2', 'policy': 'banana', 'replicationFactor': '2'} and it creates the collection without an error. Which what I expected was the collection creation to fail. This is the behavior I had seen in the past, but after tearing down and recreating the cluster in a higher environment, it does not appear to function. Is there some prerequisite before policies will be respected? The .system collection is in place as expected, and I am not seeing anything in the logs on the overseer to suggest any problems. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Solr 7.7.2 - Autoscaling in new cluster ignoring sysprop rules, possibly all rules
Using docker 7.7.2 image Solr 7.7.2 on new Znode on ZK. Created the chroot using solr zk mkroot. Created a policy: {'set-policy': {'banana': [{'replica': '#ALL', 'sysprop.HELM_CHART': 'notbanana'}]}} No errors on creation of the policy. I have no nodes that have that value for the system property "HELM_CHART", I have nodes that contain "banana" and "rulesos" for that value only. I create the collection with a call to the /admin/collections: {'action': 'CREATE', 'collection.configName': 'project-solr-7', 'name': 'banana', 'numShards': '2', 'policy': 'banana', 'replicationFactor': '2'} and it creates the collection without an error. Which what I expected was the collection creation to fail. This is the behavior I had seen in the past, but after tearing down and recreating the cluster in a higher environment, it does not appear to function. Is there some prerequisite before policies will be respected? The .system collection is in place as expected, and I am not seeing anything in the logs on the overseer to suggest any problems. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Solr 7.7.2 - SolrCloud - Autoscale Triggers - indexSize trigger - Failure isn't sending listener a FAILED message, but a SUCCEEDED message
pt. The failure, I understand because this is an unfixable situation for Solr, it can't both meet my policies in this situation AND execute the trigger. The problem is the listener sending successes each time. Anyone able to shed some light on this ? Working on setting up some automation so that when we split cores, we automatically create new containers for Solr to use and shuffle cores onto, I was testing failure cases and found this issue. Is this just a ticket I need to open in Jira or is there something I am missing ? [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.
Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks
Entered issue: https://issues.apache.org/jira/browse/SOLR-13563 Please let me know if I need to include any other information. I have to say, props to anyone involved in making the "ant idea" target a thing. Makes it ridiculously easy for someone who can code, but not java specifically, to look at and suggest possible fixes to the code. 10/10 would submit a ticket again! From: Andrzej Białecki Sent: Wednesday, June 19, 2019 7:07:02 AM To: solr-user@lucene.apache.org Subject: Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks Hi Andrew, Please create a JIRA issue and attach this patch, I’ll look into fixing this. Thanks! > On 18 Jun 2019, at 23:19, Andrew Kettmann > wrote: > > Attached the patch, but that isn't sent out on the mailing list, my mistake. > Patch below: > > > > ### START > > diff --git > a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java > b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java > index 24a52eaf97..e018f8a42f 100644 > --- > a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java > +++ > b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java > @@ -135,7 +135,9 @@ public class SplitShardCmd implements > OverseerCollectionMessageHandler.Cmd { > } > > RTimerTree t = timings.sub("checkDiskSpace"); > -checkDiskSpace(collectionName, slice.get(), parentShardLeader); > +if (splitMethod != SolrIndexSplitter.SplitMethod.LINK) { > + checkDiskSpace(collectionName, slice.get(), parentShardLeader); > +} > t.stop(); > > // let's record the ephemeralOwner of the parent leader node > > ### END > > > From: Andrew Kettmann > Sent: Tuesday, June 18, 2019 3:05:15 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on > disk usage checks > > > Looks like the disk check here is the problem, I am no Java developer, but > this patch ignores the check if you are using the link method for splitting. > Attached the patch. This is off of the commit for 7.7.2, d4c30fc285 . The > modified version only has to be run on the overseer machine, so there is that > at least. > > > From: Andrew Kettmann > Sent: Tuesday, June 18, 2019 11:32:43 AM > To: solr-user@lucene.apache.org > Subject: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on > disk usage checks > > > Using Solr 7.7.2 Docker image, testing some of the new autoscale features, > huge fan so far. Tested with the link method on a 2GB core and found that it > took less than 1MB of additional space. Filled the core quite a bit larger, > 12GB of a 20GB PVC, and now splitting the shard fails with the following > error message on my overseer: > > > 2019-06-18 16:27:41.754 ERROR > (OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) > [c:test_autoscale s:shard1 ] o.a.s.c.a.c.OverseerCollectionMessageHandler > Collection: test_autoscale operation: splitshard > failed:org.apache.solr.common.SolrException: not enough free disk space to > perform index split on node 10.0.193.23:8983_solr, required: > 23.35038321465254, available: 7.811378479003906 >at > org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567) >at > org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138) >at > org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94) >at > org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294) >at > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505) >at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) >at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >at java.base/java.lang.Thread.run(Thread.java:834) > > > > I attempted sending the request to the node itself to see if it did anything > different, but no luck. My parameters are (Note Python formatting as that is > my language of choice): > > > > splitparams = {'action':'SPLITSHARD', > 'collection':'test_autoscale', > 'shard':'shard1', > 'splitMethod':'link', > 'timing':'true', > 'async':'shardsplitasync'}
Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks
Attached the patch, but that isn't sent out on the mailing list, my mistake. Patch below: ### START diff --git a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java index 24a52eaf97..e018f8a42f 100644 --- a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java +++ b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java @@ -135,7 +135,9 @@ public class SplitShardCmd implements OverseerCollectionMessageHandler.Cmd { } RTimerTree t = timings.sub("checkDiskSpace"); -checkDiskSpace(collectionName, slice.get(), parentShardLeader); +if (splitMethod != SolrIndexSplitter.SplitMethod.LINK) { + checkDiskSpace(collectionName, slice.get(), parentShardLeader); +} t.stop(); // let's record the ephemeralOwner of the parent leader node ### END ____ From: Andrew Kettmann Sent: Tuesday, June 18, 2019 3:05:15 PM To: solr-user@lucene.apache.org Subject: Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks Looks like the disk check here is the problem, I am no Java developer, but this patch ignores the check if you are using the link method for splitting. Attached the patch. This is off of the commit for 7.7.2, d4c30fc285 . The modified version only has to be run on the overseer machine, so there is that at least. ____ From: Andrew Kettmann Sent: Tuesday, June 18, 2019 11:32:43 AM To: solr-user@lucene.apache.org Subject: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge fan so far. Tested with the link method on a 2GB core and found that it took less than 1MB of additional space. Filled the core quite a bit larger, 12GB of a 20GB PVC, and now splitting the shard fails with the following error message on my overseer: 2019-06-18 16:27:41.754 ERROR (OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) [c:test_autoscale s:shard1 ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: test_autoscale operation: splitshard failed:org.apache.solr.common.SolrException: not enough free disk space to perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, available: 7.811378479003906 at org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567) at org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138) at org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94) at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294) at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) I attempted sending the request to the node itself to see if it did anything different, but no luck. My parameters are (Note Python formatting as that is my language of choice): splitparams = {'action':'SPLITSHARD', 'collection':'test_autoscale', 'shard':'shard1', 'splitMethod':'link', 'timing':'true', 'async':'shardsplitasync'} And this is confirmed by the log message from the node itself: 2019-06-18 16:27:41.730 INFO (qtp1107530534-16) [c:test_autoscale ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link} status=0 QTime=20 While it is true I do not have enough space if I were using the rewrite method, the link method on a 2GB core used an additional less than 1MB of space. Is there something I am missing here? is there an option to disable the disk space check that I need to pass? I can't find anything in the documentation at this point. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipie
Re: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks
Looks like the disk check here is the problem, I am no Java developer, but this patch ignores the check if you are using the link method for splitting. Attached the patch. This is off of the commit for 7.7.2, d4c30fc285 . The modified version only has to be run on the overseer machine, so there is that at least. From: Andrew Kettmann Sent: Tuesday, June 18, 2019 11:32:43 AM To: solr-user@lucene.apache.org Subject: Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge fan so far. Tested with the link method on a 2GB core and found that it took less than 1MB of additional space. Filled the core quite a bit larger, 12GB of a 20GB PVC, and now splitting the shard fails with the following error message on my overseer: 2019-06-18 16:27:41.754 ERROR (OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) [c:test_autoscale s:shard1 ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: test_autoscale operation: splitshard failed:org.apache.solr.common.SolrException: not enough free disk space to perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, available: 7.811378479003906 at org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567) at org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138) at org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94) at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294) at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) I attempted sending the request to the node itself to see if it did anything different, but no luck. My parameters are (Note Python formatting as that is my language of choice): splitparams = {'action':'SPLITSHARD', 'collection':'test_autoscale', 'shard':'shard1', 'splitMethod':'link', 'timing':'true', 'async':'shardsplitasync'} And this is confirmed by the log message from the node itself: 2019-06-18 16:27:41.730 INFO (qtp1107530534-16) [c:test_autoscale ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link} status=0 QTime=20 While it is true I do not have enough space if I were using the rewrite method, the link method on a 2GB core used an additional less than 1MB of space. Is there something I am missing here? is there an option to disable the disk space check that I need to pass? I can't find anything in the documentation at this point. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error. diff --git a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java index 24a52eaf97..e018f8a42f 100644 --- a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java +++ b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java @@ -135,7 +135,9 @@ public class SplitShardCmd implements OverseerCollectionMessageHandler.Cmd { } RTimerTree t = timings.sub("checkDiskSpace"); -checkDiskSpace(collectionName, slice.get(), parentShardLeader); +if (splitMethod != SolrIndexSplitter.SplitMethod.LINK) { + checkDiskSpace(collectionName, slice.get(), parentShardLeader); +} t.stop(); // let's record t
Solr 7.7.2 - SolrCloud - SPLITSHARD - Using LINK method fails on disk usage checks
Using Solr 7.7.2 Docker image, testing some of the new autoscale features, huge fan so far. Tested with the link method on a 2GB core and found that it took less than 1MB of additional space. Filled the core quite a bit larger, 12GB of a 20GB PVC, and now splitting the shard fails with the following error message on my overseer: 2019-06-18 16:27:41.754 ERROR (OverseerThreadFactory-49-thread-5-processing-n:10.0.192.74:8983_solr) [c:test_autoscale s:shard1 ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: test_autoscale operation: splitshard failed:org.apache.solr.common.SolrException: not enough free disk space to perform index split on node 10.0.193.23:8983_solr, required: 23.35038321465254, available: 7.811378479003906 at org.apache.solr.cloud.api.collections.SplitShardCmd.checkDiskSpace(SplitShardCmd.java:567) at org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:138) at org.apache.solr.cloud.api.collections.SplitShardCmd.call(SplitShardCmd.java:94) at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:294) at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) I attempted sending the request to the node itself to see if it did anything different, but no luck. My parameters are (Note Python formatting as that is my language of choice): splitparams = {'action':'SPLITSHARD', 'collection':'test_autoscale', 'shard':'shard1', 'splitMethod':'link', 'timing':'true', 'async':'shardsplitasync'} And this is confirmed by the log message from the node itself: 2019-06-18 16:27:41.730 INFO (qtp1107530534-16) [c:test_autoscale ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={async=shardsplitasync=true=SPLITSHARD=test_autoscale=shard1=link} status=0 QTime=20 While it is true I do not have enough space if I were using the rewrite method, the link method on a 2GB core used an additional less than 1MB of space. Is there something I am missing here? is there an option to disable the disk space check that I need to pass? I can't find anything in the documentation at this point. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://linkedin.com/company/evolve24> [Twitter] <https://twitter.com/evolve24> [Instagram] <https://www.instagram.com/evolve_24> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.