[
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527025#comment-14527025
]
Sachin Goyal commented on SOLR-7121:
------------------------------------
Thanks for the patch file [[email protected]]! I will add a patch file
in the future along with pull request updation.
Please see my comments below:
\\
{quote}I think we want to look at making these new tests much faster.{quote}
Please let me know how much time you are seeing for the running of the newly
added tests.
I think the new tests are using the existing actual Solr Cloud infrastructure
and probably will need a little bit of time to setup and shutdown ZK, Cloud
etc. unless we are happy with unit tests instead of functional. But if you have
any ideas for the particular tests added in this ticket, I will be happy to
improve upon the same.
\\
\\
{quote}The test suite with this patch doesn't yet fully pass for me
either.{quote}
Can you please run those failing tests without the patch and let me know if
they are still failing?
The build seems to be passing at my end.
\\
\\
{quote}What is the motivation behind the core regex matching and multiple
config entries? Do you really need to configure different healthcheck
thresholds per core in a collection?{quote}
At a very minimum, we may want to configure the cores differently for different
collections.
The regular expression approach allows us to have a single configuration file
for collections serving million documents and running on more powerful machines
and also for collections serving a couple thousand small documents and running
on less powerful machines.
Without the regular expression, one would need separate configuration files for
separate collections which is somewhat of a pain to manage.
So basically, the regular expressions help define different thresholds for solr
running on heterogeneous hardware.
\\
\\
{quote}We also want to make it clear this functionality only works with
SolrCloud and think about how that should best be expressed in the code - this
bleeds a bit of SolrCloud specific code out of ZkController and into SolrCore
in a way we have not really done yet I think.{quote}
I agree to some extent. However, please note that all the new code is protected
with *cc.isZooKeeperAware()* and it should not affect non-cloud-aware code.
If you have more specific thoughts on improving this, I would be happy to
refactor the current patch.
\\
\\
{quote}What if we are the leader and publish a down state due to overload?
Shouldn't we also give up our leader position?{quote}
I am a little confused on this one.
Wouldn't a down state trigger re-election? If not, it should probably be fixed
elsewhere by asking non-leaders to start the election process.
In any case, note that this code will be reached only when the leader is near
exhaustion.
Without this code, it would have tipped over completely and would have needed a
restart.
So, this code helps the leader node to survive a crash and become available in
the future.
> Solr nodes should go down based on configurable thresholds and not rely on
> resource exhaustion
> ----------------------------------------------------------------------------------------------
>
> Key: SOLR-7121
> URL: https://issues.apache.org/jira/browse/SOLR-7121
> Project: Solr
> Issue Type: New Feature
> Reporter: Sachin Goyal
> Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch,
> SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch
>
>
> Currently, there is no way to control when a Solr node goes down.
> If the server is having high GC pauses or too many threads or is just getting
> too many queries due to some bad load-balancer, the cores in the machine keep
> on serving unless they exhaust the machine's resources and everything comes
> to a stall.
> Such a slow-dying core can affect other cores as well by taking huge time to
> serve their distributed queries.
> There should be a way to specify some threshold values beyond which the
> targeted core can its ill-health and proactively go down to recover.
> When the load improves, the core should come up automatically.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]