[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

Sachin Goyal (JIRA) Mon, 04 May 2015 11:38:36 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527025#comment-14527025
 ]


Sachin Goyal commented on SOLR-7121:
------------------------------------

Thanks for the patch file [[email protected]]! I will add a patch file 
in the future along with pull request updation.
Please see my comments below:

\\
{quote}I think we want to look at making these new tests much faster.{quote}
Please let me know how much time you are seeing for the running of the newly 
added tests.
I think the new tests are using the existing actual Solr Cloud infrastructure 
and probably will need a little bit of time to setup and shutdown ZK, Cloud 
etc. unless we are happy with unit tests instead of functional. But if you have 
any ideas for the particular tests added in this ticket, I will be happy to 
improve upon the same.

\\
\\
{quote}The test suite with this patch doesn't yet fully pass for me 
either.{quote}
Can you please run those failing tests without the patch and let me know if 
they are still failing?
The build seems to be passing at my end.

\\
\\
{quote}What is the motivation behind the core regex matching and multiple 
config entries? Do you really need to configure different healthcheck 
thresholds per core in a collection?{quote}
At a very minimum, we may want to configure the cores differently for different 
collections.
The regular expression approach allows us to have a single configuration file 
for collections serving million documents and running on more powerful machines 
and also for collections serving a couple thousand small documents and running 
on less powerful machines.
Without the regular expression, one would need separate configuration files for 
separate collections which is somewhat of a pain to manage.
So basically, the regular expressions help define different thresholds for solr 
running on heterogeneous hardware.

\\
\\
{quote}We also want to make it clear this functionality only works with 
SolrCloud and think about how that should best be expressed in the code - this 
bleeds a bit of SolrCloud specific code out of ZkController and into SolrCore 
in a way we have not really done yet I think.{quote}
I agree to some extent. However, please note that all the new code is protected 
with *cc.isZooKeeperAware()* and it should not affect non-cloud-aware code.
If you have more specific thoughts on improving this, I would be happy to 
refactor the current patch.

\\
\\
{quote}What if we are the leader and publish a down state due to overload? 
Shouldn't we also give up our leader position?{quote}
I am a little confused on this one.
Wouldn't a down state trigger re-election? If not, it should probably be fixed 
elsewhere by asking non-leaders to start the election process.
In any case, note that this code will be reached only when the leader is near 
exhaustion.
Without this code, it would have tipped over completely and would have needed a 
restart.
So, this code helps the leader node to survive a crash and become available in 
the future.

> Solr nodes should go down based on configurable thresholds and not rely on 
> resource exhaustion
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7121
>                 URL: https://issues.apache.org/jira/browse/SOLR-7121
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Sachin Goyal
>         Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch, 
> SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch
>
>
> Currently, there is no way to control when a Solr node goes down.
> If the server is having high GC pauses or too many threads or is just getting 
> too many queries due to some bad load-balancer, the cores in the machine keep 
> on serving unless they exhaust the machine's resources and everything comes 
> to a stall.
> Such a slow-dying core can affect other cores as well by taking huge time to 
> serve their distributed queries.
> There should be a way to specify some threshold values beyond which the 
> targeted core can its ill-health and proactively go down to recover.
> When the load improves, the core should come up automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

Reply via email to