Shalin, we're seeing that issue too (and actually actively debugging it
these days). So far I can confirm the following (on a 2-node cluster):

1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
2) It does not reproduce when SSL is disabled
3) Restarting the Solr process (sometimes both need to be restarted), the
count drops to 0, but if indexing continues, they climb up again

When it does happen, Solr seems stuck. The leader cannot talk to the
replica, or vice versa, the replica is usually put in DOWN state and
there's no way to fix it besides restarting the JVM.

Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look
legit. That did not help, and honestly I've done that before we suspected
it might be the SSL. Therefore I think those are "safe", but just FYI.

When it does happen, the number of CLOSE_WAITS climb very high, to the
order of 30K+ entries in 'netstat'.

When I say it does not reproduce on 5.4.1 I really mean the numbers don't
go as high as they do in 5.5.1. Meaning, when running without SSL, the
number of CLOSE_WAITs is smallish, usually less than a 10 (I would
separately like to understand why we have any in that state at all). When
running with SSL and 5.4.1, they stay low at the order of hundreds the most.

Unfortunately running without SSL is not an option for us. We will likely
roll back to 5.4.1, even if the problem exists there, but to a lesser
degree.

I will post back here when/if we have more info about this.

Shai

On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <shalinman...@gmail.com>
wrote:

> I have myself seen this CLOSE_WAIT issue at a customer. I am running some
> tests with different versions trying to pinpoint the cause of this leak.
> Once I have some more information and a reproducible test, I'll open a jira
> issue. I'll keep you posted.
>
> On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no>
> wrote:
>
> > Hello there,
> > Our SolrCloud is experiencing a FD leak while running with SSL. This is
> > occurring on the one machine that our program is sending data too. We
> have
> > a total of three servers running as an ensemble.
> >
> > While running without SSL does the FD Count remain quite constant at
> > around 180 while indexing. Performing a garbage collection also clears
> > almost the entire JVM-memory.
> >
> > However - when indexing with SSL does the FDC grow polynomial. The count
> > increases with a few hundred every five seconds or so, but reaches easily
> > 50 000 within three to four minutes. Performing a GC swipes most of the
> > memory on the two machines our program isn't transmitting the data
> directly
> > to. The last machine is unaffected by the GC, and both memory nor FDC
> > doesn't reset before Solr is restarted on that machine.
> >
> > Performing a netstat reveals that the FDC mostly consists of
> > TCP-connections in the state of "CLOSE_WAIT".
> >
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Reply via email to