Re: Solr 4.6.1 Cloud Stops Replication

Erick Erickson Mon, 24 Aug 2015 09:10:29 -0700

bq: As a follow up, the default is set to "NRTCachingDirectoryFactory"
for DirectoryFactory but not MMapDirectory. It is mentioned that
NRTCachingDirectoryFactory "caches small files in memory for better
NRT performance".


NRTCachingDirectoryFactory also uses MMapDirectory under the covers as
well as "caches small files in memory...."
so you really can't separate out the two.

I didn't mention this explicitly, but your original problem should
_not_ be happening in a well-tuned
system. Why your nodes go into a down state needs to be understood.
The connection timeout is
the only clue so far, and the usual reason here is that very long GC
pauses are happening. If this
continually happens, you might try turning on GC reporting options.

Best,
Erick


On Mon, Aug 24, 2015 at 2:47 AM, Rallavagu <rallav...@gmail.com> wrote:
> As a follow up, the default is set to "NRTCachingDirectoryFactory" for
> DirectoryFactory but not MMapDirectory. It is mentioned that
> NRTCachingDirectoryFactory "caches small files in memory for better NRT
> performance".
>
> Wondering if the this would also consume physical memory to the amount of
> MMap directory. Thoughts?
>
> On 8/18/15 9:29 AM, Erick Erickson wrote:
>>
>> Couple of things:
>>
>> 1> Here's an excellent backgrounder for MMapDirectory, which is
>> what makes it appear that Solr is consuming all the physical memory....
>>
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>> 2> It's possible that your transaction log was huge. Perhaps not likely,
>> but possible. If Solr abnormally terminates (kill -9 is a prime way to do
>> this),
>> then upon restart the transaction log is replayed. This log is rolled over
>> upon
>> every hard commit (openSearcher true or false doesn't matter). So, in the
>> scenario where you are indexing a whole lot of stuff without committing,
>> then
>> it can take a very long time to replay the log. Not only that, but as you
>> do
>> replay the log, any incoming updates are written to the end of the tlog..
>> That
>> said, nothing in your e-mails indicates this could be a problem and it's
>> frankly not consistent with the errors you _do_ report but I thought
>> I'd mention it.
>> See:
>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> You can avoid the possibility of this by configuring your autoCommit
>> interval
>> to be relatively short (say 60 seconds) with openSearcher=false....
>>
>> 3> ConcurrentUpdateSolrServer isn't the best thing for bulk loading
>> SolrCloud,
>> CloudSolrServer (renamed CloudSolrClient in 5.x) is better. CUSS sends all
>> the docs to some node, and from there that node figures out which
>> shard each doc belongs on and forwards the doc (actually in batches) to
>> the
>> appropriate leader. So doing what you're doing creates a lot of cross
>> chatter
>> amongst nodes. CloudSolrServer/Client figures that out on the client side
>> and
>> only sends packets to each leader that consist of only the docs the belong
>> on
>> that shard. You can getnearly linear throughput with increasing numbers of
>> shards this way.
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 18, 2015 at 9:03 AM, Rallavagu <rallav...@gmail.com> wrote:
>>>
>>> Thanks Shawn.
>>>
>>> All participating cloud nodes are running Tomcat and as you suggested
>>> will
>>> review the number of threads and increase them as needed.
>>>
>>> Essentially, what I have noticed was that two of four nodes caught up
>>> with
>>> "bulk" updates instantly while other two nodes took almost 3 hours to
>>> completely in sync with "leader". I have "tickled" other nodes by sending
>>> an
>>> update thinking that it would initiate the replication but not sure if
>>> that
>>> caused other two nodes to eventually catch up.
>>>
>>> On similar note, I was using "CouncurrentUpdateSolrServer" directly
>>> pointing
>>> to leader to bulk load Solr cloud. I have configured the chunk size and
>>> thread count for the same. Is this the right practice to bulk load
>>> SolrCloud?
>>>
>>> Also, the maximum number of connections per host parameter for
>>> "HttpShardHandler" is in solrconfig.xml I suppose?
>>>
>>> Thanks
>>>
>>>
>>>
>>> On 8/18/15 8:28 AM, Shawn Heisey wrote:
>>>>
>>>>
>>>> On 8/18/2015 8:18 AM, Rallavagu wrote:
>>>>>
>>>>>
>>>>> Thanks for the response. Does this cache behavior influence the delay
>>>>> in catching up with cloud? How can we explain solr cloud replication
>>>>> and what are the option to monitor and take proactive action (such as
>>>>> initializing, pausing etc) if needed?
>>>>
>>>>
>>>>
>>>> I don't know enough about your setup to speculate.
>>>>
>>>> I did notice this exception in a previous reply:
>>>>
>>>> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
>>>> connection from pool
>>>>
>>>> I can think of two things that would cause this.
>>>>
>>>> One cause is that your servlet container is limiting the number of
>>>> available threads.  A typical jetty or tomcat default for maxThreads is
>>>> 200, which can easily be exceeded by a small Solr install, especially if
>>>> it's SolrCloud.  The jetty included with Solr sets maxThreads to 10000,
>>>> which is effectively unlimited except for extremely large installs.  If
>>>> you are providing your own container, this will almost certainly need to
>>>> be raised.
>>>>
>>>> The other cause is that your install is extremely busy and you have run
>>>> out of available HttpClient connections.  The solution in this case is
>>>> to increase the maximum number of connections per host in the
>>>> HttpShardHandler config, which defaults to 20.
>>>>
>>>>
>>>>
>>>> https://wiki.apache.org/solr/SolrConfigXml#Configuration_of_Shard_Handlers_for_Distributed_searches
>>>>
>>>> There might be other causes for that exception, but I think those are
>>>> the most common causes.  Depending on how things are set up, you have
>>>> problems with both.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>
>

Re: Solr 4.6.1 Cloud Stops Replication

Reply via email to