Cool, useful info.

As soon as I can duplicate the issue I'll work out what we need to do 
differently for this case.

- Mark

On Mar 7, 2013, at 10:19 AM, Brett Hoerner <br...@bretthoerner.com> wrote:

> As an update to this, I did my SolrCloud dance and made it 2xJVMs per
> machine (2 machines still, the same ones) and spread the load around. Each
> Solr instance now has 16 total shards (master for 8, replica for 8).
> 
> *drum roll* ... I can repeatedly run my delete script and nothing breaks. :)
> 
> 
> On Thu, Mar 7, 2013 at 11:03 AM, Brett Hoerner <br...@bretthoerner.com>wrote:
> 
>> Here is the other server when it's locked:
>> https://gist.github.com/3529b7b6415756ead413
>> 
>> To be clear, neither is really "the replica", I have 32 shards and each
>> physical server is the leader for 16, and the replica for 16.
>> 
>> Also, related to the max threads hunch: my working cluster has many, many
>> fewer shards per Solr instance. I'm going to do some migration dancing on
>> this cluster today to have more Solr JVMs each with fewer cores, and see
>> how it affects the deletes.
>> 
>> 
>> On Wed, Mar 6, 2013 at 5:40 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> 
>>> Any chance you can grab the stack trace of a replica as well? (also when
>>> it's locked up of course).
>>> 
>>> - Mark
>>> 
>>> On Mar 6, 2013, at 3:34 PM, Brett Hoerner <br...@bretthoerner.com> wrote:
>>> 
>>>> If there's anything I can try, let me know. Interestingly, I think I
>>> have
>>>> noticed that if I stop my indexer, do my delete, and restart the indexer
>>>> then I'm fine. Which goes along with the update thread contention
>>> theory.
>>>> 
>>>> 
>>>> On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>>> 
>>>>> This is what I see:
>>>>> 
>>>>> We currently limit the number of outstanding update requests at one
>>> time
>>>>> to avoid a crazy number of threads being used.
>>>>> 
>>>>> It looks like a bunch of update requests are stuck in socket reads and
>>> are
>>>>> taking up the available threads. It looks like the deletes are hanging
>>> out
>>>>> waiting for a free thread.
>>>>> 
>>>>> It seems the question is, why are the requests stuck in socket reads. I
>>>>> don't have an answer at the moment.
>>>>> 
>>>>> We should probably get this into a JIRA issue though.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> 
>>>>> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> It does not look like a deadlock, though it could be a distributed
>>> one.
>>>>> Or
>>>>>> it could be a livelock, though that's less likely.
>>>>>> 
>>>>>> Here is what we used to recommend in similar situations for large Java
>>>>>> systems (BEA Weblogic):
>>>>>> 1) Do thread dump of both systems before anything. As simultaneous as
>>> you
>>>>>> can make it.
>>>>>> 2) Do the first delete. Do a thread dump every 2 minutes on both
>>> servers
>>>>>> (so, say 3 dumps in that 5 minute wait)
>>>>>> 3) Do the second delete and do thread dumps every 30 seconds on both
>>>>>> servers from just before and then during. Preferably all the way until
>>>>> the
>>>>>> problem shows itself. Every 5 seconds if the problem shows itself
>>> really
>>>>>> quick.
>>>>>> 
>>>>>> That gives you a LOT of thread dumps. But it also gives you something
>>>>> that
>>>>>> allows to compare thread state before and after the problem starting
>>>>>> showing itself and to identify moving (or unnaturally still) threads.
>>> I
>>>>>> even wrote a tool long time ago that parsed those thread dumps
>>>>>> automatically and generated pretty deadlock graphs of those.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Alex.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Personal blog: http://blog.outerthoughts.com/
>>>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>>>> - Time is the quality of nature that keeps events from happening all
>>> at
>>>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>> book)
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Thans Brett, good stuff (though not a good problem).
>>>>>>> 
>>>>>>> We def need to look into this.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Here is a dump after the delete, indexing has been stopped:
>>>>>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
>>>>>>>> 
>>>>>>>> An interesting hint that I forgot to mention: it doesn't always
>>> happen
>>>>> on
>>>>>>>> the first delete. I manually ran the delete cron, and the server
>>>>>>> continued
>>>>>>>> to work. I waited about 5 minutes and ran it again and it stalled
>>> the
>>>>>>>> indexer (as seen from indexer process):
>>> http://i.imgur.com/1Tt35u0.png
>>>>>>>> 
>>>>>>>> Another thing I forgot to mention. To bring the cluster back to
>>> life I:
>>>>>>>> 
>>>>>>>> 1) stop my indexer
>>>>>>>> 2) stop server1, start server1
>>>>>>>> 3) stop server2, start start2
>>>>>>>> 4) manually rebalance half of the shards to be mastered on server2
>>>>>>>> (unload/create on server1)
>>>>>>>> 5) restart indexer
>>>>>>>> 
>>>>>>>> And it works again until a delete eventually kills it.
>>>>>>>> 
>>>>>>>> To be clear again, select queries continue to work indefinitely.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Brett
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Which version of Solr?
>>>>>>>>> 
>>>>>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces
>>> and
>>>>>>> see
>>>>>>>>> where things are halting?
>>>>>>>>> 
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com
>>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32
>>> shards,
>>>>>>>>>> replication factor of 2) that I've been using for over a month
>>> now in
>>>>>>>>>> production.
>>>>>>>>>> 
>>>>>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query
>>>>>>>>>> completely halts all indexing. Select queries still run (and
>>>>> quickly),
>>>>>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer
>>> (which
>>>>>>>>> runs
>>>>>>>>>> at ~400 doc/sec steady) pauses, and everything blocks
>>> indefinitely.
>>>>>>>>>> 
>>>>>>>>>> To clarify some on the schema, this is a moving window of data
>>>>> (imagine
>>>>>>>>>> messages that don't matter after a 24 hour period) which are
>>>>> regularly
>>>>>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours
>>> old)
>>>>>>> to
>>>>>>>>>> keep the index size reasonable.
>>>>>>>>>> 
>>>>>>>>>> There are no errors (log level warn) in the logs. I'm not sure
>>> what
>>>>> to
>>>>>>>>> look
>>>>>>>>>> into. As I've said this has been running (delete included) for
>>> about
>>>>> a
>>>>>>>>>> month.
>>>>>>>>>> 
>>>>>>>>>> I'll also note that I have another cluster much like this one
>>> where I
>>>>>>> do
>>>>>>>>>> the very same thing... it has 4 machines, and indexes 10x the
>>>>> documents
>>>>>>>>> per
>>>>>>>>>> second, with more indexes... and yet I delete on a cron without
>>>>>>> issue...
>>>>>>>>>> 
>>>>>>>>>> Any ideas on where to start, or other information I could provide?
>>>>>>>>>> 
>>>>>>>>>> Thanks much.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 

Reply via email to