As an update to this, I did my SolrCloud dance and made it 2xJVMs per
machine (2 machines still, the same ones) and spread the load around. Each
Solr instance now has 16 total shards (master for 8, replica for 8).

*drum roll* ... I can repeatedly run my delete script and nothing breaks. :)


On Thu, Mar 7, 2013 at 11:03 AM, Brett Hoerner <br...@bretthoerner.com>wrote:

> Here is the other server when it's locked:
> https://gist.github.com/3529b7b6415756ead413
>
> To be clear, neither is really "the replica", I have 32 shards and each
> physical server is the leader for 16, and the replica for 16.
>
> Also, related to the max threads hunch: my working cluster has many, many
> fewer shards per Solr instance. I'm going to do some migration dancing on
> this cluster today to have more Solr JVMs each with fewer cores, and see
> how it affects the deletes.
>
>
> On Wed, Mar 6, 2013 at 5:40 PM, Mark Miller <markrmil...@gmail.com> wrote:
>
>> Any chance you can grab the stack trace of a replica as well? (also when
>> it's locked up of course).
>>
>> - Mark
>>
>> On Mar 6, 2013, at 3:34 PM, Brett Hoerner <br...@bretthoerner.com> wrote:
>>
>> > If there's anything I can try, let me know. Interestingly, I think I
>> have
>> > noticed that if I stop my indexer, do my delete, and restart the indexer
>> > then I'm fine. Which goes along with the update thread contention
>> theory.
>> >
>> >
>> > On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>> >
>> >> This is what I see:
>> >>
>> >> We currently limit the number of outstanding update requests at one
>> time
>> >> to avoid a crazy number of threads being used.
>> >>
>> >> It looks like a bunch of update requests are stuck in socket reads and
>> are
>> >> taking up the available threads. It looks like the deletes are hanging
>> out
>> >> waiting for a free thread.
>> >>
>> >> It seems the question is, why are the requests stuck in socket reads. I
>> >> don't have an answer at the moment.
>> >>
>> >> We should probably get this into a JIRA issue though.
>> >>
>> >> - Mark
>> >>
>> >>
>> >> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com>
>> >> wrote:
>> >>
>> >>> It does not look like a deadlock, though it could be a distributed
>> one.
>> >> Or
>> >>> it could be a livelock, though that's less likely.
>> >>>
>> >>> Here is what we used to recommend in similar situations for large Java
>> >>> systems (BEA Weblogic):
>> >>> 1) Do thread dump of both systems before anything. As simultaneous as
>> you
>> >>> can make it.
>> >>> 2) Do the first delete. Do a thread dump every 2 minutes on both
>> servers
>> >>> (so, say 3 dumps in that 5 minute wait)
>> >>> 3) Do the second delete and do thread dumps every 30 seconds on both
>> >>> servers from just before and then during. Preferably all the way until
>> >> the
>> >>> problem shows itself. Every 5 seconds if the problem shows itself
>> really
>> >>> quick.
>> >>>
>> >>> That gives you a LOT of thread dumps. But it also gives you something
>> >> that
>> >>> allows to compare thread state before and after the problem starting
>> >>> showing itself and to identify moving (or unnaturally still) threads.
>> I
>> >>> even wrote a tool long time ago that parsed those thread dumps
>> >>> automatically and generated pretty deadlock graphs of those.
>> >>>
>> >>>
>> >>> Regards,
>> >>>  Alex.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Personal blog: http://blog.outerthoughts.com/
>> >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> >>> - Time is the quality of nature that keeps events from happening all
>> at
>> >>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>> >>>
>> >>>
>> >>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com>
>> >> wrote:
>> >>>
>> >>>> Thans Brett, good stuff (though not a good problem).
>> >>>>
>> >>>> We def need to look into this.
>> >>>>
>> >>>> - Mark
>> >>>>
>> >>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com>
>> >> wrote:
>> >>>>
>> >>>>> Here is a dump after the delete, indexing has been stopped:
>> >>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
>> >>>>>
>> >>>>> An interesting hint that I forgot to mention: it doesn't always
>> happen
>> >> on
>> >>>>> the first delete. I manually ran the delete cron, and the server
>> >>>> continued
>> >>>>> to work. I waited about 5 minutes and ran it again and it stalled
>> the
>> >>>>> indexer (as seen from indexer process):
>> http://i.imgur.com/1Tt35u0.png
>> >>>>>
>> >>>>> Another thing I forgot to mention. To bring the cluster back to
>> life I:
>> >>>>>
>> >>>>> 1) stop my indexer
>> >>>>> 2) stop server1, start server1
>> >>>>> 3) stop server2, start start2
>> >>>>> 4) manually rebalance half of the shards to be mastered on server2
>> >>>>> (unload/create on server1)
>> >>>>> 5) restart indexer
>> >>>>>
>> >>>>> And it works again until a delete eventually kills it.
>> >>>>>
>> >>>>> To be clear again, select queries continue to work indefinitely.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Brett
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Which version of Solr?
>> >>>>>>
>> >>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces
>> and
>> >>>> see
>> >>>>>> where things are halting?
>> >>>>>>
>> >>>>>> - Mark
>> >>>>>>
>> >>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com
>> >
>> >>>> wrote:
>> >>>>>>
>> >>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32
>> shards,
>> >>>>>>> replication factor of 2) that I've been using for over a month
>> now in
>> >>>>>>> production.
>> >>>>>>>
>> >>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query
>> >>>>>>> completely halts all indexing. Select queries still run (and
>> >> quickly),
>> >>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer
>> (which
>> >>>>>> runs
>> >>>>>>> at ~400 doc/sec steady) pauses, and everything blocks
>> indefinitely.
>> >>>>>>>
>> >>>>>>> To clarify some on the schema, this is a moving window of data
>> >> (imagine
>> >>>>>>> messages that don't matter after a 24 hour period) which are
>> >> regularly
>> >>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours
>> old)
>> >>>> to
>> >>>>>>> keep the index size reasonable.
>> >>>>>>>
>> >>>>>>> There are no errors (log level warn) in the logs. I'm not sure
>> what
>> >> to
>> >>>>>> look
>> >>>>>>> into. As I've said this has been running (delete included) for
>> about
>> >> a
>> >>>>>>> month.
>> >>>>>>>
>> >>>>>>> I'll also note that I have another cluster much like this one
>> where I
>> >>>> do
>> >>>>>>> the very same thing... it has 4 machines, and indexes 10x the
>> >> documents
>> >>>>>> per
>> >>>>>>> second, with more indexes... and yet I delete on a cron without
>> >>>> issue...
>> >>>>>>>
>> >>>>>>> Any ideas on where to start, or other information I could provide?
>> >>>>>>>
>> >>>>>>> Thanks much.
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Reply via email to