Re: Stress test deadlocks

Erick Erickson Tue, 29 Jan 2013 19:27:31 -0800

First, I approach synchronization with fear and loathing because it's
so easy to screw up, so any comments deeply appreciated.


Second, the code (see below) in DefaultSolrCoreState just looks wrong.
Why synchronize the entire method and then synchronize everything
except the error condition on a _different_ object? The same thing
happens a bit further down in the newIndexWriter method and yet again
potentially in, I think, every synchronized method. That seems like
classic deadlock-vulnerable code. I really think we've been getting
away with this because the incidence of simultaneously opening and
closing cores is very low, or was before the "many cores" madness. In
fact, I'm not sure _any_ of the synchronized methods in
DefaultSolrCoreState should be synchronized.

Odd that I had that exact thought about running this against the
unmodified code an hour ago on the way back from some family business.
I'm not entirely sure what it would show though, since you don't get
into this situation without loading and unloading cores at the same
time, the stuff from SOLR-1028 and SOLR-4300. Which you couldn't do
before this rapidly. The deadlock is definitely happening when closing
a core, which generally wouldn't happen the way it's happening with
the stress test before the other changes. I can say that the deadlock
doesn't involve any of the new synchronization SOLR-4196 has
introduced, but I think is a consequence of rapidly opening/closing
cores. Now that I look more closely, the stack trace for the
deadlocked cores are _both_ in DefaultSolrCoreState. Smoking gun?

re: reviewing the patch. It spun out of control as I tried to get all
of the solr.xml purged. It's huge. For the main changes to things like
CoreContainer I don't think looking at the differences will help,
we'll have to evaluate it as new code. I'd be happy to get on a
marathon remote-review session with anyone who wants to rip things
apart..... All tests pass at this point, but I'm not sure how much
that really means in terms of deadlocking.... And I'm particularly
uncertain of the violence I did to persisting....

Final comment. With the one synchronization on the one method removed,
I've had about 11 hours if my stress test run without problems. So
this is in the right area at least. I'm about to remove the other
synchronized statement on the methods and run it overnight, will look
at the logs and report in the morning.

Here's the code pattern I just think is wrong (pseudo). It's repeated
several times...

 public synchronized RefCounted<IndexWriter> getIndexWriter(SolrCore core)
      throws IOException {

    if (closed) {
      throw new RuntimeException("SolrCoreState already closed");
    }

    synchronized (writerPauseLock) {
         all the rest of the method
    }
}

On Tue, Jan 29, 2013 at 1:32 PM, Mark Miller <[email protected]> wrote:
> There is a lot of complicated interplay between locks in that area of the 
> code - small changes can easily get you into trouble.
>
> Can you modify your test to run on the code before your patch? That would 
> help in telling if it's something existing or something your introducing.
>
> I still plan on looking at this issue closer this week, so perhaps I can 
> offer some more help soon.
>
> - Mark
>
> On Jan 29, 2013, at 1:26 PM, Erick Erickson <[email protected]> wrote:
>
>> Two runs doth not a conclusion reach, but removing "synchronized" from:
>> DefaultSolrCoreState.getIndexWriter (line 78)
>>
>> let me run for an hour, at least twice. And my stress test succeeds,
>> which fires up 15 indexing threads on 100 cores (transient core size
>> is 20), indexes documents for an hour while another 15 threads fire
>> off queries. At the end, it inspects each core to see if there are the
>> expected number of documents.
>>
>> But that's kinda a frail reed to pin my hopes on, these are
>> notoriously hard to reproduce.
>>
>> I'll set this up to run on an old machine for much longer later today.
>> Does anyone who knows that code know whether I'm playing with fire? I
>> haven't looked at the synchronization in that code to try to figure
>> out the purpose, I'm hoping someone will glance at it and say "that's
>> wrong".
>>
>> I'll dig into it later and see how much I can figure out about whether it's 
>> safe
>>
>> FWIW,
>>
>> On Tue, Jan 29, 2013 at 8:31 AM, Erick Erickson <[email protected]> 
>> wrote:
>>> All:
>>>
>>> As part of SOLR-4196, I'm opening and closing cores at a furious rate.
>>> My tests are running for 20-40 minutes then locking up quite
>>> regularly. Of course the first place I'm looking is my recent code,
>>> since it has a bunch of synchronized blocks.
>>>
>>> The deadlock is definitely happening at a call from the new code to
>>> close a Solr core, so to really look at this anyone will need to get
>>> the patch I'll put up in a minute. The deadlock trace is below.
>>>
>>> But without going that far, I question whether it's really anything to
>>> do with new synchronizations I'm doing or whether it's just something
>>> that's been lurking for a while and I'm flushing out of the woodwork.
>>> One of the deadlocked threads may be called form my code, but as far
>>> as I can tell none of the actual synchronization objects I'm using are
>>> held. I have the full jstack output if anyone needs it...
>>>
>>> Of course I'll continue looking, but at a glance I'm wondering if this
>>> code has really ever been stressed this way before or whether these
>>> have existed for a while. All synchronization should be approached
>>> with fear and loathing IMO.....
>>>
>>> One thread blocks at a synchronized method, but should this method
>>> really be synchronized?
>>>
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:78)
>>> (here's the method)  @Override
>>>  public synchronized RefCounted<IndexWriter> getIndexWriter(SolrCore core)
>>>      throws IOException {
>>>
>>> and a little later in the method there's:
>>>    synchronized (writerPauseLock) {
>>>      if (core == null) {
>>>
>>>
>>>
>>> and the other thread blocks at:
>>>
>>> at 
>>> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:668),
>>> (here's the method)
>>>  // IndexWriterCloser interface method - called from 
>>> solrCoreState.decref(this)
>>>  @Override
>>>  public void closeWriter(IndexWriter writer) throws IOException {
>>>    boolean clearRequestInfo = false;
>>>    commitLock.lock(); **********locking here!
>>>    try {
>>>      SolrQueryRequest req = new LocalSolrQueryRequest(core, new
>>> ModifiableSolrParams());
>>>      SolrQueryResponse rsp = new SolrQueryResponse();
>>>      if (SolrRequestInfo.getRequestInfo() == null) {
>>>        clearRequestInfo = true;
>>>
>>>
>>>
>>>
>>> Java stack information for the threads listed above:
>>> ===================================================
>>> "commitScheduler-42617-thread-1":
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:78)
>>> - waiting to lock <78b4aa518> (a 
>>> org.apache.solr.update.DefaultSolrCoreState)
>>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1359)
>>> at 
>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:561)
>>> - locked <7884ca730> (a java.lang.Object)
>>> at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>>> at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.java:680)
>>>
>>> *********
>>> Other thread
>>> "qtp1401888126-32":
>>> at sun.misc.Unsafe.park(Native Method)
>>> - parking to wait for  <788d73208> (a
>>> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>>> at 
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>>> at 
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
>>> at 
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
>>> at 
>>> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
>>> at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
>>> at 
>>> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:668)
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:64)
>>> - locked <78b4aa518> (a org.apache.solr.update.DefaultSolrCoreState)
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:272)
>>> - locked <78b4aa518> (a org.apache.solr.update.DefaultSolrCoreState)
>>> at org.apache.solr.core.SolrCore.decrefSolrCoreState(SolrCore.java:888)
>>> - locked <78b4aa518> (a org.apache.solr.update.DefaultSolrCoreState)
>>> at org.apache.solr.core.SolrCore.close(SolrCore.java:980)
>>> at org.apache.solr.core.CoreMaps.putTransientCore(CoreContainer.java:1465)
>>> at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:730)
>>> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1137)
>>> at 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:190)
>>> at
>>>
>>> Thanks,
>>> Erick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Stress test deadlocks

Reply via email to