No pending tasks for compactionstats and netstats. On Fri, Aug 26, 2011 at 6:07 AM, aaron morton <aa...@thelastpickle.com>wrote:
> That's a thread waiting for other threads / activities to complete. Nothing > unusual there. > > Work out how fair the repair gets. Is there a validation compaction listed > in nodetool compactionstats ? Are there any streams running in nodetool > netstats ? > > > Look through the logs on the machine you start the repair on, follow the > messages from the AnitEntrophyService. They will say when they send messages > to other nodes to build the merkle tree and when they get the response back. > You can then check if the other nodes respond. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 25/08/2011, at 7:02 PM, Boris Yen wrote: > > We tried to dump the stack trace of threads, we noticed that > > "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10 > tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000] > > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f5d4acf0f38> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(Unknown Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown > Source) > at java.util.concurrent.CountDownLatch.await(Unknown Source) > > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665) > > > This seems to be the thread which causes the repair to hang. > > We also noticed another odd thing, sometimes we can see lots [WRITE-/...] > threads. > > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > > > On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulin...@gmail.com> wrote: > >> Would Cassandra-2433 cause this? >> >> >> On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulin...@gmail.com> wrote: >> >>> Hi, >>> >>> In our testing environment, we got two nodes with RF=2 running 0.8.4. We >>> tried to test the repair functions of cassandra, however, every once a >>> while, the "nodetool repair" never returns. We have checked the system.log, >>> nothing seems to be out of ordinary, no errors, no exceptions. The data is >>> only 50 mb, and it is consistently updated. >>> >>> Shutting down one node during the repair process could cause similar >>> symptom. So, our original thought is that maybe one of the TreeRequest is >>> not sent to the other node correctly, that might cause the repair to run >>> forever. However, I did not see any relative log msg to support that. I am >>> kind of running out of idea about this... Does anyone also has this problem? >>> >>> Regards >>> Boris >>> >> >> > >