On Mon, Jun 8, 2009 at 12:04 PM, Neil Ellis <neil.el...@mangala.co.uk> wrote: > So after a weekend away from the problem it occured to me that I had > neglected to nohup the java process. Therefore I am suspect that was > the problem, running now with nohup ;-) >
Lets hope this works. I am not 100% sure about this but normally a kill signal sent to a Java process shouldn't affect any of the Java thread's interrupt status. If thats the case (when for example using java.nio stuff) behavior is new to me. A way to achieve this would be to register a signal handler (via sun.misc.Signal) and manually go through each thread and interrupt them (so check if there is any code like that running). What JVM are you running? Another thing, I just now went through all the catching of InterruptedException we do in Neo4j and actually found one place that calls java.nio code that didn't reset the interrupt properly after a wait. I fixed this in trunk (b9-SNAPSHOT) and if you want you could try to apply the same patch to b8 (see https://trac.neo4j.org/changeset/2890) or I could provide a jar for you. If you want to try out b9-SNAPSHOT just remember you can't switch back to b8 since the store layout as changed. > It also occured to me that on receipt of ClosedByInterruptException > Neo4J should probably trigger a shutdown (i.e. neo.shutdown()) and log > that it was interrupted since an interrupt has been triggered on the > write thread. The later stack traces show that Neo doesn't recover > from such an interrupt, so one would suspect that this is currently > the best course of action and is logically what should happen when an > interrupt is triggered. > Agreed, performing a shutdown and presenting a message to the user about the cause would be much better then current behavior. I will start thinking on how this can be implemented. Regards, Johan > Let me know what you think please Johan, meanwhile I'm running with > nohup and will let you know if that was indeed the cause. > > All the best > Neil > > On 5 Jun 2009, at 19:05, Johan Svensson wrote: > >> These problems are hard to find. >> >> I do not like the way java.nio behaves on interrupts since I don't >> know how much data got written/read and the underlying file channel >> just gets closed. At the moment the only thing we can do is throw an >> exception and do a full recovery process... >> >> Also I think Thread.interrupt idiom/usage is broken. Only time you can >> use it is when you have total control and "own" the full stack knowing >> exactly where the specific thread is executing. I have seen some web >> servers/containers use it to try control threads that should timeout >> and that doesn't work very well. >> >> How many CPUs do you have on the machine running this (could be a >> spurious wakeup somewhere)? Also do you run concurrent transactions or >> not? I have been trying to reproduce the other problem with nested >> transactions but nothing so far. >> >> -Johan >> >> On Fri, Jun 5, 2009 at 7:42 PM, Neil Ellis >> <neil.el...@mangala.co.uk> wrote: >>> Nope, don't think it's a user exception, digging deeper. >>> >>> >>> On 5 Jun 2009, at 14:55, Neil Ellis wrote: >>> >>>> Hi Johan >>>> >>>> Took a little longer to fail (circa 31Gb) with >>>> >>>> java.nio.channels.ClosedByInterruptException >>>> Received http://www.myspace.com/nuski >>>> at >>>> java >>>> .nio >>>> .channels >>>> .spi >>>> .AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java: >>>> 184) >>>> at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) >>>> at org.neo4j.impl.transaction.TxLog.txDone(TxLog.java:221) >>>> at >>>> org.neo4j.impl.transaction.TxManager.rollback(TxManager.java:732) >>>> at >>>> org >>>> .neo4j >>>> .impl.transaction.TransactionImpl.rollback(TransactionImpl.java:108) >>>> at org.neo4j.api.core.EmbeddedNeo >>>> $TransactionImpl.finish(EmbeddedNeo.java:377) >>>> >>>> So I'm going to have to check through my code to see how the thread >>>> got interrupted. That smells >>>> like a user error, however it would be good of course if we can >>>> track the mistake down and allow >>>> Neo4j to realise it is a user error and fail more gracefully. >>>> Anyway, my turn to investigate a little more :) >>>> >>>> ATB >>>> Neil >>>> On 4 Jun 2009, at 19:19, Johan Svensson wrote: >>>> >>>>> Thanks, I'll have a look at this and run some tests with nested >>>>> transactions. >>>>> >>>>> -Johan >>>>> >>>>> On Thu, Jun 4, 2009 at 7:25 PM, Neil Ellis >>>>> <neil.el...@mangala.co.uk> wrote: >>>>>> Hi these are from centos and 1.0-b8 I have changed to avoid nested >>>>>> transactions and now I'm not getting this, so (at the moment) >>>>>> looks >>>>>> like a nested transaction quirk ... and would only occur after a >>>>>> large >>>>>> number of successful writes (gigs). >>>>>> >>>>>> Thx >>>>>> Neil >>>>>> >>>>>> On 4 Jun 2009, at 18:18, Johan Svensson wrote: >>>>>> >>>>>>> Hi Niel, >>>>>>> >>>>>>> What version of Neo4j are you running (b8,b9-SNAPSHOT) and on >>>>>>> what >>>>>>> OS/file system? >>>>>>> >>>>>>> -Johan >>>>>>> _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user