Hi John, Let's see if I can help:

On Apr 5, 2012, at 3:19 PM, John Nagro wrote:

> Hello -
> 
> I've been hitting Ivan up for advice about a bookkeeper project of mine. I 
> recently ran into another issue and he suggested I inquire here since he is 
> traveling.
> 
> We've got a pool of 5 BK servers running in EC2. Last night they got into a 
> funky state and/or crashed - unfortunately the log with the original event 
> got rotated (that has been fixed). I was running a cut of 4.1.0-SNAPSHOT sha 
> 6d56d60831a63fe9520ce156686d0cb1142e44f5 from Wed Mar 28 21:57:40 2012 +0000 
> which brought everything up to BOOKKEEPER-195. That build had some bugfixes 
> over 4.0.0 that I was originally running (and a previous version before that).
> 

Is there anything else you can say about your application, like how fast you're 
writing and how often you're rolling ledgers maybe? Are you deleting ledgers at 
all?


> When I restart the servers after the incident this is what the logs looked 
> like:
> 
> https://gist.github.com/f2b9c8c76943b057546e
> 
> Which contain a lot of errors - although it appears the servers come up (i 
> have not tried to use the servers yet). Although I don't have the original 
> stack that caused the crash, the logs from recently after the crash contained 
> a lot of this stack:
> 
> 2012-04-04 21:04:58,833 - INFO  
> [GarbageCollectorThread:GarbageCollectorThread@266] - Deleting entryLogId 4 
> as it has no active ledgers!
> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188] - 
> Trying to delete an entryLog file that could not be found: 4.log
> 2012-04-04 21:04:59,783 - WARN  [NIOServerFactory-3181:NIOServerFactory@129] 
> - Exception in server socket loop: /0.0.0.0
> 
> java.util.NoSuchElementException
>         at java.util.LinkedList.getFirst(LinkedList.java:109)
>         at 
> org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458)
>         at 
> org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165)
>         at 
> org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93)
>         at 
> org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999)
>         at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034)
>         at 
> org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359)
>         at 
> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315)
>         at 
> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213)
>         at 
> org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124)

This looks like what we found and resolved here:

        https://issues.apache.org/jira/browse/BOOKKEEPER-198

> 
> This morning I upgraded to the most recent cut - sha 
> f694716e289c448ab89cab5fa81ea0946f9d9193 made on Tue Apr 3 16:02:44 2012 
> +0000 and restarted. That did not seem to correct matters, although the log 
> has slightly different error messages:
> 
> https://gist.github.com/aea874d89b28d4cfef31
> 
> Does anyone know whats going on? How i can correct these errors? Are the 
> machines in an okay state to use?

It sounds like we have resolved it in 198, so if you're using a recent cut, you 
shouldn't observe this problem anymore. But, if it does happen again, it would 
be great to try to find a way to reproduce it so that we can track the bug... 
assuming it is a bug.

-Flavio


Reply via email to