Flavio - I forgot to mention some scale. At the moment lets say we create a couple dozen ledgers a minute, we persist them at about the same pace. If something goes wrong (which has happened a few times - new software) it is not uncommon to see 10's of thousands of ledgers in BK.
Thanks. -John On Thu, Apr 5, 2012 at 10:53 AM, John Nagro <[email protected]> wrote: > Flavio - > > I really appreciate your prompt response. Some quick background - we use > some of the hadoop technologies for storage, coordination, and processing. > Recently we wanted to add a write-ahead-log to our infrastructure so that > clients could record "transactions" prior to executing them - such as > updates going to an API or processing of an event. I've written a set of > tools that use BK as a generic write-ahead-logger. Clients (using zookeeper > for coordination) can create named write ahead logs with custom chunking > (how frequently a new ledger is created - based on size/time). Once a > ledger has rolled-over (or a client crashes), a persister (monitoring ZK) > reads that ledger and persists it to S3/HDFS as hadoop sequence files where > a map-reduce process can reconcile it. The ledger is then deleted from BK. > This is all done using ZK in a fashion where (hopefully) once a client has > written any data to the ledger it will always end up on S3/HDFS (via BK) > even if the client crashes (the persister will always know which ledger > belongs to which log and which ledgers are currently in use). > > Does that sound like an appropriate use of BK? It seemed like a natural > fit as a durable storage solution until something can reliably get it to a > place where it would ultimately be archived and could be > reprocessed/reconciled (S3/HDFS). > > As for the bug fix you mentioned, this gist shows the logs from the cut i > made this morning: > > https://gist.github.com/aea874d89b28d4cfef31 > > As you can see, there are still some exceptions and error messages that > repeat (forever). This is the newest cut available on github, last commit > is: > > commit f694716e289c448ab89cab5fa81ea0946f9d9193 > Author: Flavio Paiva Junqueira <[email protected]> > Date: Tue Apr 3 16:02:44 2012 +0000 > > BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj) > > git-svn-id: > https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@130900713f79535-47bb-0310-9956-ffa450edef68 > > > What are your thoughts? Thanks! > > -John > > > On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira <[email protected]>wrote: > >> Hi John, Let's see if I can help: >> >> On Apr 5, 2012, at 3:19 PM, John Nagro wrote: >> >> Hello - >> >> I've been hitting Ivan up for advice about a bookkeeper project of mine. >> I recently ran into another issue and he suggested I inquire here since he >> is traveling. >> >> We've got a pool of 5 BK servers running in EC2. Last night they got into >> a funky state and/or crashed - unfortunately the log with the original >> event got rotated (that has been fixed). I was running a cut of >> 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5 from Wed Mar 28 >> 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That >> build had some bugfixes over 4.0.0 that I was originally running (and a >> previous version before that). >> >> >> Is there anything else you can say about your application, like how fast >> you're writing and how often you're rolling ledgers maybe? Are you deleting >> ledgers at all? >> >> >> When I restart the servers after the incident this is what the logs >> looked like: >> >> https://gist.github.com/f2b9c8c76943b057546e >> >> Which contain a lot of errors - although it appears the servers come up >> (i have not tried to use the servers yet). Although I don't have the >> original stack that caused the crash, the logs from recently after the >> crash contained a lot of this stack: >> >> 2012-04-04 21:04:58,833 - INFO >> [GarbageCollectorThread:GarbageCollectorThread@266] - Deleting >> entryLogId 4 as it has no active ledgers! >> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188] >> - Trying to delete an entryLog file that could not be found: 4.log >> 2012-04-04 21:04:59,783 - WARN >> [NIOServerFactory-3181:NIOServerFactory@129] - Exception in server >> socket loop: /0.0.0.0 >> >> java.util.NoSuchElementException >> at java.util.LinkedList.getFirst(LinkedList.java:109) >> at >> org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458) >> at >> org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165) >> at >> org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93) >> at >> org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999) >> at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034) >> at >> org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359) >> at >> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315) >> at >> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213) >> >> at >> org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124) >> >> >> This looks like what we found and resolved here: >> >> https://issues.apache.org/jira/browse/BOOKKEEPER-198 >> >> >> This morning I upgraded to the most recent cut - >> sha f694716e289c448ab89cab5fa81ea0946f9d9193 made on Tue Apr 3 16:02:44 >> 2012 +0000 and restarted. That did not seem to correct matters, although >> the log has slightly different error messages: >> >> https://gist.github.com/aea874d89b28d4cfef31 >> >> Does anyone know whats going on? How i can correct these errors? Are the >> machines in an okay state to use? >> >> >> It sounds like we have resolved it in 198, so if you're using a recent >> cut, you shouldn't observe this problem anymore. But, if it does happen >> again, it would be great to try to find a way to reproduce it so that we >> can track the bug... assuming it is a bug. >> >> -Flavio >> >> >> >
