Flavio - Thanks a lot, I really appreciate your help. I'll look over the ticket. I'm also going to setup a new install of the latest server code at some point today. I'll let you know how that goes.
-John On Fri, Apr 6, 2012 at 9:13 AM, Flavio Junqueira <[email protected]> wrote: > Hi John, I have created a jira to report my findings so far. Here is a > link in the case you're interested: > > https://issues.apache.org/jira/browse/BOOKKEEPER-212 > > Feel free to add to the discussion if you see anything I'm missing. > > -Flavio > > On Apr 5, 2012, at 9:57 PM, John Nagro wrote: > > Thanks Flavio. > > On Thu, Apr 5, 2012 at 3:36 PM, Flavio Junqueira <[email protected]>wrote: > >> Exactly, that's my suspicion so far, but we need to make sure. I'm also >> seeing if I can reproduce the problem independently. >> >> -Flavio >> >> >> On Apr 5, 2012, at 9:28 PM, John Nagro wrote: >> >> Okay cool, thanks for the help Flavio. So, just to recap so that I >> understand, you suspect that we experienced BOOKKEEPER-198, put the >> machines into a funky state, and now we just need to start over with these >> servers - however 198 has been patched and we shouldn't see this specific >> failure mode again with a current cut of the code? >> >> Thanks again for all your assistance. >> >> -John >> >> On Thu, Apr 5, 2012 at 3:05 PM, Flavio Junqueira <[email protected]>wrote: >> >>> >>> On Apr 5, 2012, at 8:41 PM, John Nagro wrote: >>> >>> All 5 machines behave the same way. It is not the end of the world if we >>> can't recover the data from them in this instance, but at some point the >>> data sent through this WAL will be very important to be able to recover in >>> the event of a crash. >>> >>> >>> Understood and agreed, losing data is not an option. Unfortunately, it >>> sounds like you have hit a nasty bug, let's make sure to fix it because I'm >>> sure it could also hit others, especially if BOOKKEEPER-198 does not fix >>> it. >>> >>> >>> I tried to move the logs/current directory out of the way, replacing it >>> with another and got this: >>> >>> 2012-04-05 14:39:31,829 - INFO [main-EventThread:Bookie$3@678] - ZK >>> client has been reconnected to the ZK server! >>> 2012-04-05 14:39:31,854 - ERROR [main:Bookie@331] - Cookie exists in >>> zookeeper, but not in all local directories >>> java.io.FileNotFoundException: /opt/bookkeeper/logs/current/VERSION (No >>> such file or directory) >>> at java.io.FileInputStream.open(Native Method) >>> at java.io.FileInputStream.<init>(FileInputStream.java:106) >>> at java.util.Scanner.<init>(Scanner.java:636) >>> at org.apache.bookkeeper.bookie.Cookie.readFromDirectory(Cookie.java:188) >>> at >>> org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:326) >>> at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:401) >>> at >>> org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:75) >>> at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:294) >>> 2012-04-05 14:39:31,856 - ERROR [main:BookieServer@308] - Exception >>> running bookie server : >>> org.apache.bookkeeper.bookie.BookieException$InvalidCookieException >>> at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:332) >>> at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:401) >>> at org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:75) >>> at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:294) >>> >>> which makes me believe i'd need to clean out the cookie entry in ZK for >>> this server? >>> >>> >>> Yes, you need to clean up and restart zookeeper too. >>> >>> -Flavio >>> >>> >>> On Thu, Apr 5, 2012 at 1:54 PM, Flavio Junqueira <[email protected]>wrote: >>> >>>> By cleaning it up, I meant deleting the log and index files. You must >>>> have passed paths when you started bookies, I was suggesting to wipe them >>>> out. >>>> >>>> I just realized that there are a couple of things I don't understand: >>>> >>>> 1- How many bookies have been affected? Just one or all of them? If it >>>> is just one, then you may need to use the bookie recovery tool to >>>> reconstruct the ledger fragments of the problematic bookie. >>>> 2- Can you clean up all bookies and zookeeper and start it over or >>>> there is data you need to recover? >>>> >>>> -Flavio >>>> >>>> On Apr 5, 2012, at 7:32 PM, John Nagro wrote: >>>> >>>> Flavio - >>>> >>>> The log is from an existing server which crashed and I am trying to >>>> bring it back up. How might i clean things up? What about existing ledgers? >>>> >>>> -John >>>> >>>> On Thu, Apr 5, 2012 at 12:57 PM, Flavio Junqueira >>>> <[email protected]>wrote: >>>> >>>>> Actually, I have a question about the new logs you sent. Have you >>>>> tried to clean up the state of bookies before rerunning them or these logs >>>>> correspond to a bookie that you're bringing back up and the bookie is >>>>> trying to read its old state? If it is the latter, then I would suggest to >>>>> give it a fresh start. >>>>> >>>>> -Flavio >>>>> >>>>> On Apr 5, 2012, at 4:53 PM, John Nagro wrote: >>>>> >>>>> Flavio - >>>>> >>>>> I really appreciate your prompt response. Some quick background - we >>>>> use some of the hadoop technologies for storage, coordination, and >>>>> processing. Recently we wanted to add a write-ahead-log to our >>>>> infrastructure so that clients could record "transactions" prior to >>>>> executing them - such as updates going to an API or processing of an >>>>> event. >>>>> I've written a set of tools that use BK as a generic write-ahead-logger. >>>>> Clients (using zookeeper for coordination) can create named write ahead >>>>> logs with custom chunking (how frequently a new ledger is created - based >>>>> on size/time). Once a ledger has rolled-over (or a client crashes), a >>>>> persister (monitoring ZK) reads that ledger and persists it to S3/HDFS as >>>>> hadoop sequence files where a map-reduce process can reconcile it. The >>>>> ledger is then deleted from BK. This is all done using ZK in a fashion >>>>> where (hopefully) once a client has written any data to the ledger it will >>>>> always end up on S3/HDFS (via BK) even if the client crashes (the >>>>> persister >>>>> will always know which ledger belongs to which log and which ledgers are >>>>> currently in use). >>>>> >>>>> Does that sound like an appropriate use of BK? It seemed like a >>>>> natural fit as a durable storage solution until something can reliably get >>>>> it to a place where it would ultimately be archived and could be >>>>> reprocessed/reconciled (S3/HDFS). >>>>> >>>>> As for the bug fix you mentioned, this gist shows the logs from the >>>>> cut i made this morning: >>>>> >>>>> https://gist.github.com/aea874d89b28d4cfef31 >>>>> >>>>> As you can see, there are still some exceptions and error messages >>>>> that repeat (forever). This is the newest cut available on github, last >>>>> commit is: >>>>> >>>>> commit f694716e289c448ab89cab5fa81ea0946f9d9193 >>>>> Author: Flavio Paiva Junqueira <[email protected]> >>>>> Date: Tue Apr 3 16:02:44 2012 +0000 >>>>> >>>>> BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj) >>>>> >>>>> git-svn-id: >>>>> https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@130900713f79535-47bb-0310-9956-ffa450edef68 >>>>> >>>>> >>>>> What are your thoughts? Thanks! >>>>> >>>>> -John >>>>> >>>>> >>>>> On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi John, Let's see if I can help: >>>>>> >>>>>> On Apr 5, 2012, at 3:19 PM, John Nagro wrote: >>>>>> >>>>>> Hello - >>>>>> >>>>>> I've been hitting Ivan up for advice about a bookkeeper project of >>>>>> mine. I recently ran into another issue and he suggested I inquire here >>>>>> since he is traveling. >>>>>> >>>>>> We've got a pool of 5 BK servers running in EC2. Last night they got >>>>>> into a funky state and/or crashed - unfortunately the log with the >>>>>> original >>>>>> event got rotated (that has been fixed). I was running a cut of >>>>>> 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5 from Wed Mar >>>>>> 28 >>>>>> 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That >>>>>> build had some bugfixes over 4.0.0 that I was originally running (and a >>>>>> previous version before that). >>>>>> >>>>>> >>>>>> Is there anything else you can say about your application, like how >>>>>> fast you're writing and how often you're rolling ledgers maybe? Are you >>>>>> deleting ledgers at all? >>>>>> >>>>>> >>>>>> When I restart the servers after the incident this is what the logs >>>>>> looked like: >>>>>> >>>>>> https://gist.github.com/f2b9c8c76943b057546e >>>>>> >>>>>> Which contain a lot of errors - although it appears the servers come >>>>>> up (i have not tried to use the servers yet). Although I don't have the >>>>>> original stack that caused the crash, the logs from recently after the >>>>>> crash contained a lot of this stack: >>>>>> >>>>>> 2012-04-04 21:04:58,833 - INFO >>>>>> [GarbageCollectorThread:GarbageCollectorThread@266] - Deleting >>>>>> entryLogId 4 as it has no active ledgers! >>>>>> 2012-04-04 21:04:58,834 - ERROR >>>>>> [GarbageCollectorThread:EntryLogger@188] - Trying to delete an >>>>>> entryLog file that could not be found: 4.log >>>>>> 2012-04-04 21:04:59,783 - WARN >>>>>> [NIOServerFactory-3181:NIOServerFactory@129] - Exception in server >>>>>> socket loop: /0.0.0.0 >>>>>> >>>>>> java.util.NoSuchElementException >>>>>> at java.util.LinkedList.getFirst(LinkedList.java:109) >>>>>> at >>>>>> org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458) >>>>>> at >>>>>> org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165) >>>>>> at >>>>>> org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93) >>>>>> at >>>>>> org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999) >>>>>> at >>>>>> org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034) >>>>>> at >>>>>> org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359) >>>>>> at >>>>>> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315) >>>>>> at >>>>>> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213) >>>>>> >>>>>> at >>>>>> org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124) >>>>>> >>>>>> >>>>>> This looks like what we found and resolved here: >>>>>> >>>>>> https://issues.apache.org/jira/browse/BOOKKEEPER-198 >>>>>> >>>>>> >>>>>> This morning I upgraded to the most recent cut - >>>>>> sha f694716e289c448ab89cab5fa81ea0946f9d9193 made on Tue Apr 3 16:02:44 >>>>>> 2012 +0000 and restarted. That did not seem to correct matters, although >>>>>> the log has slightly different error messages: >>>>>> >>>>>> https://gist.github.com/aea874d89b28d4cfef31 >>>>>> >>>>>> Does anyone know whats going on? How i can correct these errors? Are >>>>>> the machines in an okay state to use? >>>>>> >>>>>> >>>>>> It sounds like we have resolved it in 198, so if you're using a >>>>>> recent cut, you shouldn't observe this problem anymore. But, if it does >>>>>> happen again, it would be great to try to find a way to reproduce it so >>>>>> that we can track the bug... assuming it is a bug. >>>>>> >>>>>> -Flavio >>>>>> >>>>>> >>>>>> >>>>> >>>>> flavio >>>>> junqueira >>>>> senior research scientist >>>>> >>>>> [email protected] >>>>> direct +34 93-183-8828 >>>>> >>>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>>>> phone (408) 349 3300 fax (408) 349 3301 >>>>> >>>>> >>>> >>>> flavio >>>> junqueira >>>> senior research scientist >>>> >>>> [email protected] >>>> direct +34 93-183-8828 >>>> >>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>>> phone (408) 349 3300 fax (408) 349 3301 >>>> >>>> >>> >>> flavio >>> junqueira >>> senior research scientist >>> >>> [email protected] >>> direct +34 93-183-8828 >>> >>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>> phone (408) 349 3300 fax (408) 349 3301 >>> >>> >> >> flavio >> junqueira >> senior research scientist >> >> [email protected] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> > > flavio > junqueira > senior research scientist > > [email protected] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > >
