Re: BK servers in a funky state

John Nagro Thu, 05 Apr 2012 07:57:56 -0700

Flavio -

I forgot to mention some scale. At the moment lets say we create a couple
dozen ledgers a minute, we persist them at about the same pace. If
something goes wrong (which has happened a few times - new software) it is
not uncommon to see 10's of thousands of ledgers in BK.


Thanks.

-John

On Thu, Apr 5, 2012 at 10:53 AM, John Nagro <[email protected]> wrote:

> Flavio -
>
> I really appreciate your prompt response. Some quick background - we use
> some of the hadoop technologies for storage, coordination, and processing.
> Recently we wanted to add a write-ahead-log to our infrastructure so that
> clients could record "transactions" prior to executing them - such as
> updates going to an API or processing of an event. I've written a set of
> tools that use BK as a generic write-ahead-logger. Clients (using zookeeper
> for coordination) can create named write ahead logs with custom chunking
> (how frequently a new ledger is created - based on size/time). Once a
> ledger has rolled-over (or a client crashes), a persister (monitoring ZK)
> reads that ledger and persists it to S3/HDFS as hadoop sequence files where
> a map-reduce process can reconcile it. The ledger is then deleted from BK.
> This is all done using ZK in a fashion where (hopefully) once a client has
> written any data to the ledger it will always end up on S3/HDFS (via BK)
> even if the client crashes (the persister will always know which ledger
> belongs to which log and which ledgers are currently in use).
>
> Does that sound like an appropriate use of BK? It seemed like a natural
> fit as a durable storage solution until something can reliably get it to a
> place where it would ultimately be archived and could be
> reprocessed/reconciled (S3/HDFS).
>
> As for the bug fix you mentioned, this gist shows the logs from the cut i
> made this morning:
>
> https://gist.github.com/aea874d89b28d4cfef31
>
> As you can see, there are still some exceptions and error messages that
> repeat (forever). This is the newest cut available on github, last commit
> is:
>
> commit f694716e289c448ab89cab5fa81ea0946f9d9193
> Author: Flavio Paiva Junqueira <[email protected]>
> Date:   Tue Apr 3 16:02:44 2012 +0000
>
> BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj)
>
> git-svn-id:
> https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@130900713f79535-47bb-0310-9956-ffa450edef68
>
>
> What are your thoughts? Thanks!
>
> -John
>
>
> On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira <[email protected]>wrote:
>
>> Hi John, Let's see if I can help:
>>
>> On Apr 5, 2012, at 3:19 PM, John Nagro wrote:
>>
>> Hello -
>>
>> I've been hitting Ivan up for advice about a bookkeeper project of mine.
>> I recently ran into another issue and he suggested I inquire here since he
>> is traveling.
>>
>> We've got a pool of 5 BK servers running in EC2. Last night they got into
>> a funky state and/or crashed - unfortunately the log with the original
>> event got rotated (that has been fixed). I was running a cut of
>> 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5 from Wed Mar 28
>> 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That
>> build had some bugfixes over 4.0.0 that I was originally running (and a
>> previous version before that).
>>
>>
>> Is there anything else you can say about your application, like how fast
>> you're writing and how often you're rolling ledgers maybe? Are you deleting
>> ledgers at all?
>>
>>
>> When I restart the servers after the incident this is what the logs
>> looked like:
>>
>> https://gist.github.com/f2b9c8c76943b057546e
>>
>> Which contain a lot of errors - although it appears the servers come up
>> (i have not tried to use the servers yet). Although I don't have the
>> original stack that caused the crash, the logs from recently after the
>> crash contained a lot of this stack:
>>
>> 2012-04-04 21:04:58,833 - INFO
>> [GarbageCollectorThread:GarbageCollectorThread@266] - Deleting
>> entryLogId 4 as it has no active ledgers!
>> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188]
>> - Trying to delete an entryLog file that could not be found: 4.log
>> 2012-04-04 21:04:59,783 - WARN
>> [NIOServerFactory-3181:NIOServerFactory@129] - Exception in server
>> socket loop: /0.0.0.0
>>
>> java.util.NoSuchElementException
>>         at java.util.LinkedList.getFirst(LinkedList.java:109)
>>         at
>> org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458)
>>         at
>> org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165)
>>         at
>> org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93)
>>         at
>> org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999)
>>         at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034)
>>         at
>> org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359)
>>         at
>> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315)
>>         at
>> org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213)
>>
>>         at
>> org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124)
>>
>>
>> This looks like what we found and resolved here:
>>
>> https://issues.apache.org/jira/browse/BOOKKEEPER-198
>>
>>
>> This morning I upgraded to the most recent cut -
>> sha f694716e289c448ab89cab5fa81ea0946f9d9193 made on Tue Apr 3 16:02:44
>> 2012 +0000 and restarted. That did not seem to correct matters, although
>> the log has slightly different error messages:
>>
>> https://gist.github.com/aea874d89b28d4cfef31
>>
>> Does anyone know whats going on? How i can correct these errors? Are the
>> machines in an okay state to use?
>>
>>
>> It sounds like we have resolved it in 198, so if you're using a recent
>> cut, you shouldn't observe this problem anymore. But, if it does happen
>> again, it would be great to try to find a way to reproduce it so that we
>> can track the bug... assuming it is a bug.
>>
>> -Flavio
>>
>>
>>
>

Re: BK servers in a funky state

Reply via email to