Re: Ledgers failing to replicate

Sijie Guo Thu, 12 Jan 2017 08:32:14 -0800

I see. Let me ask one more questions - how do you create ledgers? And when
do you write these ledgers and when do you close them.

I think they are probably just empty ledgers at the time you were rolling.
There is a setting in the recovery tool to force close the open ledgers. I
need to check and confirm that.

On Jan 12, 2017 6:14 AM, "Sebastián Schepens" <
[email protected]> wrote:

Sijie,
We were replacing all our nodes and testing how to do it best without
affecting the cluster.

This same thing happened again yesterday. I have 4 underreplicated ledgers,
which are empty.
But this time, I turned off bookies on by one, and waiting for all
underreplicated ledgers to replicate before turning off another bookie.
Even while doing this 'rolling' replace, I ended up with inconsistent
ledgers. How can this be possible?
One would expect that when there are no underreplicated ledgers, it would
be safe to loose a machine.

What's the recommended quorum setup if I wanted to safely tolerate 2
machine failure?

If you want to tolerate 2 failures, you need to write quorum size - ack
quorum size to be larger than or equal to 2.

Thanks,
Sebastian

On Wed, Jan 11, 2017 at 5:04 PM Sijie Guo <[email protected]> wrote:

> On Wed, Jan 11, 2017 at 11:15 AM, Sebastián Schepens <sebastian.schepens@
> mercadolibre.com> wrote:
>
> Hi guys,
> I'm doing some tests and turned off 2 bookies almost simultaneously hoping
> that all the ledgers would still be able to replicate since we have
> ensemble and quorum size of 3.
> Almost all ledgers managed to replicate using the autorecovery daemon
> except for 5. What's curious about this 5 ledgers is that they are all
> empty and the only node which contains data for it claims it does not exist.
>
> Here's the ledger metadata for one of them:
> ledgerID: 772
> BookieMetadataFormatVersion 2
> quorumSize: 3
> ensembleSize: 3
> length: 0
> lastEntryId: -1
> state: IN_RECOVERY
> segment {
>   ensembleMember: "10.64.103.57:3181"
>   ensembleMember: "10.64.103.249:3181"
>   ensembleMember: "10.64.102.95:3181"
>   firstEntryId: 0
> }
> digestType: CRC32
> password: ""
> ackQuorumSize: 2
>
> Where all nodes except 10.64.103.249 are down.
>
> And that node contains these logs:
> ERROR - [BookieReadThread-3181-10-1:ReadEntryProcessorV3@123] - No ledger
> found while reading entry:-1 from ledger: 772
>
>
> They seem to be empty ledgers with no entries.
>
>
>
> I don't understand how these ledgers ended in this state, is it
> recoverable?
>
>
> If the ledgers are closed, if you lose two bookies, the re-replication can
> replicate the data correctly. As when the ledger is in closed state, it
> will contains the last entry id in the metadata, it would use the
> information to determine the state of the ledger and replicate data
> correctly.
>
> However, if the ledgers are open and you lost two bookies (which is the
> majority of your quorum), the client can't make a decision what is the last
> entry id based on only one left bookie, so it can not close/seal the ledger
> correctly.
>
> Can you explain more about your tests? It would help me understand more
> about that.
>
>
>
> I could just delete the ledgers cause they are empty too. By the way,
> bookkeeper shell should have a command for deleting ledgers.
>
>
> Yeah, this is a good suggestion. Do you mind creating a jira for adding
> the delete ledger command?
>
>
>
> Thanks,
> Sebastian
>
>
>

Re: Ledgers failing to replicate

Reply via email to