[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963072#comment-13963072
 ] 

Ivan Kelly commented on BOOKKEEPER-733:
---------------------------------------

I don't like the idea of putting more logic into the ledger underreplication 
manager. As a component, I already find it hard to test and understand. I'd 
actually prefer to take some logic out of it.

I think it would be better to get rid of getLedgerToRereplicate & 
pollLedgerToReplicate completely. Instead we could return an iterator into the 
unreplicated ledgers. Each worker would loop over the list, check if they can 
rereplicate the specific items, if so, try to lock it, and if it gets the lock, 
rereplicate. This way the error handling and backoff logic can be put in the 
replication worker, where it belongs IMO.

> Improve ReplicationWorker to handle the urLedgers which already have same 
> leder replica in hand
> -----------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-733
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-733
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.2.2, 4.3.0
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>
> +Scenario:+
> Step1 : Have three bookies BK1, BK2, BK3
> Step2 : Have written ledgers with quorum 2
> Step3 : Unfortunately BK2 and BK3 both went down for few moments.
> The following logs are flooded in BK1 autorecovery logs. RW is trying to 
> replicate the ledgers, but it simply skip this fragment and moves to next 
> cycle when it sees a replica found in his hand. IMO, we should have a 
> mechanism in place to avoid unnecessary cycles.
> {code}
> 2014-02-18 21:47:55,140 - ERROR - [New I/O client boss 
> #2-1:PerChannelBookieClient$1@230] - Could not connect to bookie: [id: 
> 0x00ba679e]/10.18.170.130:15002, current state CONNECTING : 
> java.net.ConnectException: Connection refused: no further information
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:401)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:370)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:292)
>       at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>       at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
>       at java.lang.Thread.run(Thread.java:619)
> 2014-02-18 21:47:55,140 - INFO  - 2014-02-18 21:59:33,215 - DEBUG  - 
> [ReplicationWorker:ReplicationWorker@182] - Target 
> Bookie[10.18.170.130:15003] found in the fragment ensemble: 
> [10.18.170.130:15003, 10.18.170.130:15001, 10.18.170.130:15002]
> [ReplicationWorker:PerChannelBookieClient@194] - Connecting to bookie: 
> 10.18.170.130:15002
> 2014-02-18 21:47:56,162 - ERROR - [New I/O client boss 
> #2-1:PerChannelBookieClient$1@230] - Could not connect to bookie: [id: 
> 0x0003f377]/10.18.170.130:15002, current state CONNECTING : 
> java.net.ConnectException: Connection refused: no further information
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:401)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:370)
>       at 
> org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:292)
>       at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>       at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
>       at java.lang.Thread.run(Thread.java:619)
> 2014-02-18 21:59:33,215 - DEBUG  - [ReplicationWorker:ReplicationWorker@182] 
> - Target Bookie[10.18.170.130:15003] found in the fragment ensemble: 
> [10.18.170.130:15003, 10.18.170.130:15001, 10.18.170.130:15002]
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to