[ https://issues.apache.org/jira/browse/ZOOKEEPER-3757 ]
Gowrima deleted comment on ZOOKEEPER-3757:
------------------------------------
was (Author: JIRAUSER307071):
Can I take this up? I am interested to work on it.
> Transaction log sync can take 20+ seconds after leader election when there is
> a large snapCount
> -----------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3757
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.5.6
> Reporter: Alex Kaiser
> Priority: Minor
>
> Short overview:
> If you have a large snapCount (we are using 10,000,000) you can end up with a
> very large transaction log (ours are between 1GB - 1.5 GB), which can cause
> the sync between a newly elected leader and it's followers to take 20+
> seconds. This stems from the code (FileTxnIterator.getStorageSize()) in most
> cases returning 0 even if the transaction log is 1GB.
>
> Long Explanation:
> A few years ago we had some trouble with our zookeeper cluster having many
> shortish (100-500ms) pauses during our peak traffic times. These ended up
> resulting from the master taking a snap shot. To solve this we upped the
> snapCount to 10,000,000 so that we weren't taking snapshots nearly as often.
> We also made changes to reduce the size of our snapshots (from around 2.5 GB
> to ~500 MB).
> I don't remember what version of zookeeper we were using originally, but this
> was all working fine using 3.4.10, but we started to have problems when we
> upgraded to 3.5.6 around 3 months ago. We have a fairly high transaction
> rate and thus end up hitting the zxid overflow about once a month, which will
> cause a leader election. When we were on 3.4.10, this was fine because
> leader election and syncing would happen within 2-4 seconds, which was low
> enough for us to be able to basically ignore it. However after we upgraded
> to 3.5.6 the pauses we saw took between 15 - 30 seconds which were
> unacceptable for us.
> For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the
> comments say this is only supposed to be used for testing), which causes
> syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5
> GB).
>
> Technical details:
> I tried taking a look at the code and I think I know why this happens. From
> what I learned, it looks like when a follower needs to sync with a leader on
> the leader LearnerHandler.syncFollower() gets called. It goes through a big
> if statement, but at one point it will call
> db.getProposalsFromTxnLog(peerLastZxid, sizeLimit). That peerLastZxid could
> be some very old zxid if the follower hadn't taken a snapshot in a long time
> (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 *
> snapshot size (in my case around 10 MB).
> Inside of that getProposalsFromTxnLog it will create a TxnIterator and then
> call getStorageSize() on it. The problem comes from the fact that this call
> to getStorageSize() will usually return with 0. The reason that happens is
> because the FileTxnIterator class has a "current" log file that it is
> reading, this.logFile, and a list of files that it would still have to
> iterate through, this.storedFiles. The getStroageSize() function though only
> looks at the storedFiles list, so if the iterator has one large transaction
> log as the "current" log file and nothing in the storedFiles list, then this
> method will return 0 even though there is a huge transaction log to sync.
> One other side affect of this problem is that even bouncing a follower can
> cause long (5-10 second) pauses as the leader will hold a read lock on the
> transaction log while it syncs up with the follower.
> While I know what the problem is I don't know what the best solution is. I'm
> willing to work on the solution, but I would appreciate suggestions. One
> possible solution would be to include the this.logFile in the
> getStorageSize() calculation, however this could cause the estimate to over
> estimate the amount of data that is in the iterator (possibly by a lot) and I
> don't know what the consequences of doing that is. I'm not quite sure what
> is a good way to get an accurate estimate.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)