[ https://issues.apache.org/jira/browse/KAFKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manikumar resolved KAFKA-1712. ------------------------------ Resolution: Fixed Fixed via https://issues.apache.org/jira/browse/KAFKA-2511 > Excessive storage usage on newly added node > ------------------------------------------- > > Key: KAFKA-1712 > URL: https://issues.apache.org/jira/browse/KAFKA-1712 > Project: Kafka > Issue Type: Bug > Components: log > Reporter: Oleg Golovin > Priority: Major > > When a new node is added to cluster data starts replicating into it. The > mtime of creating segments will be set on the last message being written to > them. Though the replication is a prolonged process, let's assume (for > simplicity of explanation) that their mtime is very close to the time when > the new node was added. > After the replication is done, new data will start to flow into this new > node. After `log.retention.hours` the amount of data will be 2 * > daily_amount_of_data_in_kafka_node (first one is the replicated data from > other nodes when the node was added (let us call it `t1`) and the second is > the amount of replicated data from other nodes which happened from `t1` to > `t1 + log.retention.hours`). So by that time the node will have twice as much > data as the other nodes. > This poses a big problem to us as our storage is chosen to fit normal amount > of data (not twice this amount). > In our particular case it poses another problem. We have an emergency segment > cleaner which runs in case storage is nearly full (>90%). We try to balance > the amount of data for it not to run to rely solely on kafka internal log > deletion, but sometimes emergency cleaner runs. > It works this way: > - it gets all kafka segments for the volume > - it filters out last segments of each partition (just to avoid unnecessary > recreation of last small-size segments) > - it sorts them by segment mtime > - it changes mtime of the first N segements (with the lowest mtime) to 1, so > they become really really old. Number N is chosen to free specified > percentage of volume (3% in our case). Kafka deletes these segments later > (as they are very old). > Emergency cleaner works very well. Except for the case when the data is > replicated to the newly added node. > In this case segment mtime is the time the segment was replicated and does > not reflect the real creation time of original data stored in this segment. > So in this case kafka emergency cleaner will delete segments with the lowest > mtime, which may hold the data which is much more recent than the data in > other segments. > This is not a big problem until we delete the data which hasn't been fully > consumed. > In this case we loose data and this makes it a big problem. > Is it possible to retain segment mtime during initial replication on a new > node? > This will help not to load the new node with the twice as large amount of > data as other nodes have. > Or maybe there are another ways to sort segments by data creation times (or > close to data creation time)? (for example if this ticket is implemented > https://issues.apache.org/jira/browse/KAFKA-1403, we may take time of the > first message from .index). In our case it will help with kafka emergency > cleaner, which will be deleting really the oldest data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)