Hey Todd, The problem you pointed out is real. Unfortunately, placing by available size at creation time actually makes things worse. The original plan was to place new partitions on the disk with the most space, but consider a common case: disk 1: 500M disk 2: 0M Now say you are creating 10 partitions for what will be a massively large topic. You will place them all on disk 2 as it has the most space, but then immediately you will discover that that was a bad idea as those partitions get huge. I think the current balancing by number of partitions is better than this because at least you get basically random assignment.
I think to solve the problem you describe we need to do active rebalancing--predicting the size of partitions ahead of time is basically impossible. I think having the controller be unaware of disks is probably a good thing. So the controller would balance partitions over servers and the server would be responsible for balancing over disks. I think this kind of balancing is possible though not totally trivial. Basically a background thread in LogManager would decide that there was too much skew in data assignment (actually debatable whether it is data size or I/O throughput that you should optimize) and would then try to rebalance. To do the rebalance it would do a background copy of the log from the current disk to the new disk, then it would take the partition offline and delete the old log, then bring the partition back using the new log and catch it back up off the leader. -Jay On Thu, Apr 9, 2015 at 8:19 AM, Todd Palino <[email protected]> wrote: > I think this is a good start. We've been discussing JBOD internally, so > it's good to see a discussion going externally about it as well. > > The other big blocker to using JBOD is the lack of intelligent partition > assignment logic, and the lack of tools to adjust it. The controller is not > smart enough to take into account disk usage when deciding to place a > partition, which may not be a big concern (at the controller level, you > worry about broker usage, not individual mounts). However, the broker is > not smart enough to do it either, when looking at the local directories. It > just round robins. > > In addition, there is no tool available to move a partition from one mount > point to another. So in the case that you do have a hot disk, you cannot do > anything about it without shutting down the broker and doing a manual move > of the log segments. > > -Todd > > > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi < > [email protected]> wrote: > > > Hi, > > > > Let me start discussion thread for KIP-18 - JBOD Support. > > > > Link to wiki: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support > > > > > > Thanks, > > Andrii Biletskyi > > >
