At Mon, 16 Mar 2015 21:19:49 +0800, Liu Yuan wrote: > > On Mon, Mar 16, 2015 at 09:37:45PM +0900, Hitoshi Mitake wrote: > > On Mon, Mar 16, 2015 at 4:39 PM, Liu Yuan <namei.u...@gmail.com> wrote: > > > On Mon, Mar 16, 2015 at 03:55:43PM +0900, Hitoshi Mitake wrote: > > >> At Mon, 16 Mar 2015 14:48:47 +0800, > > >> Liu Yuan wrote: > > >> > > > >> > On Mon, Mar 16, 2015 at 03:39:17PM +0900, Hitoshi Mitake wrote: > > >> > > At Mon, 16 Mar 2015 14:31:44 +0800, > > >> > > Liu Yuan wrote: > > >> > > > > > >> > > > On Mon, Mar 16, 2015 at 02:08:25PM +0800, Liu Yuan wrote: > > >> > > > > On Mon, Mar 16, 2015 at 02:36:57PM +0900, Hitoshi Mitake wrote: > > >> > > > > > At Mon, 16 Mar 2015 10:21:50 +0800, > > >> > > > > > Liu Yuan wrote: > > >> > > > > > > > > >> > > > > > > On Thu, Mar 12, 2015 at 08:14:33PM +0900, Hitoshi Mitake > > >> > > > > > > wrote: > > >> > > > > > > > At Thu, 12 Mar 2015 14:41:56 +0800, > > >> > > > > > > > Liu Yuan wrote: > > >> > > > > > > > > > > >> > > > > > > > > On Tue, Jan 13, 2015 at 10:37:40AM +0900, Hitoshi Mitake > > >> > > > > > > > > wrote: > > >> > > > > > > > > > Current sheepdog never recycles VIDs. But it will > > >> > > > > > > > > > cause problems > > >> > > > > > > > > > e.g. VID space exhaustion, too much garbage inode > > >> > > > > > > > > > objects. > > >> > > > > > > > > > > > >> > > > > > > > > > Keeping deleted inode objects is required because > > >> > > > > > > > > > living inodes > > >> > > > > > > > > > (snapshots or clones) can point objects of the deleted > > >> > > > > > > > > > inodes. So if > > >> > > > > > > > > > every member of VDI family is deleted, it is safe to > > >> > > > > > > > > > remove deleted > > >> > > > > > > > > > inode objects. > > >> > > > > > > > > > > > >> > > > > > > > > > v2: > > >> > > > > > > > > > - update test scripts > > >> > > > > > > > > > > >> > > > > > > > > All the nodes of our test cluster panic out for the > > >> > > > > > > > > following problem: > > >> > > > > > > > > > > >> > > > > > > > > Mar 12 00:05:03 DEBUG [main] zk_handle_notify(1216) > > >> > > > > > > > > NOTIFY > > >> > > > > > > > > Mar 12 00:05:03 DEBUG [main] sd_notify_handler(960) op > > >> > > > > > > > > NOTIFY_VDI_ADD, size: 96, from: IPv4 ip:192.168.39.177 > > >> > > > > > > > > port:7000 > > >> > > > > > > > > Mar 12 00:05:03 DEBUG [main] do_add_vdi_state(362) > > >> > > > > > > > > 7c2b2b, 3, 0, 22, 0 > > >> > > > > > > > > Mar 12 00:05:03 DEBUG [main] do_add_vdi_state(362) > > >> > > > > > > > > 7c2b2c, 3, 0, 22, 7c2b2b > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] update_vdi_family(127) > > >> > > > > > > > > PANIC: parent VID: 7c2b2b not found > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] crash_handler(286) sheep > > >> > > > > > > > > exits unexpectedly (Aborted), si pid 4786, uid 0, errno > > >> > > > > > > > > 0, code -6 > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(833) > > >> > > > > > > > > sheep.c:288: crash_handler > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(847) > > >> > > > > > > > > /lib64/libpthread.so.0() [0x338200f4ff] > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(847) > > >> > > > > > > > > /lib64/libc.so.6(gsignal+0x34) [0x3381c328a4] > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(847) > > >> > > > > > > > > /lib64/libc.so.6(abort+0x174) [0x3381c34084] > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(833) > > >> > > > > > > > > vdi.c:127: update_vdi_family > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(833) > > >> > > > > > > > > vdi.c:398: add_vdi_state > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(833) > > >> > > > > > > > > ops.c:711: cluster_notify_vdi_add > > >> > > > > > > > > Mar 12 00:05:03 EMERG [main] sd_backtrace(833) > > >> > > > > > > > > group.c:975: sd_notify_handler > > >> > > > > > > > > > > >> > > > > > > > > So I tracked back to this patch set. The problem of this > > >> > > > > > > > > patch set tried to > > >> > > > > > > > > solve is very clear and come along with sheepdog since > > >> > > > > > > > > its born. This reveals > > >> > > > > > > > > actually the defeciency of our vdi allocation algorithm, > > >> > > > > > > > > which we need rethink > > >> > > > > > > > > a completely new algorithm to replace it and is not > > >> > > > > > > > > fixable, unfortunately. > > >> > > > > > > > > > > >> > > > > > > > > One simple rule, we can't recyle any vid if it is once > > >> > > > > > > > > created because of its > > >> > > > > > > > > current hash collision handling. Our current > > >> > > > > > > > > implementation forbigs recycling. > > >> > > > > > > > > > > >> > > > > > > > > Instead of fixing the above panic bug, I'd suggest we > > >> > > > > > > > > revert this patch set. > > >> > > > > > > > > For the problem this patch set mentioned, I think we > > >> > > > > > > > > need a new algoirthm and > > >> > > > > > > > > implementation. But before that, we should stay with old > > >> > > > > > > > > one, it is stable and > > >> > > > > > > > > reliable and should work for small size cluster. > > >> > > > > > > > > > > >> > > > > > > > > How do you think, Hitoshi and Kazutaka? > > >> > > > > > > > > > >> > > > > > > > How about providing switch turn on/off VID recycling? e.g. > > >> > > > > > > > dog cluster > > >> > > > > > > > format --enable-vid-recycle. The code can easily be pushed > > >> > > > > > > > into > > >> > > > > > > > conditional branches. I can post a patch if this way is > > >> > > > > > > > good for you. > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > This temporary workaroud looks okay but not good enough to > > >> > > > > > > me, what I am > > >> > > > > > > concerned is that vdi recycle will probably never be > > >> > > > > > > implemented if we stick to > > >> > > > > > > current vdi allocation algorithm. Once the new vdi > > >> > > > > > > allocation is intruduced > > >> > > > > > > someday in the future, the new algorithm would have no this > > >> > > > > > > kind of problem at > > >> > > > > > > all. If this is the case, the above code we leave here is > > >> > > > > > > also useless. > > >> > > > > > > > > >> > > > > > > I think we should focus on the new vdi allocation algorithm, > > >> > > > > > > e.g, store > > >> > > > > > > {name, vid} directly into a kv engine either implemented by > > >> > > > > > > sheep or by with the > > >> > > > > > > help of other software like zookeeper. > > >> > > > > > > > > >> > > > > > > I'm inclined to revert above patch set, for > > >> > > > > > > 1. it can't fix a non-fixable problem inherently > > >> > > > > > > 2. the code is probalematic and can cause a catastraphic > > >> > > > > > > disaster (all node die) > > >> > > > > > > 3. we might not need it in the future because it is specific > > >> > > > > > > for current vdi > > >> > > > > > > allocation algorithm. > > >> > > > > > > > >> > > > > > We can simply employ whole range lookup of bitmap as a VID > > >> > > > > > allocation > > >> > > > > > algorithm for recycling policy. Of course it would be harmful > > >> > > > > > for > > >> > > > > > snapshot and clone creation, but it can work correctly (and we > > >> > > > > > have > > >> > > > > > optimization e.g. parallelizing, caching, etc). In addition, > > >> > > > > > the > > >> > > > > > performance degradation can happen potentially even if we use > > >> > > > > > the > > >> > > > > > existing VID allocation algorithm (e.g. hash collision, > > >> > > > > > although > > >> > > > > > course it can happen rearely). > > >> > > > > > > >> > > > > Do we really need vdi recycle if we bring very complex lines of > > >> > > > > code? Current > > >> > > > > algorithm can *reuse* of deleted vdi IDs and inodes. So the very > > >> > > > > problem is > > >> > > > > actually the space effeciency, so you try to reclaim the space > > >> > > > > occupied by > > >> > > > > deleted vdis. > > >> > > > > > > >> > > > > If it is very easy to reclaim deleted inodes, I'd say great and > > >> > > > > let's go ahead. > > >> > > > > But it apparently not. We have this patch set and then the > > >> > > > > lookup algorithm is > > >> > > > > heavily degrated. > > >> > > > > > > >> > > > > I'm afraid, lookup the whole range is too costy, considering the > > >> > > > > deleted inodes > > >> > > > > space we reclaim. I think, most of users can bear the very > > >> > > > > little space overhead > > >> > > > > for better performance. So this patch set trade the code > > >> > > > > complexity and > > >> > > > > performance for the space efficiency. Note, we can reclaim > > >> > > > > inodes only in the > > >> > > > > case that we delete the whole snapshot chain and parent. This is > > >> > > > > actually a rare > > >> > > > > case. > > >> > > > > > >> > > > Delete the vdi & snapthos data objects is really good enough to > > >> > > > me. Your patch > > >> > > > set is the one of efforts to perfect current algorithms. But the > > >> > > > cost is too > > >> > > > high because the hottest path of vdi_lookup() is heavily degrated > > >> > > > for gerenal > > >> > > > cases, even though later we can fix all the bugs related to the > > >> > > > this patch set. > > >> > > > > > >> > > > Please consider it. > > >> > > > > >> > > Of course the increased cost of vdi_lookup() is problem. So I'm > > >> > > posting a patch for providing an option for enabling/disabling vid > > >> > > recycling. In default, the recycling will be disabled with the > > >> > > patch. So users can choose two different policies with different > > >> > > pros/cons. > > >> > > > > >> > > The recycling VID is an actual requirement from the development team > > >> > > of NTT DATA. I need to provide it at least as an option. > > >> > > > >> > THe requrement is for reclaim the deleted inodes? The vid exaustion > > >> > problem > > >> > mentioned in your patch set, is actually not a problem, no? We can > > >> > reuse deleted > > >> > inode and vid. If so, reclaim the deleted inodes, which are very > > >> > little and > > >> > reuable, is so important? > > >> > > >> The primary problem is VID. But reusing deleted VID correctly will > > >> require mucm more complex code e.g. rewriting parent/child > > >> relationship (and it would be superset of current code). > > > > > > I guess so. I don't have a glance of this problem in a code manner, but > > > we have > > > two state to indicate if the vid is used or not, > > > > > > 1. bit int bitmap of system_info > > > 2. inode's name field > > > > > > For the non-snapshot case, the valid vid is a) bit is set, b) name is not > > > empty. > > > So if we zero name, we can resue this bit by check if name is empty. > > > > > > For the snapshot case, the above criteria still hold true, no? > > > > > > 001111000 > > > ABCD > > > > > > D is the working vid, ABC are the snapshots. After we remove B, > > > > > > 0011 11000 > > > AB'CD > > > > > > B' mean its name is empty. So the snapshot chain becomes A->C->D. B is > > > still > > > reusable, no? > > > > > > > No. Because C and D can point objects created when B was a working VDI. > > You are right, but your patch set won't fix this problem either. I am not > concerned this problem, but rather: > > 1. code complexity and code stability. > 2. heavily degraded vdi_lookup(). > > If we fix one problem by introducing yet another problem(2), I'd suggest we > think twice. >
At least currently we don't have effective vdi_lookup() for recycle VDI policy. But it can be disabled by the cluster format option. Thanks, Hitoshi -- sheepdog mailing list sheepdog@lists.wpkg.org https://lists.wpkg.org/mailman/listinfo/sheepdog