Re: [Linux-cachefs] [PATCH] netfs: Fix netfs_extract_iter_to_sg() for ITER_UBUF/IOVEC
On Wed, Apr 12, 2023 at 5:19 AM David Howells wrote: > > Could you apply this, please? It doesn't affect anything yet, but I have > patches in the works that will use it. Applied, Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] erofs updates for 6.2-rc1 (fscache part inclusive)
On Sun, Dec 11, 2022 at 11:01 PM Gao Xiang wrote: > > Comparing with the latest -next on the last Thursday, there was one-liner > addition commit 927e5010ff5b ("erofs: use kmap_local_page() > only for erofs_bread()") as below: [...] > Because there is no -next version on Monday, if you would like to > take all commits in -next you could take > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git > tags/erofs-for-6.2-rc1-alt No worries. That one-liner isn't a problem, and you sent the pull request to me early, so I'm perfectly happy with your original request and have pulled it. Thanks for being careful, Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [ammarfaizi2-block:dhowells/linux-fs/fscache-fixes] [mm, netfs, fscache] 6919cda8e0: canonical_address#:#[##]
The disassembly isn't great, because the test robot doesn't try to find where the instructions start, but before that >4: 48 8b 57 18 mov0x18(%rdi),%rdx instruction we also had a mov(%rdi),%rax and it looks like this is the very top of 'filemap_release_folio()', so '%rdi' contains the folio pointer coming into this. End result: On Sun, Dec 11, 2022 at 6:27 AM kernel test robot wrote: > >4: 48 8b 57 18 mov0x18(%rdi),%rdx >8: 83 e0 01and$0x1,%eax >b: 74 59 je 0x66 The and$0x1,%eax je 0x66 above is the test for BUG_ON(!folio_test_locked(folio)); where it's jumping out to the 'ud2' in case the lock bit (bit #0) isn't set. Then we have this: >d: 48 f7 07 00 60 00 00testq $0x6000,(%rdi) > 14: 74 22 je 0x38 Which is testing PG_private | PG_private2, and jumping out (which we also don't do) if neither is set. And then we have: > 16: 48 8b 07mov(%rdi),%rax > 19: f6 c4 80test $0x80,%ah > 1c: 75 32 jne0x50 Which is checking for PG_writeback. So then we get to if (mapping && mapping->a_ops->release_folio) return mapping->a_ops->release_folio(folio, gfp); which is this: > 1e: 48 85 d2test %rdx,%rdx > 21: 74 34 je 0x57 This %rdx value is the early load from the top of the function, it's checking 'mapping' for NULL. It's not NULL, but it's some odd value according to the oops report: RDX: 889f03987f71 which doesn't look like it's valid (well, it's a valid kernel pointer, but it's not aligned like a 'mapping' pointer should be. So now when we're going to load 'a_ops' from there, we load another garbage value: > 23: 48 8b 82 90 00 00 00mov0x90(%rdx),%rax and we now have RAX: b000 and then the 'a_ops->release_folio' access will trap: > 2a:* 48 8b 40 48 mov0x48(%rax),%rax <-- trapping > instruction > 2e: 48 85 c0test %rax,%rax > 31: 74 24 je 0x57 The above is the "load a_ops->release_folio and test it for NULL", but the load took a page fault because RAX was garbage. But RAX was garbage because we already had a bogus "mapping" pointer earlier. Now, why 'mapping' was bogus, I don't know. Maybe that page wasn't a page cache page at all? The mapping field is in a union and can contain other things. So I have no explanation for the oops, but I thought I'd just post the decoding of the instruction stream in case that helps somebody else to figure it out. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v4 0/3] mm, netfs, fscache: Stop read optimisation when folio removed from pagecache
On Wed, Nov 23, 2022 at 2:48 PM David Howells wrote: > > I've also got rid of the bit clearances > from the network filesystem evict_inode functions as they doesn't seem to > be necessary. Well, the patches look superficially cleaner to me, at least. That "doesn't seem to be necessary" makes me a bit worried, and I'd have liked to see a more clear-cut "clearing it isn't necessary because X", but I _assume_ it's not necessary simply because the 'struct address_space" is released and never re-used. But making the lifetime of that bit explicit might just be a good idea. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v3] mm, netfs, fscache: Stop read optimisation when folio removed from pagecache
On Wed, Nov 23, 2022 at 12:03 PM David Howells wrote: > > Linus Torvalds wrote: > > > But I also think it's strange in another way, with that odd placement of > > > > mapping_clear_release_always(inode->i_mapping); > > > > at inode eviction time. That just feels very random. > > I was under the impression that a warning got splashed if unexpected > address_space flags were set when ->evict_inode() returned. I may be thinking > of page flags. If it doesn't, fine, this isn't required. I don't know if the warning happens or not, but the thing I reacted to was just how *random* this was. There was no logic to it, nor any explanation. I *suspect* that if we add this kind of new generic address space flag, then that flag should just be cleared by generic code when the address space is released. But I'm not saying it has to be done that way - I'm just saying that however it is done, please don't make it this random mess with no explanation. The *setting* of the flag was at least fairly obvious. I didn't find code like this odd: + if (v9inode->netfs.cache) + mapping_set_release_always(inode->i_mapping); and it makes all kinds of sense (ie I can read it as a "if I use netfs caching for this inode, then I want to be informed when a folio is released from this mapping"). It's just the clearing that looked very random to me. Maybe just a comment would have helped, but I get the feeling that it migth as well just be cleared in "clear_inode()" or something like that. > > That code makes no sense what-so-ever. Why isn't it using > > "folio_has_private()"? > > It should be, yes. > > > Why is this done as an open-coded - and *badly* so - version of > > !folio_needs_release() that you for some reason made private to mm/vmscan.c? > > Yeah, in retrospect, I should have put that in mm/internal.h. So if folio_needs_release() is in mm/internal.h, and then mm/filemap.c uses it in filemap_release_folio() instead of the odd open-coding, I think that would clear up my worries about both mm/filemap.c and mm/vmscan.c. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v3] mm, netfs, fscache: Stop read optimisation when folio removed from pagecache
On Wed, Nov 23, 2022 at 5:02 AM David Howells wrote: > > Is the attached patch too heavy to be applied this late in the merge cycle? > Or would you prefer it to wait for the merge window? This patch is much too much for this point in the release. But I also think it's strange in another way, with that odd placement of mapping_clear_release_always(inode->i_mapping); at inode eviction time. That just feels very random. Similarly, that change to shrink_folio_list() looks strange, with the nasty folio_needs_release() helper. It seems entirely pointless, with the use then being if (folio_needs_release(folio)) { if (!filemap_release_folio(folio, sc->gfp_mask)) goto activate_locked; when everybody else is just using filemap_release_folio() and checking its return value. I like how you changed other cases of if (folio_has_private(folio) && !filemap_release_folio(folio, 0)) return 0; to just use "!filemap_release_folio()" directly, and that felt like a cleanup, but the shrink_folio_list() changes look like a step backwards. And the change to mm/filemap.c is completely unacceptable in all forms, and this added test + if ((!mapping || !mapping_release_always(mapping)) && + !folio_test_private(folio) && + !folio_test_private_2(folio)) + return true; will not be accepted even during the merge window. That code makes no sense what-so-ever, and is in no way acceptable. That code makes no sense what-so-ever. Why isn't it using "folio_has_private()"? Why is it using it's own illegible version of that that doesn't match any other case? Why is this done as an open-coded - and *badly* so - version of !folio_needs_release() that you for some reason made private to mm/vmscan.c? So no, this patch is too ugly to apply as-is *ever*, much less during the late rc series. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [RFC][PATCH 0/3] netfs, afs: Cleanups
On Fri, Jun 10, 2022 at 12:56 PM David Howells wrote: > > Here are some cleanups, one for afs and a couple for netfs: Pulled, Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] Out of order read() completion and buffer filling beyond returned amount
On Mon, Jan 17, 2022 at 11:57 AM David Howells wrote: > > Do you have an opinion on whether it's permissible for a filesystem to write > into the read() buffer beyond the amount it claims to return, though still > within the specified size of the buffer? I'm pretty sure that would seriously violate POSIX in the general case, and maybe even break some programs that do fancy buffer management (ie I could imagine some circular buffer thing that expects any "unwritten" ('unread'?) parts to stay with the old contents) That said, that's for generic 'read()' cases for things like tty's or pipes etc that can return partial reads in the first place. If it's a regular file, then any partial read *already* violates POSIX, and nobody sane would do any such buffer management because it's supposed to be a 'can't happen' thing. And since you mention DIO, that's doubly true, and is already outside basic POSIX, and has already violated things like "all or nothing" rules for visibility of writes-vs-reads (which admittedly most Linux filesystems have violated even outside of DIO, since the strictest reading of the rules are incredibly nasty anyway). But filesystems like XFS which took some of the strict rules more seriously already ignored them for DIO, afaik. So I suspect you're fine. Buffered reads might care more, but even there the whole "you can't really validly have partial reads anyway" thing is a bigger violation to begin with. With DIO, I suspect nobody cares about _those_ kinds of semantic details. People who use DIO tend to care primarily about performance - it's why they use it, after all - and are probably more than happy to be lax about other rules. But maybe somebody would prefer to have a mount option to specify just how out-of-spec things can be (ie like the traditional old nfs 'intr' thing). If only for testing, and for 'in case some odd app breaks' Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v3 56/68] afs: Handle len being extending over page end in write_begin/write_end
On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox wrote: > > Since ->write_begin is the place where we actually create folios, it > needs to know what size folio to create. Unless you'd rather we do > something to actually create the folio before calling ->write_begin? I don't think we can create a folio before that, because the filesystem may not even want a folio (think persistent memory or whatever). Honestly, I think you need to describe more what you actually want to happen. Because generic_perform_write() has already decided to use a PAGE_SIZE by the time write_begin() is called, Right now the world order is "we chunk things by PAGE_SIZE", and that's just how it is. I can see other options - like the filesystem passing in the chunk size when it calls generic_perform_write(). Or we make the rule be that ->write_begin() simply always is given the whole area, and the filesystem can decide how it wants to chunk things up, and return the size of the write chunk in the status (rather than the current "success or error"). But at no point will this *EVER* be a "afs will limit the size to the folio size" issue. Nothing like that will ever make sense. Allowing bigger chunks will not be about any fscache issues, it will be about every single filesystem that uses generic_perform_write(). So I will NAK these patches by David, because they are fundamentally wrong, whichever way we turn. Any "write in bigger chunks" patch will be something else entirely. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v3 56/68] afs: Handle len being extending over page end in write_begin/write_end
On Thu, Dec 16, 2021 at 8:22 AM David Howells wrote: > > With transparent huge pages, in the future, write_begin() and write_end() > may be passed a length parameter that, in combination with the offset into > the page, exceeds the length of that page. This allows > grab_cache_page_write_begin() to better choose the size of THP to allocate. I still think this is a fundamental bug in the caller. That "explanation" is weak, and the whole concept smells like week-old fish to me. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v3 57/68] afs: Fix afs_write_end() to handle len > page size
On Thu, Dec 16, 2021 at 8:22 AM David Howells wrote: > > It is possible for the len argument to afs_write_end() to overrun the end > of the page (len is used to key the size of the page in afs_write_start() > when compound pages become a regular thing). This smells like a bug in the caller. It's just insane to call "write_end()" with a range that doesn't actually fit in the page provided. Exactly how does that happen, and why should AFS deal with it, not whoever called write_end()? Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 07/67] fscache: Implement a hash function
On Fri, Dec 10, 2021 at 6:41 AM David Howells wrote: > > However, since the comparator functions are only used to see if they're the > same or different, the attached change makes them return bool instead, no > cast or subtraction required. Ok, thanks, that resolves my worries. Which is not to say it all works - I obviously only scanned the patches rather than testing anything. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 07/67] fscache: Implement a hash function
On Thu, Dec 9, 2021 at 1:57 PM David Howells wrote: > > What I'm trying to get at is that the hash needs to be consistent, no matter > the endianness of the cpu, for any particular input blob. Yeah, if that's the case, then you should probably make that "unsigned int *data" argument probably just be "void *" and then: > a = *data++; <<< > HASH_MIX(x, y, a); > } > return fold_hash(x, y); > } > > The marked line should probably use something like le/be32_to_cpu(). Yes, it should be using a '__le32 *' inside that function and you should use l32_to_cpu(). Obviously, BE would work too, but cause unnecessary work on common hardware. But as mentioned for the other patches, you should then also be a lot more careful about always using the end result as an 'unsigned int' (or maybe 'u32') too, and when comparing hashes for binary search or other, you should always do th4e compare in some stable format. Because doing return (long)hash_a - (long)hash_b; and looking at the sign doesn't actually result in a stable ordering on 32-bit architectures. You don't get a transitive ordering (ie a < b and b < c doesn't imply a < c). And presumably if the hashes are meaningful across machines, then hash comparisons should also be meaningful across machines. So when comparing hashes, you need to compare them either in a truly bigger signed type (and make sure that doesn't get truncated) - kind of like how a lot of 'memcmp()' functions do 'unsigned char' subtractions in an 'int' - or you need to compare them _as_ 'unsigned int'. Otherwise the comparisons will be all kinds of messed up. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 10/67] fscache: Implement cookie registration
On Thu, Dec 9, 2021 at 8:55 AM David Howells wrote: > > + buf = (u32 *)cookie->inline_key; > + } > + > + memcpy(buf, index_key, index_key_len); > + cookie->key_hash = fscache_hash(cookie->volume->key_hash, buf, bufs); This is actively wrong given the noise about "endianness independence" of the fscache_hash() function. There is absolutely nothing endianness-independent in the above. You're taking some random data, casting the pointer to a native word-order 32-bit entity, and then doing things in that native word order. The same data will give different results on different endiannesses. Maybe some other code has always munged stuff so that it's in some "native word format", but if so, the type system should have been made to match. And it's not. It explicitly casts what is clearly some other pointer type to "u32 *". There is no way in hell this is properly endianness-independent with each word in an array having some actual endianness-independent value when you write code like this. I'd suggest making endianness either explicit (make things explicitly "__le32" or whatever) and making sure that you don't just randomly cast pointers, you actually have the proper types. Or, alternatively, just say "nobody cares about BE any more, endianness isn't relevant, get over it". But don't have functions that claim to be endianness-independent and then randomly access data like this. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 07/67] fscache: Implement a hash function
On Thu, Dec 9, 2021 at 8:54 AM David Howells wrote: > > Implement a function to generate hashes. It needs to be stable over time > and endianness-independent as the hashes will appear on disk in future > patches. I'm not actually seeing this being endianness-independent. Is the input just regular 32-bit data in native word order? Because then it's not endianness-independent, it's purely that there *is* no endianness to the data at all and it is purely native data. So the code may be correct, but the explanation is confusing. There is absolutely nothing here that is about endianness. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH 00/64] fscache, cachefiles: Rewrite
On Mon, Nov 29, 2021 at 6:22 AM David Howells wrote: > > The patchset is structured such that the first few patches disable fscache > use by the network filesystems using it, remove the cachefiles driver > entirely and as much of the fscache driver as can be got away with without > causing build failures in the network filesystems. The patches after that > recreate fscache and then cachefiles, attempting to add the pieces in a > logical order. Finally, the filesystems are reenabled and then the very > last patch changes the documentation. Thanks, this all looks conceptually sane to me. But I only really scanned the commit messages, not the actual new code. That obviously needs all the usual testing and feedback from the users of this all.. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v4 00/10] fscache: Replace and remove old I/O API
On Fri, Oct 29, 2021 at 10:55 AM David Howells wrote: > > This means bisection is of limited value and why I'm looking at a 'flag day'. So I'm kind of resigned to that by now, I just wanted to again clarify that the rest of my comments are about "if we have to deal with a flag dat anyway, then make it as simple and straightforward as possible, rather than adding extra steps that are only noise". > [ Snip explanation of netfslib ] > This particular patchset is intended to enable removal of the old I/O routines > by changing nfs and cifs to use a "fallback" method to use the new kiocb-using > API and thus allow me to get on with the rest of it. Ok, at least that explains that part. But: > However, if you would rather I just removed all of fscache and (most of[*]) > cachefiles, that I can do. I assume and think that if you just do that part first, then the "convert to netfslib" of afs and ceph at that later stage will mean that the fallback code will never be needed? So I would much prefer that streamlined model over one that adds that temporary intermediate stage only for it to be deleted. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v4 00/10] fscache: Replace and remove old I/O API
On Fri, Oct 29, 2021 at 7:09 AM David Howells wrote: > > (1) A simple fallback API is added that can read or write a single page > synchronously. The functions for this have "fallback" in their names > as they have to be removed at some point. David, I still don't understand WHY. I read the explanations in the commits, and that didn't help either. Why, of why, do you insist of adding this intermediate interface that is already documented to "must be removed" at the point it is even added? What's the point of adding garbage that is useless in the long run? Why is the first step not just "remove fscache"? Why is there this addition of the "deprecated" interface - that you have now renamed "fallback"? I agree that "fallback" is a less annoying name, so that renaming is an improvement, but WHY? I absolutely detested your whole "move old garbage around before removal", and I also detest this "add new garbage that will be removed". What's the point? Why isn't the fix just "remove CONFIG_FSCACHE and all the code". You already *HAVE* the "fallback" code - it's all that #else /* CONFIG_NFS_FSCACHE */ static inline int nfs_fscache_register(void) { return 0; } static inline void nfs_fscache_unregister(void) {} ... stuff in and friends. So why do you need _new_ fallback code, when CONFIG_FSCACHE already exists and already has a "this disables fscache"? Maybe there is some really good reason, but that really good reason sure as hell isn't documented anywhere, and I really don't see the point. So let me say this again: - it would be much better if you could incrementally just improve the existing FSCACHE so that it just _works_ all the time, and fixes the problems in it, and a bisection works, and there is no flag-day. - but dammit, if you have to have a flag-day, then there is NO POINT in all this "move the old code around before moving it", or "add a fallback interface before removing it again". Oh, I can understand wanting to keep the header files around in case the interfaces end up being similar enough in the end that it all matters. But I don't understand why you do this kind of crud: fs/cachefiles/io.c | 28 - fs/fscache/io.c | 137 +++-- when the neither of those directories will ever even be *compiled* if CONFIG_FSCACHE isn't true (because CACHEFILES has a "depends on FSCACHE"). See my argument? If FSCACHE isn't usable during the transition, then don't make these kinds of pointless code movement or creation things that are just dead. There's absolutely no point in having some "fallback" interface that allows people to test some configuration that simply isn't relevant. It doesn't help anything. It just adds more noise and more configurations, and no actual value that I can see. It doesn't help bisectability: if some bug ever bisects to the fallback code, what does that tell us? Nothing. Sure, it trivially tells us the fallback code was buggy, but since the fallback code has been removed afterwards, the _value_ of that information is nil, zilch, nada. It's not "information", it's just "worthless data". And hey, maybe there's some issue that I don't understand, and I don't see. But if there is some subtle value here, it should have been documented. So I say exactly the same thing I said last time: if the old fscache code is not usable, and you can't incrementally fix it so that it works all the time, then JUST REMOVE IT ALL. Moving it elsewhere before the removal is only pointless noise. But adding some fallback intermediate code before removal is ALSO just pointless noise. Doing a flag-day with "switch from A to B" is already painful and wrong. I don't like it. But I like it even _less_, if it's a "switch from A to B to C". If you do want t9o have a "halfway state", the only halfway state that makes sense to me is something like (a) make all the changes to the old FSCACHE - keeping it all _working_ during this phase - to make it have the same _interfaces_ as the new fscache will have. (b) then remove the old FSCACHE entirely (c) then plop in the new FSCACHE But note how there was no "fallback" stage anywhere. No code that lies around dead at any point. At each point it was either all working old or all working new (or nothing at all). Yes, in this case that "step (a)" is extra work and you're basically modifying code that you know will be removed, but the advantage now is - at least the fscache _users_ are being modified while the old and tested world is still working, and the interface change is "bisectable" in that sense. That's useful in itself. - if it turns out that people have problems with the new generation FSCACHE, they can reverse steps (b) and (c) without having to touch and revert all the other filesystems changes. IOW, if a "same interfaces" state exists, that's fine. But for it to make sense, those same interfaces have to be actually _useful_, not some fallback
Re: [Linux-cachefs] [PATCH v2 00/53] fscache: Rewrite index API and management system
On Fri, Oct 22, 2021 at 9:58 AM Linus Torvalds wrote: > > and if (c) is the thing that all the network filesystem people want, > then what the heck is the point in keeping dead code around? At that > point, all the rename crap is just extra work, extra noise, and only a > distraction. There's no upside that I can see. Again, I'm not a fan of (c) as an option, but if done, then the simplest model would appear to be: - remove the old fscache code, obviously disabling the Kconfig for it for each filesystem, all in one fell swoop. - add the new fscache code (possibly preferably in sane chunks that explains the parts). - then do a "convert to new world order and enable" commit individually for each filesystem but as mentioned, there's no sane way to bisect things, or have a sane development history in this kind of situation. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 00/53] fscache: Rewrite index API and management system
On Fri, Oct 22, 2021 at 9:40 AM David Howells wrote: > > What's the best way to do this? Is it fine to disable caching in all the > network filesystems and then directly remove the fscache and cachefiles > drivers and replace them? So the basic issue with this whole "total rewrite" is that there's no way to bisect anything. And there's no way for people to say "I don't trust the rewrite, I want to keep using the old tested model". Which makes this all painful and generally the wrong way to do anything like this, and there's fundamentally no "best way". The real best way would be if the conversion could be done truly incrementally. Flag-days simply aren't good for development, because even if the patch to enable the new code might be some trivial one-liner, that doesn't _help_ anything. The switch-over switches from one code-base to another, with no help from "this is where the problem started". So in order of preference: (a) actual incremental changes where the code keeps working all the time, and no flag days (b) same interfaces, so at least you can do A/B testing and people can choose one or the other (c) total rewrite and if (c) is the thing that all the network filesystem people want, then what the heck is the point in keeping dead code around? At that point, all the rename crap is just extra work, extra noise, and only a distraction. There's no upside that I can see. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v2 00/53] fscache: Rewrite index API and management system
On Fri, Oct 22, 2021 at 8:58 AM David Howells wrote: > > David Howells (52): > fscache_old: Move the old fscache driver to one side > fscache_old: Rename CONFIG_FSCACHE* to CONFIG_FSCACHE_OLD* > cachefiles_old: Move the old cachefiles driver to one side Honestly, I don't see the point of this when it ends up just being dead code basically immediately. You don't actually support picking one or the other at build time, just a hard switch-over. That makes the old fscache driver useless. You can't say "use the old one because I don't trust the new". You just have a legacy implementation with no users. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [RFC PATCH 0/8] fscache: Replace and remove old I/O API
On Tue, Sep 14, 2021 at 9:21 AM Linus Torvalds wrote: > > Call it "fallback" or "simple" or something that shows the intent, but > no, I'm not taking patches that introduce a _new_ interface and call > it "deprecated". Put another way: to call something "deprecated", you have to already have the replacement all ready to go. And if you have that, then converting existing code to a deprecated model isn't the way to go. So in neither situation does it make any sense to convert anything to a model that is deprecated. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [RFC PATCH 0/8] fscache: Replace and remove old I/O API
On Tue, Sep 14, 2021 at 6:54 AM David Howells wrote: > > (1) A simple fallback API is added that can read or write a single page > synchronously. The functions for this have "deprecated" in their names > as they have to be removed at some point. I'm looking at those patches, and there's no way I'll apply anything that starts out with moving to a "deprecated" interface. Call it "fallback" or "simple" or something that shows the intent, but no, I'm not taking patches that introduce a _new_ interface and call it "deprecated". Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] Canvassing for network filesystem write size vs page size
On Thu, Aug 5, 2021 at 9:36 AM David Howells wrote: > > Some network filesystems, however, currently keep track of which byte ranges > are modified within a dirty page (AFS does; NFS seems to also) and only write > out the modified data. NFS definitely does. I haven't used NFS in two decades, but I worked on some of the code (read: I made nfs use the page cache both for reading and writing) back in my Transmeta days, because NFSv2 was the default filesystem setup back then. See fs/nfs/write.c, although I have to admit that I don't recognize that code any more. It's fairly important to be able to do streaming writes without having to read the old contents for some loads. And read-modify-write cycles are death for performance, so you really want to coalesce writes until you have the whole page. That said, I suspect it's also *very* filesystem-specific, to the point where it might not be worth trying to do in some generic manner. In particular, NFS had things like interesting credential issues, so if you have multiple concurrent writers that used different 'struct file *' to write to the file, you can't just mix the writes. You have to sync the writes from one writer before you start the writes for the next one, because one might succeed and the other not. So you can't just treat it as some random "page cache with dirty byte extents". You really have to be careful about credentials, timeouts, etc, and the pending writes have to keep a fair amount of state around. At least that was the case two decades ago. [ goes off and looks. See "nfs_write_begin()" and friends in fs/nfs/file.c for some of the examples of these things, althjough it looks like the code is less aggressive about avoding the read-modify-write case than I thought I remembered, and only does it for write-only opens ] Linus Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [RFC][PATCH] mm: Split page_has_private() in two to better handle PG_private_2
On Thu, Apr 8, 2021 at 2:15 PM David Howells wrote: > > mm: Split page_has_private() in two to better handle PG_private_2 >From a look through the patch and some (limited) thinking about it, I like the patch. I think it clarifies the two very different cases, and makes it clear that one is about that page cleanup, and the other is about the magical reference counting. The two are separate issues, even if for PG_private both happen to be true. So this seems sane to me. That said, I had a couple of reactions: > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 04a34c08e0a6..04cb440ce06e 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -832,14 +832,27 @@ static inline void ClearPageSlabPfmemalloc(struct page > *page) > > #define PAGE_FLAGS_PRIVATE \ > (1UL << PG_private | 1UL << PG_private_2) I think this should be re-named to be PAGE_FLAGS_CLEANUP, because I don't think it makes any other sense to "combine" the two PG_private* bits any more. No? > +static inline int page_private_count(struct page *page) > +{ > + return test_bit(PG_private, >flags) ? 1 : 0; > +} Why is this open-coding the bit test, rather than just doing return PagePrivate(page) ? 1 : 0; instead? In fact, since test_bit() _should_ return a 'bool', I think even just return PagePrivate(page); should work and give the same result, but I could imagine that some architecture version of "test_bit()" might return some other non-zero value (although honestly, I think that should be fixed if so). Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH v4 02/28] mm: Add an unlock function for PG_private_2/PG_fscache
[ Adding btrfs people explicitly, maybe they see this on the fs-devel list, but maybe they don't react .. ] On Tue, Mar 16, 2021 at 12:07 PM Matthew Wilcox wrote: > > This isn't a problem with this patch per se, but I'm concerned about > private2 and expected page refcounts. Ugh. You are very right. It would be good to just change the rules - I get the feeling nobody actually depended on them anyway because they were _so_ esoteric. > static inline int is_page_cache_freeable(struct page *page) > { > /* > * A freeable page cache page is referenced only by the caller > * that isolated the page, the page cache and optional buffer > * heads at page->private. > */ > int page_cache_pins = thp_nr_pages(page); > return page_count(page) - page_has_private(page) == 1 + > page_cache_pins; You're right, that "page_has_private()" is really really nasty. The comment is, I think, the traditional usage case, which used to be about page->buffers. Obviously these days it is now about page->private with PG_private set, pointing to buffers (attach_page_private() and detach_page_private()). But as you point out: > #define PAGE_FLAGS_PRIVATE \ > (1UL << PG_private | 1UL << PG_private_2) > > So ... a page with both flags cleared should have a refcount of N. > A page with one or both flags set should have a refcount of N+1. Could we just remove the PG_private_2 thing in this context entirely, and make the rule be that (a) PG_private means that you have some local private data in page->private, and that's all that matters for the "freeable" thing. (b) PG_private_2 does *not* have the same meaning, and has no bearing on freeability (and only the refcount matters) I _)think_ the btrfs behavior is to only use PagePrivate2() when it has a reference to the page, so btrfs doesn't care? I think fscache is already happy to take the page count when using PG_private_2 for locking, exactly because I didn't want to have any confusion about lifetimes. But this "page_has_private()" math ends up meaning it's confusing anyway. btrfs people? What are the semantics for PG_private_2? Is it just a flag, and you really don't want it to have anything to do with any page lifetime decisions? Or? Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://listman.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache: I/O API modernisation and netfs helper library
On Sun, Feb 14, 2021 at 4:23 PM David Howells wrote: > > Anyway, I have posted my fscache modernisation patches multiple times for > public review, I have tried to involve the wider community in aspects of the > development on public mailing lists and I have been including the maintainers > in to/cc. So then add those links and the cc's to the commit logs, so that I can *see* them. I'm done with this discussion. If I see a pull request from you, I DO NOT WANT TO HAVE TO HAVE A WEEK-LONG EMAIL THREAD ABOUT HOW I CANNOT SEE THAT IT HAS EVER SEEN ANY REVIEW. So if all I see is "Signed-off-by:" from you, I will promptly throw that pull request into the garbage, because it's just not worth my time to try to have to get you kicking and screaming to show that others have been involved. Can you not understand that? When I get that pull request, I need to see that yes, this has been reviewed, people have been involved, and yes, it's been in linux-next. I want to see "reviewed-by" and "tested-by", I want to see "cc", and I want to see links to submission threads with discussion showing that others actually were involved. I do *not* want to see just a single signed-off-by line from you, and then have to ask for "has anybody else actually seen this and reviewed it". Look, here's an entirely unrelated example from a single fairly recent trivial one-liner memory leak fix: Fixes: 87c715dcde63 ("scsi: scsi_debug: Add per_host_store option") Link: https://lore.kernel.org/r/20210208111734.34034-1-mlomb...@redhat.com Acked-by: Douglas Gilbert Signed-off-by: Maurizio Lombardi Signed-off-by: Martin K. Petersen that's from a quite a trivial commit. Yes, it's trivial, but it could still be wrong, of course. And if somebody ever reports that it causes problems despite how simple it was, look at what I have: I have three people to contact, and I have a pointer to the actual original submission of the patch. Do we have that for all our commits? No. But it's also not at all unusual any more, and in fact many commits have even more, with testing etc. And yes, sometimes the test results and acks come back later after you've already pushed the changes out etc, and no, it's generally not worth rebasing for that - maybe others have now started to rely on whatever public branch you have. Which is why the "Link:" is useful, so that if things come in later, the discussion can still be found. But quite often, you shouldn't have pushed out some final branch before you've gotten at least *some* positive response from people, so I do kind of expect some "Acked-by" etc in the commit itself. THAT is what you need to aim for. And yes, I'm picking on you. Because we've had this problem before. I've complained when you've sent me pull requests that don't even build, that you in fact had been told by linux-next didn't build, and you still sent them to me. And as a result, I've asked for more involvement from other people before. So now I'm clarifying that requirement - I absolutely need to see that it has actually seen testing, that it has seen other people being involved, and that it isn't just you throwing spaghetti at the wall to see what sticks. And I'm not going to do that for every pull request. I want to see that data *in* the pull request itself. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache: I/O API modernisation and netfs helper library
On Thu, Feb 11, 2021 at 3:21 PM David Howells wrote: > > Most of the development discussion took place on IRC and waving snippets of > code about in pastebin rather than email - the latency of email is just too > high. There's not a great deal I can do about that now as I haven't kept IRC > logs. I can do that in future if you want. No, I really don't. IRC is fine for discussing ideas about how to solve things. But no, it's not a replacement for actual code review after the fact. If you think email has too long latency for review, and can't use public mailing lists and cc the people who are maintainers, then I simply don't want your patches. You need to fix your development model. This whole "I need to get feedback from whoever still uses irc and is active RIGHT NOW" is not a valid model. It's fine for brainstorming for possible approaches, and getting ideas, sure. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache: I/O API modernisation and netfs helper library
On Wed, Feb 10, 2021 at 8:33 AM David Howells wrote: > > Then I could follow it up with this patch here, moving towards dropping the > PG_fscache alias for the new API. So I don't mind the alias per se, but I did mind the odd mixing of names for the same thing. So I think your change to make it be named "wait_on_page_private_2()" fixed that mixing, but I also think that it's probably then a good idea to have aliases in place for filesystems that actually include the fscache.h header. Put another way: I think that it would be even better to simply just have a function like static inline void wait_on_page_fscache(struct page *page) { if (PagePrivate2(page)) wait_on_page_bit(page, PG_private_2); } and make that be *not* in , but simply be in under that big comment about how PG_private_2 is used for the fscache bit. You already have that comment, putting the above kind of helper function right there would very much explain why a "wait for fscache bit" function then uses the PagePrivate2 function to test the bit. Agreed? Alternatively, since that header file already has #define PageFsCache(page) PagePrivate2((page)) you could also just write the above as static inline void wait_on_page_fscache(struct page *page) { if (PageFsCache(page)) wait_on_page_bit(page, PG_fscache); } and now it is even more obvious. And there's no odd mixing of "fscache" and "private_2", it's all consistent. IOW, I'm not against "wait_on_page_fscache()" as a function, but I *am* against the odd _mixing_ of things without a big explanation, where the code itself looks very odd and questionable. And I think the "fscache" waiting functions should not be visible to any core VM or filesystem code - it should be limited explicitly to those filesystems that use fscache, and include that header file. Wouldn't that make sense? Also, honestly, I really *REALLY* want your commit messages to talk about who has been cc'd, who has been part of development, and point to the PUBLIC MAILING LISTS WHERE THAT DISCUSSION WAS TAKING PLACE, so that I can actually see that "yes, other people were involved" No, I don't require this in general, but exactly because of the history we have, I really really want to see that. I want to see a Link: https://lore.kernel.org/r/ and the Cc's - or better yet, the Reviewed-by's etc - so that when I get a pull request, it really is very obvious to me when I look at it that others really have been involved. So if I continue to see just Signed-off-by: David Howells at the end of the commit messages, I will not pull. Yes, in this thread a couple of people have piped up and said that they were part of the discussion and that they are interested, but if I have to start asking around just to see that, then it's too little, too late. No more of this "it looks like David Howells did things in private". I want links I can follow to see the discussion, and I really want to see that others really have been involved. Ok? Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache: I/O API modernisation and netfs helper library
On Tue, Feb 9, 2021 at 12:21 PM Matthew Wilcox wrote: > > Yeah, I have trouble with the private2 vs fscache bit too. I've been > trying to persuade David that he doesn't actually need an fscache > bit at all; he can just increment the page's refcount to prevent it > from being freed while he writes data to the cache. Does the code not hold a refcount already? Honestly, the fact that writeback doesn't take a refcount, and then has magic "if writeback is set, don't free" code in other parts of the VM layer has been a problem already, when the wakeup ended up "leaking" from a previous page to a new allocation. I very much hope the fscache bit does not make similar mistakes, because the rest of the VM will _not_ have special "if fscache is set, then we won't do X" the way we do for writeback. So I think the fscache code needs to hold a refcount regardless, and that the fscache bit is set the page has to have a reference. So what are the current lifetime rules for the fscache bit? Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache: I/O API modernisation and netfs helper library
So I'm looking at this early, because I have more time now than I will have during the merge window, and honestly, your pull requests have been problematic in the past. The PG_fscache bit waiting functions are completely crazy. The comment about "this will wake up others" is actively wrong, and the waiting function looks insane, because you're mixing the two names for "fscache" which makes the code look totally incomprehensible. Why would we wait for PF_fscache, when PG_private_2 was set? Yes, I know why, but the code looks entirely nonsensical. So just looking at the support infrastructure changes, I get a big "Hmm". But the thing that makes me go "No, I won't pull this", is that it has all the same hallmark signs of trouble that I've complained about before: I see absolutely zero sign of "this has more developers involved". There's not a single ack from a VM person for the VM changes. There's no sign that this isn't yet another "David Howells went off alone and did something that absolutely nobody else cared about". See my problem? I need to be convinced that this makes sense outside of your world, and it's not yet another thing that will cause problems down the line because nobody else really ever used it or cared about it until we hit a snag. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache and cachefiles fixes
On Wed, Jul 25, 2018 at 11:18 AM David Howells wrote: > > Linus Torvalds wrote: > > > Why do they have commit times literally *minutes* before your email to > > ask me to pull? > > I split the first patch into two because it had two more-or-less independent > changes and it made it easier to describe them separately. This didn't explain the background before that happened. I really want the warm and fuzzies about things, and when I see something being changed just minutes before, something like "I did a rebase becasue XYZ, but this was around in linux-next for a week before that without problems" tells me not only why the times are so recent, but also tells me that I really shouldn't worry because so-and-so. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [GIT PULL] fscache and cachefiles fixes
On Wed, Jul 25, 2018 at 7:02 AM David Howells wrote: > > Can you pull these fixes for fscache and cachefiles please? I've pulled them, but I'm not happy about it. They are all *very* new. Why do they have commit times literally *minutes* before your email to ask me to pull? What kind of testing have these gotten? Seriously, if I get a pull request with commits this new, I want to have an *explanation*. The explanation could be "I use quilt, and this underwent lots of testing outside of git", but the explanation should be there! I took them because they look simple enough, but that still leaves me grumpy. Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH net-next 00/12] fscache: Fixes, traces and development
On Fri, Apr 6, 2018 at 11:21 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > No, but if you can redo the pull request part so that the diffstat I > get will match the diffstat I see in the pull request, that would be > good. Oh, and can you please make sure there is a "[GIT PULL]" in the subject line for your pull requests? Particularly during the merge window (not so much later) I end up having a separate filtered list of emails that I look at that mention "git pull". Your pull requests don't seem to match that, so your pull requests don't actually even show up in my list of pending pull requests. That doesn't mean that they get lost - it just tends to mean that I don't necessarily see them during my busiest days of pulling. I usually start my days off looking at emails in general, so I see discussions and I see your emails, but then when I start actually pulling I go to that list of pending pull requests, and never actually pull your stuff. That all changes the second week of the merge window when I generally don't have a separate list of pull requests at all (because I've gotten through the big pile of pending stuff), so the pull *does* eventually happen, but it kind of gets delayed. For example, yesterday when I was doing filesystem pulls (I try to often group "similar areas" together), I did nfsd and f2fs. I would have done your afs and fscache pulls while at it, but they didn't show up in my pull request list, so I didn't.. So if you have that [GIT PULL] in the subject line, the pulls will often be a bit timelier. (Again, this is mainly just an issue for the merge window, but it certainly doesn't hurt at other times). Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs
Re: [Linux-cachefs] [PATCH 06/12] fscache: Detect multiple relinquishment of a cookie
On Wed, Apr 4, 2018 at 3:07 PM, David Howellswrote: > Report if an fscache cookie is relinquished multiple times by the netfs. > > - set_bit(FSCACHE_COOKIE_RELINQUISHED, >flags); > + if (test_and_set_bit(FSCACHE_COOKIE_RELINQUISHED, >flags)) > + BUG(); Ugh. Please try to avoid adding BUG() calls for reporting. They can often cause the machine to be basically dead. This one seem fairly ok, simply because it looks like the only caller doesn't really hold a lot of locks (the superblock s_umount lock, but that may be all). So I'll pull this change, I just wanted people to realize that if this is a "help make sure we notice if things go wrong", then "WARN_ON_ONCE()" or something is usually a better choice than "potentially kill the machine and make it harder to actually see the error". Linus -- Linux-cachefs mailing list Linux-cachefs@redhat.com https://www.redhat.com/mailman/listinfo/linux-cachefs