Kernel-failure - Computer freezes
Got this Kernel-failure today while making a btrfs-snapshot. # uname -r 4.2.0-1-amd64 Dez 16 20:33:14 aldebaran kernel: btrfs: page allocation failure: order:1, mode:0x204020 Dez 16 20:33:14 aldebaran kernel: CPU: 1 PID: 8016 Comm: btrfs Tainted: G U W 4.2.0-1-amd64 #1 Debian 4.2.6-3 Dez 16 20:33:14 aldebaran kernel: Hardware name: Hewlett-Packard HP ProBook 450 G2/2248, BIOS M74 Ver. 01.08 12/12/2014 Dez 16 20:33:14 aldebaran kernel: 0001 8154f6c3 00204020 Dez 16 20:33:14 aldebaran kernel: 8115727f 88014f5fbb00 0001 Dez 16 20:33:14 aldebaran kernel: 0001 88014f5fdfa8 0046 Dez 16 20:33:14 aldebaran kernel: Call Trace: Dez 16 20:33:14 aldebaran kernel: [] ? dump_stack+0x40/0x50 Dez 16 20:33:14 aldebaran kernel: [] ? warn_alloc_failed+0xcf/0x130 Dez 16 20:33:14 aldebaran kernel: [] ? __alloc_pages_nodemask+0x2b4/0x9e0 Dez 16 20:33:14 aldebaran kernel: [] ? kmem_getpages+0x61/0x110 Dez 16 20:33:14 aldebaran kernel: [] ? fallback_alloc+0x143/0x1f0 Dez 16 20:33:14 aldebaran kernel: [] ? kmem_cache_alloc+0x1eb/0x430 Dez 16 20:33:14 aldebaran kernel: [] ? ida_pre_get+0x5c/0xd0 Dez 16 20:33:14 aldebaran kernel: [] ? get_anon_bdev+0x6d/0xe0 Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_init_free_ino_ctl+0x61/0xa0 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_init_fs_root+0x106/0x180 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_read_fs_root+0x33/0x40 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_get_fs_root.part.49+0x93/0x180 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? memcmp_extent_buffer+0xc1/0x120 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_lookup_dentry+0x285/0x500 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? btrfs_lookup+0xe/0x30 [btrfs] Dez 16 20:33:14 aldebaran kernel: [] ? lookup_real+0x19/0x60 Dez 16 20:33:14 aldebaran kernel: [] ? path_openat+0xa98/0x14c0 Dez 16 20:33:14 aldebaran kernel: [] ? do_set_pte+0x9e/0xd0 Dez 16 20:33:14 aldebaran kernel: [] ? filemap_map_pages+0x21e/0x230 Dez 16 20:33:14 aldebaran kernel: [] ? do_filp_open+0x75/0xd0 Dez 16 20:33:14 aldebaran kernel: [] ? __alloc_fd+0x3f/0x110 Dez 16 20:33:14 aldebaran kernel: [] ? do_sys_open+0x12a/0x200 Dez 16 20:33:14 aldebaran kernel: [] ? system_call_fast_compare_end+0xc/0x6b Dez 16 20:33:14 aldebaran kernel: Mem-Info: Dez 16 20:33:14 aldebaran kernel: active_anon:226184 inactive_anon:65166 isolated_anon:0 active_file:245088 inactive_file:246297 isolated_file:0 unevictable:192 dirty:1706 writeback:50 unstable:0 slab_reclaimable:32362 slab_unreclaimable:16012 mapped:96648 shmem:53953 pagetables:9075 bounce:0 free:7046 free_pcp:1265 free_cma:0 Dez 16 20:33:14 aldebaran kernel: Node 0 DMA free:13364kB min:32kB low:40kB high:48kB active_anon:252kB inactive_anon:284kB active_file:560kB inactive_file:660kB unevictable:0kB isolated(anon):0kB isolated(file) Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 2125 3337 3337 Dez 16 20:33:14 aldebaran kernel: Node 0 DMA32 free:11232kB min:4624kB low:5780kB high:6936kB active_anon:612872kB inactive_anon:162128kB active_file:616016kB inactive_file:618100kB unevictable:516kB isolated(an Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 0 1212 1212 Dez 16 20:33:14 aldebaran kernel: Node 0 Normal free:3464kB min:2636kB low:3292kB high:3952kB active_anon:291612kB inactive_anon:98252kB active_file:363776kB inactive_file:366212kB unevictable:252kB isolated(ano Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 0 0 0 Dez 16 20:33:14 aldebaran kernel: Node 0 DMA: 31*4kB (UE) 17*8kB (UM) 8*16kB (UE) 5*32kB (UEM) 2*64kB (E) 1*128kB (E) 1*256kB (M) 2*512kB (EM) 1*1024kB (E) 1*2048kB (E) 2*4096kB (UM) = 13348kB Dez 16 20:33:14 aldebaran kernel: Node 0 DMA32: 2720*4kB (UEM) 30*8kB (UEM) 7*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11232kB Dez 16 20:33:14 aldebaran kernel: Node 0 Normal: 837*4kB (UEM) 22*8kB (UM) 0*16kB 2*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3588kB Dez 16 20:33:14 aldebaran kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dez 16 20:33:14 aldebaran kernel: 545434 total pagecache pages Dez 16 20:33:14 aldebaran kernel: 4 pages in swap cache Dez 16 20:33:14 aldebaran kernel: Swap cache stats: add 10, delete 6, find 0/0 Dez 16 20:33:14 aldebaran kernel: Free swap = 7812052kB Dez 16 20:33:14 aldebaran kernel: Total swap = 7812092kB Dez 16 20:33:14 aldebaran kernel: 892955 pages RAM Dez 16 20:33:14 aldebaran kernel: 0 pages HighMem/MovableOnly Dez 16 20:33:14 aldebaran kernel: 33747 pages reserved Dez 16 20:33:14 aldebaran kernel: 0 pages hwpoisoned Dez 16 20:33:14 aldebaran kernel: BUG: unable to handle kernel NULL
Re: [GIT PULL] Btrfs fixes for 4.4
On Thu, Dec 10, 2015 at 11:44:50AM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Hi Chris, > > Please consider the following fixes for kernel 4.4. Two of them are fixes > to new issues introduced in the 4.4 merge window and 4.4 release candidates. > The other one just fixes a warning message that is confusing and has made > several users wonder if they are supposed to do anything or not when we > fail to read a space cache. > All these fixes have been previously sent to the mailing list. Thanks Filipe, I tested these and pushed out, along with my two from this week. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Wed, 2015-12-09 at 16:36 +, Duncan wrote: > But... as I've pointed out in other replies, in many cases including > this > specific one (bittorrent), applications have already had to develop > their > own integrity management features Well let's move discussion upon that into the "dear developers, can we have notdatacow + checksumming, plz?" where I showed in one of the more recent threads that bittorrent seems rather to be the only thing which does use that per default... while on the VM image front, nothing seems to support it, and on the DB front, some support it, but don't use it per default. > In the bittorrent case specifically, torrent chunks are already > checksummed, and if they don't verify upon download, the chunk is > thrown > away and redownloaded. I'm not a bittorrent expert, because I don't use it, but that sounds to be more like the edonkey model, where - while there are checksums - these are only used until the download completes. Then you have the complete file, any checksum info thrown away, and the file again being "at risk" (i.e. not checksum protected). > And after the download is complete and the file isn't being > constantly > rewritten, it's perfectly fine to copy it elsewhere, into a dir where > nocow doesn't apply. Sure, but again, nothing the user may automatically do, and there's still the gap between the final verification from the bt software, to the time it's copied over. Arguably, that may be very short, but I see no reasons to make any breaks in the everything-verified chain from the btrfs side. > With the copy, btrfs will create checksums, and if > you're paranoid you can hashcheck the original nocow copy against the > new > checksummed/cow copy, and after that, any on-media changes will be > caught > by the normal checksum verification mechanisms. As before... of course you're right that one can do this, but nothing that happens per default. And I think that's just one of the nice things btrfs would/should give us. That the filesystem assures that data is valid, at least in terms of storage device and bus errors (it cannot protect of course against memory errors or that like). > > Hmm doesn't seem really good to me if systemd would do that, cause > > it > > then excludes any such files from being snapshot. > > Of course if the directories are already present due to systemd > upgrading > from non-btrfs-aware versions, they'll remain as normal dirs, not > subvolumes. This is the case here. Well, even if not, because one starts from a fresh system... people may not want that. > And of course you can switch them around to dirs if you like, and/or > override the shipped tmpfiles.d config with your own. ... sure but, people may not even notice that. I don't think such a decision is up to systemd. Anyway, since we're btrfs here, not systemd, that shouldn't bother us ;) > > > and small ones such as the sqlite files generated by firefox and > > > various email clients are handled quite well by autodefrag, with > > > that > > > general desktop usage being its primary target. > > Which is however not yet the default... > Distro integration bug! =:^) Nah,... really not... I'm quite sure that most distros will generally decide against diverting from upstream in such choices. > > It feels a bit, if there should be some tools provided by btrfs, > > which > > tell the users which files are likely problematic and should be > > nodatacow'ed > And there very well might be such a tool... five or ten years down > the > road when btrfs is much more mature and generally stabilized, well > beyond > the "still maturing and stabilizing" status of the moment. Hmm let's hope btrfs isn't finished only when the next-gen default fs arrives ;^) > But it can be the case that as filesystem fragmentation levels rise, > free- > space itself is fragmented, to the point where files that would > otherwise > not be fragmented as they're created once and never touched again, > end up > fragmented, because there's simply no free-space extents big enough > to > create them in unfragmented, so a bunch of smaller free-space extents > must be used where one larger one would have been used had it > existed. In kinda curios, what free space fragmentation actually means here. Ist simply like this: +--+-+---++ | F | D | F | D | +--+-+---++ Where D is data (i.e. files/metadata) and F is free space. In other words, (F)ree space itself is not further subdivided and only fragmented by the (D)ata extents in between. Or is it more complex like this: +-++-+---++ | F | F | D | F | D | +-+ +-+---++ Where the (F)ree space itself is subdivided into "extents" (not necessarily of the same size), and btrfs couldn't use e.g. the first two F's as one contiguous amount of free space for a larger (D)ata extent of that size: +--+-+---++ | D | D | F | D |
Re: btrfs: poor performance on deleting many large files
On Sun, 2015-12-13 at 07:10 +, Duncan wrote: > > So you basically mean that ro snapshots won't have their atime > > updated > > even without noatime? > > Well I guess that was anyway the recent behaviour of Linux > > filesystems, > > and only very old UNIX systems updated the atime even when the fs > > was > > set ro. > > I'd test it to be sure before relying on it (keeping in mind that my > own > use-case doesn't include subvolumes/snapshots so it's quite possible > I > could get fine details of this nature wrong), but that would be my > very > (_very_! see next) strong assumption, yes. > > Because read-only snapshots are used for btrfs-send among other > things, > with the idea being that the read-only will keep them from changing > in > the middle of the send, and ro snapshot atime updates would seem to > throw > that entirely out the window. So I can't imagine ro snapshots doing > atime > updates under any circumstance because I just can't see how send > could > rely on them then, but I'd still test it before counting on it. For those who haven't followed up the other threads: I've tried it out and yes, ro-snapshots (as well as ro mounted btrfs filesystem/subvolumes) don't have their atimes changed on e.g. read. > AFAIK, the general idea was to eventually have all the (possible, > some > are global-filesystem-scope) subvolume mount options exposed as > properties, it's just not implemented yet, but I'm not entirely sure > if > that was all /btrfs-specific/ mount options, or included the generic > ones > such as the *atime and no* (noexec/nodev/...) options as well. In > view > of that and the fact that noatime is generic, adding it as a specific > request still makes sense. Someone with more specific knowledge on > the > current plan can remove it if it's already covered. Not sure if I'd had already posted that here, but I did write some of these ideas up and added it to the wiki: https://btrfs.wiki.kernel.org/index.php?title=Project_ideas=historysubmit=29757=29743 Best wishes, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Mon, 2015-12-14 at 10:51 +, Duncan wrote: > > AFAIU, the one the get's fragmented then is the snapshot, right, > > and the > > "original" will stay in place where it was? (Which is of course > > good, > > because one probably marked it nodatacow, to avoid that > > fragmentation > > problem on internal writes). > > No. Or more precisely, keep in mind that from btrfs' perspective, in > terms of reflinks, once made, there's no "original" in terms of > special > treatment, all references to the extent are treated the same. Sure... you misunderstood me I guess.. > > What a snapshot actually does is create another reference (reflink) > to an > extent. [snip snap] > So in the > case of nocow, a cow1 (one-time-cow) exception must be made, > rewriting > the changed data to a new location, as the old location continues to > be > referenced by at least one other reflink. That's what I've meant. > So (with the fact that writable snapshots are available and thus it > can > be the snapshot that changed if it's what was written to) the one > that > gets the changed fragment written elsewhere, thus getting fragmented, > is > the one that changed, whether that's the working copy or the snapshot > of > that working copy. Yep,.. that's what I've suspected and asked for. The "original" file, in the sense of the file that first reflinked the contiguous blocks,... will continue to point to these continuous blocks. While the "new" file, i.e he CoW-1-ed snapshot's file, will partially reflink blocks form the contiguous range, and it's rewritten blocks will reflink somewhere else. Thus the "new" file is the one that gets fragmented. > > And one more: > > You both said, auto-defrag is generally recommended. > > Does that also apply for SSDs (where we want to avoid unnecessary > > writes)? > > It does seem to get enabled, when SSD mode is detected. > > What would it actually do on an SSD? > Did you mean it does _not_ seem to get (automatically) enabled, when > SSD > mode is detected, or that it _does_ seem to get enabled, when > specifically included in the mount options, even on SSDs? I does seem to get enabled, when specifically included in the mount options (the ssd mount option is not used), i.e.: /dev/mapper/system / btrfs subvol=/root,defaults,noatime,autodefrag0 1 leads to: [5.294205] BTRFS: device label foo devid 1 transid 13 /dev/disk/by-label/foo [5.295957] BTRFS info (device sdb3): disk space caching is enabled [5.296034] BTRFS: has skinny extents [ 67.082702] BTRFS: device label system devid 1 transid 60710 /dev/mapper/system [ 67.85] BTRFS info (device dm-0): disk space caching is enabled [ 67.111267] BTRFS: has skinny extents [ 67.305084] BTRFS: detected SSD devices, enabling SSD mode [ 68.562084] BTRFS info (device dm-0): enabling auto defrag [ 68.562150] BTRFS info (device dm-0): disk space caching is enabled > Or did you actually mean it the way you wrote it, that it seems to be > enabled (implying automatically, along with ssd), when ssd mode is > detected? No, sorry for being unclear. I meant it that way, that having the ssd detected doesn't auto-disable auto-defrag, which I thought may make sense, given that I didn't know exactly what it would do on SSDs... IIRC, Hugo or Austin, mentioned the thing with making for better IOPS, but I haven't had considered that to have impact enough... so I thought it could have made sense to ignore the "autodefrag" mount option in case an ssd was detected. > There are three factors I'm aware of here as well, all favoring > autodefrag, just as the two above favored leaving it off. > > 1) IOPS, Input/Output Operations Per Second. SSDs typically have > both an > IOPS and a throughput rating. And unlike spinning rust, where raw > non- > sequential-write IOPS are generally bottlenecked by seek times, on > SSDs > with their zero seek-times, IOPS can actually be the bottleneck. Hmm it would be really nice to get someone who has found a way to make some sound analysis/benchmarking of that. > 2) SSD physical write and erase block sizes as multiples of the > logical/ > read block size. To the extent that extent sizes are multiples of > the > write and/or erase-block size, writing larger extents will reduce > write > amplification due to writing and blocks smaller than the write or > erase > block size. Hmm... okay I don't know the details of how btrfs does this, but I'd have expected that all extents are aligned to the underlying physical devices' block structure. Thus each extent should start at such write/erase block, and at most it shouldn't perfectly at the end of the extent. If the file is fragmented (i.e. more than one extent), I'd have even hoped that all but the last one fit perfectly. So what you basically mean, AFAIU, is that by having auto-defrag, you get larger extents (i.e. smaller ones collapsed into one) and by thus you get less cut off at the end of extents where these
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Am Wed, 9 Dec 2015 13:36:01 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > >> > 4) Duncan mentioned that defrag (and I guess that's also for > >> > auto- defrag) isn't ref-link aware... > >> > Isn't that somehow a complete showstopper? > > >> It is, but the one attempt at dealing with it caused massive data > >> corruption, and it was turned off again. > > IIRC, it wasn't data corruption so much, as massive scaling issues, > to the point where defrag was entirely useless, as it could take a > week or more for just one file. > > So the decision was made that a non-reflink-aware defrag that > actually worked in something like reasonable time even if it did > break reflinks and thus increase space usage, was of more use than a > defrag that basically didn't work at all, because it effectively took > an eternity. After all, you can always decide not to run it if you're > worried about the space effects it's going to have, but if it's going > to take a week or more for just one file, you effectively don't have > the choice to run it at all. > > > So... does this mean that it's still planned to be implemented some > > day or has it been given up forever? > > AFAIK it's still on the list. And the scaling issues are better, but > one big thing holding it up now is quota management. Quotas never > have worked correctly, but they were a big part (close to half, IIRC) > of the original snapshot-aware-defrag scaling issues, and thus must > be reliably working and in a generally stable state before a > snapshot-aware-defrag can be coded to work with them. And without > that, it's only half a solution that would have to be redone when > quotes stabilized anyway, so really, quota code /must/ be stabilized > to the point that it's not a moving target, before reimplementing > snapshot-aware-defrag makes any sense at all. > > But even at that point, while snapshot-aware-defrag is still on the > list, I'm not sure if it's ever going to be actually viable. It may > be that the scaling issues are just too big, and it simply can't be > made to work both correctly and in anything approaching practical > time. Time will tell, of course, but until then... I'd like to throw in an idea... Couldn't auto-defrag just be made "sort of reflink-aware" in a very simple fashion: Just let it ignore extents that are shared? That way you can still enjoy it benefits in a mixed-mode scenario where you are working with snapshots partly but other subvolumes are never taken snapshots of. Comments? -- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ideas on unified real-ro mount option across all filesystems
Hi, In a recent btrfs patch, it is going to add a mount option to disable log replay for btrfs, just like "norecovery" for ext4/xfs. But in the discussion on the mount option name and use case, it seems better to have an unified and fs independent mount option alias for real RO mount Reasons: 1) Some file system may have already used [no]"recovery" mount option In fact, btrfs has already used "recovery" mount option. Using "norecovery" mount option will be quite confusing for btrfs. 2) More straight forward mount option Currently, to get real RO mount, for ext4/xfs, user must use -o ro,norecovery. Just ro won't ensure real RO, and norecovery can't be used alone. If we have a simple alias, it would be much better for user to use. (it maybe done just in user space mount) Not to mention some fs (yeah, btrfs again) doesn't have "norecovery" but "nologreplay". 3) A lot of user even don't now mount ro can still modify device Yes, I didn't know this point until I checked the log replay code of btrfs. Adding such mount option alias may raise some attention of users. Any ideas about this? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas on unified real-ro mount option across all filesystems
And here is the existing discussion in btrfs mail list, just for reference: http://thread.gmane.org/gmane.comp.file-systems.btrfs/51098 Thanks, Qu Qu Wenruo wrote on 2015/12/17 09:41 +0800: Hi, In a recent btrfs patch, it is going to add a mount option to disable log replay for btrfs, just like "norecovery" for ext4/xfs. But in the discussion on the mount option name and use case, it seems better to have an unified and fs independent mount option alias for real RO mount Reasons: 1) Some file system may have already used [no]"recovery" mount option In fact, btrfs has already used "recovery" mount option. Using "norecovery" mount option will be quite confusing for btrfs. 2) More straight forward mount option Currently, to get real RO mount, for ext4/xfs, user must use -o ro,norecovery. Just ro won't ensure real RO, and norecovery can't be used alone. If we have a simple alias, it would be much better for user to use. (it maybe done just in user space mount) Not to mention some fs (yeah, btrfs again) doesn't have "norecovery" but "nologreplay". 3) A lot of user even don't now mount ro can still modify device Yes, I didn't know this point until I checked the log replay code of btrfs. Adding such mount option alias may raise some attention of users. Any ideas about this? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 13:03:24 +0100 as excerpted: > Human readable lables are basically guaranteed to collide, Heh, not here, tho one could argue that my labels aren't "human readable", I suppose. grep LABEL= /etc/fstab | cut -f1 LABEL=bt0238gcn1+35l0 LABEL=bt0238gcn0+35l0 LABEL=bt0465gsg0+47f0 LABEL=rt0238gcnx+35l0 LABEL=rt0238gcnx+35l1 LABEL=rt0465gsg0+47f0 LABEL=hm0238gcnx+35l0 LABEL=pk0238gcnx+35l0 LABEL=nr0238gcnx+35l0 LABEL=hm0238gcnx+35l1 LABEL=pk0238gcnx+35l1 LABEL=nr0238gcnx+35l1 LABEL=hm0465gsg0+47f0 LABEL=pk0465gsg0+47f0 LABEL=nr0465gsg0+47f0 LABEL=lg0238gcnx+35l0 LABEL=lg0465gsg0+47f0 LABEL=mm0465gsg0+2550 LABEL=mm0465gsg0+2551 #LABEL=sw0465gsg0+47f0 The scheme was originally designed with reiserfs' 15-char limited labels in mind, so it's 15-char. These days I use it for both fs labels and gpt partition names/labels, with the two generally matched except for the device sequential, which is x in the multi-device case. * function: 2 chars bt=boot, hm=home, etc * device-id:8 uniq-in-scope device id ** size:5 0238g=238 GiB ** brand: 2 sg=seagate, cn=corsair neutron, etc ** dev-seq: 1 can be more than one 465 GiB seagate * target: 1 +=home workstation, . for the netbook, etc * date: 3 date of original partition creation ** year:1 last digit of year, gives decade scope ** month1 1-9abc ** day 1 1-9a-v (2char would be nice here, but...) * func-seq 1 0=working, backup-N 2+8+1+3+1=15 chars =:^) So for example rt0238gcnx+35l0 is root, on 238 GiB Corsair Neutron (multi- devices), targeted at the workstation, with the partitions originally setup on 2013, June (something, whatever l is), working copy. (Hmm... Only apropos to this thread due to the tangential btrfs angle, but that's two and a half years ago. Which since that's when I first deployed btrfs permanently, I've been running btrfs for two and a half years now. ... =:^) The function tells me at a glance what it's intended to be used for. The target (which also functions as a visual separator) tells me at a glance where the device is intended to be used. The func-seq tells me at a glance whether I'm dealing with the working copy or what level of backup, and taken together with the function and target, uniquely ID the partition/filesystem "software device". The dev-id is uniq-in-scope, easily IDing size, brand, and number of "hardware device", and size is ridiculously scalable from bytes to PiB and beyond. For multi-device btrfs, dev-seq is "x", while the individual device partitions composing it still have their sequence numbers in their gpt labels. The date (along with size, of course) provides some idea of the age of the device, or at least the partitioning scheme on it, as well as providing more bits of "software device" and overall unique-id. Both sequence numbers can easily and intuitively scale to 61 (1-9a-zA-Z) if needed, and less intuitively a bit higher if it's really necessary. Target would lose its separator status if it scaled too far, but certainly gives me as an individual /reasonable/ number of machines flexibility. This scheme self-evidently and easily scales to a library well into the multi-hundreds if not thousands of physical devices, portable or permanently installed, partitioned up as needed. I haven't yet found the need as my "device library" is small enough, but were I to need to, I could reasonably easily put together a database tracking where various files (and even various versions of those files) are located. With the "software device" and "hardware device" IDed separately, I can easily substitute out or add/remove hardware devices from software devices, or the reverse, as necessary. The biggest problem is the 15-char limit; I had to pack the fields rather tighter and more cryptically than I'd have liked, so it's not as easily human readable as I'd have liked. And of course it'd need adapted for deployment scales on the level of facebook/google/nsa, where 60-some device-scaling in the sequence numbers, and the target scaling as well, is pitifully laughable, but it's certainly reasonable on an individual scale, and with a couple revisions for mdraid and btrfs (basically, md for brand when I was doing partitioned mdraid, and substituting x for individual sequence number for multi-device), the scheme has served me surprisingly well over the years since I came up with it, and should continue to do so, I suppose, until I no longer have the need (death, or near-vegetable in a nursing home or whatever). Tho if HP's "the machine" were to ever take off in my lifetime, it could prove somewhat... challenging to the mental and nomenclature model, but that pretty much applies to the entire computer field, both hardware and software, as we know it, so I'm far from alone, there. But, despite the debatable
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> And there very well might be such a tool... five or ten years down the >> road when btrfs is much more mature and generally stabilized, well >> beyond the "still maturing and stabilizing" status of the moment. > Hmm let's hope btrfs isn't finished only when the next-gen default fs > arrives ;^) [Again, breaking into smaller point replies...] Well, given the development history for both zfs and btrfs to date, five to ten years down the line, with yet another even newer filesystem then already under development, is more being "real", than not. Also see the history in MS' attempt at a next-gen filesystem. The reality is these things take FAR longer than one might think. FWIW, on the wiki I see feature points and benchmarks for v0.14, introduced in April of 2008, and a link to an earlier btree filesystem on which btrfs apparently was based, dating to 2006, so while I don't have a precise beginning date, and to some extent such a thing would be rather arbitrary anyway, as Chris would certainly have done some major thinking, preliminary research and coding, before his first announcement, a project origin in late 2006 or sometime in 2007 has to be quite close. And (as I noted in a parenthetical at my discovery in a different thread), I switched to btrfs for my main filesystems when I bought my first SSDs, in June of 2013, so already a quarter decade ago. At the time btrfs was just starting to remove some of the more dire "experimental" warnings. Obviously it has stabilized quite a bit since then, but due to the oft-quoted 80/20 rule and extensions, where the last 20% of the progress takes 80% of the work, etc... It could well be another five years before btrfs is at a point I think most here would call stable. That would be 2020 or so, about 13 years for the project, and if you look at the similar projects mentioned above, that really isn't unrealistic at all. Ten years minimum, and that's with serious corporate level commitments and a lot more dedicated devs than btrfs has. 12 years not unusual at all, and a decade and a half still well within reasonable range, for a filesystem with this level of complexity, scope, and features. And realistically, by that time, yet another successor filesystem may indeed be in the early stages of development, say at the 20/80 point, 20% of required effort invested, possibly 80% of the features done, but not stabilized. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > In kinda curios, what free space fragmentation actually means here. > > Ist simply like this: > +--+-+---++ > | F | D | F | D | > +--+-+---++ > Where D is data (i.e. files/metadata) and F is free space. > In other words, (F)ree space itself is not further subdivided and only > fragmented by the (D)ata extents in between. > > Or is it more complex like this: > +-++-+---++ > | F | F | D | F | D | > +-++-+---++ > Where the (F)ree space itself is subdivided into "extents" (not > necessarily of the same size), and btrfs couldn't use e.g. the first two > F's as one contiguous amount of free space for a larger (D)ata extent [still breaking into smaller points for reply] At the one level, I had the simpler f/d/f/d scheme in mind, but that would be the case inside a single data chunk. At the higher file level, with files significant fractions of the size of a single data chunk to much larger than a single data chunk, the more complex and second f/f/d/f/d case would apply, with the chunk boundary as the separation between the f/f. IOW, files larger than data chunk size will always be fragmented into data chunk size fragments/extents, at the largest, because chunks are designed to be movable using balance, device remove, replace, etc. So (using the size numbers from a recent comment from Qu in a different thread), on a filesystem with under 100 GiB total space-effective (space- effective, space available, accounting for the replication type, raid1, etc, and I'm simplifying here...), data chunks should be 1 GiB, while above that, with striping, they might be upto 10 GiB. Using the 1 GiB nominal figure, files over 1 GiB would always be broken into 1 GiB maximum size extents, corresponding to 1 extent per chunk. But while 4 KiB extents are clearly tiny and inefficient at today's scale, in practice, efficiency gains break down at well under GiB scale, with AFAIK 128 MiB being the upper bound at which any efficiency gains could really be expected, and 1 MiB arguably being a reasonable point at which further increases in extent size likely won't have a whole lot of effect even on SSD erase-block (where 1 MiB is a nominal max), but that's that's still 256X the usual 4 KiB minimum data block size, 8X the 128 KiB btrfs compression-block size, and 4X the 256 KiB defrag default "don't bother with extents larger than this" size. Basically, the 256 KiB btrfs defrag "don't bother with anything larger than this" default is quite reasonable, tho for massive multi-gig VM images, the number of 256 KiB fragments will still look pretty big, so while technically a very reasonable choice, the "eye appeal" still isn't that great. But based on real reports posting before and after numbers from filefrag (on uncompressed btrfs), we do have cases where defrag can't find 256 KiB free-space blocks and thus can actually fragment a file worse than it was before, so free-space fragmentation is indeed a very real problem. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > I'm a bit unsure how to read filefrag's output... (even in the > uncompressed case). > What would it show me if there was fragmentation /path/to/file: 18 extents found It tells you the number of extents found. Nominally, each extent should be a fragment, but as has been discussed elsewhere, on btrfs compressed files it will interpret each 128 KiB btrfs compression block as its own extent, even if (as seen in verbose mode) the next one begins where the previous one ends so it's really just a single extent. Apparently on ext3/4, it's possible to have multi-gig files as a single extent, thus unfragmented, but as explained in an earlier reply to a point earlier in your post, on btrfs, extents of a GiB are nominally the best you can do as that's the nominal data chunk size, tho in limited circumstances larger extents are still possible on btrfs. In the case above, where I took the 18 extents result from a real file (tho obviously the posted path isn't real), it was 4 MiB in size (I think exactly, it's a 4 MiB BIOS image =:^), so doing the math, extents average 227 KiB. That's on a filesystem that is always mounted with autodefrag, but it's also always mounted with compress, so it's possible some of the reported extents are compressed. Actually, looking at filefrag -v output (which I've never used before but which someone noted could be used to check fragmentation on compressed files, tho it's not as straightforward as you might think), it looks like all but two of the listed extents are 32 blocks long (with 4096 byte blocks), which equates to 128 KiB, the btrfs compression-block size, and the two remaining extents are 224 blocks long or 896 KiB, an exact 7 multiple of 128 KiB, so this file would indeed appear to be compressed except for those two uncompressed extents. (As for figuring out how to interpret the full -v output to know whether the compressed blocks are actually single extents or not, as I said this is my first time trying -v, and I didn't bother going that far with it.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> he obviously didn't think thru the fact that compression MUST be a >> rewrite, thereby breaking snapshot reflinks, even were normal >> non-compression defrag to be snapshot aware, because compression >> substantially changes the way the file is stored), that's _implied_, >> not explicit. > So you mean, even if ref-link aware defrag would return, it would still > break them again when compressing/uncompressing/recompressing? > I'd have hoped that then, all snapshots respectively other reflinks > would simply also change to being compressed, You're correct. I "obviously didn't thing thru" that the whole way, myself. =:^( But meanwhile, we don't have snapshot-aware-defrag, and in that case, the implication... and his result... remains. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> It's certainly in quite a few on-list posts over the years > okay,.. in other words: no ;-) > scatter over the years list posts don't count as documentation :P =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Thu, 2015-12-17 at 01:09 +, Duncan wrote: > Well, "don't load the journal on mounting" is exactly what the option > would do. The journal (aka log) of course has a slightly different > meaning, it's only the fsync log, but loading it is exactly what the > option would prevent, here. That's not the point. What David asked for was an option, that has the meaning "do what ever is necessary to mount the fs in such a way, that the device isn't changed". At least that's how I've understood him. *Right now* this is for btrfs the "nologreplay" option, so "nodevwrites" or whatever you call it, would simply imply "nologreply"... and in the future any other options that are necessary to get the above defined semantics. For ext, "nodevwrites" would imply "noload" (AFAIU). If you now make "noload" an alias for "nodevwrites" in btrfs you clearly break semantics here: "noload" from ext4 hasn't the same meaning as "nodevwrites" from btrfs.. it has *just now* while ext doesn't need any other possible future options. Maybe in 10 years, ext has a dozens new features (because btrfs still hasn't stabilised yet, as it misses snapshot-aware defrag and checksums for noCoWed data >:-D O:-D ... sorry, couldn't resist ;) ), one of that new features needs to be disabled for "hard ro mounts", thus ext's "nodevwrites" would in addition imply "noSuperNewExtFeature". Then "noload" from ext4 isn't even *effectively* the same anymore as "nodevwrites" from btrfs. Therefore, it shouldn't be an alias. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Ideas to do custom operation just after mount?
Hi, Will xfstests provides some API to do some operation just after mounting a filesystem? Some filesystem(OK, btrfs again) has some function(now qgroup only) which needed to be enabled by ioctl instead of mount option. Currently, for btrfs qgroup we added special test case enabling qgroup and do test. But there is only less than 10 qgroup test case and we didn't test most of the rest cases with qgroup enable. Things will get even worse if we are adding inband-deduplication. So is there any good idea to do some operation just after mounting in xfstests? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 16:04:03 +0100 as excerpted: > On Wed, 2015-12-16 at 09:41 -0500, Chris Mason wrote: >> Hugo is right here. reiserfs had tools that would scan and entire >> block device for metadata blocks and try to reconstruct the filesystem >> based on what it found. > Creepy... at least when talking about a "normal" fsck... good that btrfs > is going to be the next-gen-ext, and not reiser4 ;) What often gets lost in discussions of this nature is that it _wasn't_ "normal" fsck that had the problem, but rather, a parameter (--rebuild-tree, IIRC) much like btrfs check (--init-csum-tree, init-extent-tree) and rescue (chunk-recover) use for blowing away and recreating the checksum tree, extent tree, chunk tree, etc. So it's definitely _not_ something that reiserfsck would do in a "normal" fsck, only when doing "I'm desperate and don't have backups, go to the ends of the earth if necessary to recover what you can of my data, and yes, I understand it could be a bit risky or end up rather disordered, but I'm willing to take that risk because I _am_ that desperate", level recovery. Arguably, however, the problem was that reiserfs (heh, that's the second time I almost wrote btrfs and caught it, hope I didn't miss any! =:^) had a rather minor items repair mode, and an "I'm desperate, ends of the earth and I don't care about the risk as anything is better than nothing" mode, but not a lot of choice in between the two. Additionally, now looking at btrfs (a correct reference this time! =:^), the "desperate" solution in btrfs is rather more fine-grained, including at least the three above options plus one for the superblock, with an additional read- only restore tool that can often restore most or all data to elsewhere, in the case of a missed or not current backup, that reiserfs never had. But AFAIK reiser4 (which I never actually tried as it never made mainline, which in general I prefer to stick to, but I read about it) improved on the reiserfs model in this regard as well -- indeed, it would have been surprising if it didn't, since both reiser4 and btrfs had the lessens of reiserfs to build upon. And of course reiserfs might have gotten the same sort of tool changes too, except for Hans Reiser's controversial policy of letting stable be stable, and putting the improvements into reiser4, which of course was intended to get into mainline in some reasonable time and thus wouldn't have left reiserfs users so in the lurch as actually happened, because reiser4 never did hit mainline due to $reasons, most/all of which I agree with, or at least understand, where I don't entirely agree. But anyway, for anyone with half a tech-oriented brain, it was very evident that the required options were "desperate" level, and for people without half a tech-oriented brain, the documentation clearly suggested danger ahead, you should have backups if you're going to do this as it's a risky process that could destroy chances of recovery instead of fixing things, as well. But of course so many don't read the docs, they just do it... and sometimes they suffer the consequences when they do... and sometimes then try to blame others for it.That's the way of the world; not something we're going to change. Even the required actually spelled out "yes" confirmation, not just "y", didn't stop people, either from doing it or for blaming reiserfs for problems that were in fact mostly their own, when they went ahead anyway. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas on unified real-ro mount option across all filesystems
On Wed, Dec 16, 2015 at 09:15:59PM -0600, Eric Sandeen wrote: > > > On 12/16/15 7:41 PM, Qu Wenruo wrote: > > Hi, > > > > In a recent btrfs patch, it is going to add a mount option to disable > > log replay for btrfs, just like "norecovery" for ext4/xfs. > > > > But in the discussion on the mount option name and use case, it seems > > better to have an unified and fs independent mount option alias for > > real RO mount > > > > Reasons: > > 1) Some file system may have already used [no]"recovery" mount option > >In fact, btrfs has already used "recovery" mount option. > >Using "norecovery" mount option will be quite confusing for btrfs. > > Too bad btrfs picked those semantics when "norecovery" has existed on > other filesystems for quite some time with a different meaning... :( > > > 2) More straight forward mount option > >Currently, to get real RO mount, for ext4/xfs, user must use -o > >ro,norecovery. > >Just ro won't ensure real RO, and norecovery can't be used alone. > >If we have a simple alias, it would be much better for user to use. > >(it maybe done just in user space mount) > > mount(8) simply says: > >ro Mount the filesystem read-only. > > and mount(2) is no more illustrative: > >MS_RDONLY > Mount file system read-only. > > kernel code is no help, either: > > #define MS_RDONLY1 /* Mount read-only */ > > They say nothing about what, exactly, "read-only" means. But since at least > the early ext3 days, it means that you cannot write through the filesystem, > not > that the filesystem will leave the block device unmodified when it mounts. > > I have always interpreted it as simply "no user changes to the filesystem," > and that is clearly what the vfs does with the flag... That ("-o ro means no user changes") has always been my understanding too. You /want/ the FS to replay the journal on an RO mount so that regular FS operation picks up the committed transactions. --D > > >Not to mention some fs (yeah, btrfs again) doesn't have "norecovery" > >but "nologreplay". > > well, again, btrfs picked unfortunate semantics, given the precedent set > by other filesystems. > > f2fs, ext4, gfs2, nilfs2, and xfs all support "norecovery" - xfs since > forever, ext4 & f2fs since 2009, etc. > > > 3) A lot of user even don't now mount ro can still modify device > >Yes, I didn't know this point until I checked the log replay code of > >btrfs. > >Adding such mount option alias may raise some attention of users. > > Given that nothing in the documentation implies that the block device itself > must remain unchanged on a read-only mount, I don't see any problem which > needs fixing. MS_RDONLY rejects user IO; that's all. > > If you want to be sure your block device rejects all IO for forensics or > what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount, > and take it out of the filesystem's control. Or better yet, making an > image and not touching the original. > > -Eric > > > Any ideas about this? > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > On Wed, 2015-12-09 at 16:36 +, Duncan wrote: >> But... as I've pointed out in other replies, in many cases including >> this specific one (bittorrent), applications have already had to >> develop their own integrity management features > Well let's move discussion upon that into the "dear developers, can we > have notdatacow + checksumming, plz?" where I showed in one of the more > recent threads that bittorrent seems rather to be the only thing which > does use that per default... while on the VM image front, nothing seems > to support it, and on the DB front, some support it, but don't use it > per default. > >> In the bittorrent case specifically, torrent chunks are already >> checksummed, and if they don't verify upon download, the chunk is >> thrown away and redownloaded. > I'm not a bittorrent expert, because I don't use it, but that sounds to > be more like the edonkey model, where - while there are checksums - > these are only used until the download completes. Then you have the > complete file, any checksum info thrown away, and the file again being > "at risk" (i.e. not checksum protected). [I'm breaking this into smaller replies again.] Just to mention here, that I said "integrity management features", which includes more than checksumming. As Austin Hemmelgarn has been pointing out, DBs and some VMs do COW, some DBs do checksumming or at least have that option, and both VMs and DBs generally do at least some level of consistency checking as they load. Those are all "integrity management features" at some level. As for bittorrent, I /think/ the checksums are in the torrent files themselves (and if I'm not mistaken, much as git, the chunks within the file are actually IDed by checksum, not specific position, so as long as the torrent is active, uploading or downloading, these will by definition be retained). As long as those are retained, the checksums should be retained. And ideally, people will continue to torrent the files long after they've finished downloading them, in which case they'll still need the torrent files themselves, along with the checksums info. And for longer term storage, people really should be copying/moving their torrented files elsewhere, in such a way that they either eliminate the fragmentation if the files weren't nocowed, or eliminate the nocow attribute and get them checksum-protected as normal for files not intended to be constantly randomly rewritten, which will be the case once they're no longer being actively downloaded. Of course that's at the slightly technically oriented user level, but then, the whole nocow thing, or even caring about checksums and longer term file integrity in the first place, is also technically oriented user level. Normal users will just download without worrying about the nocow in the first place, and perhaps wonder why the disk is thrashing so, but not be inclined to do anything about it except perhaps switch back to their old filesystem, where it was faster and the disk didn't sound as bad. In doing so, they'll either automatically get the checksuming along with the worse performance, or go back to a filesystem without the checksumming, and think it's fine as they know no different. Meanwhile, if they do it correctly there's no window without protection, as the torrent file can be used to double-verify the file once moved, as well, before deleting it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas on unified real-ro mount option across all filesystems
On 12/16/15 7:41 PM, Qu Wenruo wrote: > Hi, > > In a recent btrfs patch, it is going to add a mount option to disable > log replay for btrfs, just like "norecovery" for ext4/xfs. > > But in the discussion on the mount option name and use case, it seems > better to have an unified and fs independent mount option alias for > real RO mount > > Reasons: > 1) Some file system may have already used [no]"recovery" mount option >In fact, btrfs has already used "recovery" mount option. >Using "norecovery" mount option will be quite confusing for btrfs. Too bad btrfs picked those semantics when "norecovery" has existed on other filesystems for quite some time with a different meaning... :( > 2) More straight forward mount option >Currently, to get real RO mount, for ext4/xfs, user must use -o >ro,norecovery. >Just ro won't ensure real RO, and norecovery can't be used alone. >If we have a simple alias, it would be much better for user to use. >(it maybe done just in user space mount) mount(8) simply says: ro Mount the filesystem read-only. and mount(2) is no more illustrative: MS_RDONLY Mount file system read-only. kernel code is no help, either: #define MS_RDONLY1 /* Mount read-only */ They say nothing about what, exactly, "read-only" means. But since at least the early ext3 days, it means that you cannot write through the filesystem, not that the filesystem will leave the block device unmodified when it mounts. I have always interpreted it as simply "no user changes to the filesystem," and that is clearly what the vfs does with the flag... >Not to mention some fs (yeah, btrfs again) doesn't have "norecovery" >but "nologreplay". well, again, btrfs picked unfortunate semantics, given the precedent set by other filesystems. f2fs, ext4, gfs2, nilfs2, and xfs all support "norecovery" - xfs since forever, ext4 & f2fs since 2009, etc. > 3) A lot of user even don't now mount ro can still modify device >Yes, I didn't know this point until I checked the log replay code of >btrfs. >Adding such mount option alias may raise some attention of users. Given that nothing in the documentation implies that the block device itself must remain unchanged on a read-only mount, I don't see any problem which needs fixing. MS_RDONLY rejects user IO; that's all. If you want to be sure your block device rejects all IO for forensics or what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount, and take it out of the filesystem's control. Or better yet, making an image and not touching the original. -Eric > Any ideas about this? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 12:45:00 +0100 as excerpted: > On Wed, 2015-12-16 at 11:10 +, Duncan wrote: >> And noload doesn't have the namespace collision problem norecovery does >> on btrfs, so I'd strongly suggest using it, at least as an alias for >> whatever other btrfs-specific name we might choose. > > but noload is, AFAIU, not what's desired here, is it? > Per manpage it's "Don't load the journal on mounting",... not only > wouldn't that fit for btrfs, it's also not what's really desired, i.e. > an option that implies everything necessary to not modify the device. Well, "don't load the journal on mounting" is exactly what the option would do. The journal (aka log) of course has a slightly different meaning, it's only the fsync log, but loading it is exactly what the option would prevent, here. Of course that isn't to say there shouldn't be another option, call it nomodify, for argument, that includes this and perhaps other options that would otherwise trigger filesystem level changes on a normal read-only mount. Too bad we can't simply rename the recovery mount option so norecovery could be used as well, but I guess that could potentially break too many existing deployments. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: > > Well sure, I think we'de done most of this and have dedicated > > controllers, at least of a quality that funding allows us ;-) > > But regardless how much one tunes, and how good the hardware is. If > > you'd then loose always a fraction of your overall IO, and be it > > just > > 5%, to defragging these types of files, one may actually want to > > avoid > > this at all, for which nodatacow seems *the* solution. > nodatacow only works for that if the file is pre-allocated, if it > isn't, > then it still ends up fragmented. Hmm is that "it may end up fragmented" or a "it will definitely? Cause I'd have hoped, that if nothing else had been written in the meantime, btrfs would perhaps try to write next to the already allocated blocks. > > > The problem is not entirely the lack of COW semantics, it's also > > > the > > > fact that it's impossible to implement an atomic write on a hard > > > disk. > > Sure... but that's just the same for the nodatacow writes of data. > > (And the same, AFAIU, for CoW itself, just that we'd notice any > > corruption in case of a crash due to the CoWed nature of the fs and > > could go back to the last generation). > Yes, but it's also the reason that using either COW or a log- > structured > filesystem (like NILFS2, LogFS, or I think F2FS) is important for > consistency. So then it's no reason why it shouldn't work. The meta-data is CoWed, any incomplete writes of checksumdata in that (be it for CoWed data or no-CoWed data, should the later be implemented), would be protected at that level. Currently, the no-CoWed data is, AFAIU completely at risk of being corrupted (no checksums, no journal). Checksums on no-CoWed data would just improve that. > > What about VMs? At least a quick google search didn't give me any > > results on whether there would be e.g. checksumming support for > > qcow2. > > For raw images there surely is not. > I don't mean that the VMM does checksumming, I mean that the guest OS > should be the one to handle the corruption. No sane OS doesn't run > at > least some form of consistency checks when mounting a filesystem. Well but we're not talking about having a filesystem that "looks clear" here. For this alone we wouldn't need any checksumming at all. We talk about data integrity protection, i.e. all files and their contents. Nothing which a fsck inside a guest VM would ever notice (I mean by a fsck), if there are just some bit flips or things like that. > > > > And even if DBs do some checksumming now, it may be just a > > consequence > > of that missing in the filesystems. > > As I've written somewhere else in the previous mail: it's IMHO much > > better if one system takes care on this, where the code is well > > tested, > > than each application doing it's own thing. > That's really a subjective opinion. The application knows better > than > we do what type of data integrity it needs, and can almost certainly > do > a better job of providing it than we can. Hmm I don't see that. When we, at the filesystem level, provide data integrity, than all data is guaranteed to be valid. What more should an application be able to provide? At best they can do the same thing faster, but even for that I see no immediate reason to believe it. And in practise it seems far more likely that if countless applications should such task on their own, that it's more error prone (that's why we have libraries for all kinds of code, trying to reuse code, minimising the possibility of errors in countless home-brew solutions), or not done at all. > > > > - the data was written out correctly, but before the csum > > > > was > > > > written the system crashed, so the csum would now tell us > > > > that > > > > the > > > > block is bad, while in reality it isn't. > > > There is another case to consider, the data got written out, but > > > the > > > crash happened while writing the checksum (so the checksum was > > > partially > > > written, and is corrupt). This means we get a false positive on > > > a > > > disk > > > error that isn't there, even when the data is correct, and that > > > should > > > be avoided if at all possible. > > I've had that, and I've left it quoted above. > > But as I've said before: That's one case out of many? How likely is > > it > > that the crash happens exactly after a large data block has been > > written followed by a relatively tiny amount of checksum data. > > I'd assume it's far more likely that the crash happens during > > writing > > the data. > Except that the whole metadata block pointing to that data block gets > rewritten, not just the checksum. But that's the case anyway, isn't it? With or without checksums. > > And regarding "reporting data to be in error, which is actually > > correct"... isn't that what all journaling systems may do? > No, most of them don't actually do that. The general design of a > journaling filesystem is that
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
Qu Wenruo posted on Wed, 16 Dec 2015 09:36:23 +0800 as excerpted: > David Sterba wrote on 2015/12/14 18:32 +0100: >> On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote: >>> Introduce a new mount option "nologreplay" to co-operate with "ro" >>> mount option to get real readonly mount, like "norecovery" in ext* and >>> xfs. >>> >>> Since the new parse_options() need to check new flags at remount time, >>> so add a new parameter for parse_options(). >>> >>> Signed-off-by: Qu Wenruo>>> Reviewed-by: Chandan Rajendra >>> Tested-by: Austin S. Hemmelgarn >> >> I've read the discussions around the change and from the user's POV I'd >> suggest to add another mount option that would be just an alias for any >> mount options that would implement the 'hard-ro' semantics. >> >> Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover >> more options in the future. >> >> mount -o ro,nowr /dev/sdx /mnt >> >> would work when switching kernels. >> >> > That would be nice. > > I'd like to forward the idea/discussion to filesystem ml, not only btrfs > maillist. > > Such behavior should better be coordinated between all(at least xfs and > ext4 and btrfs) filesystems. > > One sad example is, we can't use 'norecovery' mount option to disable > log replay in btrfs, as there is 'recovery' mount option already. > > So I hope we can have a unified mount option between mainline > filesystems. FWIW, I was just reading the mount manpage in connection with a reply for a different thread, and noticed... mount (8) (from util-linux 2.27.1) says noload and norecovery are the same option, for ext3/4 at least. It refers to the xfs (5) manpage, from xfsprogs, for xfs mount options, and that's not installed here, so I can't confirm noload for it, but it's there for ext3/4. And noload doesn't have the namespace collision problem norecovery does on btrfs, so I'd strongly suggest using it, at least as an alias for whatever other btrfs-specific name we might choose. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Wed, 2015-12-16 at 11:10 +, Duncan wrote: > And noload doesn't have the namespace collision problem norecovery > does > on btrfs, so I'd strongly suggest using it, at least as an alias for > whatever other btrfs-specific name we might choose. but noload is, AFAIU, not what's desired here, is it? Per manpage it's "Don't load the journal on mounting",... not only wouldn't that fit for btrfs, it's also not what's really desired, i.e. an option that implies everything necessary to not modify the device. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Tue, 2015-12-15 at 08:54 -0500, Austin S. Hemmelgarn wrote: > Except for one thing: Automobiles actually provide a measurable > significant benefit to society. What specific benefit does embedding > the filesystem UUID in the metadata actually provide? I guess that's quite obvious. You want something that can be used to address devices stable (i.e. not their "path" like sda,sdb). So either some ID or a label. Human readable lables are basically guaranteed to collide, so UUIDs are the clean solution. Since there is however no guarantee that they don't collide (either by accident or malicious intent), you need to protect against that. Analogous for the device IDs of multi-device fs or containers. > > "A UUID is 128 bits long, and can guarantee uniqueness across space > > and time." > > > > Also see security considerations in section 6. > Both aspects ignore the facts that: > Version 1 is easy to cause a collision with (MAC addresses are by no > means unique, and are easy to spoof, and so are timestamps). > Version 2 is relatively easy to cause a collision with, because UID > and > GID numbers are a fixed size namespace. > Version 3 is slightly better, but still not by any means unique > because > you just have to guess the seed string (or a collision for it). > Version 4 is probably the hardest to get a collision with, but only > if > you are using a true RNG, and evne then, 122 bits of entropy is not > much > protection. > Version 5 has the same issues as Version 3, but is more secure > against > hash collisions. I guess we don't need to discuss how unique UUIDs are when they're *freshly created*, since this is the only thing what the RFC "guarantees"... That's mostly irrelevant for us here, as we have two far more stronger cases, accidental duplication and malicious collisions. The possible case, that by normal means (e.g. mkfs.btrfs) a UUID collision occurs, are small, but solving the actual two cases here, will solve that one as well. Apart from that, I've noticed in several of your mails that either something with the indention level goes wrong, or you mix contents from multiple mails from different people. E.g. that "Also see security considerations in section 6." wasn't from me, which was at quotation level 1 in your mail, but the example with the automobile, which was also on level 1, was from me. That's kinda confusing... > In general, you should only use UUID's when either: > a. You have absolutely 100% complete control of the storage of them, > such that you can guarantee they don't get reused. > b. They can be guaranteed to be relatively unique for the system > using them. No, this aren't necessary constraints. And in fact would make multi- device practically impossible (you always need some ID, unless you want to open the door for countless of errors, where people wrongly assemble their devices... whether it's UUID or anything else, doesn't matter). The only thing that one needs to do, is handle collisions gracefully and don't do auto-assemblies,.. all as I've described in the mail from "Fri, 11 Dec 2015 23:06:03 +0100" (http://thread.gmane.org/gmane.comp.file-systems.btrfs/50909/focus=51147) > > There could be some leveraging of the device WWN, or absent that > > its > > serial number, propogated into all of the volume's devices (cross > > referencing each other's devid to WWN or serial). And then that way > > there's a way to differentiate. In the dd case, there would be > > mismatching real device WWN/serial number and the one written in > > metadata on all drives, including the copy. This doesn't say what > > policy should happen next, just that at least it's known there's a > > mismatch. > > > That gets tricky too, because for example you have stuff like flat > files > used as filesystem images. plus... one cannot be sure whether any hardware device IDs, like serial numbers, are unique... a powerful attacker could surely change these as well. Or imagine you have a failing harddisk, and dd it's content to another... the btrfs part would stay identical, while the hardware device IDs change and confuse everything. > However, if we then use some separate UUID (possibly hashed off of > the > file location) in place of the device serial/WWN, that could > theoretically provide some better protection. Not really... it just delegates the problem one level further. The only real protection is, that the kernel and userland tools deal correctly with the situation. > The obvious solution in > the case of a mismatch would be to refuse the mount until either the > issue is fixed using the tools, or the user specifies some particular > mount option to either fix ti automatically, or ignore copies with a > mismatching serial. Sure, as I've said before :-) Cheers, Chris smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Tue, 2015-12-15 at 14:18 +, Hugo Mills wrote: > That one's easy to answer. It deals with a major issue that > reiserfs had: if you have a filesystem with another filesystem image > stored on it, reiserfsck could end up deciding that both the metadata > blocks of the main filesystem *and* the metadata blocks of the image > were part of the same FS (because they're on the same block device), > and so would splice both filesystems into one, generally complaining > loudly along the way that there was a lot of corruption present that > it was trying to fix. Hmm that's a bit strange though, and to me it rather sounds like other bugs... You can have a ext4 on a file in an ext4, with or without the same UUIDs, and it will just work. If the filesystem takes contents from a normal file as possible metadata, than something else is severely screwed up... or in case of the fsck: it probably means it's a bit too liberal in searching places. I'd be quite shocked if this is the case in btrfs, cause it would mean again, that we have a vulnerability against UUID collisions. Imagine some attacker finds out the UUID of a filesystem (which is probably rather easy)... next he uploads some file (e.g. it's a webserver with allows image uploads, a forum perhaps) that in reality contains what's looks like btrfs metadata and uses a matching UUID. It would run into the same issues as what you describe for reiser,.. the UUID would be no real help to solve that problem. Does anyone know whether btrfsck (or other userland) tools do such things? I.e. search more or less arbitrary blocks, where it cannot be sure it's *not* data, for what it would interpret as meta-data subsequently? CHeers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Tue, 2015-12-15 at 11:03 -0500, Austin S. Hemmelgarn wrote: > May I propose the alternative option of adding a flag to tell mount > to > _only_ use the devices specified in the options? That's one part of exactly what I propose since a few days :-P (no one seems to read my mails ;-) ) Plus that this isn't the case only for mounts, but also fsck, repair, and all other userland tool operations. But it's only part of the solution to the whole problem, the other one is that automatic device activations/rebuilds/etc. of _already active_ devices should generally not happen (manual of course may happen, again with device= options, specifying *which* devices are actually meant). See my mail from "Fri, 11 Dec 2015 23:06:03 +0100" (http://thread.gmane.org/gmane.comp.file-systems.btrfs/50909/focus=5114 7) which I think covers pretty much all cases. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-16 06:10, Duncan wrote: Qu Wenruo posted on Wed, 16 Dec 2015 09:36:23 +0800 as excerpted: David Sterba wrote on 2015/12/14 18:32 +0100: On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote: Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Signed-off-by: Qu WenruoReviewed-by: Chandan Rajendra Tested-by: Austin S. Hemmelgarn I've read the discussions around the change and from the user's POV I'd suggest to add another mount option that would be just an alias for any mount options that would implement the 'hard-ro' semantics. Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover more options in the future. mount -o ro,nowr /dev/sdx /mnt would work when switching kernels. That would be nice. I'd like to forward the idea/discussion to filesystem ml, not only btrfs maillist. Such behavior should better be coordinated between all(at least xfs and ext4 and btrfs) filesystems. One sad example is, we can't use 'norecovery' mount option to disable log replay in btrfs, as there is 'recovery' mount option already. So I hope we can have a unified mount option between mainline filesystems. FWIW, I was just reading the mount manpage in connection with a reply for a different thread, and noticed... mount (8) (from util-linux 2.27.1) says noload and norecovery are the same option, for ext3/4 at least. It refers to the xfs (5) manpage, from xfsprogs, for xfs mount options, and that's not installed here, so I can't confirm noload for it, but it's there for ext3/4. Unless it's undocumented, XFS doesn't have it (as much as I hate XFS, I have to have xfsprogs installed so that I can do recovery for the few systems at work that actually use it if the need arises). And noload doesn't have the namespace collision problem norecovery does on btrfs, so I'd strongly suggest using it, at least as an alias for whatever other btrfs-specific name we might choose. I kind of agree with Christoph here. I don't think that noload should be the what we actually use, although I do think having it as an alias for whatever name we end up using would be a good thing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Mon, Dec 14, 2015 at 12:50:37PM -0500, Austin S. Hemmelgarn wrote: > On 2015-12-14 12:32, David Sterba wrote: > > On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote: > >> Introduce a new mount option "nologreplay" to co-operate with "ro" mount > >> option to get real readonly mount, like "norecovery" in ext* and xfs. > >> > >> Since the new parse_options() need to check new flags at remount time, > >> so add a new parameter for parse_options(). > >> > >> Signed-off-by: Qu Wenruo> >> Reviewed-by: Chandan Rajendra > >> Tested-by: Austin S. Hemmelgarn > > > > I've read the discussions around the change and from the user's POV I'd > > suggest to add another mount option that would be just an alias for any > > mount options that would implement the 'hard-ro' semantics. > > > > Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover > > more options in the future. > It should also imply noatime. I'm not sure how BTRFS handles atime when > mounted RO, but I know a lot of old UNIX systems updated atime even on > filesystems mounted RO, and I know that at least at one point Linux did too. A mount with -o ro will not touch atimes. At one point the read-only snapshots changed atimes, but this has been fixed since. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/inode.c#n1602 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c#n5973 > > mount -o ro,nowr /dev/sdx /mnt > > > > would work when switching kernels. > > I like this idea, but I think that having a name like true-ro or hard-ro > and making it imply ro (and noatime) would probably be better (or at > least, simpler to use from a user perspective). Ok, a single option to do the real-ro sounds better than ro,something. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: need to recover large file
Langhorst, Brad posted on Wed, 16 Dec 2015 03:13:48 + as excerpted: > Hi: > > I just screwed up⊠spent the last 3 weeks generting a 400G file > (genome assembly) . > Went to back it up and swapped the arguments to tar (tar Jcf my_precious > my_precious.tar.xz) > what was once 400G is now 108 bytes of xz header - argh. > > This is on a 6-volume btrfs filesystem. > > I immediately unmounted the fs (had to cd / first). > > After a bit of searching I found Chris Masonâs post about using > btrfs-debug-tree -R [Snip most of the result as I'm not familiar with this utility. But it ends with...] > Btrfs v3.12 [...] > I also read that one can use btrfs-find-root to get a list of files to > recover and just ran btrfs-find-root on one of the underlying disks but > I get an error "Super think's the tree root is at 25606900367360, > chunk root 25606758596608 > Went past the fs size, exiting WTF? I thought that bug was patched a long time ago. How old a btrfs- progs are you using, anyway? *OH!* *3.12* Why are you still using 3.12? That's nigh back in the dark ages as far as btrfs is concerned. AFAIK, btrfs was either still labeled experimental back then, or 3.12 was the first version where experimental was stripped, so it's a very long time ago indeed, particularly for a filesystem that while no longer experimental, is still under heavy development, with bugs being fixed in every release, with the patches not always reaching old stable tho since 3.12 for the kernel anyway they do try, so if you're running old releases, you're code that's known to be buggy and known to have fixes for those bugs in newer releases. The general list recommendation for the kernel, unless you have a known reason (like an already reported but still unfixed bug in newer), is run the latest or next to latest release series of either current, which would ATM be 4.3 or 4.2, as 4.4 isn't out yet, tho it's getting close, or or LTS, which would be 4.1 or 3.18, tho 4.4 will be also and it's getting close so you should be already preparing to upgrade to at least 4.1 if you're on the LTS series. The coverage of the penultimate series gives you some time to upgrade to the latest, since the penultimate series is covered too. For runtime, the kernel code is generally used (userspace mostly just makes calls to the kernel and lets it do the work), so kernel code is most important. However, once you're trying to recover things, basically, when you're working with the unmounted filesystem, userspace code is used, so having something reasonably current there becomes important. As a rule of thumb, then, running a btrfs-progs userspace of at least the same release series as the kernel you're running is recommended (tho newer is fine), since the kernel and userspace were developed at about the same time and with the same problems in mind, and if you're keeping up with the kernel series recommendation, that means you're userspace isn't getting /too/ old. But even then, once you're trying to do a btrfs recovery with those tools, a recommendation to try latest current can be considered normal, since it'll be able to detect and usually fix the latest problems. So a 3.18 series kernel and at least a 3.18 series userspace, would be recommended, and indeed, for a filesystem like btrfs that is still stabilizing and not yet fully stable and mature, is quite reasonable indeed. While some people have reason to use particularly old and demonstrated stable versions, and enterprise distros generally cater to this need with support for upto a decade, using a still new and maturing btrfs is incompatible with a need for old and demonstrated stable, so in that case, from the viewpoint of this list at least, if you're looking for that old and stable, you really should be using a different filesystem, as btrfs simply isn't that old and stable, yet. Meanwhile, while that's the view of upstream btrfs and thus this upstream list, some distros never-the-less choose to support old btrfs, backporting patches, etc, as they consider necessary. However, that's their support then, and their business. If you're trusting them for that support, really, you should be contacting them for it, as this list really isn't in the best position to supply that sort of support. Yes, they may be using an old in number kernel and perhaps userspace, with newer patches backported to it, but it's the distro that makes those choices and knows what patches it's backporting, and thus is in the best position to support it. Not that we on the list won't try, but we're simply not in a good position to provide that support that far back as we've long since moved on, neither do we track what distros have backported and what they haven't, etc. So, basically, you have four choices: 1) Follow list recommendations and upgrade to something that isn't out of the dark ages in terms of btrfs history. 2) Follow the presumed
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-16 07:34, Christoph Anton Mitterer wrote: On Wed, 2015-12-16 at 07:12 -0500, Austin S. Hemmelgarn wrote: I kind of agree with Christoph here. I don't think that noload should be the what we actually use, although I do think having it as an alias for whatever name we end up using would be a good thing. No, because people would start using it, getting used to it, and in 4 years we could never change it again,... which may be necessary... No, because we should ease the transition from other filesystems to the greatest extent reasonably possible. It should be clearly documented as an alias for compatibility with ext{3,4}, and that it might go away in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Wed, Dec 16, 2015 at 09:36:23AM +0800, Qu Wenruo wrote: > > > > mount -o ro,nowr /dev/sdx /mnt > > > > would work when switching kernels. > > > That would be nice. > > I'd like to forward the idea/discussion to filesystem ml, not only btrfs > maillist. Good idea. > Such behavior should better be coordinated between all(at least xfs and > ext4 and btrfs) filesystems. > > One sad example is, we can't use 'norecovery' mount option to disable > log replay in btrfs, as there is 'recovery' mount option already. I think we should pick a name that's not tied to the implementation how the potential writes could happen under a RO mount. Recovery/replay/whatever, the expected use is "avoid any writes". > So I hope we can have a unified mount option between mainline filesystems. That would be a good thing indeed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
On Tue, Dec 15, 2015 at 06:30:58PM -0800, Liu Bo wrote: > On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote: > > >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * > > >data_stripes, > > >may I know how your partition gets a 10GB chunk? > > > > Oh, it seems that I remembered the wrong size. > > After checking the code, yes you're right. > > A stripe won't be larger than 1G, so my assumption above is totally wrong. > > > > And the problem is not in the 10% limit. > > > > Please forget it. > > No problem, glad to see people talking about the space issue again. You can still end up with larger block groups if you have a lot of drives. We've had different problems with that in the past, but it is limited now to 10G. At any rate if things are still getting badly out of balance we need to tweak the allocator some more. It's hard to reproduce because you need a burst of allocations for whatever type is full. I'll give it another shot. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On Wed, Dec 16, 2015 at 01:03:38PM +0100, Christoph Anton Mitterer wrote: > On Tue, 2015-12-15 at 14:18 +, Hugo Mills wrote: > > That one's easy to answer. It deals with a major issue that > > reiserfs had: if you have a filesystem with another filesystem image > > stored on it, reiserfsck could end up deciding that both the metadata > > blocks of the main filesystem *and* the metadata blocks of the image > > were part of the same FS (because they're on the same block device), > > and so would splice both filesystems into one, generally complaining > > loudly along the way that there was a lot of corruption present that > > it was trying to fix. > Hmm that's a bit strange though, and to me it rather sounds like other > bugs... > You can have a ext4 on a file in an ext4, with or without the same > UUIDs, and it will just work. Hugo is right here. reiserfs had tools that would scan and entire block device for metadata blocks and try to reconstruct the filesystem based on what it found. Since there was no uuid, it was impossible to tell if a block from the scan was really part of this filesystem or part of some image file that happened to be sitting there. Adding UUIDs doesn't make that whole class of problem go away (you could have an image of the filesystem inside that filesystem), but it does make it dramatically less likely. At the end of the day it's just a best practice mechanism to help recovery and prevent admin mistakes. It's also a building block of the multi-device support. We could change the multi-device support to allow duplicate uuids in single device filesystems. But I'd much rather see a variation on seed devices enable transitioning from one uuid to another. > If the filesystem takes contents from a normal file as possible > metadata, than something else is severely screwed up... or in case of > the fsck: it probably means it's a bit too liberal in searching places. > > I'd be quite shocked if this is the case in btrfs, cause it would mean > again, that we have a vulnerability against UUID collisions. > Imagine some attacker finds out the UUID of a filesystem (which is > probably rather easy)... next he uploads some file (e.g. it's a > webserver with allows image uploads, a forum perhaps) that in reality > contains what's looks like btrfs metadata and uses a matching UUID. > > It would run into the same issues as what you describe for reiser,.. > the UUID would be no real help to solve that problem. > > > Does anyone know whether btrfsck (or other userland) tools do such > things? I.e. search more or less arbitrary blocks, where it cannot be > sure it's *not* data, for what it would interpret as meta-data > subsequently? These are emergency tools, btrfs restore and find-roots can do some scanning. We don't do it the way reiserfs did because it would be very difficult to reconstruct shared data and metadata from snapshots. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On Wed, 2015-12-16 at 09:41 -0500, Chris Mason wrote: > Hugo is right here. reiserfs had tools that would scan and entire > block > device for metadata blocks and try to reconstruct the filesystem > based > on what it found. Creepy... at least when talking about a "normal" fsck... good that btrfs is going to be the next-gen-ext, and not reiser4 ;) > Adding UUIDs doesn't make that whole class of problem go away (you > could > have an image of the filesystem inside that filesystem), but it does > make it dramatically less likely. Sure... > > Does anyone know whether btrfsck (or other userland) tools do such > > things? I.e. search more or less arbitrary blocks, where it cannot > > be > > sure it's *not* data, for what it would interpret as meta-data > > subsequently? > > These are emergency tools, btrfs restore and find-roots can do some > scanning. We don't do it the way reiserfs did because it would be > very > difficult to reconstruct shared data and metadata from snapshots. Hmm I agree, that it's valid for such tools, to do these kinds of scans (i.e. scan for meta-data in places that are not known for sure to be meta-data) when doing some last-resort-rescue tries... or for rescue operations, where it's clearly documented that this is done. But I think it shouldn't happen e.g. during a normal fsck - only when special options are given. And it should be properly documented (i.e. telling people in the docs, that this does a block for block scan for meta-data even within normal data, and that if they'd had e.g. another fs of the same UUIDs within, the results may be completely bogus. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Tue, 2015-12-15 at 14:42 +, Hugo Mills wrote: > I would suggest trying to migrate to a state where detecting more > than one device with the same UUID and devid is cause to prevent the > FS from mounting, unless there's also a "mount_duplicates_yes_i_ > know_this_is_dangerous_and_i_know_what_im_doing" mount flag present, > for the multipathing people. That will break existing userspace > behaviour for the multipathing case, but the migration can probably > be > managed. (e.g. NFS has successfully changed default behaviour for one > of its mount options in the last few(?) years). I don't think that a single mountpoint a la "force-and-do-it" is a proper solution here. It would still open surface for attacks and also for accidents. In the case mutli-pathing is used, the only realistic way seems to be manually specifying the devices a la device=/dev/sda,/dev/sdb. Of course btrfs would stil use the UUIDs/deviceIDs of these, but *only* of those devices that have been whitelisted with the device=option. In the case of a general "mount_duplicates_yes_iknow_th..." option you could end up with having e.g. three duplicates, two being actually mutli-paths, and the third one being a losetup or USB clone of the image,... again allowing for the aforementioned attacks to happen, and again allowing for severe corruption to occur. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Wed, 2015-12-16 at 07:12 -0500, Austin S. Hemmelgarn wrote: > I kind of agree with Christoph here. I don't think that noload > should > be the what we actually use, although I do think having it as an > alias > for whatever name we end up using would be a good thing. No, because people would start using it, getting used to it, and in 4 years we could never change it again,... which may be necessary... noload, seems to mean don't load the journal. Unless btrfs gets a journal in the sense xfs/ext has one, it simply should either not use that name at all... or not try to "map" it to something of it's own which is similar, but in reality not the same. Chris. smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Tue, 2015-12-15 at 09:19 -0500, Austin S. Hemmelgarn wrote: > Um, no you don't have direct physical access to the hardware with an > ATM, at least, not unless you are going to take apart the cover and > anything else in your way (and probably set off internal alarms). Well access to the services ports (which may be USB) is typically much easier, and doesn't require to completely dismantle the steel and so... Simply because service teams also need to access these "regularly". But even if we don't count ATMs here, use any other publicly accessible computer terminals. Library computer, the entertainment systems in airplanes, TVs in a shopping centre, etc. pp. > And > even without that, it's still possible to DoS an ATM without much > effort. Most of them have a 3.5mm headphone jack for TTS for people > with poor vision, and that's more than enough to overload at least > part > of the system with a relatively simple to put together bit of > electronics that would cost you less than 10 USD. As I've said before,.. you always find another weak link, of course,... as it was pointed out before, USB itself is quite a security problem (firmware attacks and that like). But just because there are other issues, right now, there is no justification to make btrfs "weak" as well... because this just leads to the vicious circle, that everyone has security issues, not willing to solve them, pointing to others as an excuse. > > imageine you're running a VM hosting service, where you allow users > > to > > upload images and have them deployed. > > In the cheap" case these will end up as regular files, where they > > couldn't do any harm (even if colliding UUIDs)... but even there > > one > > would have to expect, that the hypervisor admin may losetup them > > for > > whichever reason. > > But if you offer more professional services, you may give your > > clients > > e.g. direct access to some storage backend, which are then probably > > also seen on the host by its kernel. > > And here we already have the case, that a client could remotely > > trigger > > such collision. > In that particular situation, it's not relevant unless the host admin > goes to mount them. UUID collisions are only an issue if the > filesystems get mounted. Hmm from the impression I got so far, it was not only a problem when actually mounting... but even if... this doesn't change the situation. Same problem as before, the host system may have btrfs filesystems whose IDs have leaked, the attacker may upload them as VM images as described above, and even if the host's admin doesn't want to mount those, he may mount what he considers his filsystems, which however also collide. Boom. Same issues as before. Turn it as you want, resistance is futile ;-) > > Well I think it's a proper way to e.g. handle the multi-device > > case. > > You have n devices, you want to differ them,... using a pseudo- > > random > > UUID is surely better than giving them numbers. > That's debatable, the same issues are obviously present in both cases > (individual numbers can collide too). Sure, as I've said. You always must handle the case of accidentally or maliciously colliding IDs if you count on data integrity and security. But using UUIDs makes chances at least small that you run into collisions (that users must than manually resolve somehow) *even when* you just create fresh filesystem, have no attacker and no dd or that like goes in your way. > > Same for the fs UUID, e.g. when used for mounting devices whose > > paths > > aren't stable. > In the case of a sanely designed system using LVM for example, device > paths are stable. Well, but LVM itself works with UUIDs again, so you just delegate the problem. And apart from that, with btrfs, I thought, we rather want to avoid using LVM below. > > As said before, using the UUID isn't the problem - not protecting > > against collisions is. > No, the issues are: > 1. We assume that the UUID will be unique for the life of the > filesystem, which is not a safe assumption. > 2. We don't sanely handle things if it isn't unique. Well isn't that what I've said? At least it's what I've meant ;) > > Well,... I don't think that writing *into* the filesystem is > > covered by > > common practise anymore. > For end users, I agree. Part of the discussion involves attacks on > the > system, and for a attacker it's not a far stretch to write directly > to > the block device if possible (and it's even common practice for > bypassing permission checks done in the VFS layer). Well but that's something else here what I don't think we can cover. What we must assume is, that devices show up with colliding IDs, either by "accident" or means like dd... or by an attacker somehow being able to make them show up (USB, the image upload scenarios I've described before, and so on). If the attacker can however write to *arbitrary* (and not just "his") devices, bypassing checks in the VFS layer or anything else... well than
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Wed, 2015-12-16 at 07:57 -0500, Austin S. Hemmelgarn wrote: > No, because we should ease the transition from other filesystems to > the > greatest extent reasonably possible. It should be clearly documented > as > an alias for compatibility with ext{3,4}, and that it might go away > in > the future. Where's the need for an easy migration path? Such option wouldn't be used by default and even if, people would need to change their fstab/etc. anyway (ext->btrfs). The note that things go away never really work out that easy... people start to use it, rely on it... and sooner or later you have a situation as with atime, that you basically never can get rid of it. IMHO an alias here would be just ambiguous. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: dear developers, can we have notdatacow + checksumming, plz?
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > And in particular, the only > journaling filesystem that I know of that even allows the option of > journaling the file contents instead of just metadata is ext4. IIRC, ext3 was the first to have it in Linux mainline, with data=writeback for the speed freaks that don't care about data loss, data=ordered as the default normal option (except for that infamous period when Linus lost his head and let people talk him into switching to data=writeback, despite the risks... he later came back to his senses and reverted that), and data=journal for the folks that were willing to pay trade a bit of speed for better data protection (tho it was famous for surprising everybody, in that in certain use-cases it was extremely fast, faster than data=writeback, something I don't think was ever fully explained). To my knowledge ext3 still has that, tho I haven't used it probably a decade. Reiserfs has all three data= options as well, with data=ordered the default, tho it only had data=writeback initially. While I've used reiserfs for years, it has always been with the default data=ordered since that was introduced, and I'd be surprised if data=journal had the same use-case speed advantage that it did on ext3, as it's too different. Meanwhile, that early data=writeback default is where reiserfs got its ill repute for data loss, but it had long switched to data=ordered by default by the time Linus lost his senses and tried data=writeback by default on ext3. Because I was on reiserfs from data=writeback era, I was rather glad most kernel hackers didn't want to touch it by the time Linus let them talk him into data=writeback on ext3, and thus left reiserfs (which again had long been data=ordered by default by then) well enough alone. But I did help a few people running ext3 trace down their new ext3 stability issues to that bad data=writeback experiment, and persuaded them to specify data=ordered, which solved their problems, so indeed they /were/ data=writeback related. And happily, Linus did eventually regain his senses and return ext3 to data=ordered by default once again. And based on what you said, ext4 still has all three data= options, including data=journal. But I wasn't sure on that myself (tho I would have assumed it inherited it from ext3) and thus am /definitely/ not sure whether it inherits ext3's data=journal speed advantages in certain corner-cases. I have no idea whether other journaled filesystems allow choosing the journal level or not, tho. I only know of those three. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Lionel Bouton posted on Tue, 15 Dec 2015 03:38:33 +0100 as excerpted: > I just checked: this has only be made crystal-clear in the latest > man-pages version 4.03 released 10 days ago. > > The mount(8) page of Gentoo's current stable man-pages (4.02 release in > August) which is installed on my systems states for noatime: > "Do not update inode access times on this filesystem (e.g., for faster > access on the news spool to speed up news servers)." Hmm... I hadn't synced and updated in about that time, and sure enough, while I've just synced I've not yet updated, and still have man-pages 4.02 installed. But, the mount.8(.bz2 in my case as that's the compression I'm configured for, I had to use man -d mount to debug-dump what file it was actually loading) manpage actually belongs to util-linux, according to equery belongs, while equery files man-pages | grep mount only returns hits for mount.2(.bz2 and umount). So at least here, it's util-linux providing the mount (8) manpage, not man-pages. Tho I'm on ~amd64 and IIRC just updated util-linux in the last update, so the cross-ref to nodiratime in the noatime entry (saying it isn't necessary as noatime covers it) probably came from there, or a similar recent util-linux update. Let's see... My current util-linux (with the xref in both noatime and nodiratime to the other, saying nodiratime isn't needed if noatime is used) is 2.27.1. The oldest version I still have in my binpkg cache (tho I likely have older on the backup) is util-linux 2.24.2. For noatime it has the wording you mention, don't update inode access times, but for nodiratime, it specifically mentions directory inode access times. So from util-linux 2.24.2 at least, the information was there, but you had to read between the lines a bit more, because nodiratime mentions dir inodes, and noatime says don't update atime on inodes, so it's there but you have to be a reasonably astute reader to see it. In between those two I have other versions including 2.26.2 and 2.27. Looks like 2.27 added both the "implies nodiratime" wording to the noatime entry, and the nodiratime unneeded if noatime set notation to the nodiratime entry. If there was a util-linux 2.26.x beyond x=2, I apparently never installed it, so the wording likely changed with 2.27, but may have changed with late 2.26 versions as well, if there were any beyond 2.26.2. And on gentoo, 2.26.2 appears to be the latest stable-keyworded, so that's what stable users would have. But as I said, the info is there at least as of 2.24.2, you just have to note in the nodiratime entry that it says dir inodes, while the noatime entry simply says inodes, without excluding dir inodes. So it's there, you just have to be a somewhat astute reader to note it. Anywhere else, say on-the-net recommendations for nodiratime, /should/ mention that they aren't necessary if noatime is used as well, but of course not all of them will. (Tho I'd actually find it a bit strange to see discussion of nodiratime without discussion of noatime as well, as I'd guess any discussion of just one of the two would likely be on noatime, leaving nodiratime unmentioned if they're only covering one, as it shouldn't be necessary to mention, since it's already included in noatime.) But there's probably a bunch of folks who originally read coverage of noatime, then saw nodiratime later, and thought "Oh, that's separate? Well I want that too!" and simply enabled them both, without actually checking the manpage or other documentation including on-the-net discussion. I know here I originally saw noatime and decided I wanted it, then was confused when I saw nodiratime sometime later. But I don't just enable stuff without having some idea what I'm enabling, so I did my research, and saw noatime implied nodiratime as well, so the only reason nodiratime might be needed would be if you wanted atime in general, but not on dirs. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > AFAIUI, checksums are stored per-instance for every block. This is > important in a multi-device filesystem in case you lose a device, so > that you still have a checksum for the block. There should be no > difference between extent layout and compression between devices > however. I don't believe that's quite correct. What is correct, to the best of my knowledge, is that checksums are metadata, and thus have whatever duplication/parity level metadata is assigned. For single devices, that is of course by default dup, 2X the metadata and thus 2X the checksums, both on the single data (as effectively the only choice on a single device, at least thru 4.3, tho there's a patch adding dup data as an option that I think should be in 4.4) when covering data, dup metadata when covering it. For multiple devices, it's default raid1 metadata, default single data, so the picture doesn't differ much by default from the single-device default picture. It's also possible to do single metadata, raidN data, which really doesn't make sense except for raid0 data, and thus I believe there's a warning about that sort of layout in newer mkfs.btrfs, or when lowering the metadata redundancy using balance filters. But of course it's possible to do raid1 data and metadata, which would be two copies of each, regardless of the number of devices (except that it's 2+, of course). But the copies aren't 1:1 assigned. That is, if they're equal generation, btrfs can read either checksum and apply it to either data/metadata block. (Of course if they're not equal generation, btrfs will choose the higher one, thus covering the case of writing at the time of a crash, since either they will both be the same generation if the root block wasn't updated to the new one on either one yet, or one will be a higher/newer generation than the other, if it had already finished writing one but not the other at the time of the crash.) This is why it's an extremely good idea if you have a pair of devices in raid1, and you mount one of them degraded/writable with the other unavailable for some reason, that you don't also mount the other one writable and then try to recombined them. Chances are the generations wouldn't match and it'd pick the one with the higher generation, but if they did for some reason match, and both checksums were valid on their data, but the data differed... either one could be chosen, and a scrub might choose either one to fix the other, as well, which could in theory result in a file with intermixed blocks from the two different versions! Just ensure that if one is mounted writable, it's the only one mounted writable if there's a chance of recombining, and you'll be fine, as it'll be the only one with advancing generations. And if by some accident both are mounted writable separately, the best bet is to be sure and wipe the one, then add it as a new device, if you're going to reintroduce it to the same filesystem. Of course this gets a bit more complicated with 3+ device raid1, since currently, there's still only two copies of each block and two copies of the checksum, meaning there's at least one device without a copy of each block, and if the filesystem is mounted degraded writable repeatedly with a random device missing... Similarly, the permutations can be calculated for the other raid types, and for mixed raid types like raid6 data (specified) and raid1 metadata (unspecified so the default used), but I won't attempt that here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html