date:20151216

Kernel-failure - Computer freezes

2015-12-16 Thread Jakob Schürz

Got this Kernel-failure today while making a btrfs-snapshot.

# uname -r
4.2.0-1-amd64


Dez 16 20:33:14 aldebaran kernel: btrfs: page allocation failure:
order:1, mode:0x204020
Dez 16 20:33:14 aldebaran kernel: CPU: 1 PID: 8016 Comm: btrfs Tainted:
G U  W   4.2.0-1-amd64 #1 Debian 4.2.6-3
Dez 16 20:33:14 aldebaran kernel: Hardware name: Hewlett-Packard HP
ProBook 450 G2/2248, BIOS M74 Ver. 01.08 12/12/2014
Dez 16 20:33:14 aldebaran kernel:   0001
8154f6c3 00204020
Dez 16 20:33:14 aldebaran kernel:  8115727f 88014f5fbb00
 0001
Dez 16 20:33:14 aldebaran kernel:  0001 
88014f5fdfa8 0046
Dez 16 20:33:14 aldebaran kernel: Call Trace:
Dez 16 20:33:14 aldebaran kernel:  [] ?
dump_stack+0x40/0x50
Dez 16 20:33:14 aldebaran kernel:  [] ?
warn_alloc_failed+0xcf/0x130
Dez 16 20:33:14 aldebaran kernel:  [] ?
__alloc_pages_nodemask+0x2b4/0x9e0
Dez 16 20:33:14 aldebaran kernel:  [] ?
kmem_getpages+0x61/0x110
Dez 16 20:33:14 aldebaran kernel:  [] ?
fallback_alloc+0x143/0x1f0
Dez 16 20:33:14 aldebaran kernel:  [] ?
kmem_cache_alloc+0x1eb/0x430
Dez 16 20:33:14 aldebaran kernel:  [] ?
ida_pre_get+0x5c/0xd0
Dez 16 20:33:14 aldebaran kernel:  [] ?
get_anon_bdev+0x6d/0xe0
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_init_free_ino_ctl+0x61/0xa0 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_init_fs_root+0x106/0x180 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_read_fs_root+0x33/0x40 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_get_fs_root.part.49+0x93/0x180 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
memcmp_extent_buffer+0xc1/0x120 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_lookup_dentry+0x285/0x500 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
btrfs_lookup+0xe/0x30 [btrfs]
Dez 16 20:33:14 aldebaran kernel:  [] ?
lookup_real+0x19/0x60
Dez 16 20:33:14 aldebaran kernel:  [] ?
path_openat+0xa98/0x14c0
Dez 16 20:33:14 aldebaran kernel:  [] ?
do_set_pte+0x9e/0xd0
Dez 16 20:33:14 aldebaran kernel:  [] ?
filemap_map_pages+0x21e/0x230
Dez 16 20:33:14 aldebaran kernel:  [] ?
do_filp_open+0x75/0xd0
Dez 16 20:33:14 aldebaran kernel:  [] ?
__alloc_fd+0x3f/0x110
Dez 16 20:33:14 aldebaran kernel:  [] ?
do_sys_open+0x12a/0x200
Dez 16 20:33:14 aldebaran kernel:  [] ?
system_call_fast_compare_end+0xc/0x6b
Dez 16 20:33:14 aldebaran kernel: Mem-Info:
Dez 16 20:33:14 aldebaran kernel: active_anon:226184 inactive_anon:65166
isolated_anon:0
   active_file:245088
inactive_file:246297 isolated_file:0
   unevictable:192 dirty:1706
writeback:50 unstable:0
   slab_reclaimable:32362
slab_unreclaimable:16012
   mapped:96648 shmem:53953
pagetables:9075 bounce:0
   free:7046 free_pcp:1265 free_cma:0
Dez 16 20:33:14 aldebaran kernel: Node 0 DMA free:13364kB min:32kB
low:40kB high:48kB active_anon:252kB inactive_anon:284kB
active_file:560kB inactive_file:660kB unevictable:0kB isolated(anon):0kB
isolated(file)
Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 2125 3337 3337
Dez 16 20:33:14 aldebaran kernel: Node 0 DMA32 free:11232kB min:4624kB
low:5780kB high:6936kB active_anon:612872kB inactive_anon:162128kB
active_file:616016kB inactive_file:618100kB unevictable:516kB isolated(an
Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 0 1212 1212
Dez 16 20:33:14 aldebaran kernel: Node 0 Normal free:3464kB min:2636kB
low:3292kB high:3952kB active_anon:291612kB inactive_anon:98252kB
active_file:363776kB inactive_file:366212kB unevictable:252kB isolated(ano
Dez 16 20:33:14 aldebaran kernel: lowmem_reserve[]: 0 0 0 0
Dez 16 20:33:14 aldebaran kernel: Node 0 DMA: 31*4kB (UE) 17*8kB (UM)
8*16kB (UE) 5*32kB (UEM) 2*64kB (E) 1*128kB (E) 1*256kB (M) 2*512kB (EM)
1*1024kB (E) 1*2048kB (E) 2*4096kB (UM) = 13348kB
Dez 16 20:33:14 aldebaran kernel: Node 0 DMA32: 2720*4kB (UEM) 30*8kB
(UEM) 7*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 11232kB
Dez 16 20:33:14 aldebaran kernel: Node 0 Normal: 837*4kB (UEM) 22*8kB
(UM) 0*16kB 2*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 3588kB
Dez 16 20:33:14 aldebaran kernel: Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dez 16 20:33:14 aldebaran kernel: 545434 total pagecache pages
Dez 16 20:33:14 aldebaran kernel: 4 pages in swap cache
Dez 16 20:33:14 aldebaran kernel: Swap cache stats: add 10, delete 6,
find 0/0
Dez 16 20:33:14 aldebaran kernel: Free swap  = 7812052kB
Dez 16 20:33:14 aldebaran kernel: Total swap = 7812092kB
Dez 16 20:33:14 aldebaran kernel: 892955 pages RAM
Dez 16 20:33:14 aldebaran kernel: 0 pages HighMem/MovableOnly
Dez 16 20:33:14 aldebaran kernel: 33747 pages reserved
Dez 16 20:33:14 aldebaran kernel: 0 pages hwpoisoned
Dez 16 20:33:14 aldebaran kernel: BUG: unable to handle kernel NULL

Re: [GIT PULL] Btrfs fixes for 4.4

2015-12-16 Thread Chris Mason

On Thu, Dec 10, 2015 at 11:44:50AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Hi Chris,
> 
> Please consider the following fixes for kernel 4.4. Two of them are fixes
> to new issues introduced in the 4.4 merge window and 4.4 release candidates.
> The other one just fixes a warning message that is confusing and has made
> several users wonder if they are supposed to do anything or not when we
> fail to read a space cache.
> All these fixes have been previously sent to the mailing list.

Thanks Filipe, I tested these and pushed out, along with my two from
this week.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Christoph Anton Mitterer

On Wed, 2015-12-09 at 16:36 +, Duncan wrote:
> But... as I've pointed out in other replies, in many cases including
> this 
> specific one (bittorrent), applications have already had to develop
> their 
> own integrity management features
Well let's move discussion upon that into the "dear developers, can we
have notdatacow + checksumming, plz?" where I showed in one of the more
recent threads that bittorrent seems rather to be the only thing which
does use that per default... while on the VM image front, nothing seems
to support it, and on the DB front, some support it, but don't use it
per default.


> In the bittorrent case specifically, torrent chunks are already 
> checksummed, and if they don't verify upon download, the chunk is
> thrown 
> away and redownloaded.
I'm not a bittorrent expert, because I don't use it, but that sounds to
be more like the edonkey model, where - while there are checksums -
these are only used until the download completes. Then you have the
complete file, any checksum info thrown away, and the file again being
"at risk" (i.e. not checksum protected).


> And after the download is complete and the file isn't being
> constantly 
> rewritten, it's perfectly fine to copy it elsewhere, into a dir where
> nocow doesn't apply.
Sure, but again, nothing the user may automatically do, and there's
still the gap between the final verification from the bt software, to
the time it's copied over.
Arguably, that may be very short, but I see no reasons to make any
breaks in the everything-verified chain from the btrfs side.


>   With the copy, btrfs will create checksums, and if 
> you're paranoid you can hashcheck the original nocow copy against the
> new 
> checksummed/cow copy, and after that, any on-media changes will be
> caught 
> by the normal checksum verification mechanisms.
As before... of course you're right that one can do this, but nothing
that happens per default.
And I think that's just one of the nice things btrfs would/should give
us. That the filesystem assures that data is valid, at least in terms
of storage device and bus errors (it cannot protect of course against
memory errors or that like).


> > Hmm doesn't seem really good to me if systemd would do that, cause
> > it
> > then excludes any such files from being snapshot.
> 
> Of course if the directories are already present due to systemd
> upgrading 
> from non-btrfs-aware versions, they'll remain as normal dirs, not 
> subvolumes.  This is the case here.
Well, even if not, because one starts from a fresh system... people may
not want that.


> And of course you can switch them around to dirs if you like, and/or 
> override the shipped tmpfiles.d config with your own.
... sure but, people may not even notice that.
I don't think such a decision is up to systemd.
Anyway, since we're btrfs here, not systemd, that shouldn't bother us
;)


> > > and small ones such as the sqlite files generated by firefox and
> > > various email clients are handled quite well by autodefrag, with
> > > that
> > > general desktop usage being its primary target.
> > Which is however not yet the default...
> Distro integration bug! =:^)
Nah,... really not...
I'm quite sure that most distros will generally decide against
diverting from upstream in such choices.



> > It feels a bit, if there should be some tools provided by btrfs,
> > which
> > tell the users which files are likely problematic and should be
> > nodatacow'ed
> And there very well might be such a tool... five or ten years down
> the 
> road when btrfs is much more mature and generally stabilized, well
> beyond 
> the "still maturing and stabilizing" status of the moment.
Hmm let's hope btrfs isn't finished only when the next-gen default fs
arrives ;^)



> But it can be the case that as filesystem fragmentation levels rise,
> free-
> space itself is fragmented, to the point where files that would
> otherwise 
> not be fragmented as they're created once and never touched again,
> end up 
> fragmented, because there's simply no free-space extents big enough
> to 
> create them in unfragmented, so a bunch of smaller free-space extents
> must be used where one larger one would have been used had it
> existed.
In kinda curios, what free space fragmentation actually means here.

Ist simply like this:
+--+-+---++
|     F    |  D  | F |    D   |
+--+-+---++
Where D is data (i.e. files/metadata) and F is free space.
In other words, (F)ree space itself is not further subdivided and only
fragmented by the (D)ata extents in between.

Or is it more complex like this:
+-++-+---++
|  F  |  F |  D  | F |    D   |
+-+
+-+---++
Where the (F)ree space itself is subdivided
into "extents" (not necessarily of the same size), and btrfs couldn't
use e.g. the first two F's as one contiguous amount of free space for a
larger (D)ata extent of that size:
+--+-+---++
|    
D    |  D  | F |    D   |

Re: btrfs: poor performance on deleting many large files

2015-12-16 Thread Christoph Anton Mitterer

On Sun, 2015-12-13 at 07:10 +, Duncan wrote:
> > So you basically mean that ro snapshots won't have their atime
> > updated
> > even without noatime?
> > Well I guess that was anyway the recent behaviour of Linux
> > filesystems,
> > and only very old UNIX systems updated the atime even when the fs
> > was
> > set ro.
> 
> I'd test it to be sure before relying on it (keeping in mind that my
> own 
> use-case doesn't include subvolumes/snapshots so it's quite possible
> I 
> could get fine details of this nature wrong), but that would be my
> very 
> (_very_! see next) strong assumption, yes.
> 
> Because read-only snapshots are used for btrfs-send among other
> things, 
> with the idea being that the read-only will keep them from changing
> in 
> the middle of the send, and ro snapshot atime updates would seem to
> throw 
> that entirely out the window.  So I can't imagine ro snapshots doing
> atime 
> updates under any circumstance because I just can't see how send
> could 
> rely on them then, but I'd still test it before counting on it.

For those who haven't followed up the other threads:
I've tried it out and yes, ro-snapshots (as well as ro mounted btrfs
filesystem/subvolumes) don't have their atimes changed on e.g. read.



> AFAIK, the general idea was to eventually have all the (possible,
> some 
> are global-filesystem-scope) subvolume mount options exposed as 
> properties, it's just not implemented yet, but I'm not entirely sure
> if 
> that was all /btrfs-specific/ mount options, or included the generic
> ones 
> such as the *atime and no* (noexec/nodev/...) options as well.  In
> view 
> of that and the fact that noatime is generic, adding it as a specific
> request still makes sense.  Someone with more specific knowledge on
> the 
> current plan can remove it if it's already covered.
Not sure if I'd had already posted that here, but I did write some of
these ideas up and added it to the wiki:
https://btrfs.wiki.kernel.org/index.php?title=Project_ideas=historysubmit=29757=29743



Best wishes,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Christoph Anton Mitterer

On Mon, 2015-12-14 at 10:51 +, Duncan wrote:
> > AFAIU, the one the get's fragmented then is the snapshot, right,
> > and the
> > "original" will stay in place where it was? (Which is of course
> > good,
> > because one probably marked it nodatacow, to avoid that
> > fragmentation
> > problem on internal writes).
> 
> No.  Or more precisely, keep in mind that from btrfs' perspective, in
> terms of reflinks, once made, there's no "original" in terms of
> special 
> treatment, all references to the extent are treated the same.
Sure... you misunderstood me I guess..

> 
> What a snapshot actually does is create another reference (reflink)
> to an 
> extent.
[snip snap]
> So in the 
> case of nocow, a cow1 (one-time-cow) exception must be made,
> rewriting 
> the changed data to a new location, as the old location continues to
> be 
> referenced by at least one other reflink.
That's what I've meant.


> So (with the fact that writable snapshots are available and thus it
> can 
> be the snapshot that changed if it's what was written to) the one
> that 
> gets the changed fragment written elsewhere, thus getting fragmented,
> is 
> the one that changed, whether that's the working copy or the snapshot
> of 
> that working copy.
Yep,.. that's what I've suspected and asked for.

The "original" file, in the sense of the file that first reflinked the
contiguous blocks,... will continue to point to these continuous
blocks.

While the "new" file, i.e he CoW-1-ed snapshot's file, will partially
reflink blocks form the contiguous range, and it's rewritten blocks
will reflink somewhere else.
Thus the "new" file is the one that gets fragmented.


> > And one more:
> > You both said, auto-defrag is generally recommended.
> > Does that also apply for SSDs (where we want to avoid unnecessary
> > writes)?
> > It does seem to get enabled, when SSD mode is detected.
> > What would it actually do on an SSD?
> Did you mean it does _not_ seem to get (automatically) enabled, when
> SSD 
> mode is detected, or that it _does_ seem to get enabled, when 
> specifically included in the mount options, even on SSDs?
I does seem to get enabled, when specifically included in the mount
options (the ssd mount option is not used), i.e.:
/dev/mapper/system  /   btrfs   
subvol=/root,defaults,noatime,autodefrag0   1
leads to:
[5.294205] BTRFS: device label foo devid 1 transid 13 /dev/disk/by-label/foo
[5.295957] BTRFS info (device sdb3): disk space caching is enabled
[5.296034] BTRFS: has skinny extents
[   67.082702] BTRFS: device label system devid 1 transid 60710 
/dev/mapper/system
[   67.85] BTRFS info (device dm-0): disk space caching is enabled
[   67.111267] BTRFS: has skinny extents
[   67.305084] BTRFS: detected SSD devices, enabling SSD mode
[   68.562084] BTRFS info (device dm-0): enabling auto defrag
[   68.562150] BTRFS info (device dm-0): disk space caching is enabled



> Or did you actually mean it the way you wrote it, that it seems to be
> enabled (implying automatically, along with ssd), when ssd mode is 
> detected?
No, sorry for being unclear.
I meant it that way, that having the ssd detected doesn't auto-disable
auto-defrag, which I thought may make sense, given that I didn't know
exactly what it would do on SSDs...
IIRC, Hugo or Austin, mentioned the thing with making for better IOPS,
but I haven't had considered that to have impact enough... so I thought
it could have made sense to ignore the "autodefrag" mount option in
case an ssd was detected.



> There are three factors I'm aware of here as well, all favoring 
> autodefrag, just as the two above favored leaving it off.
> 
> 1) IOPS, Input/Output Operations Per Second.  SSDs typically have
> both an 
> IOPS and a throughput rating.  And unlike spinning rust, where raw
> non-
> sequential-write IOPS are generally bottlenecked by seek times, on
> SSDs 
> with their zero seek-times, IOPS can actually be the bottleneck.
Hmm it would be really nice to get someone who has found a way to make
some sound analysis/benchmarking of that.


> 2) SSD physical write and erase block sizes as multiples of the
> logical/
> read block size.  To the extent that extent sizes are multiples of
> the 
> write and/or erase-block size, writing larger extents will reduce
> write 
> amplification due to writing and blocks smaller than the write or
> erase 
> block size.
Hmm... okay I don't know the details of how btrfs does this, but I'd
have expected that all extents are aligned to the underlying physical
devices' block structure.
Thus each extent should start at such write/erase block, and at most it
shouldn't perfectly at the end of the extent.
If the file is fragmented (i.e. more than one extent), I'd have even
hoped that all but the last one fit perfectly.

So what you basically mean, AFAIU, is that by having auto-defrag, you
get larger extents (i.e. smaller ones collapsed into one) and by thus
you get less cut off at the end of extents where these

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Kai Krakow

Am Wed, 9 Dec 2015 13:36:01 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> >> > 4) Duncan mentioned that defrag (and I guess that's also for
> >> > auto- defrag) isn't ref-link aware...
> >> > Isn't that somehow a complete showstopper?  
> 
> >> It is, but the one attempt at dealing with it caused massive data
> >> corruption, and it was turned off again.  
> 
> IIRC, it wasn't data corruption so much, as massive scaling issues,
> to the point where defrag was entirely useless, as it could take a
> week or more for just one file.
> 
> So the decision was made that a non-reflink-aware defrag that
> actually worked in something like reasonable time even if it did
> break reflinks and thus increase space usage, was of more use than a
> defrag that basically didn't work at all, because it effectively took
> an eternity. After all, you can always decide not to run it if you're
> worried about the space effects it's going to have, but if it's going
> to take a week or more for just one file, you effectively don't have
> the choice to run it at all.
> 
> > So... does this mean that it's still planned to be implemented some
> > day or has it been given up forever?  
> 
> AFAIK it's still on the list.  And the scaling issues are better, but
> one big thing holding it up now is quota management.  Quotas never
> have worked correctly, but they were a big part (close to half, IIRC)
> of the original snapshot-aware-defrag scaling issues, and thus must
> be reliably working and in a generally stable state before a
> snapshot-aware-defrag can be coded to work with them.  And without
> that, it's only half a solution that would have to be redone when
> quotes stabilized anyway, so really, quota code /must/ be stabilized
> to the point that it's not a moving target, before reimplementing
> snapshot-aware-defrag makes any sense at all.
> 
> But even at that point, while snapshot-aware-defrag is still on the
> list, I'm not sure if it's ever going to be actually viable.  It may
> be that the scaling issues are just too big, and it simply can't be
> made to work both correctly and in anything approaching practical
> time.  Time will tell, of course, but until then...

I'd like to throw in an idea... Couldn't auto-defrag just be made "sort
of reflink-aware" in a very simple fashion: Just let it ignore extents
that are shared?

That way you can still enjoy it benefits in a mixed-mode scenario where
you are working with snapshots partly but other subvolumes are never
taken snapshots of.

Comments?

-- 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ideas on unified real-ro mount option across all filesystems

2015-12-16 Thread Qu Wenruo


Hi,

In a recent btrfs patch, it is going to add a mount option to disable 
log replay for btrfs, just like "norecovery" for ext4/xfs.


But in the discussion on the mount option name and use case, it seems 
better to have an unified and fs independent mount option alias for real 
RO mount


Reasons:
1) Some file system may have already used [no]"recovery" mount option
   In fact, btrfs has already used "recovery" mount option.
   Using "norecovery" mount option will be quite confusing for btrfs.

2) More straight forward mount option
   Currently, to get real RO mount, for ext4/xfs, user must use -o
   ro,norecovery.
   Just ro won't ensure real RO, and norecovery can't be used alone.
   If we have a simple alias, it would be much better for user to use.
   (it maybe done just in user space mount)

   Not to mention some fs (yeah, btrfs again) doesn't have "norecovery"
   but "nologreplay".

3) A lot of user even don't now mount ro can still modify device
   Yes, I didn't know this point until I checked the log replay code of
   btrfs.
   Adding such mount option alias may raise some attention of users.


Any ideas about this?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas on unified real-ro mount option across all filesystems

2015-12-16 Thread Qu Wenruo


And here is the existing discussion in btrfs mail list, just for reference:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/51098

Thanks,
Qu

Qu Wenruo wrote on 2015/12/17 09:41 +0800:

Hi,

In a recent btrfs patch, it is going to add a mount option to disable
log replay for btrfs, just like "norecovery" for ext4/xfs.

But in the discussion on the mount option name and use case, it seems
better to have an unified and fs independent mount option alias for real
RO mount

Reasons:
1) Some file system may have already used [no]"recovery" mount option
In fact, btrfs has already used "recovery" mount option.
Using "norecovery" mount option will be quite confusing for btrfs.

2) More straight forward mount option
Currently, to get real RO mount, for ext4/xfs, user must use -o
ro,norecovery.
Just ro won't ensure real RO, and norecovery can't be used alone.
If we have a simple alias, it would be much better for user to use.
(it maybe done just in user space mount)

Not to mention some fs (yeah, btrfs again) doesn't have "norecovery"
but "nologreplay".

3) A lot of user even don't now mount ro can still modify device
Yes, I didn't know this point until I checked the log replay code of
btrfs.
Adding such mount option alias may raise some attention of users.


Any ideas about this?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 13:03:24 +0100 as
excerpted:

> Human readable lables are basically guaranteed to collide,

Heh, not here, tho one could argue that my labels aren't "human 
readable", I suppose.

grep LABEL= /etc/fstab | cut -f1
LABEL=bt0238gcn1+35l0
LABEL=bt0238gcn0+35l0
LABEL=bt0465gsg0+47f0
LABEL=rt0238gcnx+35l0
LABEL=rt0238gcnx+35l1
LABEL=rt0465gsg0+47f0
LABEL=hm0238gcnx+35l0
LABEL=pk0238gcnx+35l0
LABEL=nr0238gcnx+35l0
LABEL=hm0238gcnx+35l1
LABEL=pk0238gcnx+35l1
LABEL=nr0238gcnx+35l1
LABEL=hm0465gsg0+47f0
LABEL=pk0465gsg0+47f0
LABEL=nr0465gsg0+47f0
LABEL=lg0238gcnx+35l0
LABEL=lg0465gsg0+47f0
LABEL=mm0465gsg0+2550
LABEL=mm0465gsg0+2551
#LABEL=sw0465gsg0+47f0

The scheme was originally designed with reiserfs' 15-char limited labels 
in mind, so it's 15-char.  These days I use it for both fs labels and gpt 
partition names/labels, with the two generally matched except for the 
device sequential, which is x in the multi-device case.

* function: 2 chars bt=boot, hm=home, etc

* device-id:8   uniq-in-scope device id
** size:5   0238g=238 GiB
** brand:   2   sg=seagate, cn=corsair neutron, etc
** dev-seq: 1   can be more than one 465 GiB seagate

* target:   1   +=home workstation, . for the netbook, etc

* date: 3   date of original partition creation
** year:1   last digit of year, gives decade scope
** month1   1-9abc
** day  1   1-9a-v (2char would be nice here, but...)

* func-seq  1   0=working, backup-N

2+8+1+3+1=15 chars =:^)

So for example rt0238gcnx+35l0 is root, on 238 GiB Corsair Neutron (multi-
devices), targeted at the workstation, with the partitions originally 
setup on 2013, June (something, whatever l is), working copy.

(Hmm...  Only apropos to this thread due to the tangential btrfs angle, 
but that's two and a half years ago.  Which since that's when I first 
deployed btrfs permanently, I've been running btrfs for two and a half 
years now. ... =:^)

The function tells me at a glance what it's intended to be used for.

The target (which also functions as a visual separator) tells me at a 
glance where the device is intended to be used.

The func-seq tells me at a glance whether I'm dealing with the working 
copy or what level of backup, and taken together with the function and 
target, uniquely ID the partition/filesystem "software device".

The dev-id is uniq-in-scope, easily IDing size, brand, and number of 
"hardware device", and size is ridiculously scalable from bytes to PiB 
and beyond.  For multi-device btrfs, dev-seq is "x", while the individual 
device partitions composing it still have their sequence numbers in their 
gpt labels.

The date (along with size, of course) provides some idea of the age of 
the device, or at least the partitioning scheme on it, as well as 
providing more bits of "software device" and overall unique-id.

Both sequence numbers can easily and intuitively scale to 61 (1-9a-zA-Z) 
if needed, and less intuitively a bit higher if it's really necessary.  
Target would lose its separator status if it scaled too far, but 
certainly gives me as an individual /reasonable/ number of machines 
flexibility.

This scheme self-evidently and easily scales to a library well into the 
multi-hundreds if not thousands of physical devices, portable or 
permanently installed, partitioned up as needed.  I haven't yet found the 
need as my "device library" is small enough, but were I to need to, I 
could reasonably easily put together a database tracking where various 
files (and even various versions of those files) are located.  With the 
"software device" and "hardware device" IDed separately, I can easily 
substitute out or add/remove hardware devices from software devices, or 
the reverse, as necessary.

The biggest problem is the 15-char limit; I had to pack the fields rather 
tighter and more cryptically than I'd have liked, so it's not as easily 
human readable as I'd have liked.  And of course it'd need adapted for 
deployment scales on the level of facebook/google/nsa, where 60-some 
device-scaling in the sequence numbers, and the target scaling as well, 
is pitifully laughable, but it's certainly reasonable on an individual 
scale, and with a couple revisions for mdraid and btrfs (basically, md 
for brand when I was doing partitioned mdraid, and substituting x for 
individual sequence number for multi-device), the scheme has served me 
surprisingly well over the years since I came up with it, and should 
continue to do so, I suppose, until I no longer have the need (death, or 
near-vegetable in a nursing home or whatever).  Tho if HP's "the machine" 
were to ever take off in my lifetime, it could prove somewhat... 
challenging to the mental and nomenclature model, but that pretty much 
applies to the entire computer field, both hardware and software, as we 
know it, so I'm far from alone, there.

But, despite the debatable

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> And there very well might be such a tool... five or ten years down the
>> road when btrfs is much more mature and generally stabilized, well
>> beyond the "still maturing and stabilizing" status of the moment.

> Hmm let's hope btrfs isn't finished only when the next-gen default fs
> arrives ;^)

[Again, breaking into smaller point replies...]

Well, given the development history for both zfs and btrfs to date, five 
to ten years down the line, with yet another even newer filesystem then 
already under development, is more being "real", than not.  Also see the 
history in MS' attempt at a next-gen filesystem.  The reality is these 
things take FAR longer than one might think.

FWIW, on the wiki I see feature points and benchmarks for v0.14, 
introduced in April of 2008, and a link to an earlier btree filesystem on 
which btrfs apparently was based, dating to 2006, so while I don't have a 
precise beginning date, and to some extent such a thing would be rather 
arbitrary anyway, as Chris would certainly have done some major thinking, 
preliminary research and coding, before his first announcement, a project 
origin in late 2006 or sometime in 2007 has to be quite close.

And (as I noted in a parenthetical at my discovery in a different 
thread), I switched to btrfs for my main filesystems when I bought my 
first SSDs, in June of 2013, so already a quarter decade ago.  At the 
time btrfs was just starting to remove some of the more dire 
"experimental" warnings.  Obviously it has stabilized quite a bit since 
then, but due to the oft-quoted 80/20 rule and extensions, where the last 
20% of the progress takes 80% of the work, etc...

It could well be another five years before btrfs is at a point I think 
most here would call stable.  That would be 2020 or so, about 13 years 
for the project, and if you look at the similar projects mentioned above, 
that really isn't unrealistic at all.  Ten years minimum, and that's with 
serious corporate level commitments and a lot more dedicated devs than 
btrfs has.  12 years not unusual at all, and a decade and a half still 
well within reasonable range, for a filesystem with this level of 
complexity, scope, and features.

And realistically, by that time, yet another successor filesystem may 
indeed be in the early stages of development, say at the 20/80 point, 20% 
of required effort invested, possibly 80% of the features done, but not 
stabilized.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> In kinda curios, what free space fragmentation actually means here.
> 
> Ist simply like this:
> +--+-+---++
> |     F    |  D  | F |    D   |
> +--+-+---++
> Where D is data (i.e. files/metadata) and F is free space.
> In other words, (F)ree space itself is not further subdivided and only
> fragmented by the (D)ata extents in between.
> 
> Or is it more complex like this:
> +-++-+---++
> |  F  |  F |  D  | F |    D   |
> +-++-+---++
> Where the (F)ree space itself is subdivided into "extents" (not
> necessarily of the same size), and btrfs couldn't use e.g. the first two
> F's as one contiguous amount of free space for a larger (D)ata extent

[still breaking into smaller points for reply]

At the one level, I had the simpler f/d/f/d scheme in mind, but that 
would be the case inside a single data chunk.  At the higher file level, 
with files significant fractions of the size of a single data chunk to 
much larger than a single data chunk, the more complex and second
f/f/d/f/d case would apply, with the chunk boundary as the separation 
between the f/f.

IOW, files larger than data chunk size will always be fragmented into 
data chunk size fragments/extents, at the largest, because chunks are 
designed to be movable using balance, device remove, replace, etc.

So (using the size numbers from a recent comment from Qu in a different 
thread), on a filesystem with under 100 GiB total space-effective (space-
effective, space available, accounting for the replication type, raid1, 
etc, and I'm simplifying here...), data chunks should be 1 GiB, while 
above that, with striping, they might be upto 10 GiB.

Using the 1 GiB nominal figure, files over 1 GiB would always be broken 
into 1 GiB maximum size extents, corresponding to 1 extent per chunk.

But while 4 KiB extents are clearly tiny and inefficient at today's 
scale, in practice, efficiency gains break down at well under GiB scale, 
with AFAIK 128 MiB being the upper bound at which any efficiency gains 
could really be expected, and 1 MiB arguably being a reasonable point at 
which further increases in extent size likely won't have a whole lot of 
effect even on SSD erase-block (where 1 MiB is a nominal max), but that's 
that's still 256X the usual 4 KiB minimum data block size, 8X the 128 KiB 
btrfs compression-block size, and 4X the 256 KiB defrag default "don't 
bother with extents larger than this" size.

Basically, the 256 KiB btrfs defrag "don't bother with anything larger 
than this" default is quite reasonable, tho for massive multi-gig VM 
images, the number of 256 KiB fragments will still look pretty big, so 
while technically a very reasonable choice, the "eye appeal" still isn't 
that great.

But based on real reports posting before and after numbers from filefrag 
(on uncompressed btrfs), we do have cases where defrag can't find 256 KiB 
free-space blocks and thus can actually fragment a file worse than it was 
before, so free-space fragmentation is indeed a very real problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> I'm a bit unsure how to read filefrag's output... (even in the
> uncompressed case).
> What would it show me if there was fragmentation

/path/to/file:  18 extents found

It tells you the number of extents found.  Nominally, each extent should 
be a fragment, but as has been discussed elsewhere, on btrfs compressed 
files it will interpret each 128 KiB btrfs compression block as its own 
extent, even if (as seen in verbose mode) the next one begins where the 
previous one ends so it's really just a single extent.

Apparently on ext3/4, it's possible to have multi-gig files as a single 
extent, thus unfragmented, but as explained in an earlier reply to a 
point earlier in your post, on btrfs, extents of a GiB are nominally the 
best you can do as that's the nominal data chunk size, tho in limited 
circumstances larger extents are still possible on btrfs.

In the case above, where I took the 18 extents result from a real file 
(tho obviously the posted path isn't real), it was 4 MiB in size (I think 
exactly, it's a 4 MiB BIOS image =:^), so doing the math, extents average 
227 KiB.  That's on a filesystem that is always mounted with autodefrag, 
but it's also always mounted with compress, so it's possible some of the 
reported extents are compressed.

Actually, looking at filefrag -v output (which I've never used before but 
which someone noted could be used to check fragmentation on compressed 
files, tho it's not as straightforward as you might think), it looks like 
all but two of the listed extents are 32 blocks long (with 4096 byte 
blocks), which equates to 128 KiB, the btrfs compression-block size, and 
the two remaining extents are 224 blocks long or 896 KiB, an exact 7 
multiple of 128 KiB, so this file would indeed appear to be compressed 
except for those two uncompressed extents.  (As for figuring out how to 
interpret the full -v output to know whether the compressed blocks are 
actually single extents or not, as I said this is my first time trying
-v, and I didn't bother going that far with it.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> he obviously didn't think thru the fact that compression MUST be a
>> rewrite, thereby breaking snapshot reflinks, even were normal
>> non-compression defrag to be snapshot aware, because compression
>> substantially changes the way the file is stored), that's _implied_,
>> not explicit.
> So you mean, even if ref-link aware defrag would return, it would still
> break them again when compressing/uncompressing/recompressing?
> I'd have hoped that then, all snapshots respectively other reflinks
> would simply also change to being compressed,

You're correct.  I "obviously didn't thing thru" that the whole way, 
myself. =:^(

But meanwhile, we don't have snapshot-aware-defrag, and in that case, the 
implication... and his result... remains.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> It's certainly in quite a few on-list posts over the years

> okay,.. in other words: no ;-)
> scatter over the years list posts don't count as documentation :P

=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Christoph Anton Mitterer

On Thu, 2015-12-17 at 01:09 +, Duncan wrote:
> Well, "don't load the journal on mounting" is exactly what the option

> would do.  The journal (aka log) of course has a slightly different 
> meaning, it's only the fsync log, but loading it is exactly what the 
> option would prevent, here.
That's not the point.
What David asked for was an option, that has the meaning "do what ever
is necessary to mount the fs in such a way, that the device isn't
changed".
At least that's how I've understood him.

*Right now* this is for btrfs the "nologreplay" option, so
"nodevwrites" or whatever you call it, would simply imply
"nologreply"... and in the future any other options that are necessary
to get the above defined semantics.

For ext, "nodevwrites" would imply "noload" (AFAIU).

If you now make "noload" an alias for "nodevwrites" in btrfs you
clearly break semantics here:
"noload" from ext4 hasn't the same meaning as "nodevwrites" from
btrfs.. it has *just now* while ext doesn't need any other possible
future options.

Maybe in 10 years, ext has a dozens new features (because btrfs still
hasn't stabilised yet, as it misses snapshot-aware defrag and checksums
for noCoWed data >:-D O:-D ... sorry, couldn't resist ;) ), one of that
new features needs to be disabled for "hard ro mounts", thus ext's
"nodevwrites" would in addition imply "noSuperNewExtFeature".

Then "noload" from ext4 isn't even *effectively* the same anymore as
"nodevwrites" from btrfs.
Therefore, it shouldn't be an alias.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Ideas to do custom operation just after mount?

2015-12-16 Thread Qu Wenruo


Hi,

Will xfstests provides some API to do some operation just after
mounting a filesystem?

Some filesystem(OK, btrfs again) has some function(now qgroup only) 
which needed to be enabled by ioctl instead of mount option.


Currently, for btrfs qgroup we added special test case enabling qgroup 
and do test.
But there is only less than 10 qgroup test case and we didn't test most 
of the rest cases with qgroup enable.


Things will get even worse if we are adding inband-deduplication.

So is there any good idea to do some operation just after mounting in 
xfstests?


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 16:04:03 +0100 as
excerpted:

> On Wed, 2015-12-16 at 09:41 -0500, Chris Mason wrote:

>> Hugo is right here.  reiserfs had tools that would scan and entire
>> block device for metadata blocks and try to reconstruct the filesystem
>> based on what it found.

> Creepy... at least when talking about a "normal" fsck... good that btrfs
> is going to be the next-gen-ext, and not reiser4 ;)

What often gets lost in discussions of this nature is that it _wasn't_ 
"normal" fsck that had the problem, but rather, a parameter 
(--rebuild-tree, IIRC) much like btrfs check (--init-csum-tree,
 init-extent-tree) and rescue (chunk-recover) use for blowing away and 
recreating the checksum tree, extent tree, chunk tree, etc.

So it's definitely _not_ something that reiserfsck would do in a "normal" 
fsck, only when doing "I'm desperate and don't have backups, go to the 
ends of the earth if necessary to recover what you can of my data, and 
yes, I understand it could be a bit risky or end up rather disordered, 
but I'm willing to take that risk because I _am_ that desperate", level 
recovery.

Arguably, however, the problem was that reiserfs (heh, that's the second 
time I almost wrote btrfs and caught it, hope I didn't miss any! =:^) had 
a rather minor items repair mode, and an "I'm desperate, ends of the 
earth and I don't care about the risk as anything is better than nothing" 
mode, but not a lot of choice in between the two.  Additionally, now 
looking at btrfs (a correct reference this time! =:^), the "desperate" 
solution in btrfs is rather more fine-grained, including at least the 
three above options plus one for the superblock, with an additional read-
only restore tool that can often restore most or all data to elsewhere, 
in the case of a missed or not current backup, that reiserfs never had.

But AFAIK reiser4 (which I never actually tried as it never made 
mainline, which in general I prefer to stick to, but I read about it) 
improved on the reiserfs model in this regard as well -- indeed, it would 
have been surprising if it didn't, since both reiser4 and btrfs had the 
lessens of reiserfs to build upon.

And of course reiserfs might have gotten the same sort of tool changes 
too, except for Hans Reiser's controversial policy of letting stable be 
stable, and putting the improvements into reiser4, which of course was 
intended to get into mainline in some reasonable time and thus wouldn't 
have left reiserfs users so in the lurch as actually happened, because 
reiser4 never did hit mainline due to $reasons, most/all of which I agree 
with, or at least understand, where I don't entirely agree.

But anyway, for anyone with half a tech-oriented brain, it was very 
evident that the required options were "desperate" level, and for people 
without half a tech-oriented brain, the documentation clearly suggested 
danger ahead, you should have backups if you're going to do this as it's 
a risky process that could destroy chances of recovery instead of fixing 
things, as well.

But of course so many don't read the docs, they just do it... and 
sometimes they suffer the consequences when they do... and sometimes then 
try to blame others for it.That's the way of the world; not 
something we're going to change.

Even the required actually spelled out "yes" confirmation, not just "y", 
didn't stop people, either from doing it or for blaming reiserfs for 
problems that were in fact mostly their own, when they went ahead anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas on unified real-ro mount option across all filesystems

2015-12-16 Thread Darrick J. Wong

On Wed, Dec 16, 2015 at 09:15:59PM -0600, Eric Sandeen wrote:
> 
> 
> On 12/16/15 7:41 PM, Qu Wenruo wrote:
> > Hi,
> > 
> > In a recent btrfs patch, it is going to add a mount option to disable
> > log replay for btrfs, just like "norecovery" for ext4/xfs.
> > 
> > But in the discussion on the mount option name and use case, it seems
> > better to have an unified and fs independent mount option alias for
> > real RO mount
> > 
> > Reasons:
> > 1) Some file system may have already used [no]"recovery" mount option
> >In fact, btrfs has already used "recovery" mount option.
> >Using "norecovery" mount option will be quite confusing for btrfs.
> 
> Too bad btrfs picked those semantics when "norecovery" has existed on
> other filesystems for quite some time with a different meaning... :(
> 
> > 2) More straight forward mount option
> >Currently, to get real RO mount, for ext4/xfs, user must use -o
> >ro,norecovery.
> >Just ro won't ensure real RO, and norecovery can't be used alone.
> >If we have a simple alias, it would be much better for user to use.
> >(it maybe done just in user space mount)
> 
> mount(8) simply says:
> 
>ro Mount the filesystem read-only.
> 
> and mount(2) is no more illustrative:
> 
>MS_RDONLY
>   Mount file system read-only.
> 
> kernel code is no help, either:
> 
> #define MS_RDONLY1  /* Mount read-only */
> 
> They say nothing about what, exactly, "read-only" means.  But since at least
> the early ext3 days, it means that you cannot write through the filesystem, 
> not
> that the filesystem will leave the block device unmodified when it mounts.
> 
> I have always interpreted it as simply "no user changes to the filesystem,"
> and that is clearly what the vfs does with the flag...

That ("-o ro means no user changes") has always been my understanding too.  You
/want/ the FS to replay the journal on an RO mount so that regular FS operation
picks up the committed transactions.

--D

> 
> >Not to mention some fs (yeah, btrfs again) doesn't have "norecovery"
> >but "nologreplay".
> 
> well, again, btrfs picked unfortunate semantics, given the precedent set
> by other filesystems.
> 
> f2fs, ext4, gfs2, nilfs2, and xfs all support "norecovery" - xfs since
> forever, ext4 & f2fs since 2009, etc.
> 
> > 3) A lot of user even don't now mount ro can still modify device
> >Yes, I didn't know this point until I checked the log replay code of
> >btrfs.
> >Adding such mount option alias may raise some attention of users.
> 
> Given that nothing in the documentation implies that the block device itself
> must remain unchanged on a read-only mount, I don't see any problem which
> needs fixing.  MS_RDONLY rejects user IO; that's all.
> 
> If you want to be sure your block device rejects all IO for forensics or
> what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount,
> and take it out of the filesystem's control.  Or better yet, making an
> image and not touching the original.
> 
> -Eric
> 
> > Any ideas about this?
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> On Wed, 2015-12-09 at 16:36 +, Duncan wrote:
>> But... as I've pointed out in other replies, in many cases including
>> this specific one (bittorrent), applications have already had to
>> develop their own integrity management features

> Well let's move discussion upon that into the "dear developers, can we
> have notdatacow + checksumming, plz?" where I showed in one of the more
> recent threads that bittorrent seems rather to be the only thing which
> does use that per default... while on the VM image front, nothing seems
> to support it, and on the DB front, some support it, but don't use it
> per default.
> 
>> In the bittorrent case specifically, torrent chunks are already
>> checksummed, and if they don't verify upon download, the chunk is
>> thrown away and redownloaded.

> I'm not a bittorrent expert, because I don't use it, but that sounds to
> be more like the edonkey model, where - while there are checksums -
> these are only used until the download completes. Then you have the
> complete file, any checksum info thrown away, and the file again being
> "at risk" (i.e. not checksum protected).

[I'm breaking this into smaller replies again.]

Just to mention here, that I said "integrity management features", which 
includes more than checksumming.  As Austin Hemmelgarn has been pointing 
out, DBs and some VMs do COW, some DBs do checksumming or at least have 
that option, and both VMs and DBs generally do at least some level of 
consistency checking as they load.  Those are all "integrity management 
features" at some level.

As for bittorrent, I /think/ the checksums are in the torrent files 
themselves (and if I'm not mistaken, much as git, the chunks within the 
file are actually IDed by checksum, not specific position, so as long as 
the torrent is active, uploading or downloading, these will by definition 
be retained).  As long as those are retained, the checksums should be 
retained.  And ideally, people will continue to torrent the files long 
after they've finished downloading them, in which case they'll still need 
the torrent files themselves, along with the checksums info.

And for longer term storage, people really should be copying/moving their 
torrented files elsewhere, in such a way that they either eliminate the 
fragmentation if the files weren't nocowed, or eliminate the nocow 
attribute and get them checksum-protected as normal for files not 
intended to be constantly randomly rewritten, which will be the case once 
they're no longer being actively downloaded.  Of course that's at the 
slightly technically oriented user level, but then, the whole nocow 
thing, or even caring about checksums and longer term file integrity in 
the first place, is also technically oriented user level.  Normal users 
will just download without worrying about the nocow in the first place, 
and perhaps wonder why the disk is thrashing so, but not be inclined to 
do anything about it except perhaps switch back to their old filesystem, 
where it was faster and the disk didn't sound as bad.  In doing so, 
they'll either automatically get the checksuming along with the worse 
performance, or go back to a filesystem without the checksumming, and 
think it's fine as they know no different.

Meanwhile, if they do it correctly there's no window without protection, 
as the torrent file can be used to double-verify the file once moved, as 
well, before deleting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas on unified real-ro mount option across all filesystems

2015-12-16 Thread Eric Sandeen

On 12/16/15 7:41 PM, Qu Wenruo wrote:
> Hi,
> 
> In a recent btrfs patch, it is going to add a mount option to disable
> log replay for btrfs, just like "norecovery" for ext4/xfs.
> 
> But in the discussion on the mount option name and use case, it seems
> better to have an unified and fs independent mount option alias for
> real RO mount
> 
> Reasons:
> 1) Some file system may have already used [no]"recovery" mount option
>In fact, btrfs has already used "recovery" mount option.
>Using "norecovery" mount option will be quite confusing for btrfs.

Too bad btrfs picked those semantics when "norecovery" has existed on
other filesystems for quite some time with a different meaning... :(

> 2) More straight forward mount option
>Currently, to get real RO mount, for ext4/xfs, user must use -o
>ro,norecovery.
>Just ro won't ensure real RO, and norecovery can't be used alone.
>If we have a simple alias, it would be much better for user to use.
>(it maybe done just in user space mount)

mount(8) simply says:

   ro Mount the filesystem read-only.

and mount(2) is no more illustrative:

   MS_RDONLY
  Mount file system read-only.

kernel code is no help, either:

#define MS_RDONLY1  /* Mount read-only */

They say nothing about what, exactly, "read-only" means.  But since at least
the early ext3 days, it means that you cannot write through the filesystem, not
that the filesystem will leave the block device unmodified when it mounts.

I have always interpreted it as simply "no user changes to the filesystem,"
and that is clearly what the vfs does with the flag...

>Not to mention some fs (yeah, btrfs again) doesn't have "norecovery"
>but "nologreplay".

well, again, btrfs picked unfortunate semantics, given the precedent set
by other filesystems.

f2fs, ext4, gfs2, nilfs2, and xfs all support "norecovery" - xfs since
forever, ext4 & f2fs since 2009, etc.

> 3) A lot of user even don't now mount ro can still modify device
>Yes, I didn't know this point until I checked the log replay code of
>btrfs.
>Adding such mount option alias may raise some attention of users.

Given that nothing in the documentation implies that the block device itself
must remain unchanged on a read-only mount, I don't see any problem which
needs fixing.  MS_RDONLY rejects user IO; that's all.

If you want to be sure your block device rejects all IO for forensics or
what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount,
and take it out of the filesystem's control.  Or better yet, making an
image and not touching the original.

-Eric

> Any ideas about this?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Duncan

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 12:45:00 +0100 as
excerpted:

> On Wed, 2015-12-16 at 11:10 +, Duncan wrote:
>> And noload doesn't have the namespace collision problem norecovery does
>> on btrfs, so I'd strongly suggest using it, at least as an alias for
>> whatever other btrfs-specific name we might choose.
> 
> but noload is, AFAIU, not what's desired here, is it?
> Per manpage it's "Don't load the journal on mounting",... not only
> wouldn't that fit for btrfs, it's also not what's really desired, i.e.
> an option that implies everything necessary to not modify the device.

Well, "don't load the journal on mounting" is exactly what the option 
would do.  The journal (aka log) of course has a slightly different 
meaning, it's only the fsync log, but loading it is exactly what the 
option would prevent, here.

Of course that isn't to say there shouldn't be another option, call it 
nomodify, for argument, that includes this and perhaps other options that 
would otherwise trigger filesystem level changes on a normal read-only 
mount.

Too bad we can't simply rename the recovery mount option so norecovery 
could be used as well, but I guess that could potentially break too many 
existing deployments. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:
> > Well sure, I think we'de done most of this and have dedicated
> > controllers, at least of a quality that funding allows us ;-)
> > But regardless how much one tunes, and how good the hardware is. If
> > you'd then loose always a fraction of your overall IO, and be it
> > just
> > 5%, to defragging these types of files, one may actually want to
> > avoid
> > this at all, for which nodatacow seems *the* solution.
> nodatacow only works for that if the file is pre-allocated, if it
> isn't, 
> then it still ends up fragmented.
Hmm is that "it may end up fragmented" or a "it will definitely?
Cause I'd have hoped, that if nothing else had been written in the
meantime, btrfs would perhaps try to write next to the already
allocated blocks.


> > > The problem is not entirely the lack of COW semantics, it's also
> > > the
> > > fact that it's impossible to implement an atomic write on a hard
> > > disk.
> > Sure... but that's just the same for the nodatacow writes of data.
> > (And the same, AFAIU, for CoW itself, just that we'd notice any
> > corruption in case of a crash due to the CoWed nature of the fs and
> > could go back to the last generation).
> Yes, but it's also the reason that using either COW or a log-
> structured 
> filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
> consistency.
So then it's no reason why it shouldn't work.
The meta-data is CoWed, any incomplete writes of checksumdata in that
(be it for CoWed data or no-CoWed data, should the later be
implemented), would be protected at that level.

Currently, the no-CoWed data is, AFAIU completely at risk of being
corrupted (no checksums, no journal).

Checksums on no-CoWed data would just improve that.


> > What about VMs? At least a quick google search didn't give me any
> > results on whether there would be e.g. checksumming support for
> > qcow2.
> > For raw images there surely is not.
> I don't mean that the VMM does checksumming, I mean that the guest OS
> should be the one to handle the corruption.  No sane OS doesn't run
> at 
> least some form of consistency checks when mounting a filesystem.
Well but we're not talking about having a filesystem that "looks clear"
here. For this alone we wouldn't need any checksumming at all.

We talk about data integrity protection, i.e. all files and their
contents. Nothing which a fsck inside a guest VM would ever notice (I
mean by a fsck), if there are just some bit flips or things like that.


> > 
> > And even if DBs do some checksumming now, it may be just a
> > consequence
> > of that missing in the filesystems.
> > As I've written somewhere else in the previous mail: it's IMHO much
> > better if one system takes care on this, where the code is well
> > tested,
> > than each application doing it's own thing.
> That's really a subjective opinion.  The application knows better
> than 
> we do what type of data integrity it needs, and can almost certainly
> do 
> a better job of providing it than we can.
Hmm I don't see that.
When we, at the filesystem level, provide data integrity, than all data
is guaranteed to be valid.
What more should an application be able to provide? At best they can do
the same thing faster, but even for that I see no immediate reason to
believe it.

And in practise it seems far more likely that if countless applications
should such task on their own, that it's more error prone (that's why
we have libraries for all kinds of code, trying to reuse code,
minimising the possibility of errors in countless home-brew solutions),
or not done at all.


> > > > - the data was written out correctly, but before the csum
> > > > was
> > > >   written the system crashed, so the csum would now tell us
> > > > that
> > > > the
> > > >   block is bad, while in reality it isn't.
> > > There is another case to consider, the data got written out, but
> > > the
> > > crash happened while writing the checksum (so the checksum was
> > > partially
> > > written, and is corrupt).  This means we get a false positive on
> > > a
> > > disk
> > > error that isn't there, even when the data is correct, and that
> > > should
> > > be avoided if at all possible.
> > I've had that, and I've left it quoted above.
> > But as I've said before: That's one case out of many? How likely is
> > it
> > that the crash happens exactly after a large data block has been
> > written followed by a relatively tiny amount of checksum data.
> > I'd assume it's far more likely that the crash happens during
> > writing
> > the data.
> Except that the whole metadata block pointing to that data block gets
> rewritten, not just the checksum.
But that's the case anyway, isn't it? With or without checksums.



> > And regarding "reporting data to be in error, which is actually
> > correct"... isn't that what all journaling systems may do?
> No, most of them don't actually do that.  The general design of a 
> journaling filesystem is that

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Duncan

Qu Wenruo posted on Wed, 16 Dec 2015 09:36:23 +0800 as excerpted:

> David Sterba wrote on 2015/12/14 18:32 +0100:
>> On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:
>>> Introduce a new mount option "nologreplay" to co-operate with "ro"
>>> mount option to get real readonly mount, like "norecovery" in ext* and
>>> xfs.
>>>
>>> Since the new parse_options() need to check new flags at remount time,
>>> so add a new parameter for parse_options().
>>>
>>> Signed-off-by: Qu Wenruo 
>>> Reviewed-by: Chandan Rajendra 
>>> Tested-by: Austin S. Hemmelgarn 
>>
>> I've read the discussions around the change and from the user's POV I'd
>> suggest to add another mount option that would be just an alias for any
>> mount options that would implement the 'hard-ro' semantics.
>>
>> Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
>> more options in the future.
>>
>>   mount -o ro,nowr /dev/sdx /mnt
>>
>> would work when switching kernels.
>>
>>
> That would be nice.
> 
> I'd like to forward the idea/discussion to filesystem ml, not only btrfs
> maillist.
> 
> Such behavior should better be coordinated between all(at least xfs and
> ext4 and btrfs) filesystems.
> 
> One sad example is, we can't use 'norecovery' mount option to disable
> log replay in btrfs, as there is 'recovery' mount option already.
> 
> So I hope we can have a unified mount option between mainline
> filesystems.

FWIW, I was just reading the mount manpage in connection with a reply for 
a different thread, and noticed...

mount (8) (from util-linux 2.27.1) says noload and norecovery are the 
same option, for ext3/4 at least.  It refers to the xfs (5) manpage, from 
xfsprogs, for xfs mount options, and that's not installed here, so I 
can't confirm noload for it, but it's there for ext3/4.

And noload doesn't have the namespace collision problem norecovery does 
on btrfs, so I'd strongly suggest using it, at least as an alias for 
whatever other btrfs-specific name we might choose.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Christoph Anton Mitterer

On Wed, 2015-12-16 at 11:10 +, Duncan wrote:
> And noload doesn't have the namespace collision problem norecovery
> does 
> on btrfs, so I'd strongly suggest using it, at least as an alias for 
> whatever other btrfs-specific name we might choose.

but noload is, AFAIU, not what's desired here, is it?
Per manpage it's "Don't load the journal on mounting",... not only
wouldn't that fit for btrfs, it's also not what's really desired, i.e.
an option that implies everything necessary to not modify the device.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 08:54 -0500, Austin S. Hemmelgarn wrote:
> Except for one thing:  Automobiles actually provide a measurable
> significant benefit to society.  What specific benefit does embedding
> the filesystem UUID in the metadata actually provide?
I guess that's quite obvious.
You want something that can be used to address devices stable (i.e. not
their "path" like sda,sdb). So either some ID or a label. Human
readable lables are basically guaranteed to collide, so UUIDs are the
clean solution.
Since there is however no guarantee that they don't collide (either by
accident or malicious intent), you need to protect against that.

Analogous for the device IDs of multi-device fs or containers.

> > "A UUID is 128 bits long, and can guarantee uniqueness across space
> > and time."
> > 
> > Also see security considerations in section 6.
> Both aspects ignore the facts that:
> Version 1 is easy to cause a collision with (MAC addresses are by no 
> means unique, and are easy to spoof, and so are timestamps).
> Version 2 is relatively easy to cause a collision with, because UID
> and 
> GID numbers are a fixed size namespace.
> Version 3 is slightly better, but still not by any means unique
> because 
> you just have to guess the seed string (or a collision for it).
> Version 4 is probably the hardest to get a collision with, but only
> if 
> you are using a true RNG, and evne then, 122 bits of entropy is not
> much 
> protection.
> Version 5 has the same issues as Version 3, but is more secure
> against 
> hash collisions.
I guess we don't need to discuss how unique UUIDs are when they're
*freshly created*, since this is the only thing what the RFC
"guarantees"...
That's mostly irrelevant for us here, as we have two far more stronger
cases, accidental duplication and malicious collisions.
The possible case, that by normal means (e.g. mkfs.btrfs) a UUID
collision occurs, are small, but solving the actual two cases here,
will solve that one as well.

Apart from that, I've noticed in several of your mails that either
something with the indention level goes wrong, or you mix contents from
multiple mails from different people.
E.g. that "Also see security considerations in section 6." wasn't from
me, which was at quotation level 1 in your mail, but the example with
the automobile, which was also on level 1, was from me.
That's kinda confusing...


> In general, you should only use UUID's when either:
> a. You have absolutely 100% complete control of the storage of them, 
> such that you can guarantee they don't get reused.
> b. They can be guaranteed to be relatively unique for the system
> using them.
No, this aren't necessary constraints. And in fact would make multi-
device practically impossible (you always need some ID, unless you want
to open the door for countless of errors, where people wrongly assemble
their devices... whether it's UUID or anything else, doesn't matter).
The only thing that one needs to do, is handle collisions gracefully
and don't do auto-assemblies,.. all as I've described in the mail from
"Fri, 11 Dec 2015 23:06:03 +0100"
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/50909/focus=51147)



> > There could be some leveraging of the device WWN, or absent that
> > its
> > serial number, propogated into all of the volume's devices (cross
> > referencing each other's devid to WWN or serial). And then that way
> > there's a way to differentiate. In the dd case, there would be
> > mismatching real device WWN/serial number and the one written in
> > metadata on all drives, including the copy. This doesn't say what
> > policy should happen next, just that at least it's known there's a
> > mismatch.
> > 
> That gets tricky too, because for example you have stuff like flat
> files 
> used as filesystem images.
plus... one cannot be sure whether any hardware device IDs, like serial
numbers, are unique... a powerful attacker could surely change these as
well.
Or imagine you have a failing harddisk, and dd it's content to
another... the btrfs part would stay identical, while the hardware
device IDs change and confuse everything.



> However, if we then use some separate UUID (possibly hashed off of
> the 
> file location) in place of the device serial/WWN, that could 
> theoretically provide some better protection.
Not really... it just delegates the problem one level further.
The only real protection is, that the kernel and userland tools deal
correctly with the situation.


> The obvious solution in 
> the case of a mismatch would be to refuse the mount until either the 
> issue is fixed using the tools, or the user specifies some particular
> mount option to either fix ti automatically, or ignore copies with a 
> mismatching serial.
Sure, as I've said before :-)


Cheers,
Chris

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 14:18 +, Hugo Mills wrote:
>    That one's easy to answer. It deals with a major issue that
> reiserfs had: if you have a filesystem with another filesystem image
> stored on it, reiserfsck could end up deciding that both the metadata
> blocks of the main filesystem *and* the metadata blocks of the image
> were part of the same FS (because they're on the same block device),
> and so would splice both filesystems into one, generally complaining
> loudly along the way that there was a lot of corruption present that
> it was trying to fix.
Hmm that's a bit strange though, and to me it rather sounds like other
bugs...
You can have a ext4 on a file in an ext4, with or without the same
UUIDs, and it will just work.
If the filesystem takes contents from a normal file as possible
metadata, than something else is severely screwed up... or in case of
the fsck: it probably means it's a bit too liberal in searching places.

I'd be quite shocked if this is the case in btrfs, cause it would mean
again, that we have a vulnerability against UUID collisions.
Imagine some attacker finds out the UUID of a filesystem (which is
probably rather easy)... next he uploads some file (e.g. it's a
webserver with allows image uploads, a forum perhaps) that in reality
contains what's looks like btrfs metadata and uses a matching UUID.

It would run into the same issues as what you describe for reiser,..
the UUID would be no real help to solve that problem.

Does anyone know whether btrfsck (or other userland) tools do such
things? I.e. search more or less arbitrary blocks, where it cannot be
sure it's *not* data, for what it would interpret as meta-data
subsequently?

CHeers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 11:03 -0500, Austin S. Hemmelgarn wrote:
> May I propose the alternative option of adding a flag to tell mount
> to 
> _only_ use the devices specified in the options?
That's one part of exactly what I propose since a few days :-P
(no one seems to read my mails ;-) )
Plus that this isn't the case only for mounts, but also fsck, repair,
and all other userland tool operations.

But it's only part of the solution to the whole problem, the other one
is that automatic device activations/rebuilds/etc. of _already active_
devices should generally not happen (manual of course may happen, again
with device= options, specifying *which* devices are actually meant).

See my mail from "Fri, 11 Dec 2015 23:06:03 +0100"
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/50909/focus=5114
7)
which I think covers pretty much all cases.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Austin S. Hemmelgarn


On 2015-12-16 06:10, Duncan wrote:

Qu Wenruo posted on Wed, 16 Dec 2015 09:36:23 +0800 as excerpted:


David Sterba wrote on 2015/12/14 18:32 +0100:

On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:

Introduce a new mount option "nologreplay" to co-operate with "ro"
mount option to get real readonly mount, like "norecovery" in ext* and
xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo 
Reviewed-by: Chandan Rajendra 
Tested-by: Austin S. Hemmelgarn 


I've read the discussions around the change and from the user's POV I'd
suggest to add another mount option that would be just an alias for any
mount options that would implement the 'hard-ro' semantics.

Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
more options in the future.

   mount -o ro,nowr /dev/sdx /mnt

would work when switching kernels.



That would be nice.

I'd like to forward the idea/discussion to filesystem ml, not only btrfs
maillist.

Such behavior should better be coordinated between all(at least xfs and
ext4 and btrfs) filesystems.

One sad example is, we can't use 'norecovery' mount option to disable
log replay in btrfs, as there is 'recovery' mount option already.

So I hope we can have a unified mount option between mainline
filesystems.


FWIW, I was just reading the mount manpage in connection with a reply for
a different thread, and noticed...

mount (8) (from util-linux 2.27.1) says noload and norecovery are the
same option, for ext3/4 at least.  It refers to the xfs (5) manpage, from
xfsprogs, for xfs mount options, and that's not installed here, so I
can't confirm noload for it, but it's there for ext3/4.
Unless it's undocumented, XFS doesn't have it (as much as I hate XFS, I 
have to have xfsprogs installed so that I can do recovery for the few 
systems at work that actually use it if the need arises).


And noload doesn't have the namespace collision problem norecovery does
on btrfs, so I'd strongly suggest using it, at least as an alias for
whatever other btrfs-specific name we might choose.
I kind of agree with Christoph here.  I don't think that noload should 
be the what we actually use, although I do think having it as an alias 
for whatever name we end up using would be a good thing.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread David Sterba

On Mon, Dec 14, 2015 at 12:50:37PM -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 12:32, David Sterba wrote:
> > On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:
> >> Introduce a new mount option "nologreplay" to co-operate with "ro" mount
> >> option to get real readonly mount, like "norecovery" in ext* and xfs.
> >>
> >> Since the new parse_options() need to check new flags at remount time,
> >> so add a new parameter for parse_options().
> >>
> >> Signed-off-by: Qu Wenruo 
> >> Reviewed-by: Chandan Rajendra 
> >> Tested-by: Austin S. Hemmelgarn 
> >
> > I've read the discussions around the change and from the user's POV I'd
> > suggest to add another mount option that would be just an alias for any
> > mount options that would implement the 'hard-ro' semantics.
> >
> > Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
> > more options in the future.
> It should also imply noatime.  I'm not sure how BTRFS handles atime when 
> mounted RO, but I know a lot of old UNIX systems updated atime even on 
> filesystems mounted RO, and I know that at least at one point Linux did too.

A mount with -o ro will not touch atimes. At one point the read-only
snapshots changed atimes, but this has been fixed since.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/inode.c#n1602
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c#n5973

> >   mount -o ro,nowr /dev/sdx /mnt
> >
> > would work when switching kernels.
> 
> I like this idea, but I think that having a name like true-ro or hard-ro 
> and making it imply ro (and noatime) would probably be better (or at 
> least, simpler to use from a user perspective).

Ok, a single option to do the real-ro sounds better than ro,something.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: need to recover large file

2015-12-16 Thread Duncan

Langhorst, Brad posted on Wed, 16 Dec 2015 03:13:48 + as excerpted:

> Hi:
> 
> I just screwed upâŠ spent the last 3 weeks generting a 400G file
> (genome assembly) .
> Went to back it up and swapped the arguments to tar (tar Jcf my_precious
> my_precious.tar.xz)
> what was once 400G is now 108 bytes of xz header - argh.
> 
> This is on a 6-volume btrfs filesystem.
> 
> I immediately unmounted the fs (had to cd /  first).
> 
> After a bit of searching I found Chris Masonâs post about using
> btrfs-debug-tree -R

[Snip most of the result as I'm not familiar with this utility.  But it 
ends with...]

> Btrfs v3.12

[...]
 
> I also read that one can use btrfs-find-root to get a list of files to
> recover and just ran btrfs-find-root on one of the underlying disks but
> I get an error "Super think's the tree root is at 25606900367360,
> chunk root 25606758596608
> Went past the fs size, exiting

WTF?  I thought that bug was patched a long time ago.  How old a btrfs-
progs are you using, anyway?

*OH!* *3.12*

Why are you still using 3.12?  That's nigh back in the dark ages as far 
as btrfs is concerned.  AFAIK, btrfs was either still labeled 
experimental back then, or 3.12 was the first version where experimental 
was stripped, so it's a very long time ago indeed, particularly for a 
filesystem that while no longer experimental, is still under heavy 
development, with bugs being fixed in every release, with the patches not 
always reaching old stable tho since 3.12 for the kernel anyway they do 
try, so if you're running old releases, you're code that's known to be 
buggy and known to have fixes for those bugs in newer releases.

The general list recommendation for the kernel, unless you have a known 
reason (like an already reported but still unfixed bug in newer), is run 
the latest or next to latest release series of either current, which 
would ATM be 4.3 or 4.2, as 4.4 isn't out yet, tho it's getting close, or 
or LTS, which would be 4.1 or 3.18, tho 4.4 will be also and it's getting 
close so you should be already preparing to upgrade to at least 4.1 if 
you're on the LTS series.

The coverage of the penultimate series gives you some time to upgrade to 
the latest, since the penultimate series is covered too.

For runtime, the kernel code is generally used (userspace mostly just 
makes calls to the kernel and lets it do the work), so kernel code is 
most important.  However, once you're trying to recover things, 
basically, when you're working with the unmounted filesystem, userspace 
code is used, so having something reasonably current there becomes 
important.

As a rule of thumb, then, running a btrfs-progs userspace of at least the 
same release series as the kernel you're running is recommended (tho 
newer is fine), since the kernel and userspace were developed at about 
the same time and with the same problems in mind, and if you're keeping 
up with the kernel series recommendation, that means you're userspace 
isn't getting /too/ old.  But even then, once you're trying to do a btrfs 
recovery with those tools, a recommendation to try latest current can be 
considered normal, since it'll be able to detect and usually fix the 
latest problems.

So a 3.18 series kernel and at least a 3.18 series userspace, would be 
recommended, and indeed, for a filesystem like btrfs that is still 
stabilizing and not yet fully stable and mature, is quite reasonable 
indeed.  While some people have reason to use particularly old and 
demonstrated stable versions, and enterprise distros generally cater to 
this need with support for upto a decade, using a still new and maturing 
btrfs is incompatible with a need for old and demonstrated stable, so in 
that case, from the viewpoint of this list at least, if you're looking 
for that old and stable, you really should be using a different 
filesystem, as btrfs simply isn't that old and stable, yet.

Meanwhile, while that's the view of upstream btrfs and thus this upstream 
list, some distros never-the-less choose to support old btrfs, backporting 
patches, etc, as they consider necessary.  However, that's their support 
then, and their business.  If you're trusting them for that support, 
really, you should be contacting them for it, as this list really isn't 
in the best position to supply that sort of support.  Yes, they may be 
using an old in number kernel and perhaps userspace, with newer patches 
backported to it, but it's the distro that makes those choices and knows 
what patches it's backporting, and thus is in the best position to 
support it.  Not that we on the list won't try, but we're simply not in a 
good position to provide that support that far back as we've long since 
moved on, neither do we track what distros have backported and what they 
haven't, etc.


So, basically, you have four choices:

1) Follow list recommendations and upgrade to something that isn't out of 
the dark ages in terms of btrfs history.

2) Follow the presumed

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Austin S. Hemmelgarn


On 2015-12-16 07:34, Christoph Anton Mitterer wrote:

On Wed, 2015-12-16 at 07:12 -0500, Austin S. Hemmelgarn wrote:

I kind of agree with Christoph here.  I don't think that noload
should
be the what we actually use, although I do think having it as an
alias
for whatever name we end up using would be a good thing.

No, because people would start using it, getting used to it, and in 4
years we could never change it again,... which may be necessary...

No, because we should ease the transition from other filesystems to the 
greatest extent reasonably possible.  It should be clearly documented as 
an alias for compatibility with ext{3,4}, and that it might go away in 
the future.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread David Sterba

On Wed, Dec 16, 2015 at 09:36:23AM +0800, Qu Wenruo wrote:
> >
> >   mount -o ro,nowr /dev/sdx /mnt
> >
> > would work when switching kernels.
> >
> That would be nice.
> 
> I'd like to forward the idea/discussion to filesystem ml, not only btrfs 
> maillist.

Good idea.

> Such behavior should better be coordinated between all(at least xfs and 
> ext4 and btrfs) filesystems.
> 
> One sad example is, we can't use 'norecovery' mount option to disable 
> log replay in btrfs, as there is 'recovery' mount option already.

I think we should pick a name that's not tied to the implementation how
the potential writes could happen under a RO mount.
Recovery/replay/whatever, the expected use is "avoid any writes".

> So I hope we can have a unified mount option between mainline filesystems.

That would be a good thing indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Still not production ready

2015-12-16 Thread Chris Mason

On Tue, Dec 15, 2015 at 06:30:58PM -0800, Liu Bo wrote:
> On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote:
> > >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * 
> > >data_stripes,
> > >may I know how your partition gets a 10GB chunk?
> > 
> > Oh, it seems that I remembered the wrong size.
> > After checking the code, yes you're right.
> > A stripe won't be larger than 1G, so my assumption above is totally wrong.
> > 
> > And the problem is not in the 10% limit.
> > 
> > Please forget it.
> 
> No problem, glad to see people talking about the space issue again.

You can still end up with larger block groups if you have a lot of
drives.  We've had different problems with that in the past, but it is
limited now to 10G.

At any rate if things are still getting badly out of balance we need to
tweak the allocator some more.

It's hard to reproduce because you need a burst of allocations for
whatever type is full.  I'll give it another shot.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Chris Mason

On Wed, Dec 16, 2015 at 01:03:38PM +0100, Christoph Anton Mitterer wrote:
> On Tue, 2015-12-15 at 14:18 +, Hugo Mills wrote:
> >    That one's easy to answer. It deals with a major issue that
> > reiserfs had: if you have a filesystem with another filesystem image
> > stored on it, reiserfsck could end up deciding that both the metadata
> > blocks of the main filesystem *and* the metadata blocks of the image
> > were part of the same FS (because they're on the same block device),
> > and so would splice both filesystems into one, generally complaining
> > loudly along the way that there was a lot of corruption present that
> > it was trying to fix.
> Hmm that's a bit strange though, and to me it rather sounds like other
> bugs...
> You can have a ext4 on a file in an ext4, with or without the same
> UUIDs, and it will just work.

Hugo is right here.  reiserfs had tools that would scan and entire block
device for metadata blocks and try to reconstruct the filesystem based
on what it found.  Since there was no uuid, it was impossible to tell if
a block from the scan was really part of this filesystem or part of some
image file that happened to be sitting there.

Adding UUIDs doesn't make that whole class of problem go away (you could
have an image of the filesystem inside that filesystem), but it does
make it dramatically less likely.  

At the end of the day it's just a best practice mechanism to help
recovery and prevent admin mistakes.  It's also a building block of the
multi-device support.

We could change the multi-device support to allow duplicate uuids in
single device filesystems.  But I'd much rather see a variation on seed
devices enable transitioning from one uuid to another.

> If the filesystem takes contents from a normal file as possible
> metadata, than something else is severely screwed up... or in case of
> the fsck: it probably means it's a bit too liberal in searching places.
> 
> I'd be quite shocked if this is the case in btrfs, cause it would mean
> again, that we have a vulnerability against UUID collisions.

> Imagine some attacker finds out the UUID of a filesystem (which is
> probably rather easy)... next he uploads some file (e.g. it's a
> webserver with allows image uploads, a forum perhaps) that in reality
> contains what's looks like btrfs metadata and uses a matching UUID.
> 
> It would run into the same issues as what you describe for reiser,..
> the UUID would be no real help to solve that problem.
> 
> 
> Does anyone know whether btrfsck (or other userland) tools do such
> things? I.e. search more or less arbitrary blocks, where it cannot be
> sure it's *not* data, for what it would interpret as meta-data
> subsequently?

These are emergency tools, btrfs restore and find-roots can do some
scanning.  We don't do it the way reiserfs did because it would be very
difficult to reconstruct shared data and metadata from snapshots.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Wed, 2015-12-16 at 09:41 -0500, Chris Mason wrote:
> Hugo is right here.  reiserfs had tools that would scan and entire
> block
> device for metadata blocks and try to reconstruct the filesystem
> based
> on what it found.
Creepy... at least when talking about a "normal" fsck... good that
btrfs is going to be the next-gen-ext, and not reiser4 ;)

> Adding UUIDs doesn't make that whole class of problem go away (you
> could
> have an image of the filesystem inside that filesystem), but it does
> make it dramatically less likely.  
Sure...

> > Does anyone know whether btrfsck (or other userland) tools do such
> > things? I.e. search more or less arbitrary blocks, where it cannot
> > be
> > sure it's *not* data, for what it would interpret as meta-data
> > subsequently?
> 
> These are emergency tools, btrfs restore and find-roots can do some
> scanning.  We don't do it the way reiserfs did because it would be
> very
> difficult to reconstruct shared data and metadata from snapshots.

Hmm I agree, that it's valid for such tools, to do these kinds of scans
(i.e. scan for meta-data in places that are not known for sure to be
meta-data) when doing some last-resort-rescue tries... or for rescue
operations, where it's clearly documented that this is done.

But I think it shouldn't happen e.g. during a normal fsck - only when
special options are given.
And it should be properly documented (i.e. telling people in the docs,
that this does a block for block scan for meta-data even within normal
data, and that if they'd had e.g. another fs of the same UUIDs within,
the results may be completely bogus.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 14:42 +, Hugo Mills wrote:
>    I would suggest trying to migrate to a state where detecting more
> than one device with the same UUID and devid is cause to prevent the
> FS from mounting, unless there's also a "mount_duplicates_yes_i_
> know_this_is_dangerous_and_i_know_what_im_doing" mount flag present,
> for the multipathing people. That will break existing userspace
> behaviour for the multipathing case, but the migration can probably
> be
> managed. (e.g. NFS has successfully changed default behaviour for one
> of its mount options in the last few(?) years).

I don't think that a single mountpoint a la "force-and-do-it" is a
proper solution here. It would still open surface for attacks and also
for accidents.
In the case mutli-pathing is used, the only realistic way seems to be
manually specifying the devices a la device=/dev/sda,/dev/sdb.

Of course btrfs would stil use the UUIDs/deviceIDs of these, but *only*
of those devices that have been whitelisted with the device=option.

In the case of a general "mount_duplicates_yes_iknow_th..." option you
could end up with having e.g. three duplicates, two being actually
mutli-paths, and the third one being a losetup or USB clone of the
image,... again allowing for the aforementioned attacks to happen, and
again allowing for severe corruption to occur.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Christoph Anton Mitterer

On Wed, 2015-12-16 at 07:12 -0500, Austin S. Hemmelgarn wrote:
> I kind of agree with Christoph here.  I don't think that noload
> should 
> be the what we actually use, although I do think having it as an
> alias 
> for whatever name we end up using would be a good thing.
No, because people would start using it, getting used to it, and in 4
years we could never change it again,... which may be necessary...

noload, seems to mean don't load the journal. Unless btrfs gets a
journal in the sense xfs/ext has one, it simply should either not use
that name at all... or not try to "map" it to something of it's own
which is similar, but in reality not the same.

Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-16 Thread Christoph Anton Mitterer

On Tue, 2015-12-15 at 09:19 -0500, Austin S. Hemmelgarn wrote:
> Um, no you don't have direct physical access to the hardware with an
> ATM, at least, not unless you are going to take apart the cover and 
> anything else in your way (and probably set off internal alarms).
Well access to the services ports (which may be USB) is typically much
easier, and doesn't require to completely dismantle the steel and so...
Simply because service teams also need to access these "regularly".

But even if we don't count ATMs here, use any other publicly accessible
computer terminals.
Library computer, the entertainment systems in airplanes, TVs in a
shopping centre, etc. pp.


>   And 
> even without that, it's still possible to DoS an ATM without much 
> effort.  Most of them have a 3.5mm headphone jack for TTS for people 
> with poor vision, and that's more than enough to overload at least
> part 
> of the system with a relatively simple to put together bit of 
> electronics that would cost you less than 10 USD.
As I've said before,.. you always find another weak link, of course,...
as it was pointed out before, USB itself is quite a security problem
(firmware attacks and that like).

But just because there are other issues, right now, there is no
justification to make btrfs "weak" as well... because this just leads
to the vicious circle, that everyone has security issues, not willing
to solve them, pointing to others as an excuse.


> > imageine you're running a VM hosting service, where you allow users
> > to
> > upload images and have them deployed.
> > In the cheap" case these will end up as regular files, where they
> > couldn't do any harm (even if colliding UUIDs)... but even there
> > one
> > would have to expect, that the hypervisor admin may losetup them
> > for
> > whichever reason.
> > But if you offer more professional services, you may give your
> > clients
> > e.g. direct access to some storage backend, which are then probably
> > also seen on the host by its kernel.
> > And here we already have the case, that a client could remotely
> > trigger
> > such collision.
> In that particular situation, it's not relevant unless the host admin
> goes to mount them.  UUID collisions are only an issue if the 
> filesystems get mounted.
Hmm from the impression I got so far, it was not only a problem when
actually mounting... but even if... this doesn't change the situation.
Same problem as before, the host system may have btrfs filesystems
whose IDs have leaked, the attacker may upload them as VM images as
described above, and even if the host's admin doesn't want to mount
those, he may mount what he considers his filsystems, which however
also collide.
Boom. Same issues as before.

Turn it as you want, resistance is futile ;-)


> > Well I think it's a proper way to e.g. handle the multi-device
> > case.
> > You have n devices, you want to differ them,... using a pseudo-
> > random
> > UUID is surely better than giving them numbers.
> That's debatable, the same issues are obviously present in both cases
> (individual numbers can collide too).
Sure, as I've said. You always must handle the case of accidentally or
maliciously colliding IDs if you count on data integrity and security.
But using UUIDs makes chances at least small that you run into
collisions (that users must than manually resolve somehow) *even when*
you just create fresh filesystem, have no attacker and no dd or that
like goes in your way.


> > Same for the fs UUID, e.g. when used for mounting devices whose
> > paths
> > aren't stable.
> In the case of a sanely designed system using LVM for example, device
> paths are stable.
Well, but LVM itself works with UUIDs again, so you just delegate the
problem.
And apart from that, with btrfs, I thought, we rather want to avoid
using LVM below.


> > As said before, using the UUID isn't the problem - not protecting
> > against collisions is.
> No, the issues are:
> 1. We assume that the UUID will be unique for the life of the 
> filesystem, which is not a safe assumption.
> 2. We don't sanely handle things if it isn't unique.
Well isn't that what I've said? At least it's what I've meant ;)


> > Well,... I don't think that writing *into* the filesystem is
> > covered by
> > common practise anymore.
> For end users, I agree.  Part of the discussion involves attacks on
> the 
> system, and for a attacker it's not a far stretch to write directly
> to 
> the block device if possible (and it's even common practice for 
> bypassing permission checks done in the VFS layer).
Well but that's something else here what I don't think we can cover.
What we must assume is, that devices show up with colliding IDs, either
by "accident" or means like dd... or by an attacker somehow being able
to make them show up (USB, the image upload scenarios I've described
before, and so on).

If the attacker can however write to *arbitrary* (and not just "his")
devices, bypassing checks in the VFS layer or anything else... well
than

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-16 Thread Christoph Anton Mitterer

On Wed, 2015-12-16 at 07:57 -0500, Austin S. Hemmelgarn wrote:
> No, because we should ease the transition from other filesystems to
> the 
> greatest extent reasonably possible.  It should be clearly documented
> as 
> an alias for compatibility with ext{3,4}, and that it might go away
> in 
> the future.

Where's the need for an easy migration path? Such option wouldn't be
used by default and even if, people would need to change their
fstab/etc. anyway (ext->btrfs).

The note that things go away never really work out that easy... people
start to use it, rely on it... and sooner or later you have a situation
as with atime, that you basically never can get rid of it.

IMHO an alias here would be just ambiguous.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> And in particular, the only
> journaling filesystem that I know of that even allows the option of
> journaling the file contents instead of just metadata is ext4.

IIRC, ext3 was the first to have it in Linux mainline, with data=writeback 
for the speed freaks that don't care about data loss, data=ordered as the 
default normal option (except for that infamous period when Linus lost 
his head and let people talk him into switching to data=writeback, 
despite the risks... he later came back to his senses and reverted that), 
and data=journal for the folks that were willing to pay trade a bit of 
speed for better data protection (tho it was famous for surprising 
everybody, in that in certain use-cases it was extremely fast, faster 
than data=writeback, something I don't think was ever fully explained).

To my knowledge ext3 still has that, tho I haven't used it probably a 
decade.

Reiserfs has all three data= options as well, with data=ordered the 
default, tho it only had data=writeback initially.  While I've used 
reiserfs for years, it has always been with the default data=ordered 
since that was introduced, and I'd be surprised if data=journal had the 
same use-case speed advantage that it did on ext3, as it's too 
different.  Meanwhile, that early data=writeback default is where 
reiserfs got its ill repute for data loss, but it had long switched to 
data=ordered by default by the time Linus lost his senses and tried 
data=writeback by default on ext3.  Because I was on reiserfs from 
data=writeback era, I was rather glad most kernel hackers didn't want to 
touch it by the time Linus let them talk him into data=writeback on ext3, 
and thus left reiserfs (which again had long been data=ordered by default 
by then) well enough alone.

But I did help a few people running ext3 trace down their new ext3 
stability issues to that bad data=writeback experiment, and persuaded 
them to specify data=ordered, which solved their problems, so indeed 
they /were/ data=writeback related.  And happily, Linus did eventually 
regain his senses and return ext3 to data=ordered by default once again.

And based on what you said, ext4 still has all three data= options, 
including data=journal.  But I wasn't sure on that myself (tho I would 
have assumed it inherited it from ext3) and thus am /definitely/ not sure 
whether it inherits ext3's data=journal speed advantages in certain 
corner-cases.

I have no idea whether other journaled filesystems allow choosing the 
journal level or not, tho.  I only know of those three.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs: poor performance on deleting many large files

2015-12-16 Thread Duncan

Lionel Bouton posted on Tue, 15 Dec 2015 03:38:33 +0100 as excerpted:

> I just checked: this has only be made crystal-clear in the latest
> man-pages version 4.03 released 10 days ago.
> 
> The mount(8) page of Gentoo's current stable man-pages (4.02 release in
> August) which is installed on my systems states for noatime:
> "Do not update inode access times on this filesystem (e.g., for faster
> access on the news spool to speed up news servers)."

Hmm... I hadn't synced and updated in about that time, and sure enough, 
while I've just synced I've not yet updated, and still have man-pages 
4.02 installed.

But, the mount.8(.bz2 in my case as that's the compression I'm configured 
for, I had to use man -d mount to debug-dump what file it was actually 
loading) manpage actually belongs to util-linux, according to equery 
belongs, while equery files man-pages | grep mount only returns hits for 
mount.2(.bz2 and umount).

So at least here, it's util-linux providing the mount (8) manpage, not 
man-pages.

Tho I'm on ~amd64 and IIRC just updated util-linux in the last update, so 
the cross-ref to nodiratime in the noatime entry (saying it isn't 
necessary as noatime covers it) probably came from there, or a similar 
recent util-linux update.

Let's see...

My current util-linux (with the xref in both noatime and nodiratime to 
the other, saying nodiratime isn't needed if noatime is used) is 2.27.1.

The oldest version I still have in my binpkg cache (tho I likely have 
older on the backup) is util-linux 2.24.2.  For noatime it has the 
wording you mention, don't update inode access times, but for nodiratime, 
it specifically mentions directory inode access times.  So from util-linux 
2.24.2 at least, the information was there, but you had to read between 
the lines a bit more, because nodiratime mentions dir inodes, and noatime 
says don't update atime on inodes, so it's there but you have to be a 
reasonably astute reader to see it.

In between those two I have other versions including 2.26.2 and 2.27.  
Looks like 2.27 added both the "implies nodiratime" wording to the noatime 
entry, and the nodiratime unneeded if noatime set notation to the 
nodiratime entry.

If there was a util-linux 2.26.x beyond x=2, I apparently never installed 
it, so the wording likely changed with 2.27, but may have changed with 
late 2.26 versions as well, if there were any beyond 2.26.2.

And on gentoo, 2.26.2 appears to be the latest stable-keyworded, so 
that's what stable users would have.

But as I said, the info is there at least as of 2.24.2, you just have to 
note in the nodiratime entry that it says dir inodes, while the noatime 
entry simply says inodes, without excluding dir inodes.  So it's there, 
you just have to be a somewhat astute reader to note it.

Anywhere else, say on-the-net recommendations for nodiratime, /should/ 
mention that they aren't necessary if noatime is used as well, but of 
course not all of them will.  (Tho I'd actually find it a bit strange to 
see discussion of nodiratime without discussion of noatime as well, as 
I'd guess any discussion of just one of the two would likely be on 
noatime, leaving nodiratime unmentioned if they're only covering one, as 
it shouldn't be necessary to mention, since it's already included in 
noatime.)

But there's probably a bunch of folks who originally read coverage of 
noatime, then saw nodiratime later, and thought "Oh, that's separate?  
Well I want that too!" and simply enabled them both, without actually 
checking the manpage or other documentation including on-the-net 
discussion.

I know here I originally saw noatime and decided I wanted it, then was 
confused when I saw nodiratime sometime later.  But I don't just enable 
stuff without having some idea what I'm enabling, so I did my research, 
and saw noatime implied nodiratime as well, so the only reason nodiratime 
might be needed would be if you wanted atime in general, but not on dirs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> AFAIUI, checksums are stored per-instance for every block.  This is
> important in a multi-device filesystem in case you lose a device, so
> that you still have a checksum for the block.  There should be no
> difference between extent layout and compression between devices
> however.

I don't believe that's quite correct.

What is correct, to the best of my knowledge, is that checksums are 
metadata, and thus have whatever duplication/parity level metadata is 
assigned.

For single devices, that is of course by default dup, 2X the metadata and 
thus 2X the checksums, both on the single data (as effectively the only 
choice on a single device, at least thru 4.3, tho there's a patch adding 
dup data as an option that I think should be in 4.4) when covering data, 
dup metadata when covering it.

For multiple devices, it's default raid1 metadata, default single data, 
so the picture doesn't differ much by default from the single-device 
default picture.  It's also possible to do single metadata, raidN data, 
which really doesn't make sense except for raid0 data, and thus I believe 
there's a warning about that sort of layout in newer mkfs.btrfs, or when 
lowering the metadata redundancy using balance filters.

But of course it's possible to do raid1 data and metadata, which would be 
two copies of each, regardless of the number of devices (except that it's 
2+, of course).  But the copies aren't 1:1 assigned.  That is, if they're 
equal generation, btrfs can read either checksum and apply it to either 
data/metadata block.  (Of course if they're not equal generation, btrfs 
will choose the higher one, thus covering the case of writing at the time 
of a crash, since either they will both be the same generation if the 
root block wasn't updated to the new one on either one yet, or one will 
be a higher/newer generation than the other, if it had already finished 
writing one but not the other at the time of the crash.)

This is why it's an extremely good idea if you have a pair of devices in 
raid1, and you mount one of them degraded/writable with the other 
unavailable for some reason, that you don't also mount the other one 
writable and then try to recombined them.  Chances are the generations 
wouldn't match and it'd pick the one with the higher generation, but if 
they did for some reason match, and both checksums were valid on their 
data, but the data differed... either one could be chosen, and a scrub 
might choose either one to fix the other, as well, which could in theory 
result in a file with intermixed blocks from the two different versions!

Just ensure that if one is mounted writable, it's the only one mounted 
writable if there's a chance of recombining, and you'll be fine, as it'll 
be the only one with advancing generations.  And if by some accident both 
are mounted writable separately, the best bet is to be sure and wipe the 
one, then add it as a new device, if you're going to reintroduce it to 
the same filesystem.

Of course this gets a bit more complicated with 3+ device raid1, since 
currently, there's still only two copies of each block and two copies of 
the checksum, meaning there's at least one device without a copy of each 
block, and if the filesystem is mounted degraded writable repeatedly with 
a random device missing...

Similarly, the permutations can be calculated for the other raid types, 
and for mixed raid types like raid6 data (specified) and raid1 metadata 
(unspecified so the default used), but I won't attempt that here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

42 matches

Mail list logo