Re: bad metadata crossing stripe boundary
On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote: > Am Thu, 31 Mar 2016 23:00:04 +0200 > schrieb Marc Haber: > > I find it somewhere between funny and disturbing that the first call > > of btrfs check made my kernel log the following: > > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted > > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan > > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid > > 67526 /dev/dm-31 > > > > No, the filesystem was not converted, it was directly created as > > btrfs, and no, I didn't try mounting it. > > I suggest that your partition contained ext4 before, and you didn't run > wipefs before running mkfs.btrfs. I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed it. That should do a much better job than wipefsing it, shouldnt it? Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size
On Thursday 31 Mar 2016 15:59:11 David Sterba wrote: > On Thu, Mar 31, 2016 at 11:31:06AM +0200, David Sterba wrote: > > On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote: > > > On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote: > > > > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote: > > > > > this patchset temporarily disables the commit > > > > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e. > > > > > > > > > > The commits for the Btrfs kernel module can be found at > > > > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize. > > > > > > > > The branch does not apply cleanly to at least 4.5, I've tried to > > > > rebase > > > > it but there are conflicts that are not simple. Please update it on > > > > top > > > > of current master, ie. with the preparatory patchset merged. > > > > > > I will rebase the branch and post the patchset soon. > > > > JFYI, I've seen some minor compilation failures: > > - __do_readpage: unused variable cached > > - end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info > > - fails build with (config) sanity checks enabled, the fs_info moved > > > > from eb to eb head > > And the tests crash with my quick fixes, so I'll move the branch out of > next for now. Please fix it and let me know. My fixes are on top of your > lastest branch, chandan-subpage-latest in my development gits. Hi David, After rebasing the patchset, I found a 'hard to reproduce' space accounting bug. I am currently figuring out the root cause of the bug. I will post the patchset once the issue is fixed. -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 00/19] Btrfs dedupe framework
David Sterba wrote on 2016/03/31 18:12 +0200: On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote: This March 30th patchset update mostly addresses the patchset structure comment from David: 1) Change the patchset sequence Not If only apply the first 14 patches, it can provide the full backward compatible in-memory only dedupe backend. Only starts from patch 15, on-disk format will be changed. So patch 1~14 is going to be pushed for next merge window, while I'll still submit them all for review purpose. I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option until the interface is settled. BTW, don't pick them directly, as I just forgot independent 2 bug fix patches. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Fri, Apr 01, 2016 at 11:33:00AM +1100, Dave Chinner wrote: > On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote: > > I haven't looked at the code, but I assume a JUKEBOX-returning write to > > an absent file brings into cache the bits necessary to perform the > > write, but stops short of actually doing the write. > > Not exactly, as all subsequent read/write/truncate requests will > EJUKEBOX until the absent file has been brought back onto disk. Once > that is done, the next operation attempt will proceed. > > > That allows > > handling the retried write quickly without doing the wrong thing in the > > case the retry never comes. > > Essentially. But if a retry never comes it means there's either a > bug in the client NFS implementation or the client crashed, NFS clients are under no obligation to retry operations after JUKEBOX. And I'd expect them not to in the case the calling process was interrupted, for example. > > I guess it doesn't matter as much in practice, since the only way you're > > likely to notice that fallocate unexpectedly succeeded would be if it > > caused you to hit ENOSPC elsewhere. Is that right? Still, it seems a > > little weird. > > s/succeeded/failed/ and that statement is right. Sorry, I didn't explain clearly. The case I was worrying about was the case were the on-the-wire ALLOCATE call returns JUKEBOX, but the server allocates anyway. That behavior violates the spec as I understand it. The client therefore assumes there was no allocation, when in fact there was. So, technically a bug, but I wondered if it's likely to bite anyone. One of the only ways it seems someone would notice would be if it caused the filesystem to run out of space earlier than I expected. But perhaps that's unlikely. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck: backpointer mismatch (and multiple other errors)
Henk Slager wrote on 2016/04/01 01:27 +0200: On Thu, Mar 31, 2016 at 10:44 PM, Kai Krakowwrote: Hello! I already reported this in another thread but it was a bit confusing by intermixing multiple volumes. So let's start a new thread: Since one of the last kernel upgrades, I'm experiencing one VDI file (containing a NTFS image with Windows 7) getting damaged when running the machine in VirtualBox. I got knowledge about this after experiencing an error "duplicate object" and btrfs went RO. I fixed it by deleting the VDI and restoring from backup - but no I get csum errors as soon as some VM IO goes into the VDI file. The FS is still usable. One effect is, that after reading all files with rsync (to copy to my backup), each call of "du" or "df" hangs, also similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one outcome of this is, that the FS does not properly unmount during shutdown. Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x series, and never had problems), including Gentoo patch-set r1. One possibility could be that the vbox kernel modules somehow corrupt btrfs kernel area since kernel 4.5. In order to make this reproducible (or an attempt to reproduce) for others, you could unload VirtualBox stuff and restore the VDI file from backup (or whatever big file) and then make pseudo-random, but reproducible writes to the file. It is not clear to me what 'Gentoo patch-set r1' is and does. So just boot a vanilla v4.5 kernel from kernel.org and see if you get csum errors in dmesg. Also, where does 'duplicate object' come from? dmesg ? then please post its surroundings, straight from dmesg. The device layout is: $ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT NAMEMODELFSTYPE LABEL MOUNTPOINT sda Crucial_CT128MX1 ├─sda1 vfat ESP/boot ├─sda2 └─sda3 bcache ├─bcache0 btrfs system ├─bcache1 btrfs system └─bcache2 btrfs system /usr/src sdb SAMSUNG HD103SJ ├─sdb1 swap swap0 [SWAP] └─sdb2 bcache └─bcache2 btrfs system /usr/src sdc SAMSUNG HD103SJ ├─sdc1 swap swap1 [SWAP] └─sdc2 bcache └─bcache1 btrfs system sdd SAMSUNG HD103UJ ├─sdd1 swap swap2 [SWAP] └─sdd2 bcache └─bcache0 btrfs system Mount options are: $ mount|fgrep btrfs /dev/bcache2 on / type btrfs (rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs) The FS uses mraid=1 and draid=0. Output of btrfsck is: (also available here: https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae) $ btrfsck /dev/disk/by-label/system checking extents ref mismatch on [10443660537856 524288] extent item 1, found 2 This 10443660537856 number is bigger than the 1832931324360 number found for total bytes. AFAIK, this is already wrong. Nope. That's btrfs logical space address, which can be beyond real disk bytenr. The easiest method to reproduce such case, is write something in a 256M btrfs, and balance the fs several times. Then all chunks can be at bytenr beyond 256M. The real problem is, the extent has mismatched reference. Normally it can fixed by --init-extent-tree option, but it normally means bigger problem, especially it has already caused kernel delayed-ref problem. No to mention the error "extent item 11271947091968 has multiple extent items", which makes the problem more serious. I assume some older kernel have already screwed up the extent tree, as although delayed-ref is bug-prove, it has improved in recent years. But it seems fs tree is less damaged, I assume the extent tree corruption could be fixed by "--init-extent-tree". For the only fs tree error (missing csum), if "btrfsck --init-extent-tree --repair" works without any problem, the most simple fix would be, just removing the file. Or you can use a lot of CPU time and disk IO to rebuild the whole csum, by using "--init-csum-tree" option. Thanks, Qu [...] checking fs roots root 4336 inode 4284125 errors 1000, some csum missing What is in this inode? Checking filesystem on /dev/disk/by-label/system UUID: d2bb232a-2e8f-4951-8bcc-97e237f1b536 found 1832931324360 bytes used err is 1 total csum bytes: 1730105656 total tree bytes: 6494474240 total fs tree bytes: 3789783040 total extent tree bytes: 608219136 btree space waste bytes: 1221460063 file data blocks allocated: 2406059724800 referenced 2040857763840 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote: > On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote: > > On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote: > > > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields> > > wrote: > > > > > > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: > > > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > > > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > > > Or is it ok that fallocate could block, potentially for a long time > > > > as > > > > we stream cows through the page cache (or however unshare works > > > > internally)? Those same programs might not be expecting fallocate > > > > to > > > > take a long time. > > > > > > Yes, it's perfectly fine for fallocate to block for long periods of > > > time. See what gfs2 does during preallocation of blocks - it ends up > > > calling sb_issue_zerout() because it doesn't have unwritten > > > extents, and hence can block for long periods of time > > > >>> > > > >>> gfs2 fallocate is an implementation that will cause all but the most > > > >>> trivial users real pain. Even the initial XFS implementation just > > > >>> marking the transactions synchronous made it unusable for all kinds > > > >>> of applications, and this is much worse. E.g. a NFS ALLOCATE > > > >>> operation > > > >>> to gfs2 will probab;ly hand your connection for extended periods of > > > >>> time. > > > >>> > > > >>> If we need to support something like what gfs2 does we should have a > > > >>> separate flag for it. > > > >> > > > >> Using fallocate() for preallocation was always intended to > > > >> be a faster, more efficient method allocating zeroed space > > > >> than having userspace write blocks of data. Faster, more efficient > > > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means > > > >> that if the hardware has zeroing offloads (deterministic trim, write > > > >> same, etc) it will use them, and that will be much faster than > > > >> writing zeros from userspace. > > > >> > > > >> IMO, what gfs2 is definitely within the intended usage of > > > >> fallocate() for accelerating the preallocation of blocks. > > > >> > > > >> Yes, it may not be optimal for things like NFS servers which haven't > > > >> considered that a fallocate based offload operation might take some > > > >> time to execute, but that's not a problem with fallocate. i.e. > > > >> that's a problem with the nfs server ALLOCATE implementation not > > > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs > > > >> and timeouts while the operation is run > > > > > > > > That's an interesting idea, but I don't think it's really legal. I take > > > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > > > > later and it might succeed", not "OK, I'm working on it, try again and > > > > you may find out I've done it". > > > > > > > > So if the client gets a JUKEBOX error but the server goes ahead and does > > > > the operation anyway, that'd be unexpected. > > > > > > Well, the tape continued to be mounted in the background and/or the file > > > restored from the tape into the filesystem... > > > > Right, and SGI have been shipping a DMAPI-aware Linux NFS server for > > many years, using the above NFSERR_JUKEBOX behaviour for operations > > that may block for a long time due to the need to pull stuff into > > the filesytsem from the slow backing store. Best explanation is in > > the relevant commit in the last published XFS+DMAPI branch from SGI, > > for example: > > > > http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75 > > I haven't looked at the code, but I assume a JUKEBOX-returning write to > an absent file brings into cache the bits necessary to perform the > write, but stops short of actually doing the write. Not exactly, as all subsequent read/write/truncate requests will EJUKEBOX until the absent file has been brought back onto disk. Once that is done, the next operation attempt will proceed. > That allows > handling the retried write quickly without doing the wrong thing in the > case the retry never comes. Essentially. But if a retry never comes it means there's either a bug in the client NFS implementation or the client crashed, in which case we don't really care. > Implementing fallocate by returning JUKEBOX while still continuing the > allocation in the background is a bit different. Not really. like the HSM case we don't really care if a retry occurs or not - the server simply needs to reply NFSERR_JUKEBOX for all subsequent read/write/fallocate/truncate operations on that inode until the fallocate completes... i.e. it requires O_NONBLOCK style operation for filesystem IO operations to
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
David Sterba wrote on 2016/03/31 18:30 +0200: On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote: At least 2 user from mail list reported btrfsck reported false alert of "bad metadata [,) crossing stripe boundary". While the reported number are all inside the same 64K boundary. After some check, all the false alert have the same bytenr feature, which can be divided by stripe size (64K). The result seems to be initial 'max_size' can be 0, causing 'start' + 'max_size' - 1, to cross the stripe boundary. Fix it by always update extent_record->cross_stripe when the extent_record is updated, to avoid temporary false alert to be reported. Signed-off-by: Qu WenruoApplied, thanks. Do you have a test image for that? Unfortunately, no. Although I figured out the cause the the false alert, I still didn't find a image/method to reproduce it, except the images of reporters. I can dig a little further trying to make a image. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 00/19] Btrfs dedupe framework
David Sterba wrote on 2016/03/31 18:12 +0200: On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote: This March 30th patchset update mostly addresses the patchset structure comment from David: 1) Change the patchset sequence Not If only apply the first 14 patches, it can provide the full backward compatible in-memory only dedupe backend. Only starts from patch 15, on-disk format will be changed. So patch 1~14 is going to be pushed for next merge window, while I'll still submit them all for review purpose. I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option until the interface is settled. Nice to hear that. I'll add BTRFS_DEBUG config then. BTW, any comment on btrfs-convert rewrite? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck: backpointer mismatch (and multiple other errors)
On Thu, Mar 31, 2016 at 10:44 PM, Kai Krakowwrote: > Hello! > > I already reported this in another thread but it was a bit confusing by > intermixing multiple volumes. So let's start a new thread: > > Since one of the last kernel upgrades, I'm experiencing one VDI file > (containing a NTFS image with Windows 7) getting damaged when running > the machine in VirtualBox. I got knowledge about this after > experiencing an error "duplicate object" and btrfs went RO. I fixed it > by deleting the VDI and restoring from backup - but no I get csum > errors as soon as some VM IO goes into the VDI file. > > The FS is still usable. One effect is, that after reading all files > with rsync (to copy to my backup), each call of "du" or "df" hangs, also > similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one > outcome of this is, that the FS does not properly unmount during > shutdown. > > Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x > series, and never had problems), including Gentoo patch-set r1. One possibility could be that the vbox kernel modules somehow corrupt btrfs kernel area since kernel 4.5. In order to make this reproducible (or an attempt to reproduce) for others, you could unload VirtualBox stuff and restore the VDI file from backup (or whatever big file) and then make pseudo-random, but reproducible writes to the file. It is not clear to me what 'Gentoo patch-set r1' is and does. So just boot a vanilla v4.5 kernel from kernel.org and see if you get csum errors in dmesg. Also, where does 'duplicate object' come from? dmesg ? then please post its surroundings, straight from dmesg. > The device layout is: > > $ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT > NAMEMODELFSTYPE LABEL MOUNTPOINT > sda Crucial_CT128MX1 > ├─sda1 vfat ESP/boot > ├─sda2 > └─sda3 bcache > ├─bcache0 btrfs system > ├─bcache1 btrfs system > └─bcache2 btrfs system /usr/src > sdb SAMSUNG HD103SJ > ├─sdb1 swap swap0 [SWAP] > └─sdb2 bcache > └─bcache2 btrfs system /usr/src > sdc SAMSUNG HD103SJ > ├─sdc1 swap swap1 [SWAP] > └─sdc2 bcache > └─bcache1 btrfs system > sdd SAMSUNG HD103UJ > ├─sdd1 swap swap2 [SWAP] > └─sdd2 bcache > └─bcache0 btrfs system > > Mount options are: > > $ mount|fgrep btrfs > /dev/bcache2 on / type btrfs > (rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs) > > The FS uses mraid=1 and draid=0. > > Output of btrfsck is: > (also available here: > https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae) > > $ btrfsck /dev/disk/by-label/system > checking extents > ref mismatch on [10443660537856 524288] extent item 1, found 2 This 10443660537856 number is bigger than the 1832931324360 number found for total bytes. AFAIK, this is already wrong. [...] > checking fs roots > root 4336 inode 4284125 errors 1000, some csum missing What is in this inode? > Checking filesystem on /dev/disk/by-label/system > UUID: d2bb232a-2e8f-4951-8bcc-97e237f1b536 > found 1832931324360 bytes used err is 1 > total csum bytes: 1730105656 > total tree bytes: 6494474240 > total fs tree bytes: 3789783040 > total extent tree bytes: 608219136 > btree space waste bytes: 1221460063 > file data blocks allocated: 2406059724800 > referenced 2040857763840 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "/tmp/mnt.", and not honouring compression
Le 31/03/2016 22:49, Chris Murray a écrit : > Hi, > > I'm trying to troubleshoot a ceph cluster which doesn't seem to be > honouring BTRFS compression on some OSDs. Can anyone offer some help? Is > it likely to be a ceph issue or a BTRFS one? Or something else? I've > asked on ceph-users already, but not received a response yet. > > Config is set to mount with "noatime,nodiratime,compress-force=lzo" > > Some OSDs have been getting much more full than others though, which I > think is something to do with these 'tmp' mounts e.g. below: Note that there are other reasons for unbalanced storage on Ceph OSD. The main reason is too few PGs (there's a calculator on ceph.com, google for it). These tmp mounts aren't normal, you should find out what is causing them. So it might be a Ceph issue (too few PGs) or a system issue (some component trying to use your filesystems for its own purposes). You might have more luck on the ceph-users list (post your Ceph version, the result of ceph osd tree, df on all OSDs and hunt for the process creating theses mounts on your systems). It's probably not a Btrfs issue (I run a Ceph on Btrfs cluster in production and I've never seen this kind of problem). Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "/tmp/mnt.", and not honouring compression
Chris Murray posted on Thu, 31 Mar 2016 21:49:29 +0100 as excerpted: > I'm using Proxmox, based on Debian. Kernel version 4.2.8-1-pve. Btrfs > v3.17. The problem itself is beyond my level, but aiming for the obvious low- hanging fruit... On this list, which is forward looking as btrfs remains stabilizing, not yet fully stable and mature, kernel support comes in four tracks, mainstream and btrfs development trees, mainstream current, mainstream lts, and everything else. Mainstream and btrfs development trees should be obvious. It covers mainstream current git and rc kernels as well as btrfs-integration and linux-next. Generally only recommended for bleeding edge testers willing to lose what they're testing. Mainstream current follows mainstream latest releases, with generally the latest two kernel series being best supported. With 4.5 out, that's 4.5 and 4.4. Mainstream LTS follows mainstream LTS series, and until recently, again the latest two were best supported. That's the 4.4 and 4.1 LTS series. However, as btrfs has matured, the previous LTS series, 3.18, hasn't turned out so bad and remains reasonably well supported as well, tho depending on the issue, you may still be asked to upgrade and see if it's still there in 4.1 or 4.4. Then there's "everything else", which is where a 4.2 kernel such as you're running comes in. These kernels are either long ago history (pre-3.18 LTS, for instance) in btrfs terms, or out of their mainstream kernel support windows, which is where 4.2 is. While we recognize that various distros claiming btrfs support may still be using these kernels, because we're mainline focused we don't track what patches they may or may not have backported, and thus aren't in a particularly good position to support them. If you're relying on your distro's support in such a case, that's where you need to look, as they know what they've backported and what they haven't and are thus in a far better position to provide support. As for the list, we still do the best we can with these "everything else" kernels, but unless it's a known problem recognized on-sight, that's most often simply to recommend upgrading to something that's better supported and trying to duplicate the problem there. Meanwhile, for long-term enterprise level stability, btrfs isn't likely to be a good choice in any case, as it really is still stabilizing and the expectation is that people running it will be upgrading to get the newer patches. If that's not feasible, as it may not be for the enterprise-stability-level use-case, then it's very likely that btrfs isn't a good match for the use-case anyway, as it's simply not to that level of stability yet. A more mature filesystem such as ext4, ext3, the old reiserfs which I still use on some spinning rust here (all my btrfs are on ssd), xfs, etc, is very likely to be a more appropriate choice for that use-case. For kernel 4.2, that leaves you with a few choices: 1) Ask your distro for btrfs support if they offer it on the out-of- mainline-support kernels which they've obviously chosen to use instead of the LTS series that /are/ still mainline supported. 2) Upgrade to the supported 4.4 LTS kernel series. 3) Downgrade to the older supported 4.1 LTS kernel series. 4) Decide btrfs is inappropriate for your use-case and switch to a fully stable and mature filesystem. 5) Continue with 4.2 and muddle thru, using our "best effort" help where you can and doing without or getting it elsewhere if the opportunity presents itself or you have money to buy it from a qualified provider. Personally I'd choose option 2, upgrading to 4.4, but that's just me. The other choices may work better for you. As for btrfs-progs userspace, when the filesystem is working it's not as critical, since other than filesystem creation with mkfs.btrfs, most operational commands simply invoke kernel code to do the real work. However, once problems appear, a newer version can be critical as patches to deal with newly discovered problems continue to be added to tools such as btrfs check (for detecting and repairing problems) and btrfs restore (for recovery of files off an unmountable filesystem). And newer userspace is designed to work with older kernels, so newer isn't a problem in that regard. As a result, to keep userspace from getting /too/ far behind and because userspace release version numbers are synced with kernel version, a good rule of thumb is to run a userspace version similar to that of your kernel, or newer. Assuming you're already following the current or LTS track kernel recommendations, that will keep you reasonably current, and you can always upgrade to the newest available if you're trying to fix otherwise unfixable problems. Unfortunately your userspace falls well outside that recommendation as well, with 3.17 userspace being before the earliest supported 3.18 LTS kernel, let alone comparable to your current
Re: fallocate mode flag for "unshare blocks"?
On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote: > On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote: > > On Mar 31, 2016, at 12:08 PM, J. Bruce Fieldswrote: > > > > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: > > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > > Or is it ok that fallocate could block, potentially for a long time as > > > we stream cows through the page cache (or however unshare works > > > internally)? Those same programs might not be expecting fallocate to > > > take a long time. > > > > Yes, it's perfectly fine for fallocate to block for long periods of > > time. See what gfs2 does during preallocation of blocks - it ends up > > calling sb_issue_zerout() because it doesn't have unwritten > > extents, and hence can block for long periods of time > > >>> > > >>> gfs2 fallocate is an implementation that will cause all but the most > > >>> trivial users real pain. Even the initial XFS implementation just > > >>> marking the transactions synchronous made it unusable for all kinds > > >>> of applications, and this is much worse. E.g. a NFS ALLOCATE operation > > >>> to gfs2 will probab;ly hand your connection for extended periods of > > >>> time. > > >>> > > >>> If we need to support something like what gfs2 does we should have a > > >>> separate flag for it. > > >> > > >> Using fallocate() for preallocation was always intended to > > >> be a faster, more efficient method allocating zeroed space > > >> than having userspace write blocks of data. Faster, more efficient > > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means > > >> that if the hardware has zeroing offloads (deterministic trim, write > > >> same, etc) it will use them, and that will be much faster than > > >> writing zeros from userspace. > > >> > > >> IMO, what gfs2 is definitely within the intended usage of > > >> fallocate() for accelerating the preallocation of blocks. > > >> > > >> Yes, it may not be optimal for things like NFS servers which haven't > > >> considered that a fallocate based offload operation might take some > > >> time to execute, but that's not a problem with fallocate. i.e. > > >> that's a problem with the nfs server ALLOCATE implementation not > > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs > > >> and timeouts while the operation is run > > > > > > That's an interesting idea, but I don't think it's really legal. I take > > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > > > later and it might succeed", not "OK, I'm working on it, try again and > > > you may find out I've done it". > > > > > > So if the client gets a JUKEBOX error but the server goes ahead and does > > > the operation anyway, that'd be unexpected. > > > > Well, the tape continued to be mounted in the background and/or the file > > restored from the tape into the filesystem... > > Right, and SGI have been shipping a DMAPI-aware Linux NFS server for > many years, using the above NFSERR_JUKEBOX behaviour for operations > that may block for a long time due to the need to pull stuff into > the filesytsem from the slow backing store. Best explanation is in > the relevant commit in the last published XFS+DMAPI branch from SGI, > for example: > > http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75 I haven't looked at the code, but I assume a JUKEBOX-returning write to an absent file brings into cache the bits necessary to perform the write, but stops short of actually doing the write. That allows handling the retried write quickly without doing the wrong thing in the case the retry never comes. Implementing fallocate by returning JUKEBOX while still continuing the allocation in the background is a bit different. I guess it doesn't matter as much in practice, since the only way you're likely to notice that fallocate unexpectedly succeeded would be if it caused you to hit ENOSPC elsewhere. Is that right? Still, it seems a little weird. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote: > On Mar 31, 2016, at 12:08 PM, J. Bruce Fieldswrote: > > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > Or is it ok that fallocate could block, potentially for a long time as > > we stream cows through the page cache (or however unshare works > > internally)? Those same programs might not be expecting fallocate to > > take a long time. > > Yes, it's perfectly fine for fallocate to block for long periods of > time. See what gfs2 does during preallocation of blocks - it ends up > calling sb_issue_zerout() because it doesn't have unwritten > extents, and hence can block for long periods of time > >>> > >>> gfs2 fallocate is an implementation that will cause all but the most > >>> trivial users real pain. Even the initial XFS implementation just > >>> marking the transactions synchronous made it unusable for all kinds > >>> of applications, and this is much worse. E.g. a NFS ALLOCATE operation > >>> to gfs2 will probab;ly hand your connection for extended periods of > >>> time. > >>> > >>> If we need to support something like what gfs2 does we should have a > >>> separate flag for it. > >> > >> Using fallocate() for preallocation was always intended to > >> be a faster, more efficient method allocating zeroed space > >> than having userspace write blocks of data. Faster, more efficient > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means > >> that if the hardware has zeroing offloads (deterministic trim, write > >> same, etc) it will use them, and that will be much faster than > >> writing zeros from userspace. > >> > >> IMO, what gfs2 is definitely within the intended usage of > >> fallocate() for accelerating the preallocation of blocks. > >> > >> Yes, it may not be optimal for things like NFS servers which haven't > >> considered that a fallocate based offload operation might take some > >> time to execute, but that's not a problem with fallocate. i.e. > >> that's a problem with the nfs server ALLOCATE implementation not > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs > >> and timeouts while the operation is run > > > > That's an interesting idea, but I don't think it's really legal. I take > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > > later and it might succeed", not "OK, I'm working on it, try again and > > you may find out I've done it". > > > > So if the client gets a JUKEBOX error but the server goes ahead and does > > the operation anyway, that'd be unexpected. > > Well, the tape continued to be mounted in the background and/or the file > restored from the tape into the filesystem... Right, and SGI have been shipping a DMAPI-aware Linux NFS server for many years, using the above NFSERR_JUKEBOX behaviour for operations that may block for a long time due to the need to pull stuff into the filesytsem from the slow backing store. Best explanation is in the relevant commit in the last published XFS+DMAPI branch from SGI, for example: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75 Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad metadata crossing stripe boundary
Am Thu, 31 Mar 2016 23:16:30 +0200 schrieb Kai Krakow: > Am Thu, 31 Mar 2016 23:00:04 +0200 > schrieb Marc Haber : > > > On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote: > > > Would you please try the following patch based on v4.5 > > > btrfs-progs? https://patchwork.kernel.org/patch/8706891/ > > > > This also fixes the "bad metadata crossing stripe boundary" on my > > pet patient. > > > > I find it somewhere between funny and disturbing that the first call > > of btrfs check made my kernel log the following: > > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted > > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan > > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid > > 67526 /dev/dm-31 > > > > No, the filesystem was not converted, it was directly created as > > btrfs, and no, I didn't try mounting it. > > I suggest that your partition contained ext4 before, and you didn't > run wipefs before running mkfs.btrfs. Now, the ext4 superblock is > still detected because no btrfs structure or block did overwrite it. > I had a similar problem when I first tried btrfs. > > I think there is some magic dd-fu to damage the ext4 superblock > without hurting the btrfs itself. But I leave this to the fs devs, > they could properly tell you. Tho, you could also try to force detecting btrfs before ext4 by modifying /etc/filesystems. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad metadata crossing stripe boundary
Am Thu, 31 Mar 2016 23:00:04 +0200 schrieb Marc Haber: > On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote: > > Would you please try the following patch based on v4.5 btrfs-progs? > > https://patchwork.kernel.org/patch/8706891/ > > This also fixes the "bad metadata crossing stripe boundary" on my pet > patient. > > I find it somewhere between funny and disturbing that the first call > of btrfs check made my kernel log the following: > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid > 67526 /dev/dm-31 > > No, the filesystem was not converted, it was directly created as > btrfs, and no, I didn't try mounting it. I suggest that your partition contained ext4 before, and you didn't run wipefs before running mkfs.btrfs. Now, the ext4 superblock is still detected because no btrfs structure or block did overwrite it. I had a similar problem when I first tried btrfs. I think there is some magic dd-fu to damage the ext4 superblock without hurting the btrfs itself. But I leave this to the fs devs, they could properly tell you. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode
On Wed, Mar 30, 2016 at 11:37:21PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana> > If we rename an inode A (be it a file or a directory), create a new > inode B with the old name of inode A and under the same parent directory, > fsync inode B and then power fail, at log tree replay time we end up > removing inode A completely. If inode A is a directory then all its files > are gone too. > > Example scenarios where this happens: > This is reproducible with the following steps, taken from a couple of > test cases written for fstests which are going to be submitted upstream > soon: Thanks Filipe! Since this is an older bug, I won't rush it into tomorrow's pull, but I'll test and get it into next week. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode
fdmanana posted on Wed, 30 Mar 2016 23:37:21 +0100 as excerpted: > From: Filipe Manana> > If we rename an inode A (be it a file or a directory), create a new > inode B with the old name of inode A and under the same parent > directory, fsync inode B and then power fail, at log tree replay time > we end up removing inode A completely. If inode A is a directory then > all its files are gone too. ... > V2: Node code changes, only updated the change log and the comment to > be more clear about the problems solved by the new checks. If there's a V3 anyway, apparent typo: s/Node code/No code/ -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad metadata crossing stripe boundary
On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote: > Would you please try the following patch based on v4.5 btrfs-progs? > https://patchwork.kernel.org/patch/8706891/ This also fixes the "bad metadata crossing stripe boundary" on my pet patient. I find it somewhere between funny and disturbing that the first call of btrfs check made my kernel log the following: Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid 67526 /dev/dm-31 No, the filesystem was not converted, it was directly created as btrfs, and no, I didn't try mounting it. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"/tmp/mnt.", and not honouring compression
Hi, I'm trying to troubleshoot a ceph cluster which doesn't seem to be honouring BTRFS compression on some OSDs. Can anyone offer some help? Is it likely to be a ceph issue or a BTRFS one? Or something else? I've asked on ceph-users already, but not received a response yet. Config is set to mount with "noatime,nodiratime,compress-force=lzo" Some OSDs have been getting much more full than others though, which I think is something to do with these 'tmp' mounts e.g. below: /dev/sdc1 on /var/lib/ceph/tmp/mnt.AywYKY type btrfs (rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) /dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) /dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs (rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub vol=/) /dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) After a reboot, it's moved to another drive: /dev/sdd1 on /var/lib/ceph/tmp/mnt.kWh2NA type btrfs (rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) /dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs (rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub vol=/) /dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) /dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) I'm using Proxmox, based on Debian. Kernel version 4.2.8-1-pve. Btrfs v3.17. Thank you, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfsck: backpointer mismatch (and multiple other errors)
Hello! I already reported this in another thread but it was a bit confusing by intermixing multiple volumes. So let's start a new thread: Since one of the last kernel upgrades, I'm experiencing one VDI file (containing a NTFS image with Windows 7) getting damaged when running the machine in VirtualBox. I got knowledge about this after experiencing an error "duplicate object" and btrfs went RO. I fixed it by deleting the VDI and restoring from backup - but no I get csum errors as soon as some VM IO goes into the VDI file. The FS is still usable. One effect is, that after reading all files with rsync (to copy to my backup), each call of "du" or "df" hangs, also similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one outcome of this is, that the FS does not properly unmount during shutdown. Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x series, and never had problems), including Gentoo patch-set r1. The device layout is: $ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT NAMEMODELFSTYPE LABEL MOUNTPOINT sda Crucial_CT128MX1 ├─sda1 vfat ESP/boot ├─sda2 └─sda3 bcache ├─bcache0 btrfs system ├─bcache1 btrfs system └─bcache2 btrfs system /usr/src sdb SAMSUNG HD103SJ ├─sdb1 swap swap0 [SWAP] └─sdb2 bcache └─bcache2 btrfs system /usr/src sdc SAMSUNG HD103SJ ├─sdc1 swap swap1 [SWAP] └─sdc2 bcache └─bcache1 btrfs system sdd SAMSUNG HD103UJ ├─sdd1 swap swap2 [SWAP] └─sdd2 bcache └─bcache0 btrfs system Mount options are: $ mount|fgrep btrfs /dev/bcache2 on / type btrfs (rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs) The FS uses mraid=1 and draid=0. Output of btrfsck is: (also available here: https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae) $ btrfsck /dev/disk/by-label/system checking extents ref mismatch on [10443660537856 524288] extent item 1, found 2 Backref 10443660537856 root 256 owner 23536425 offset 1310720 num_refs 0 not found in extent tree Incorrect local backref count on 10443660537856 root 256 owner 23536425 offset 1310720 found 1 wanted 0 back 0x4ceee750 Backref disk bytenr does not match extent record, bytenr=10443660537856, ref bytenr=10443660914688 Backref bytes do not match extent backref, bytenr=10443660537856, ref bytes=524288, backref bytes=69632 backpointer mismatch on [10443660537856 524288] extent item 11271946579968 has multiple extent items ref mismatch on [11271946579968 110592] extent item 1, found 2 Backref disk bytenr does not match extent record, bytenr=11271946579968, ref bytenr=11271946629120 backpointer mismatch on [11271946579968 110592] extent item 11271946690560 has multiple extent items ref mismatch on [11271946690560 114688] extent item 1, found 2 Backref disk bytenr does not match extent record, bytenr=11271946690560, ref bytenr=11271946739712 Backref bytes do not match extent backref, bytenr=11271946690560, ref bytes=114688, backref bytes=110592 backpointer mismatch on [11271946690560 114688] extent item 11271946805248 has multiple extent items ref mismatch on [11271946805248 114688] extent item 1, found 3 Backref disk bytenr does not match extent record, bytenr=11271946805248, ref bytenr=11271946850304 Backref bytes do not match extent backref, bytenr=11271946805248, ref bytes=114688, backref bytes=53248 Backref disk bytenr does not match extent record, bytenr=11271946805248, ref bytenr=11271946903552 Backref bytes do not match extent backref, bytenr=11271946805248, ref bytes=114688, backref bytes=49152 backpointer mismatch on [11271946805248 114688] extent item 11271946919936 has multiple extent items ref mismatch on [11271946919936 61440] extent item 1, found 2 Backref disk bytenr does not match extent record, bytenr=11271946919936, ref bytenr=11271946952704 Backref bytes do not match extent backref, bytenr=11271946919936, ref bytes=61440, backref bytes=110592 backpointer mismatch on [11271946919936 61440] extent item 11271946981376 has multiple extent items ref mismatch on [11271946981376 110592] extent item 1, found 2 Backref disk bytenr does not match extent record, bytenr=11271946981376, ref bytenr=11271947063296 backpointer mismatch on [11271946981376 110592] extent item 11271947091968 has multiple extent items ref mismatch on [11271947091968 110592] extent item 1, found 2 Backref disk bytenr does not match extent record, bytenr=11271947091968, ref bytenr=11271947173888 Backref bytes do not match extent backref, bytenr=11271947091968, ref bytes=110592, backref bytes=114688 backpointer mismatch on [11271947091968 110592] extent item 11271947202560 has multiple
Re: bad metadata crossing stripe boundary
>> Would you please try the following patch based on v4.5 btrfs-progs? >> https://patchwork.kernel.org/patch/8706891/ >> >> According to your output, all the output is false alert. >> All the extent starting bytenr can be divided by 64K, and I think at >> initial time, its 'max_size' may be set to 0, causing "start + 0 - 1" >> to be inside previous 64K range. >> >> The patch would update cross_stripe every time the extent is updated, >> so such temporary false alert should disappear. > > Applied and no more reports of crossing stripe boundary - thanks. > > Will this go into 4.5.1 or 4.5.2? It is not in 4.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad metadata crossing stripe boundary
Am Thu, 31 Mar 2016 10:31:49 +0800 schrieb Qu Wenruo: > Qu Wenruo wrote on 2016/03/31 09:33 +0800: > > > > > > Kai Krakow wrote on 2016/03/28 12:02 +0200: > >> Changing subject to reflect the current topic... > >> > >> Am Sun, 27 Mar 2016 21:55:40 +0800 > >> schrieb Qu Wenruo : > >> > [...] > [...] > >> > >> No, btrfs-progs 4.5 reports those, too (as far as I understood, > >> this includes the fixes for bogus "bad metadata" errors, tho I > >> thought this has already been fixed in 4.2.1, I used 4.4.1). There > >> were some nbytes wrong errors before which I already repaired > >> using "--repair". I think that's okay, I had those in the past and > >> it looks like btrfsck can repair those now (and I don't have to > >> delete and recreate the files). It caused problems with "du" and > >> "df" in the past, a problem that I'm currently facing too. So I > >> better fixed them. > >> > >> With that done, the backup fs now only reports "bad metadata" which > >> have been there before space cache v2. Full output below. > >> > [...] > [...] > >> > >> Copy and paste problem. Claws mail pretends to be smarter than me > >> - I missed to fix that one. ;-) > > > > I was searching for the missing '\n' and hopes to find any chance to > > submit a new patch. > > What a pity. :( > > > >> > [...] > >> > >> Helped, it automatically reverted the FS back to space cache v1 > >> with incompat flag cleared. (I wouldn't have enabled v2 if it > >> wasn't documented that this is possible) > >> > [...] > [...] > >> > >> It's gone now, ignore that. It's back to the situation before space > >> cache v2. Minus some "nbytes wrong" errors I had and fixed. > > > > Nice to see it works. > > > >> > >> Nevertheless, I'm now using btrfs-progs 4.5. Here's the full > >> output: (the lines seem to be partly out of order, probably due to > >> the redirection) > >> > >> $ sudo btrfsck /dev/sde1 2>&1 | tee btrfsck-label-usb-backup.txt > >> checking extents > >> bad metadata [156041216, 156057600) crossing stripe boundary > >> bad metadata [181403648, 181420032) crossing stripe boundary > >> bad metadata [392167424, 392183808) crossing stripe boundary > >> bad metadata [783482880, 783499264) crossing stripe boundary > >> bad metadata [784924672, 784941056) crossing stripe boundary > >> bad metadata [130151612416, 130151628800) crossing stripe boundary > >> bad metadata [162826813440, 162826829824) crossing stripe boundary > >> bad metadata [162927083520, 162927099904) crossing stripe boundary > >> bad metadata [619740659712, 619740676096) crossing stripe boundary > >> bad metadata [619781947392, 619781963776) crossing stripe boundary > >> bad metadata [619795644416, 619795660800) crossing stripe boundary > >> bad metadata [619816091648, 619816108032) crossing stripe boundary > >> bad metadata [620011388928, 620011405312) crossing stripe boundary > >> bad metadata [890992459776, 890992476160) crossing stripe boundary > >> bad metadata [891022737408, 891022753792) crossing stripe boundary > >> bad metadata [891101773824, 891101790208) crossing stripe boundary > >> bad metadata [891301199872, 891301216256) crossing stripe boundary > >> bad metadata [1012219314176, 1012219330560) crossing stripe > >> boundary bad metadata [1017202409472, 1017202425856) crossing > >> stripe boundary bad metadata [1017365397504, 1017365413888) > >> crossing stripe boundary bad metadata [1020764422144, > >> 1020764438528) crossing stripe boundary bad metadata > >> [1251103342592, 1251103358976) crossing stripe boundary bad > >> metadata [1251145809920, 1251145826304) crossing stripe boundary > >> bad metadata [1251147055104, 1251147071488) crossing stripe > >> boundary bad metadata [1259271225344, 1259271241728) crossing > >> stripe boundary bad metadata [1266223611904, 1266223628288) > >> crossing stripe boundary bad metadata [1304750063616, > >> 130475008) crossing stripe boundary bad metadata > >> [1304790106112, 1304790122496) crossing stripe boundary bad > >> metadata [1304850792448, 1304850808832) crossing stripe boundary > >> bad metadata [1304869928960, 1304869945344) crossing stripe > >> boundary bad metadata [1305089540096, 1305089556480) crossing > >> stripe boundary bad metadata [1309581443072, 1309581459456) > >> crossing stripe boundary bad metadata [1309583671296, > >> 1309583687680) crossing stripe boundary bad metadata > >> [1309942808576, 1309942824960) crossing stripe boundary bad > >> metadata [1310050549760, 1310050566144) crossing stripe boundary > >> bad metadata [1313031585792, 1313031602176) crossing stripe > >> boundary bad metadata [1313232912384, 1313232928768) crossing > >> stripe boundary bad metadata [1555210764288, 1555210780672) > >> crossing stripe boundary bad metadata [1555395182592, > >> 1555395198976) crossing stripe boundary bad metadata > >> [205057678, 2050576760832) crossing stripe boundary bad > >> metadata [2050803957760,
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 07:18:55AM -0400, Austin S. Hemmelgarn wrote: > On 2016-03-30 20:32, Liu Bo wrote: > >On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > >>Hi all, > >> > >>Christoph and I have been working on adding reflink and CoW support to > >>XFS recently. Since the purpose of (mode 0) fallocate is to make sure > >>that future file writes cannot ENOSPC, I extended the XFS fallocate > >>handler to unshare any shared blocks via the copy on write mechanism I > >>built for it. However, Christoph shared the following concerns with > >>me about that interpretation: > >> > >>>I know that I suggested unsharing blocks on fallocate, but it turns out > >>>this is causing problems. Applications expect falloc to be a fast > >>>metadata operation, and copying a potentially large number of blocks > >>>is against that expextation. This is especially bad for the NFS > >>>server, which should not be blocked for a long time in a synchronous > >>>operation. > >>> > >>>I think we'll have to remove the unshare and just fail the fallocate > >>>for a reflinked region for now. I still think it makes sense to expose > >>>an unshare operation, and we probably should make that another > >>>fallocate mode. > > > >I'm expecting fallocate to be fast, too. > > > >Well, btrfs fallocate doesn't allocate space if it's a shared one > >because it thinks the space is already allocated. So a later overwrite > >over this shared extent may hit enospc errors. > And this _really_ should get fixed, otherwise glibc will add a check for > running posix_fallocate against BTRFS and force emulation, and people _will_ > complain about performance. Even if glibc adds a check like that and emulates fallocate by writing zero to real blocks, btrfs still does cow and requests to allocate space for new writes, so it's not only performance, but also getting ENOSPC in extremely case though. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Mar 31, 2016, at 12:08 PM, J. Bruce Fieldswrote: > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > Or is it ok that fallocate could block, potentially for a long time as > we stream cows through the page cache (or however unshare works > internally)? Those same programs might not be expecting fallocate to > take a long time. Yes, it's perfectly fine for fallocate to block for long periods of time. See what gfs2 does during preallocation of blocks - it ends up calling sb_issue_zerout() because it doesn't have unwritten extents, and hence can block for long periods of time >>> >>> gfs2 fallocate is an implementation that will cause all but the most >>> trivial users real pain. Even the initial XFS implementation just >>> marking the transactions synchronous made it unusable for all kinds >>> of applications, and this is much worse. E.g. a NFS ALLOCATE operation >>> to gfs2 will probab;ly hand your connection for extended periods of >>> time. >>> >>> If we need to support something like what gfs2 does we should have a >>> separate flag for it. >> >> Using fallocate() for preallocation was always intended to >> be a faster, more efficient method allocating zeroed space >> than having userspace write blocks of data. Faster, more efficient >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means >> that if the hardware has zeroing offloads (deterministic trim, write >> same, etc) it will use them, and that will be much faster than >> writing zeros from userspace. >> >> IMO, what gfs2 is definitely within the intended usage of >> fallocate() for accelerating the preallocation of blocks. >> >> Yes, it may not be optimal for things like NFS servers which haven't >> considered that a fallocate based offload operation might take some >> time to execute, but that's not a problem with fallocate. i.e. >> that's a problem with the nfs server ALLOCATE implementation not >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs >> and timeouts while the operation is run > > That's an interesting idea, but I don't think it's really legal. I take > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > later and it might succeed", not "OK, I'm working on it, try again and > you may find out I've done it". > > So if the client gets a JUKEBOX error but the server goes ahead and does > the operation anyway, that'd be unexpected. Well, the tape continued to be mounted in the background and/or the file restored from the tape into the filesystem... > I suppose it's comparable to the case where a slow fallocate is > interrupted--would it be legal to return EINTR in that case and leave > the application to sort out whether some part of the allocation had > already happened? If the later fallocate() was not re-doing the same work as the first one, it should be fine for the client to re-send the fallocate() request. The fallocate() to reserve blocks does not touch the blocks that are already allocated, so this is safe to do even if another process is writing to the file. If you have multiple processes writing and calling fallocate() with PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then the application is in for a world of hurt already. > Would it be legal to continue the fallocate under the covers even after > returning EINTR? That might produce unexpected results in some cases, but it depends on the options used. Probably the safest is to not continue, and depend on userspace to retry the operation on EINTR. For fallocate() doing prealloc or punch or zero this should eventually complete even if it is slow. Cheers, Andreas > But anyway my first inclination is to say that the NFS FALLOCATE > protocol just wasn't designed to handle long-running fallocates, and if > we really need that then we need to give it a way to either report > partial results or to report results asynchronously. > > --b. signature.asc Description: Message signed with OpenPGP using GPGMail
contents incomplete (?)
Hi all - I recently spent some time writing up btrfs ioctl decoding for strace to be used in debugging various third party tools (e.g. snapper) in their interactions with btrfs.[1] In doing so, I found that substantial parts of the tree interface haven't been exported via . For my purposes, it was mostly flags and two structures passed in as ioctl arguments. All things that are intended to be part of the API. But as I started to write up a patch to remedy that, I realized that we've basically exported most of the file system internals via the SEARCH_TREE ioctl. In fact, the SEARCH_TREE ioctl is the intended interface for a number of features (e.g. qgroup status). As a result, pretty much every item type, objectid values, etc are part of the API and should be published via for consumers. I'm happy to do the lifting to move things around to make this happen, but before I do, I wanted to ask: Is it expected that consumers of the interface at this low a level will provide their own copies of the structures, flags, types, and objectid values they want to consume? If so, why? These are all on-disk format values and effectively set in stone. At least the ones that are already defined. Wouldn't things like new flags and reserved values being claimed cause API drift pretty quickly and put the maintenance burden on the third party developer? -Jeff [1] It's also pretty neat in that you can just write a quick program to run an ioctl and not have to bother decoding most of it -- just look at the strace output. I do provide symbolic names for tree ids and key types, but don't decode the item contents. -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: "bad metadata" not fixed by btrfs repair
On Mon, Mar 28, 2016 at 4:37 PM, Marc Haberwrote: > Hi, > > I have a btrfs which btrfs check --repair doesn't fix: > > # btrfs check --repair /dev/mapper/fanbtr > bad metadata [4425377054720, 4425377071104) crossing stripe boundary > bad metadata [4425380134912, 4425380151296) crossing stripe boundary > bad metadata [4427532795904, 4427532812288) crossing stripe boundary > bad metadata [4568321753088, 4568321769472) crossing stripe boundary > bad metadata [4568489656320, 4568489672704) crossing stripe boundary > bad metadata [4571474493440, 4571474509824) crossing stripe boundary > bad metadata [4571946811392, 4571946827776) crossing stripe boundary > bad metadata [4572782919680, 4572782936064) crossing stripe boundary > bad metadata [4573086351360, 4573086367744) crossing stripe boundary > bad metadata [4574221041664, 4574221058048) crossing stripe boundary > bad metadata [4574373412864, 4574373429248) crossing stripe boundary > bad metadata [4574958649344, 4574958665728) crossing stripe boundary > bad metadata [4575996018688, 4575996035072) crossing stripe boundary > bad metadata [4580376772608, 4580376788992) crossing stripe boundary In this case, for all ... [X,Y) ... X is 64K aligned and Y - X = 16K So also false alerts. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "bad metadata" not fixed by btrfs repair
On Thu, Mar 31, 2016 at 4:23 AM, Qu Wenruowrote: > > > Henk Slager wrote on 2016/03/30 16:03 +0200: >> >> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo >> wrote: >>> >>> First of all. >>> >>> The "crossing stripe boundary" error message itself is *HARMLESS* for >>> recent >>> kernels. >>> >>> It only means, that metadata extent won't be checked by scrub on recent >>> kernels. >>> Because scrub by its codes, has a limitation that, it can only check tree >>> blocks which are inside a 64K block. >>> >>> Old kernel won't have anything wrong, until that tree block is being >>> scrubbed. >>> When scrubbed, old kernel just BUG_ON(). >>> >>> Now recent kernel will handle such limitation by checking extent >>> allocation >>> and avoid crossing boundary, so new created fs with new kernel won't >>> cause >>> such error message at all. >>> >>> But for old created fs, the problem can't be avoided, but at least, new >>> kernels will not BUG_ON() when you scrub these extents, they just get >>> ignored (not that good, but at least no BUG_ON). >>> >>> And new fsck will check such case, gives such warning. >>> >>> Overall, you're OK if you are using recent kernels. >>> >>> Marc Haber wrote on 2016/03/29 08:43 +0200: On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote: > > > Did you convert this filesystem from ext4 (or ext3)? No. > You hadn't mentioned what version of btrfs-progs you're using, and that > is > somewhat important for recovery. I'm not sure if current versions of > btrfs > check can fix this issue, but I know for a fact that older versions > (prior > to at least 4.1) can not fix it. 4.1 for creation and btrfs check. >>> >>> >>> >>> I assume that you have run older kernel on it, like v4.1 or v4.2. >>> >>> In those old kernels, it lacks the check to avoid such extent allocation >>> check. >>> > As far as what the kernel is involved with, the easy way to check is if > it's > operating on a mounted filesystem or not. If it only operates on > mounted > filesystems, it almost certainly goes through the kernel, if it only > operates on unmounted filesystems, it's almost certainly done in > userspace > (except dev scan and technically fi show). Then btrfs check is a userspace-only matter, as it wants the fs unmounted, and it is irrelevant that I did btrfs check from a rescue system with an older kernel, 3.16 if I recall correctly. >>> >>> >>> >>> Not recommended to use older kernel to RW mount or use older fsck to do >>> repair. >>> As it's possible that older kernel/btrfsck may allocate extent that cross >>> the 64K boundary. >>> > 2. Regarding general support: If you're using an enterprise > distribution > (RHEL, SLES, CentOS, OEL, or something similar), you are almost > certainly > going to get better support from your vendor than from the mailing list > or > IRC. My "productive" desktops (fan is one of them) run Debian unstable with a current vanilla kernel. At the moment, I can't use 4.5 because it acts up with KVM. When I need a rescue system, I use grml, which unfortunately hasn't released since November 2014 and is still with kernel 3.16 >>> >>> >>> >>> To fix your problem(make these error message just disappear, even they >>> are >>> harmless on recent kernels), the most easy one, is to balance your >>> metadata. >> >> >> I did a balance with filter -musage=100 (kernel/tools 4.5/4.5) of the >> filesystem mentioned in here: >> http://www.spinics.net/lists/linux-btrfs/msg51405.html >> >> but still bad metadata [ ), crossing stripe boundary messages, >> double amount compared to 2 months ago >> >> Kernel operating this fs has always been maximum 1 month behind >> 'Latest Stable Kernel' (kernel.org terminology) > > > Would you please try the following patch? > https://patchwork.kernel.org/patch/8706891/ > > It is based on v4.5 and I think it should fix the false alert. I have applied the patch to 4.5 and ran btrfs check again and now no bad metadata [ ), crossing stripe boundary messages anymore (and also no other errors). Thanks. > Thanks, > Qu > > >> >>> As I explained, the bug only lies in metadata, and balance will allocate >>> new >>> tree blocks, then copy old data into new locations. >>> >>> In the allocation process of recent kernel, it will avoid such cross >>> boundary, and to fix your problem. >>> >>> But if you are using old kernels, don't scrub your metadata. >>> >>> Thanks, >>> Qu Greetings Marc >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > -- To unsubscribe from
Re: "bad metadata" not fixed by btrfs repair
On Thu, Mar 31, 2016 at 2:28 AM, Qu Wenruowrote: > > > Henk Slager wrote on 2016/03/30 16:03 +0200: >> >> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo >> wrote: >>> >>> First of all. >>> >>> The "crossing stripe boundary" error message itself is *HARMLESS* for >>> recent >>> kernels. >>> >>> It only means, that metadata extent won't be checked by scrub on recent >>> kernels. >>> Because scrub by its codes, has a limitation that, it can only check tree >>> blocks which are inside a 64K block. >>> >>> Old kernel won't have anything wrong, until that tree block is being >>> scrubbed. >>> When scrubbed, old kernel just BUG_ON(). >>> >>> Now recent kernel will handle such limitation by checking extent >>> allocation >>> and avoid crossing boundary, so new created fs with new kernel won't >>> cause >>> such error message at all. >>> >>> But for old created fs, the problem can't be avoided, but at least, new >>> kernels will not BUG_ON() when you scrub these extents, they just get >>> ignored (not that good, but at least no BUG_ON). >>> >>> And new fsck will check such case, gives such warning. >>> >>> Overall, you're OK if you are using recent kernels. >>> >>> Marc Haber wrote on 2016/03/29 08:43 +0200: On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote: > > > Did you convert this filesystem from ext4 (or ext3)? No. > You hadn't mentioned what version of btrfs-progs you're using, and that > is > somewhat important for recovery. I'm not sure if current versions of > btrfs > check can fix this issue, but I know for a fact that older versions > (prior > to at least 4.1) can not fix it. 4.1 for creation and btrfs check. >>> >>> >>> >>> I assume that you have run older kernel on it, like v4.1 or v4.2. >>> >>> In those old kernels, it lacks the check to avoid such extent allocation >>> check. >>> > As far as what the kernel is involved with, the easy way to check is if > it's > operating on a mounted filesystem or not. If it only operates on > mounted > filesystems, it almost certainly goes through the kernel, if it only > operates on unmounted filesystems, it's almost certainly done in > userspace > (except dev scan and technically fi show). Then btrfs check is a userspace-only matter, as it wants the fs unmounted, and it is irrelevant that I did btrfs check from a rescue system with an older kernel, 3.16 if I recall correctly. >>> >>> >>> >>> Not recommended to use older kernel to RW mount or use older fsck to do >>> repair. >>> As it's possible that older kernel/btrfsck may allocate extent that cross >>> the 64K boundary. >>> > 2. Regarding general support: If you're using an enterprise > distribution > (RHEL, SLES, CentOS, OEL, or something similar), you are almost > certainly > going to get better support from your vendor than from the mailing list > or > IRC. My "productive" desktops (fan is one of them) run Debian unstable with a current vanilla kernel. At the moment, I can't use 4.5 because it acts up with KVM. When I need a rescue system, I use grml, which unfortunately hasn't released since November 2014 and is still with kernel 3.16 >>> >>> >>> >>> To fix your problem(make these error message just disappear, even they >>> are >>> harmless on recent kernels), the most easy one, is to balance your >>> metadata. >> >> >> I did a balance with filter -musage=100 (kernel/tools 4.5/4.5) of the >> filesystem mentioned in here: >> http://www.spinics.net/lists/linux-btrfs/msg51405.html >> >> but still bad metadata [ ), crossing stripe boundary messages, >> double amount compared to 2 months ago > > > Would you please give an example of the output? > So I can check if it's really crossing the boundary. This is the 1st one of the 105 messages: bad metadata [8263437058048, 8263437062144) crossing stripe boundary For all ... [X,Y) ... X is 64K aligned and Y - X = 4K So in my case, all false alerts. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 02:08:21PM -0400, J. Bruce Fields wrote: > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: > > On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > > > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > > > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > > > > Or is it ok that fallocate could block, potentially for a long time as > > > > > we stream cows through the page cache (or however unshare works > > > > > internally)? Those same programs might not be expecting fallocate to > > > > > take a long time. > > > > > > > > Yes, it's perfectly fine for fallocate to block for long periods of > > > > time. See what gfs2 does during preallocation of blocks - it ends up > > > > calling sb_issue_zerout() because it doesn't have unwritten > > > > extents, and hence can block for long periods of time > > > > > > gfs2 fallocate is an implementation that will cause all but the most > > > trivial users real pain. Even the initial XFS implementation just > > > marking the transactions synchronous made it unusable for all kinds > > > of applications, and this is much worse. E.g. a NFS ALLOCATE operation > > > to gfs2 will probab;ly hand your connection for extended periods of > > > time. > > > > > > If we need to support something like what gfs2 does we should have a > > > separate flag for it. > > > > Using fallocate() for preallocation was always intended to > > be a faster, more efficient method allocating zeroed space > > than having userspace write blocks of data. Faster, more efficient > > does not mean instantaneous, and gfs2 using sb_issue_zerout() means > > that if the hardware has zeroing offloads (deterministic trim, write > > same, etc) it will use them, and that will be much faster than > > writing zeros from userspace. > > > > IMO, what gfs2 is definitely within the intended usage of > > fallocate() for accelerating the preallocation of blocks. > > > > Yes, it may not be optimal for things like NFS servers which haven't > > considered that a fallocate based offload operation might take some > > time to execute, but that's not a problem with fallocate. i.e. > > that's a problem with the nfs server ALLOCATE implementation not > > being prepared to return NFSERR_JUKEBOX to prevent client side hangs > > and timeouts while the operation is run > > That's an interesting idea, but I don't think it's really legal. I take > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > later and it might succeed", not "OK, I'm working on it, try again and > you may find out I've done it". > > So if the client gets a JUKEBOX error but the server goes ahead and does > the operation anyway, that'd be unexpected. > > I suppose it's comparable to the case where a slow fallocate is > interrupted--would it be legal to return EINTR in that case and leave > the application to sort out whether some part of the allocation had > already happened? The unshare component to XFS fallocate does this if something sends a fatal signal to the process. There's a difference between shooting down a process in the middle of fallocate and fallocate returning EINTR out of the blue, though... ...the manpage for fallocate says that "EINTR == a signal was caught". > Would it be legal to continue the fallocate under the covers even > after returning EINTR? It doesn't do that, however. --D > But anyway my first inclination is to say that the NFS FALLOCATE > protocol just wasn't designed to handle long-running fallocates, and if > we really need that then we need to give it a way to either report > partial results or to report results asynchronously. > > --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: > On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > > > Or is it ok that fallocate could block, potentially for a long time as > > > > we stream cows through the page cache (or however unshare works > > > > internally)? Those same programs might not be expecting fallocate to > > > > take a long time. > > > > > > Yes, it's perfectly fine for fallocate to block for long periods of > > > time. See what gfs2 does during preallocation of blocks - it ends up > > > calling sb_issue_zerout() because it doesn't have unwritten > > > extents, and hence can block for long periods of time > > > > gfs2 fallocate is an implementation that will cause all but the most > > trivial users real pain. Even the initial XFS implementation just > > marking the transactions synchronous made it unusable for all kinds > > of applications, and this is much worse. E.g. a NFS ALLOCATE operation > > to gfs2 will probab;ly hand your connection for extended periods of > > time. > > > > If we need to support something like what gfs2 does we should have a > > separate flag for it. > > Using fallocate() for preallocation was always intended to > be a faster, more efficient method allocating zeroed space > than having userspace write blocks of data. Faster, more efficient > does not mean instantaneous, and gfs2 using sb_issue_zerout() means > that if the hardware has zeroing offloads (deterministic trim, write > same, etc) it will use them, and that will be much faster than > writing zeros from userspace. > > IMO, what gfs2 is definitely within the intended usage of > fallocate() for accelerating the preallocation of blocks. > > Yes, it may not be optimal for things like NFS servers which haven't > considered that a fallocate based offload operation might take some > time to execute, but that's not a problem with fallocate. i.e. > that's a problem with the nfs server ALLOCATE implementation not > being prepared to return NFSERR_JUKEBOX to prevent client side hangs > and timeouts while the operation is run That's an interesting idea, but I don't think it's really legal. I take JUKEBOX to mean "sorry, I'm failing this operation for now, try again later and it might succeed", not "OK, I'm working on it, try again and you may find out I've done it". So if the client gets a JUKEBOX error but the server goes ahead and does the operation anyway, that'd be unexpected. I suppose it's comparable to the case where a slow fallocate is interrupted--would it be legal to return EINTR in that case and leave the application to sort out whether some part of the allocation had already happened? Would it be legal to continue the fallocate under the covers even after returning EINTR? But anyway my first inclination is to say that the NFS FALLOCATE protocol just wasn't designed to handle long-running fallocates, and if we really need that then we need to give it a way to either report partial results or to report results asynchronously. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 5:31 PM, Andreas Dilgerwrote: > On Mar 31, 2016, at 1:55 AM, Christoph Hellwig wrote: >> >> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote: >>> Well, btrfs fallocate doesn't allocate space if it's a shared one >>> because it thinks the space is already allocated. So a later overwrite >>> over this shared extent may hit enospc errors. >> >> And this makes it an incorrect implementation of posix_fallocate, >> which glibcs implements using fallocate if available. > > It isn't really useful for a COW filesystem to implement fallocate() > to reserve blocks. Even if it did allocate all of the blocks on the > initial fallocate() call, when it comes time to overwrite these blocks > new blocks need to be allocated as the old ones will not be overwritten. There are also use-cases on BTRFS with CoW disabled, like operations on virtual machine images that aren't snapshotted. Those files tend to be big and having fallocate() implemented and working like for e.g. XFS, in order to achieve space and speed efficiency, makes sense IMHO. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Reset IO error counters before start of device replacing
On Tue, Mar 29, 2016 at 02:17:48PM -0700, Yauhen Kharuzhy wrote: > If device replace entry was found on disk at mounting and its num_write_errors > stats counter has non-NULL value, then replace operation will never be > finished and -EIO error will be reported by btrfs_scrub_dev() because > this counter is never reset. > > # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ > # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ > Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write > errs, 0 uncorr. read errs > # btrfs replace start -B 4 /dev/sdg > /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ > ERROR: ioctl(DEV_REPLACE_START) failed on > "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error > > Reset num_write_errors and num_uncorrectable_read_errors counters in the > dev_replace structure before start of replacing. > > Signed-off-by: Yauhen KharuzhyReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH] btrfs: Add qgroup tracing
On Tue, Mar 29, 2016 at 05:19:55PM -0700, Mark Fasheh wrote: > This patch adds tracepoints to the qgroup code on both the reporting side > (insert_dirty_extents) and the accounting side. Taken together it allows us > to see what qgroup operations have happened, and what their result was. > > Signed-off-by: Mark FashehReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't use src fd for printk
On Fri, Mar 25, 2016 at 10:02:41AM -0400, Josef Bacik wrote: > The fd we pass in may not be on a btrfs file system, so don't try to do > BTRFS_I() on it. Thanks, > > Signed-off-by: Josef BacikReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote: > At least 2 user from mail list reported btrfsck reported false alert of > "bad metadata [,) crossing stripe boundary". > > While the reported number are all inside the same 64K boundary. > After some check, all the false alert have the same bytenr feature, > which can be divided by stripe size (64K). > > The result seems to be initial 'max_size' can be 0, causing 'start' + > 'max_size' - 1, to cross the stripe boundary. > > Fix it by always update extent_record->cross_stripe when the > extent_record is updated, to avoid temporary false alert to be reported. > > Signed-off-by: Qu WenruoApplied, thanks. Do you have a test image for that? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 00/19] Btrfs dedupe framework
On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote: > This March 30th patchset update mostly addresses the patchset structure > comment from David: > 1) Change the patchset sequence >Not If only apply the first 14 patches, it can provide the full >backward compatible in-memory only dedupe backend. > >Only starts from patch 15, on-disk format will be changed. > >So patch 1~14 is going to be pushed for next merge window, while I'll >still submit them all for review purpose. I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option until the interface is settled. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On 2016-03-31 11:31, Andreas Dilger wrote: On Mar 31, 2016, at 1:55 AM, Christoph Hellwigwrote: On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote: Well, btrfs fallocate doesn't allocate space if it's a shared one because it thinks the space is already allocated. So a later overwrite over this shared extent may hit enospc errors. And this makes it an incorrect implementation of posix_fallocate, which glibcs implements using fallocate if available. It isn't really useful for a COW filesystem to implement fallocate() to reserve blocks. Even if it did allocate all of the blocks on the initial fallocate() call, when it comes time to overwrite these blocks new blocks need to be allocated as the old ones will not be overwritten. Because of snapshots that could hold references to the old blocks, there isn't even the guarantee that the previous fallocated blocks will be released in a reasonable time to free up an equal amount of space. That really depends on how it's done. AFAIK, unwritten extents on BTRFS are block reservations which make sure that you can write there (IOW, the unwritten extent gets converted to a regular extent in-place, not via COW). This means that it is possible to guarantee that the first write to that area will work, which is technically all that POSIX requires. This in turn means that stuff like SystemD and RDBMS software don't exactly see things working as they expect them too, but that's because they make assumptions based on existing technology. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Mar 31, 2016, at 1:55 AM, Christoph Hellwigwrote: > > On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote: >> Well, btrfs fallocate doesn't allocate space if it's a shared one >> because it thinks the space is already allocated. So a later overwrite >> over this shared extent may hit enospc errors. > > And this makes it an incorrect implementation of posix_fallocate, > which glibcs implements using fallocate if available. It isn't really useful for a COW filesystem to implement fallocate() to reserve blocks. Even if it did allocate all of the blocks on the initial fallocate() call, when it comes time to overwrite these blocks new blocks need to be allocated as the old ones will not be overwritten. Because of snapshots that could hold references to the old blocks, there isn't even the guarantee that the previous fallocated blocks will be released in a reasonable time to free up an equal amount of space. Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Btrfs progs release 4.5.1
Hi, btrfs-progs 4.5.1 have been released. A bugfix release, several build fixes, a few minor fixes, improvements or preparatory work that do not need to be delayed. There's one user visible change: * mkfs: allow DUP on multi-device filesystem Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Anand Jain (6): btrfs-progs: rearrange subvolume functions together btrfs-progs: move test_issubvolume() to utils.c btrfs-progs: remove duplicate function __is_subvol() btrfs-progs: move get_subvol_name() to utils.c btrfs-progs: create get_subvol_info() btrfs-progs: rename get_subvol_name() to subvol_strip_mountpoint() Austin S. Hemmelgarn (1): btrfs-progs: fix fi du so it works in more cases David Sterba (14): btrfs-progs: fragments: fix build btrfs-progs: utils: make more arguments const btrfs-progs: cleanup block group helpers types btrfs-progs: fix build of standalone utilities after clean btrfs-progs: tests: introduce mustfail helper btrfs-progs: tests: add misc 014-filesystem-label btrfs-progs: make error message from add_clone_source more generic btrfs-progs: mkfs: allow DUP on multidev fs, only warn btrfs-progs: rename __strncpy__null to __strncpy_null btrfs-progs: use safe copy for label buffer everywhere btrfs-progs: fix fd leak in get_subvol_info btrfs-progs: tests: update 001-basic-profiles, dup on multidev fs btrfs-progs: docs: update mkfs page for dup on multidev fs Btrfs progs v4.5.1 Julio Montes (1): btrfs-progs: fix unknown type name 'u64' in gccgo Noah Massey (1): btrfs-progs: build: fix static standalone utilities Petros Angelatos (1): btrfs-progs: utils: make sure set_label_mounted uses correct length buffers Satoru Takeuchi (1): btrfs-progs: mkfs: fix an error when using DUP on multidev fs Tsutomu Itoh (1): btrfs-progs: send: fix handling of multiple snapshots -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fix unknown type name 'u64' in gccgo
On Tue, Mar 29, 2016 at 03:34:48PM -0600, Julio Montes wrote: > From: Julio Montes> > Signed-off-by: Julio Montes Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size
On Thu, Mar 31, 2016 at 11:31:06AM +0200, David Sterba wrote: > On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote: > > On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote: > > > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote: > > > > this patchset temporarily disables the commit > > > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e. > > > > > > > > The commits for the Btrfs kernel module can be found at > > > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize. > > > > > > The branch does not apply cleanly to at least 4.5, I've tried to rebase > > > it but there are conflicts that are not simple. Please update it on top > > > of current master, ie. with the preparatory patchset merged. > > > > I will rebase the branch and post the patchset soon. > > JFYI, I've seen some minor compilation failures: > - __do_readpage: unused variable cached > - end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info > - fails build with (config) sanity checks enabled, the fs_info moved > from eb to eb head And the tests crash with my quick fixes, so I'll move the branch out of next for now. Please fix it and let me know. My fixes are on top of your lastest branch, chandan-subpage-latest in my development gits. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to cancel btrfs balance on unmounted filesystem
On Thu, Mar 31, 2016 at 01:01:37PM +0500, Roman Mamedov wrote: > On Thu, 31 Mar 2016 08:21:12 +0200 > Marc Haberwrote: > > the balance restarts immediately after mounting > > You can use the skip_balance mount option to prevent that. Thanks. I now have this in all fstabs. On the system in questionl, I was able to sneak in a btrfs balance cancel before the system hanged itself. Mar 31 08:17:42 fan kernel: [ 240.595465] INFO: task kworker/u16:0:6 blocked for more than 120 seconds. Mar 31 08:17:42 fan kernel: [ 240.595604] Tainted: GW 4.4.6-zgws1 #2 Mar 31 08:17:42 fan kernel: [ 240.595705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 31 08:17:42 fan kernel: [ 240.595845] kworker/u16:0 D 88062fc956c0 0 6 2 0x Mar 31 08:17:42 fan kernel: [ 240.595913] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Mar 31 08:17:42 fan kernel: [ 240.595919] 88017ca680c0 0002 88017ca78000 88017ca77ca0 Mar 31 08:17:42 fan kernel: [ 240.595927] 8800c9388960 0002 81409e1c 88017ca680c0 Mar 31 08:17:42 fan kernel: [ 240.595934] 81408329 7fff 81409e5a 00c0a044e7d3 Mar 31 08:17:42 fan kernel: [ 240.595941] Call Trace: Mar 31 08:17:42 fan kernel: [ 240.595955] [] ? usleep_range+0x35/0x35 Mar 31 08:17:42 fan kernel: [ 240.595964] [] ? schedule+0x6f/0x7c Mar 31 08:17:42 fan kernel: [ 240.595973] [] ? schedule_timeout+0x3e/0x128 Mar 31 08:17:42 fan kernel: [ 240.595981] [] ? cache_alloc+0x1bd/0x277 Mar 31 08:17:42 fan kernel: [ 240.595990] [] ? __wait_for_common+0x121/0x16d Mar 31 08:17:42 fan kernel: [ 240.595997] [] ? __wait_for_common+0x121/0x16d Mar 31 08:17:42 fan kernel: [ 240.596006] [] ? wake_up_q+0x3b/0x3b Mar 31 08:17:42 fan kernel: [ 240.596047] [] ? btrfs_async_run_delayed_refs+0xbf/0xd5 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596093] [] ? __btrfs_end_transaction+0x291/0x2d5 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596140] [] ? btrfs_finish_ordered_io+0x418/0x4d7 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596187] [] ? btrfs_scrubparity_helper+0xf4/0x233 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596198] [] ? process_one_work+0x178/0x27b Mar 31 08:17:42 fan kernel: [ 240.596206] [] ? worker_thread+0x1da/0x280 Mar 31 08:17:42 fan kernel: [ 240.596213] [] ? rescuer_thread+0x284/0x284 Mar 31 08:17:42 fan kernel: [ 240.596220] [] ? kthread+0x95/0x9d Mar 31 08:17:42 fan kernel: [ 240.596227] [] ? kthread_parkme+0x16/0x16 Mar 31 08:17:42 fan kernel: [ 240.596234] [] ? ret_from_fork+0x3f/0x70 Mar 31 08:17:42 fan kernel: [ 240.596240] [] ? kthread_parkme+0x16/0x16 Mar 31 08:17:42 fan kernel: [ 240.596272] INFO: task kworker/u16:2:134 blocked for more than 120 seconds. Mar 31 08:17:42 fan kernel: [ 240.596399] Tainted: GW 4.4.6-zgws1 #2 Mar 31 08:17:42 fan kernel: [ 240.596499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 31 08:17:42 fan kernel: [ 240.596637] kworker/u16:2 D 88062fcd56c0 0 134 2 0x Mar 31 08:17:42 fan kernel: [ 240.596688] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596692] 8806130e4780 0003 880613108000 880613107ca0 Mar 31 08:17:42 fan kernel: [ 240.596699] 8805caa1d960 0002 81409e1c 8806130e4780 Mar 31 08:17:42 fan kernel: [ 240.596706] 81408329 7fff 81409e5a 88062fd556c0 Mar 31 08:17:42 fan kernel: [ 240.596712] Call Trace: Mar 31 08:17:42 fan kernel: [ 240.596721] [] ? usleep_range+0x35/0x35 Mar 31 08:17:42 fan kernel: [ 240.596728] [] ? schedule+0x6f/0x7c Mar 31 08:17:42 fan kernel: [ 240.596735] [] ? schedule_timeout+0x3e/0x128 Mar 31 08:17:42 fan kernel: [ 240.596742] [] ? check_preempt_curr+0x41/0x63 Mar 31 08:17:42 fan kernel: [ 240.596750] [] ? ttwu_do_wakeup+0xf/0xd0 Mar 31 08:17:42 fan kernel: [ 240.596757] [] ? __wait_for_common+0x121/0x16d Mar 31 08:17:42 fan kernel: [ 240.596764] [] ? __wait_for_common+0x121/0x16d Mar 31 08:17:42 fan kernel: [ 240.596771] [] ? wake_up_q+0x3b/0x3b Mar 31 08:17:42 fan kernel: [ 240.596812] [] ? btrfs_async_run_delayed_refs+0xbf/0xd5 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596858] [] ? __btrfs_end_transaction+0x291/0x2d5 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596904] [] ? btrfs_finish_ordered_io+0x418/0x4d7 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596952] [] ? btrfs_scrubparity_helper+0xf4/0x233 [btrfs] Mar 31 08:17:42 fan kernel: [ 240.596960] [] ? process_one_work+0x178/0x27b Mar 31 08:17:42 fan kernel: [ 240.596968] [] ? worker_thread+0x1da/0x280 Mar 31 08:17:42 fan kernel: [ 240.596976] [] ? rescuer_thread+0x284/0x284 Mar 31 08:17:42 fan kernel: [ 240.596982] [] ? kthread+0x95/0x9d Mar 31 08:17:42 fan
Re: fallocate mode flag for "unshare blocks"?
On 2016-03-31 07:18, Austin S. Hemmelgarn wrote: On 2016-03-30 20:32, Liu Bo wrote: On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: Hi all, Christoph and I have been working on adding reflink and CoW support to XFS recently. Since the purpose of (mode 0) fallocate is to make sure that future file writes cannot ENOSPC, I extended the XFS fallocate handler to unshare any shared blocks via the copy on write mechanism I built for it. However, Christoph shared the following concerns with me about that interpretation: I know that I suggested unsharing blocks on fallocate, but it turns out this is causing problems. Applications expect falloc to be a fast metadata operation, and copying a potentially large number of blocks is against that expextation. This is especially bad for the NFS server, which should not be blocked for a long time in a synchronous operation. I think we'll have to remove the unshare and just fail the fallocate for a reflinked region for now. I still think it makes sense to expose an unshare operation, and we probably should make that another fallocate mode. I'm expecting fallocate to be fast, too. Well, btrfs fallocate doesn't allocate space if it's a shared one because it thinks the space is already allocated. So a later overwrite over this shared extent may hit enospc errors. And this _really_ should get fixed, otherwise glibc will add a check for running posix_fallocate against BTRFS and force emulation, and people _will_ complain about performance. Thinking a bit further about this, how hard would it be to add the ability to have unwritten extents point somewhere else for reads? Then when we get an fallocate call, we create the unwritten extents, and add the metadata to make them read from the shared region. Then, when a write gets issued to that extent, the parts that aren't being written in that block get copied, the write happens, and then the link for that block gets removed. This way, fallocate would still provide the correct semantics, it would be relatively fast (still not quite as fast as it is now, but it wouldn't be anywhere near as slow as copying the data), and the cost of copying gets amortized across writes (we may not need to copy everything, but we'll still copy less than we would for just un-sharing the extent). This would of course need to be an incompat feature, but I would personally say that's not as much of an issue, as things are subtly broken in the common use-case right now (at this point I'm just thinking BTRFS, as what Darrick suggested for XFS seems to be a better solution there at least short term). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On 2016-03-30 20:32, Liu Bo wrote: On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: Hi all, Christoph and I have been working on adding reflink and CoW support to XFS recently. Since the purpose of (mode 0) fallocate is to make sure that future file writes cannot ENOSPC, I extended the XFS fallocate handler to unshare any shared blocks via the copy on write mechanism I built for it. However, Christoph shared the following concerns with me about that interpretation: I know that I suggested unsharing blocks on fallocate, but it turns out this is causing problems. Applications expect falloc to be a fast metadata operation, and copying a potentially large number of blocks is against that expextation. This is especially bad for the NFS server, which should not be blocked for a long time in a synchronous operation. I think we'll have to remove the unshare and just fail the fallocate for a reflinked region for now. I still think it makes sense to expose an unshare operation, and we probably should make that another fallocate mode. I'm expecting fallocate to be fast, too. Well, btrfs fallocate doesn't allocate space if it's a shared one because it thinks the space is already allocated. So a later overwrite over this shared extent may hit enospc errors. And this _really_ should get fixed, otherwise glibc will add a check for running posix_fallocate against BTRFS and force emulation, and people _will_ complain about performance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > > Or is it ok that fallocate could block, potentially for a long time as > > > we stream cows through the page cache (or however unshare works > > > internally)? Those same programs might not be expecting fallocate to > > > take a long time. > > > > Yes, it's perfectly fine for fallocate to block for long periods of > > time. See what gfs2 does during preallocation of blocks - it ends up > > calling sb_issue_zerout() because it doesn't have unwritten > > extents, and hence can block for long periods of time > > gfs2 fallocate is an implementation that will cause all but the most > trivial users real pain. Even the initial XFS implementation just > marking the transactions synchronous made it unusable for all kinds > of applications, and this is much worse. E.g. a NFS ALLOCATE operation > to gfs2 will probab;ly hand your connection for extended periods of > time. > > If we need to support something like what gfs2 does we should have a > separate flag for it. Using fallocate() for preallocation was always intended to be a faster, more efficient method allocating zeroed space than having userspace write blocks of data. Faster, more efficient does not mean instantaneous, and gfs2 using sb_issue_zerout() means that if the hardware has zeroing offloads (deterministic trim, write same, etc) it will use them, and that will be much faster than writing zeros from userspace. IMO, what gfs2 is definitely within the intended usage of fallocate() for accelerating the preallocation of blocks. Yes, it may not be optimal for things like NFS servers which haven't considered that a fallocate based offload operation might take some time to execute, but that's not a problem with fallocate. i.e. that's a problem with the nfs server ALLOCATE implementation not being prepared to return NFSERR_JUKEBOX to prevent client side hangs and timeouts while the operation is run Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On 2016-03-31 03:58, Christoph Hellwig wrote: On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote: Nothing that I can find in the man-pages or API documentation for Linux's fallocate explicitly says that it will be fast. There are bits that say it should be efficient, but that is not itself well defined (given context, I would assume it to mean that it doesn't use as much I/O as writing out that many bytes of zero data, not necessarily that it will return quickly). And that's pretty much as narrow as an defintion we get. But apparently gfs2 already breaks that expectation :( GFS2 breaks other expectations as well (mostly stuff with locking) in arguably more significant ways, so I would not personally consider it to be precedent for breaking this on other filesystems. delalloc system is careful enough to check that there are enough free blocks to handle both the allocation and the metadata updates. The only gap in this scheme that I can see is if we fallocate, crash, and upon restart the program then tries to write without retrying the fallocate. Can we trade some performance for the added requirement that we must fallocate -> write -> fsync, and retry the trio if we crash before the fsync returns? I think that's already an implicit requirement, so we might be ok here. Most of the software I've seen that doesn't use fallocate like this is either doing odd things otherwise, or is just making sure it has space for temporary files, so I think it is probably safe to require this. posix_fallocate gurantees you that you don't get ENOSPC from the write, and there is plenty of software relying on that or crashing / cause data integrity problems that way. posix_fallocate is not the same thing as the fallocate syscall. It's there for compatibility, it has less functionality, and most importantly, it _can_ be slow (because at least glibc will emulate it if the underlying FS doesn't support fallocate, which means it's no faster than just using dd). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs file/data loss bug fix
From: Filipe MananaHi Chris, Please consider the following fix for the linux kernel 4.6 release. It is not a regression in the code for 4.6 nor introduced in any recent release, it's a problem that's been around for a long time (years). But since it's a quite serious one, it's important in my opinion to get the fix into 4.6 (and to the stable releases) instead of waiting for the 4.7 merge window. Three test cases were sent upstream for xfstests. Thanks. The following changes since commit 232cad8413a0bfbd25f11cc19fd13dfd85e1d8ad: Merge branch 'misc-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 (2016-03-24 17:36:13 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.6 for you to fetch changes up to 3943609915b333dd6c69ea2993e4c717da07ad46: Btrfs: fix file/data loss caused by fsync after rename and new inode (2016-03-30 23:39:06 +0100) Filipe Manana (1): Btrfs: fix file/data loss caused by fsync after rename and new inode fs/btrfs/tree-log.c | 137 + 1 file changed, 137 insertions(+) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode
From: Filipe MananaIf we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] [ cut here ] [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: GW 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 8801df2a7710 81264e93 8801df2a7758 [52174.524053] 0009 8801df2a7748 81051618 a03591cd [52174.524053] fffe 88015e6e5000 88016dbc3c88 88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [] dump_stack+0x67/0x90 [52174.524053] [] warn_slowpath_common+0x99/0xb2 [52174.524053] [] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [] warn_slowpath_fmt+0x48/0x50 [52174.524053] [] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [] ? iput+0xb0/0x284 [52174.524053] [] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [] ? trace_hardirqs_on+0xd/0xf [52174.524053] [] mount_fs+0x67/0x131 [52174.524053] [] vfs_kern_mount+0x6c/0xde [52174.524053] [] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [] ? trace_hardirqs_on+0xd/0xf [52174.524053] [] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [] mount_fs+0x67/0x131 [52174.524053] [] vfs_kern_mount+0x6c/0xde [52174.524053] [] do_mount+0x8a6/0x9e8 [52174.524053] [] ? strndup_user+0x3f/0x59 [52174.524053] [] SyS_mount+0x77/0x9f [52174.524053] [] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after
Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size
On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote: > On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote: > > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote: > > > this patchset temporarily disables the commit > > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e. > > > > > > The commits for the Btrfs kernel module can be found at > > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize. > > > > The branch does not apply cleanly to at least 4.5, I've tried to rebase > > it but there are conflicts that are not simple. Please update it on top > > of current master, ie. with the preparatory patchset merged. > > I will rebase the branch and post the patchset soon. JFYI, I've seen some minor compilation failures: - __do_readpage: unused variable cached - end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info - fails build with (config) sanity checks enabled, the fs_info moved from eb to eb head -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to cancel btrfs balance on unmounted filesystem
Hello. There is no tool to disable balance on unmounted filesystem. But you can use mount option skip_balance for this. Original Message From: Marc HaberSent: March 31, 2016 9:21:12 AM GMT+03:00 To: linux-btrfs@vger.kernel.org Subject: How to cancel btrfs balance on unmounted filesystem Hi, one of my problem btrfs instances went into a hung process state while blancing metadata. This process is recorded in the file system somehow and the balance restarts immediately after mounting the filesystem with no chance to issue a btrfs balance cancel command before the system hangs again. Is there any possiblity to cancel the pending balance without mounting the fs first? I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581 to adress this in a more elegant way. Greetings Marc -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to cancel btrfs balance on unmounted filesystem
On Thu, 31 Mar 2016 08:21:12 +0200 Marc Haberwrote: > the balance restarts immediately after mounting You can use the skip_balance mount option to prevent that. -- With respect, Roman pgpGkbeeS9Inh.pgp Description: OpenPGP digital signature
Re: fallocate mode flag for "unshare blocks"?
On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote: > Nothing that I can find in the man-pages or API documentation for Linux's > fallocate explicitly says that it will be fast. There are bits that say it > should be efficient, but that is not itself well defined (given context, I > would assume it to mean that it doesn't use as much I/O as writing out that > many bytes of zero data, not necessarily that it will return quickly). And that's pretty much as narrow as an defintion we get. But apparently gfs2 already breaks that expectation :( > >delalloc system is careful enough to check that there are enough free > >blocks to handle both the allocation and the metadata updates. The > >only gap in this scheme that I can see is if we fallocate, crash, and > >upon restart the program then tries to write without retrying the > >fallocate. Can we trade some performance for the added requirement > >that we must fallocate -> write -> fsync, and retry the trio if we > >crash before the fsync returns? I think that's already an implicit > >requirement, so we might be ok here. > Most of the software I've seen that doesn't use fallocate like this is > either doing odd things otherwise, or is just making sure it has space for > temporary files, so I think it is probably safe to require this. posix_fallocate gurantees you that you don't get ENOSPC from the write, and there is plenty of software relying on that or crashing / cause data integrity problems that way. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote: > Well, btrfs fallocate doesn't allocate space if it's a shared one > because it thinks the space is already allocated. So a later overwrite > over this shared extent may hit enospc errors. And this makes it an incorrect implementation of posix_fallocate, which glibcs implements using fallocate if available. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate mode flag for "unshare blocks"?
On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > > Or is it ok that fallocate could block, potentially for a long time as > > we stream cows through the page cache (or however unshare works > > internally)? Those same programs might not be expecting fallocate to > > take a long time. > > Yes, it's perfectly fine for fallocate to block for long periods of > time. See what gfs2 does during preallocation of blocks - it ends up > calling sb_issue_zerout() because it doesn't have unwritten > extents, and hence can block for long periods of time gfs2 fallocate is an implementation that will cause all but the most trivial users real pain. Even the initial XFS implementation just marking the transactions synchronous made it unusable for all kinds of applications, and this is much worse. E.g. a NFS ALLOCATE operation to gfs2 will probab;ly hand your connection for extended periods of time. If we need to support something like what gfs2 does we should have a separate flag for it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to cancel btrfs balance on unmounted filesystem
Hi, one of my problem btrfs instances went into a hung process state while blancing metadata. This process is recorded in the file system somehow and the balance restarts immediately after mounting the filesystem with no chance to issue a btrfs balance cancel command before the system hangs again. Is there any possiblity to cancel the pending balance without mounting the fs first? I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581 to adress this in a more elegant way. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html