Re: Exporting a unique ino/dev pair to user space
On Fri, Jun 8, 2018 at 12:06 AM, Mark Fasheh wrote: [...] >> > 2c) I don't think we can really use a dedicated callback without >> > passing the vfsmount through since Overlayfs ->getattr might call >> > the lower fs ->getattr. At that point we might as well use getattr. >> > >> >> Didn't get the Overlayfs lower fs getattr argument. >> Overlayfs doesn't use the vfsmount passed into getattr >> and it could very well pass a dentry to lower fs getattr. > > My main point in 2c) is that, from my understanding, Overlayfs may need to > call down to one of the filesystem ->getattr() calls. Since those take a > vfsmount we don't gain anything by having a unique callback from this - the > plumbing work would be the same. > I guess I don't understand what you mean by "dedicated callback", but I think we both in agreement that changing fs getattr() to take dentry is preferred. > >> As a matter of fact, out of 35 getattr implementations in the kernel: >> (git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v >> "nfs.*_proc_getattr"|wc -l) >> there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME >> check and most of them only ever use d_inode(path->dentry). >> >> This API seems quite odd. >> Maybe it should be fixed so more in kernel call sites could call getattr >> without a vfsmount. >> Not sure what would be the best way to handle nfs_getattr(). > > Yeah I saw that nfs_getattr() is the only user of the vfsmount. I totally > agree that this would be tons easier if we didn't have to pass the vfsmount > (and like you said, there's only the ONE user). > > This is a bit hacky, but I wonder if we could blow the function signature > back out to a dentry + vfsmount and make the vfsmount optional when getattr > is called only for ino/dev. It's a bit ugly to have optional arguments like > that but nfs would work with just a line or two change and the other fs > would never even care. OR.. change the signature of fs getattr() and vfs_getattr_nosec() to dentry and set a kernel query_flag AT_STATX_NO_REMOTE_ATIME from vfs_getattr(). No need to pass vfsmount to nfs. Then users that access i_ino/s_dev can call vfs_getattr_nosec() with a bonus of not having to go through security modules (which now they don't). The only current user of vfs_getattr_nosec() expfs.c queries STATX_INO, so change is safe. Overlayfs doesn't need to get vfsmount in ovl_getattr() - it knows the lower fs vfsmount for calling vfs_getattr() regardless and in other places overlayfs just needs to query STATX_INO, so it may also use the light vfs_getattr_nosec() callback. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] fstests: btrfs: Test if btrfs will corrupt nodatasum compressed extent when replacing device
This is a long existing bug (from 2012) but exposed by a reporter recently, that when compressed extent without data csum get written to device-replace target device, the written data is in fact uncompressed data other than the original compressed data. And since btrfs still consider the data is compressed and will try to read it as compressed, it can cause read error. The root cause is located, and one RFC patch already sent to fix it, titled "[PATCH] btrfs: scrub: Don't use inode pages for device replace". (The RFC is only for the extra possible way to fix the bug, the fix itself should work without problem) Reported-by: James Harvey Signed-off-by: Qu Wenruo --- changelog: v2: Now the fix patch is no longer RFC. Remove _require_test as we don't really touch it. Add comment on the mount cycle. Add the test to group 'volume'. --- tests/btrfs/161 | 91 + tests/btrfs/161.out | 2 + tests/btrfs/group | 1 + 3 files changed, 94 insertions(+) create mode 100755 tests/btrfs/161 create mode 100644 tests/btrfs/161.out diff --git a/tests/btrfs/161 b/tests/btrfs/161 new file mode 100755 index ..ce1b0e04 --- /dev/null +++ b/tests/btrfs/161 @@ -0,0 +1,91 @@ +#! /bin/bash +# FS QA Test 161 +# +# Test if btrfs will corrupt compressed data extent without data csum +# by replacing it with uncompressed data, when doing replacing device. +# +# This could be fixed by the following RFC patch: +# "[PATCH] btrfs: scrub: Don't use inode pages for device replace" +# +#--- +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 +_require_scratch_dev_pool_equal_size + + +_scratch_dev_pool_get 1 +_spare_dev_get +_scratch_pool_mkfs >> $seqres.full 2>&1 + +# Create nodatasum inode +_scratch_mount "-o nodatasum" +touch $SCRATCH_MNT/nodatasum_file +_scratch_remount "datasum,compress" +_pwrite_byte 0xcd 0 128K $SCRATCH_MNT/nodatasum_file > /dev/null + +# Write the compressed data back to disk +sync + +# Replace the device +_run_btrfs_util_prog replace start -Bf 1 $SPARE_DEV $SCRATCH_MNT + +# Unmount to drop all cache so next read will read from disk +_scratch_unmount +_mount $SPARE_DEV $SCRATCH_MNT + +# Now the EXTENT_DATA item still marks the extent as compressed, +# but the on-disk data is uncompressed, thus reading it as compressed +# will definitely cause EIO. +cat $SCRATCH_MNT/nodatasum_file > /dev/null + +_scratch_unmount +_spare_dev_put +_scratch_dev_pool_put + +echo "Silence is golden" +# success, all done +status=0 +exit diff --git a/tests/btrfs/161.out b/tests/btrfs/161.out new file mode 100644 index ..1752a243 --- /dev/null +++ b/tests/btrfs/161.out @@ -0,0 +1,2 @@ +QA output created by 161 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index f04ee8d5..9195b368 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -163,3 +163,4 @@ 158 auto quick raid scrub 159 auto quick 160 auto quick +161 auto quick replace volume -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs
On 2018年06月08日 12:37, Steve Leung wrote: > On 06/06/2018 01:27 AM, Qu Wenruo wrote: >> The patchset can be fetched from github (*): >> https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes >> >> It's based on David's devel branch, whose HEAD is: >> commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel) >> Author: Matthias Benkard >> Date: Wed Apr 25 16:34:54 2018 +0200 >> >> btrfs-progs: mkfs: traverse_directory: Reset error code on continue >> >> Reported-by Steve Leung , his old btrfs (at least >> offending inodes are from 2014) has inline uncompressed extent, while >> its ram_bytes mismatch with item size. > > Took a while to run, but: > > Tested-by: Steve Leung > > Verifying everything against backups now, but things look good so far. > > I'm not 100% sure how to interpret the "btrfs check" output, but it > sounded like it was going over each subvolume. And since the corruption > was referenced by many subvolumes, the same work essentially had to be > duplicated many times. Yes, that's the case. That would be a little like the way we do in kernel balance, thus it's not implemented in btrfs-progs, so it would take a long time before we have similar facility to fix it properly. Or, I could fix it in another way, break the CoW behavior and do in-place fix. But if anyone wants to interrupt the fix, it may cause disaster. > > Is that a correct assessment? Would it have saved time to remove a > bunch of snapshots first before doing "btrfs check"? Yep, it would save a lot of time. Thanks, Qu > > Either way, many thanks for getting everything going again! > > Steve > > >> Latest kernel tree check catches this bug, while we failed to detect by >> dump-tree. >> >> It turns out that btrfs-progs is doing something evil to avoid reading >> ram_bytes from inline uncompressed extent. >> >> >> So this patchset will address all such ram_bytes related problems. >> >> The 1st patch is a not-so-relative fix for restore, which is using >> ram_bytes for decompress. Although thanks to the compression header, we >> won't read out-of-boundary, but fixing it is never a bad thing. >> >> The 2nd patch will get rid of the evil btrfs_file_extent_inline_len() >> which hides raw ram_bytes from us, and fooling us for a long long time. >> >> The 3rd~5th patches introduce check/repair function for both original >> and lowmem mode (although lowmem mode can detect it even before this >> patch). >> >> The last one is the test case for it as usual. >> >> *: Or should I just migrate to gitlab after M$ acquired github? >> >> Qu Wenruo (6): >> btrfs-progs: restore: Fix wrong compressed item size for decompress() >> btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len() >> btrfs-progs: check/original: Detect and repair wrong inline ram_bytes >> btrfs-progs: check/lowmem: Prepare check_file_extent() to handle >> repair >> btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for >> uncompressed extent >> btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes >> >> check/main.c | 46 ++- >> check/mode-lowmem.c | 120 ++ >> check/mode-original.h | 1 + >> cmds-restore.c | 5 +- >> ctree.h | 22 >> file.c | 3 +- >> print-tree.c | 4 +- >> .../offset_by_one.img | Bin 0 -> 3072 bytes >> .../035-inline-bad-ram-bytes/test.sh | 11 ++ >> 9 files changed, 157 insertions(+), 55 deletions(-) >> create mode 100644 >> tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img >> create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh >> > > signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs
On 06/06/2018 01:27 AM, Qu Wenruo wrote: The patchset can be fetched from github (*): https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes It's based on David's devel branch, whose HEAD is: commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel) Author: Matthias Benkard Date: Wed Apr 25 16:34:54 2018 +0200 btrfs-progs: mkfs: traverse_directory: Reset error code on continue Reported-by Steve Leung , his old btrfs (at least offending inodes are from 2014) has inline uncompressed extent, while its ram_bytes mismatch with item size. Took a while to run, but: Tested-by: Steve Leung Verifying everything against backups now, but things look good so far. I'm not 100% sure how to interpret the "btrfs check" output, but it sounded like it was going over each subvolume. And since the corruption was referenced by many subvolumes, the same work essentially had to be duplicated many times. Is that a correct assessment? Would it have saved time to remove a bunch of snapshots first before doing "btrfs check"? Either way, many thanks for getting everything going again! Steve Latest kernel tree check catches this bug, while we failed to detect by dump-tree. It turns out that btrfs-progs is doing something evil to avoid reading ram_bytes from inline uncompressed extent. So this patchset will address all such ram_bytes related problems. The 1st patch is a not-so-relative fix for restore, which is using ram_bytes for decompress. Although thanks to the compression header, we won't read out-of-boundary, but fixing it is never a bad thing. The 2nd patch will get rid of the evil btrfs_file_extent_inline_len() which hides raw ram_bytes from us, and fooling us for a long long time. The 3rd~5th patches introduce check/repair function for both original and lowmem mode (although lowmem mode can detect it even before this patch). The last one is the test case for it as usual. *: Or should I just migrate to gitlab after M$ acquired github? Qu Wenruo (6): btrfs-progs: restore: Fix wrong compressed item size for decompress() btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len() btrfs-progs: check/original: Detect and repair wrong inline ram_bytes btrfs-progs: check/lowmem: Prepare check_file_extent() to handle repair btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for uncompressed extent btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes check/main.c | 46 ++- check/mode-lowmem.c | 120 ++ check/mode-original.h | 1 + cmds-restore.c| 5 +- ctree.h | 22 file.c| 3 +- print-tree.c | 4 +- .../offset_by_one.img | Bin 0 -> 3072 bytes .../035-inline-bad-ram-bytes/test.sh | 11 ++ 9 files changed, 157 insertions(+), 55 deletions(-) create mode 100644 tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: in which directions does btrfs send -p | btrfs receive work
07.06.2018 05:50, Christoph Anton Mitterer пишет: > Hey. > > Just wondered about the following: > > When I have a btrfs which acts as a master and from which I make copies > of snapshots on it via send/receive (with using -p at send) to other > btrfs which acts as copies like this: > master +--> copy1 >+--> copy2 >\--> copy3 > and if now e.g. the device of master breaks, can I move *with > incremential send -p / receive backups from one of the copies? > > Which of the following two would work (or both?): > > A) Redesignating a copy to be a new master, e.g.: >old-copy1/new-master +--> new-disk/new-copy1 > What is old-copy1? You never told how it was created. >+--> copy2 > \--> copy3 >Obviously at least > send/receiving to new-copy1 shoud work, >but would that work as well > to copy2/copy3 (with -p), since they're >based on (and probably using > UUIDs) from the snapshot on the old >broken master? > > B) Let a new device be the master and move on from that (kinda creating >a "send/receive cycle": >1st: >copy1 +--> new-disk/new-master > >from then on (when new snapshots should be incrementally sent): >new-master +--> copy1 > +--> copy2 > \--> copy3 It is again not clear - you want to overwrite exiting copy1, copy2, copy3? Do these copyN have any relation to copyN in the beginning of your message? Or just want to start new incremental replication from scratch? >Again, not sure whether send/receiving to copy2/3 would work, since >they're based on snapshots/parents from the old broken master. >And I'm even more unsure, whether this back&forth send/receiving, >from copy1->new-master->copy1 would work. > > > Any expert having some definite idea? :-) > > Thanks, > Chris. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad superblock when mounting rw, ro mount works
On 2018年06月08日 01:07, Daniel Underwood wrote: > Hi, > > I have a raid10-like setup that is failing to mount in rw mode with the error > > > mount: /mnt/media: wrong fs type, bad option, bad superblock on > /dev/mapper/em1, missing codepage or helper program, or other error > > read-only mounts seem to work and the files seem to be there. > > I started having issues after a system crash during the process of > deleting a number of large files. After this (Ubuntu 16.04/Kernel > 4.4), any attempt to mount the array in rw mode would cause a similar > crash. I did an upgrade to Ubuntu 18.04/Kernel 4.15 and now get the > error above. > > I have looked through a variety of posts on the mailing list, but > couldn't find anything with the same issue. I have done a scrub on the > array that resulted in 6 verify errors with dmesg showing something > about extent trees. It didn't list them as uncorrectable errors, but > couldn't correct them either as I can't mount in rw. I also tried > `btrfs rescue zero-log /dev/mapper/em1`, which changes the above to > (say) em5, but then zero-log on em5 causes it to go back to em1. The problem is, your extent tree is corrupted. And normally, it indicates your whole filesystem has more or less extra errors. In this case, btrfs check could help you to know how badly your fs is corrupted. (don't run --repair). Optionally, btrfs check --mode=lowmem is pretty a good try, as it would provide better error message. (And help us enhance btrfs-progs) > > Any direction would be appreciated. From what I could tell, my next > steps would be a check with --repair or --init-extent-tree, though I'm > reluctant to try those without being explicitly told to do so. Never run this two, unless you're 100% sure about: 1) What's the problem 2) What --repair can fix Personally speaking, and also stated in man page, unless you're a btrfs-progs developer or has enough experience, don't do any write operation with btrfs check. Thanks, Qu > > I have attached a dmesg.log file and hope I haven't greped out > anything important. > > Thanks, > Daniel > signature.asc Description: OpenPGP digital signature
Re: RAID-1 refuses to balance large drive
On Sat, May 26, 2018 at 06:27:57PM -0700, Brad Templeton wrote: > A few years ago, I encountered an issue (halfway between a bug and a > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was > fairly full. The problem was that after replacing (by add/delete) a > small drive with a larger one, there were now 2 full drives and one > new half-full one, and balance was not able to correct this situation > to produce the desired result, which is 3 drives, each with a roughly > even amount of free space. It can't do it because the 2 smaller > drives are full, and it doesn't realize it could just move one of the > copies of a block off the smaller drive onto the larger drive to free > space on the smaller drive, it wants to move them both, and there is > nowhere to put them both. > > I'm about to do it again, taking my nearly full array which is 4TB, > 4TB, 6TB and replacing one of the 4TB with an 8TB. I don't want to > repeat the very time consuming situation, so I wanted to find out if > things were fixed now. I am running Xenial (kernel 4.4.0) and could > consider the upgrade to bionic (4.15) though that adds a lot more to > my plate before a long trip and I would prefer to avoid if I can. > > So what is the best strategy: > > a) Replace 4TB with 8TB, resize up and balance? (This is the "basic" > strategy) > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks > from 4TB but possibly not enough) > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with > recently vacated 6TB -- much longer procedure but possibly better d) Run "btrfs balance start -dlimit=3 /fs" to make some unallocated space on all drives *before* adding disks. Then replace, resize up, and balance until unallocated space on all disks are equal. There is no need to continue balancing after that, so once that point is reached you can cancel the balance. A number of bad things can happen when unallocated space goes to zero, and being unable to expand a raid1 array is only one of them. Avoid that situation even when not resizing the array, because some cases can be very difficult to get out of. Assuming your disk is not filled to the last gigabyte, you'll be able to keep at least 1GB unallocated on every disk at all times. Monitor the amount of unallocated space and balance a few data block groups (e.g. -dlimit=3) whenever unallocated space gets low. A potential btrfs enhancement area: allow the 'devid' parameter of balance to specify two disks to balance block groups that contain chunks on both disks. We want to balance only those block groups that consist of one chunk on each smaller drive. This redistributes those block groups to have one chunk on the large disk and one chunk on one of the smaller disks, freeing space on the other small disk for the next block group. Block groups that consist of a chunk on the big disk and one of the small disks are already in the desired configuration, so rebalancing them is just a waste of time. Currently it's only possible to do this by writing a script to select individual block groups with python-btrfs or similar--much faster than plain btrfs balance for this case, but more involved to set up. > Or has this all been fixed and method A will work fine and get to the > ideal goal -- 3 drives, with available space suitably distributed to > allow full utilization over time? > > On Sat, May 26, 2018 at 6:24 PM, Brad Templeton wrote: > > A few years ago, I encountered an issue (halfway between a bug and a > > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly > > full. The problem was that after replacing (by add/delete) a small drive > > with a larger one, there were now 2 full drives and one new half-full one, > > and balance was not able to correct this situation to produce the desired > > result, which is 3 drives, each with a roughly even amount of free space. > > It can't do it because the 2 smaller drives are full, and it doesn't realize > > it could just move one of the copies of a block off the smaller drive onto > > the larger drive to free space on the smaller drive, it wants to move them > > both, and there is nowhere to put them both. > > > > I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB > > and replacing one of the 4TB with an 8TB. I don't want to repeat the very > > time consuming situation, so I wanted to find out if things were fixed now. > > I am running Xenial (kernel 4.4.0) and could consider the upgrade to bionic > > (4.15) though that adds a lot more to my plate before a long trip and I > > would prefer to avoid if I can. > > > > So what is the best strategy: > > > > a) Replace 4TB with 8TB, resize up and balance? (This is the "basic" > > strategy) > > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from > > 4TB but possibly not enough) > > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently > > vacated 6TB -- much longer procedure
Re: [PATCH v3 (Only) 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()
On 06/08/2018 02:41 AM, james harvey wrote: On Thu, Jun 7, 2018 at 4:50 AM, Su Yue wrote: On 06/07/2018 03:20 PM, james harvey wrote: btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY, which are the types we're looking for. Signed-off-by: James Harvey Reviewed-by: Su Yue btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY, which are the types we're looking for. (Previous versions missed last 0 argument to btrfs_next_extent_item(). Sorry, I realize I need to take more time before sending, being more careful on everything.) So do I. If you want to write something which is unreleated to changes in a patch. Just put them and changelog belows "---" under SOB line. It would be easier for maintainers to apply patches. And, I noticed patches sent by you are separated in ML. It's OK. Puting series into one thread is better for review. You can use tools like 'git-send-email' to send patches. The last nag: a coverletter is nice with patches. Link of conversation: https://www.spinics.net/lists/linux-btrfs/msg77791.html Signed-off-by: James Harvey --- btrfs-map-logical.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 2451012b..8a41b037 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -83,7 +83,8 @@ again: ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else - ret = btrfs_next_item(fs_info->extent_root, path); + ret = btrfs_next_extent_item(fs_info->extent_root, +path, 0); After a careful look, if btrfs_next/previous_extent_item is used here. It's no need to check if ((search_foward && key.objectid < logical) || (!search_foward && key.objectid > logical) || (key.type != BTRFS_EXTENT_ITEM_KEY && key.type != BTRFS_METADATA_ITEM_KEY)) { === Pass @logical as the last parameters of btrfs_next/previous_extent_item() is simple enough. Thanks, Su if (ret) goto out; goto again; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items
On 2018年06月08日 03:57, james harvey wrote: > "On Thu, Jun 7, 2018 at 3:26 AM, Qu Wenruo wrote: >> On 2018年06月07日 15:19, james harvey wrote: >>> On Thu, Jun 7, 2018 at 12:44 AM, Qu Wenruo wrote: On 2018年06月07日 11:33, james harvey wrote: > * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, > 4096) > !!! Bonus bug, LEFT FOR READER. Why is this item #197, 5 items before > the 203 > given? I think no bounds checking causes a buffer over-read here. > btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro > read_eb_member() to call read_extent_buffer() which memcpy's using an > out > of range index, at least for this leaf. Here the key you got could be garbage. Either the leaf has enough free space, then you got pending zero into the key, or the leaf is full, you got some item data into the key. Either way, the key should be read/trusted at all. >>> >>> Is the group's preference to leave things like bounds checking to the >>> caller? >> >> The problem here is, btrfs_search_slot() is not only called for >> read_only search, but also for insert. >> >> For insert usage, such returned path is completely valid and in fact >> could be pretty useful. > > I don't mean for btrfs_search_slot() behavior to change, since it is > called for inserting. I mean btrfs_item_key_to_cpu() behavior could > change. Makes sense! An nice point to enhance! > >>> (I get the merit to that, avoiding redundant checks.) My >>> normal style would be to have read_extent_buffer() fail if start + len >>> is out of bounds. >> >> In this case, it's not out-of-boundary at all, and the invalid key you >> read out if from the last item, which could contain a valid key. >> >> For leaf, btrfs puts btrfs items pointers at the head, and the items >> data at the tail. >> >> Thus for this case, your invalid key slot is just in the middle of the >> leaf, making read_extent_buffer() unable to detect anything wrong. >> >> Thanks, >> Qu > > You're right. read_extent_buffer() doesn't have the information to > look at this. > > And, "buffer over-read error" was probably the wrong term. I mean in > the sense that it has 203 valid values, and is being asked for the > 204'th item. > > btrfs_item_key() could error on > > (nr > btrfs_header_nritems(eb)) > > This defensive programming would cause redundant checks, if the caller > checks this too as they are supposed to, so I understand if the group > doesn't swing that far on defensive programming. > > Although, perhaps in such situation, btrfs_item_key() could > automatically call btrfs_next_leaf(). This introduce extra return value, and for things like btrfs_item_key(), we need to modify tons of call sites. But at least, I could check if doing extra check (mostly a WARN_ON()) in btrfs_item_key() is valid. Thanks, Qu > I'm thinking this might not be > desired behavior everywhere, but just in case it is, mentioning it > because then btrfs_item_key() could handle all of this, not allowing > the caller to shoot their own foot, without redundant checks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item()
On 2018/06/07 21:22, David Sterba wrote: > On Thu, Jun 07, 2018 at 11:49:58AM +0900, Misono Tomohiro wrote: >> Initialize all filed of btrfs_inode_item to zero in order to prevent >> having some garbage, especially for flags field. > > Have you observed in practice or is it a matter of precaution? I saw failure of fsck-test/010 in yesterday's devel branch and made this patch. It turned out that root cause was wrong flag comparison in btrfs check. (https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg77758.html) With Su's fix, failure of fsck-test/010 is also gone without this patch, but it is better to initialize the variables anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting a unique ino/dev pair to user space
On Thu, Jun 07, 2018 at 10:38:51AM +0300, Amir Goldstein wrote: > On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh wrote: > > Hi, > > > > We have an inconsistency in how the kernel is exporting inode number / > > device pairs for user space. There's of course stat(2) and statx(2), > > but aside from those we simply dump inode->i_ino and super->s_dev. In > > some cases, the dumped values differ from what is returned via stat(2) > > or statx(2). Some filesystems might even show duplicate (but > > internally different!) pairs when the raw i_ino/s_dev is used. > > > > Some examples where we dump raw ino/dev: > > > > - /proc//maps. I've written about how this confuses lsof(8): > > https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 > > > > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See > > trace/events/lock.h or trace/events/writeback.h for examples. > > > > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo() > > > > - Audit records the raw ino/dev and passes them around. We do seem to > > have paths printed from audit as well, but if it's printed with the > > wrong ino/dev pair I believe my point still stands. > > > > knfsd also looks at i_ino. I converted one or two of these places > to getattr callbacks recently, but there are still some more. > > Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode() > an overlay inode should resolve several of the issues listed above - > probably not all, but didn't check every tracepoint.. Ok, thanks for letting me know about knfsd and the overlayfs work. I haven't had a chance to check out the overlayfs work yet but I will shortly. > > This breaks software which expects these pairs to be unique, and can > > put the user in a situation where they might not be able to find an > > inode referenced from the kernel. What's even worse - depending on how > > ino is exported, they might even find the *wrong* inode. > > > > I also should point out that we're likely at this point because > > stat(2) has been using an unsigned long for ino. On a 32 bit system, > > it would have been impossible for the user to get the real inode > > number in the first place. So there probably wasn't much we could do. > > > > These days though, we have statx(2) which will do the right thing on > > all platforms so we no longer have that excuse. The user ought to be > > able to take an ino/dev pair and ultimately find the actual file on > > their system, partially with the help of statx(2). > > > > > > Some examples of how ino is manipulated by filesystems: > > > > - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from > > what I can tell). stat->dev is not always consistent with super->s_dev. > > > > - On 32 bits, many filesystems employ a transformation to squeeze a 64 > > bit identifier into 32 bits. The exact methods are fs specific, > > what's important is that we're losing information, introducing the > > possibility of duplicate inode numbers. > > > > - On all platforms, Btrfs and Overlayfs can have duplicate inode > > numbers. Of course, device can be different across the fs as well > > with the kernel reporting s_dev and these filesystems returning > > a different device via stat() or statx(). > > > > > > Getting the inode number portion of this pair fixed would immediately > > solve the situation for all filesystems except Btrfs and > > Overlayfs - they report a different device from stat. > > > > Regarding the device portion of the pair, I'm honestly not sure > > whether Overlayfs cares, and my attempts to fix the s_dev situation > > for Btrfs have all come to the same dead ends that I've hit briefly > > looking into this inode number issue - the problems are intrinsically > > linked. > > > > So my questions are: > > > > 1) Do we care about this? On one hand, we've had this inconsistency > >for a long time, for various reasons. On the other hand, I can point > >to bugzilla's where these inconsistencies have become a problem. > > > >In the case that we don't care, any fs-internal solutions are > >likely to be extremely disruptive to the end user. > > > > > > 2) If we do care, how do we fix this? > > > > 2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious > > downsides from a memory usage standpoint. Also it doesn't fully > > address the issue - we still have a device field that Btrfs and > > Overlayfs override. > > > > We could combine this with an intermediate structure between the > > inode and super block so s_dev could be abstracted out. See my > > fs_view patches for an example of how this could be done: > > https://www.spinics.net/lists/linux-fsdevel/msg125492.html > > > > 2b) Do we call ->getattr for this information? Plumbing a vfsmount to > > various regions of the kernel like audit, or some of the deeper > > tracepoints looks ugly and prone to life-timing issues (which > > vfsmount do we use?). On the upside, we
Re: Bad superblock when mounting rw, ro mount works
On Thu, Jun 7, 2018 at 2:38 PM, Chris Murphy wrote: > Your very first task right now is to mount ro, and update your > backups. Don't do anything else until you've done that. It's a > testimony to Btrfs that this file system mounts at all, even ro, so > take advantage of this fact before you can't mount it anymore. After you've done the backup, you need to find out why one of these devices is being so unreliable. That has to be fixed first. You can recreate a new Btrfs or some other file system, and you'll just run into the exact same problem down the road. Next, it might be useful to see the output from btrfs-progs 4.16.1 'btrfs check' and 'btrfs check --mode=lowmem' both of which are slow, the second one is really slow but is a different implementation so it's helpful to see both outputs. That's safe as long as you do not use --repair. Also we need to see the output from 'btrfs fi us ' with the volume mounted (ro). Off hand I think the most likely outcome is that you get a backup from the ro mounted file system, and you'll have to recreate it from scratch and restore from backups. In other words, no matter what you need a current backup. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] vfs: allow dedupe of user owned read-only files
Hi Darrick, On Thu, Jun 07, 2018 at 11:17:51AM -0700, Darrick J. Wong wrote: > Looks ok, but could you please update the manpage for > ioctl_fideduperange to elaborate on when userspace can expect EPERM? > > Acked-by: Darrick J. Wong Yes, good idea. I can handle that. Thanks, --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad superblock when mounting rw, ro mount works
So you zero'd the log? Why? You need to slow down and stop making changes to your file system unless you hate your data and want to lose it. The log is there to help protect your data, you don't zero the log except under very specific situations, none of which appear to apply to your situation. 'btrfs check' is safe, --repair may or may not be safe, you didn't say what version of btrfs-progs you're using, I would consider 4.14.1 at the oldest, it should be straightforward with any distro to get something relatively new, maybe from a testing repo. Current is 4.16. And good god don't use --init-extent-tree, that's a hammer. The absolute safest way to troubleshoot Btrfs problems is using newer kernel versions, even though 4.15 is not old, there are still tons of bug fixes in every release. Suspicious items from your dmesg: [95639.492347] BTRFS error (device dm-11): pending csums is 44810240 [95639.492791] BTRFS error (device dm-11): cleaner transaction attach returned -30 [95639.609171] BTRFS error (device dm-11): open_ctree failed [96614.907963] BTRFS info (device dm-11): disk space caching is enabled [96614.907965] BTRFS info (device dm-11): has skinny extents [96615.207692] BTRFS info (device dm-11): bdev /dev/mapper/em3 errs: wr 0, rd 0, flush 0, corrupt 0, gen 3 [96615.207702] BTRFS info (device dm-11): bdev /dev/mapper/em5 errs: wr 164286, rd 3444262, flush 2110, corrupt 3, gen 181 OK so it's a multiple device Btrfs volume, and you have one device missing a metric done of reads and writes and flushes, with a few corruptions and generation mismatches. What's wrong with /dev/mapper/em5? Have you done any troubleshooting to figure out what that problem is? Because Btrfs can't fix it, and it can't make up for it. That looks like the device was missing for a long time, and writes were only going to em3. So what's the backstory? Your very first task right now is to mount ro, and update your backups. Don't do anything else until you've done that. It's a testimony to Btrfs that this file system mounts at all, even ro, so take advantage of this fact before you can't mount it anymore. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items
"On Thu, Jun 7, 2018 at 3:26 AM, Qu Wenruo wrote: > On 2018年06月07日 15:19, james harvey wrote: >> On Thu, Jun 7, 2018 at 12:44 AM, Qu Wenruo wrote: >>> On 2018年06月07日 11:33, james harvey wrote: * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 4096) !!! Bonus bug, LEFT FOR READER. Why is this item #197, 5 items before the 203 given? I think no bounds checking causes a buffer over-read here. btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro read_eb_member() to call read_extent_buffer() which memcpy's using an out of range index, at least for this leaf. >>> >>> Here the key you got could be garbage. >>> Either the leaf has enough free space, then you got pending zero into >>> the key, or the leaf is full, you got some item data into the key. >>> >>> Either way, the key should be read/trusted at all. >> >> Is the group's preference to leave things like bounds checking to the >> caller? > > The problem here is, btrfs_search_slot() is not only called for > read_only search, but also for insert. > > For insert usage, such returned path is completely valid and in fact > could be pretty useful. I don't mean for btrfs_search_slot() behavior to change, since it is called for inserting. I mean btrfs_item_key_to_cpu() behavior could change. >> (I get the merit to that, avoiding redundant checks.) My >> normal style would be to have read_extent_buffer() fail if start + len >> is out of bounds. > > In this case, it's not out-of-boundary at all, and the invalid key you > read out if from the last item, which could contain a valid key. > > For leaf, btrfs puts btrfs items pointers at the head, and the items > data at the tail. > > Thus for this case, your invalid key slot is just in the middle of the > leaf, making read_extent_buffer() unable to detect anything wrong. > > Thanks, > Qu You're right. read_extent_buffer() doesn't have the information to look at this. And, "buffer over-read error" was probably the wrong term. I mean in the sense that it has 203 valid values, and is being asked for the 204'th item. btrfs_item_key() could error on (nr > btrfs_header_nritems(eb)) This defensive programming would cause redundant checks, if the caller checks this too as they are supposed to, so I understand if the group doesn't swing that far on defensive programming. Although, perhaps in such situation, btrfs_item_key() could automatically call btrfs_next_leaf(). I'm thinking this might not be desired behavior everywhere, but just in case it is, mentioning it because then btrfs_item_key() could handle all of this, not allowing the caller to shoot their own foot, without redundant checks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 (Only) 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()
On Thu, Jun 7, 2018 at 4:50 AM, Su Yue wrote: > On 06/07/2018 03:20 PM, james harvey wrote: >> >> btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and >> BTRFS_METADATA_KEY, >> which are the types we're looking for. >> >> Signed-off-by: James Harvey > > Reviewed-by: Su Yue btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY, which are the types we're looking for. (Previous versions missed last 0 argument to btrfs_next_extent_item(). Sorry, I realize I need to take more time before sending, being more careful on everything.) Signed-off-by: James Harvey --- btrfs-map-logical.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 2451012b..8a41b037 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -83,7 +83,8 @@ again: ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else - ret = btrfs_next_item(fs_info->extent_root, path); + ret = btrfs_next_extent_item(fs_info->extent_root, +path, 0); if (ret) goto out; goto again; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] vfs: allow dedupe of user owned read-only files
On Thu, Jun 07, 2018 at 10:38:53AM -0700, Mark Fasheh wrote: > The permission check in vfs_dedupe_file_range() is too coarse - We > only allow dedupe of the destination file if the user is root, or > they have the file open for write. > > This effectively limits a non-root user from deduping their own read-only > files. In addition, the write file descriptor that the user is forced to > hold open can prevent execution of files. As file data during a dedupe > does not change, the behavior is unexpected and this has caused a number of > issue reports. For an example, see: > > https://github.com/markfasheh/duperemove/issues/129 > > So change the check so we allow dedupe on the target if: > > - the root or admin is asking for it > - the process has write access > - the owner of the file is asking for the dedupe > - the process could get write access > > That way users can open read-only and still get dedupe. > > Signed-off-by: Mark Fasheh Looks ok, but could you please update the manpage for ioctl_fideduperange to elaborate on when userspace can expect EPERM? Acked-by: Darrick J. Wong --D > --- > fs/read_write.c | 17 +++-- > 1 file changed, 15 insertions(+), 2 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index e83bd9744b5d..71e9077f8bc1 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, > loff_t srcoff, > } > EXPORT_SYMBOL(vfs_dedupe_file_range_compare); > > +/* Check whether we are allowed to dedupe the destination file */ > +static bool allow_file_dedupe(struct file *file) > +{ > + if (capable(CAP_SYS_ADMIN)) > + return true; > + if (file->f_mode & FMODE_WRITE) > + return true; > + if (uid_eq(current_fsuid(), file_inode(file)->i_uid)) > + return true; > + if (!inode_permission(file_inode(file), MAY_WRITE)) > + return true; > + return false; > +} > + > int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > { > struct file_dedupe_range_info *info; > @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct > file_dedupe_range *same) > u64 len; > int i; > int ret; > - bool is_admin = capable(CAP_SYS_ADMIN); > u16 count = same->dest_count; > struct file *dst_file; > loff_t dst_off; > @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct > file_dedupe_range *same) > > if (info->reserved) { > info->status = -EINVAL; > - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) { > + } else if (!allow_file_dedupe(dst_file)) { > info->status = -EINVAL; > } else if (file->f_path.mnt != dst_file->f_path.mnt) { > info->status = -EXDEV; > -- > 2.15.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/2] vfs: dedupe should return EPERM if permission is not granted
Right now we return EINVAL if a process does not have permission to dedupe a file. This was an oversight on my part. EPERM gives a true description of the nature of our error, and EINVAL is already used for the case that the filesystem does not support dedupe. Signed-off-by: Mark Fasheh Reviewed-by: Darrick J. Wong Acked-by: David Sterba --- fs/read_write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 71e9077f8bc1..7188982e2733 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; } else if (!allow_file_dedupe(dst_file)) { - info->status = -EINVAL; + info->status = -EPERM; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; } else if (S_ISDIR(dst->i_mode)) { -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/2] vfs: better dedupe permission check
Hi Al, The following patches fix a couple of issues with the permission check we do in vfs_dedupe_file_range(). I sent them out for review twice, a changelog is attached. If they look ok to you, I'd appreciate them being pushed upstream. You can get them from git if you like: git pull https://github.com/markfasheh/linux dedupe-perms The first patch expands our check to allow dedupe of a file if the user owns it or otherwise would be allowed to write to it. Current behavior is that we'll allow dedupe only if: - the user is an admin (root) - the user has the file open for write This makes it impossible for a user to dedupe their own file set unless they do it as root, or ensure that all files have write permission. There's a couple of duperemove bugs open for this: https://github.com/markfasheh/duperemove/issues/129 https://github.com/markfasheh/duperemove/issues/86 The other problem we have is also related to forcing the user to open target files for write - A process trying to exec a file currently being deduped gets ETXTBUSY. The answer (as above) is to allow them to open the targets ro - root can already do this. There was a patch from Adam Borowski to fix this back in 2016: https://lkml.org/lkml/2016/7/17/130 which I have incorporated into my changes. The 2nd patch fixes our return code for permission denied to be EPERM. For some reason we're returning EINVAL - I think that's probably my fault. At any rate, we need to be returning something descriptive of the actual problem, otherwise callers see EINVAL and can't really make a valid determination of what's gone wrong. This has also popped up in duperemove, mostly in the form of cryptic error messages. Because this is a code returned to userspace, I did check the other users of extent-same that I could find. Both 'bees' and 'rust-btrfs' do the same as duperemove and simply report the error (as they should). Please apply. Thanks, --Mark Changes from V2 to V3: - Return bool from allow_file_dedupe - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html Changes from V1 to V2: - Add inode_permission check as suggested by Adam Borowski - V1 discussion: https://marc.info/?l=linux-xfs&m=152606684017965&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/2] vfs: allow dedupe of user owned read-only files
The permission check in vfs_dedupe_file_range() is too coarse - We only allow dedupe of the destination file if the user is root, or they have the file open for write. This effectively limits a non-root user from deduping their own read-only files. In addition, the write file descriptor that the user is forced to hold open can prevent execution of files. As file data during a dedupe does not change, the behavior is unexpected and this has caused a number of issue reports. For an example, see: https://github.com/markfasheh/duperemove/issues/129 So change the check so we allow dedupe on the target if: - the root or admin is asking for it - the process has write access - the owner of the file is asking for the dedupe - the process could get write access That way users can open read-only and still get dedupe. Signed-off-by: Mark Fasheh --- fs/read_write.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index e83bd9744b5d..71e9077f8bc1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, } EXPORT_SYMBOL(vfs_dedupe_file_range_compare); +/* Check whether we are allowed to dedupe the destination file */ +static bool allow_file_dedupe(struct file *file) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + if (file->f_mode & FMODE_WRITE) + return true; + if (uid_eq(current_fsuid(), file_inode(file)->i_uid)) + return true; + if (!inode_permission(file_inode(file), MAY_WRITE)) + return true; + return false; +} + int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) { struct file_dedupe_range_info *info; @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) u64 len; int i; int ret; - bool is_admin = capable(CAP_SYS_ADMIN); u16 count = same->dest_count; struct file *dst_file; loff_t dst_off; @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) { + } else if (!allow_file_dedupe(dst_file)) { info->status = -EINVAL; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 (Only) 1/3] btrfs-progs: map-logical: look at next leaf if slot > items
btrfs-map-logical -l 10955980800 -b 4096 No extent found at range [10955980800,10955984896) But, this extent exists. btrfs-debug-tree shows: item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 142655488 count 1 ... item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 128958464 count 1 The code searches for (logical, 0, 0), and looks forward then backward. It needs to go to the next leaf when slot > number of items on this leaf. Signed-off-by: James Harvey --- btrfs-map-logical.c | 8 1 file changed, 8 insertions(+) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 7a8bcff9..8a4228a4 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info, BUG_ON(ret == 0); ret = 0; + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(fs_info->extent_root, path); + if (ret > 0) + ret = -ENOENT; + if (ret) + goto out; + } + again: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if ((search_foward && key.objectid < logical) || -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bad superblock when mounting rw, ro mount works
Hi, I have a raid10-like setup that is failing to mount in rw mode with the error mount: /mnt/media: wrong fs type, bad option, bad superblock on /dev/mapper/em1, missing codepage or helper program, or other error read-only mounts seem to work and the files seem to be there. I started having issues after a system crash during the process of deleting a number of large files. After this (Ubuntu 16.04/Kernel 4.4), any attempt to mount the array in rw mode would cause a similar crash. I did an upgrade to Ubuntu 18.04/Kernel 4.15 and now get the error above. I have looked through a variety of posts on the mailing list, but couldn't find anything with the same issue. I have done a scrub on the array that resulted in 6 verify errors with dmesg showing something about extent trees. It didn't list them as uncorrectable errors, but couldn't correct them either as I can't mount in rw. I also tried `btrfs rescue zero-log /dev/mapper/em1`, which changes the above to (say) em5, but then zero-log on em5 causes it to go back to em1. Any direction would be appreciated. From what I could tell, my next steps would be a check with --repair or --init-extent-tree, though I'm reluctant to try those without being explicitly told to do so. I have attached a dmesg.log file and hope I haven't greped out anything important. Thanks, Daniel -- Daniel Underwood dmesg.log Description: Binary data
Re: [PATCH 3/3] btrfs: fix race between mkfs and mount
On Mon, Jun 04, 2018 at 11:00:30PM +0800, Anand Jain wrote: > In an instrumented testing it is possible that the mount and > a newer mkfs.btrfs thread on the same device can race and if the new > mkfs.btrfs wins it will free the older fs_devices, then the mount thread > will lead to oops. > > Thread1 Thread2 > --- --- > mkfs.btrfs -fq /dev/sdb > mount /dev/sdb /btrfs > |_btrfs_mount_root() > |_btrfs_scan_one_device(... &fs_devices) > > mkfs.btrfs -fq /dev/sdb > |_btrfs_contol_ioctl() > |_btrfs_scan_one_device(... > &fs_devices) > |_:: > > |_btrfs_free_stale_devices() > > |_btrfs_open_devices(fs_devices ..) <-- stale fs_devices. > > Fix this with a mutually exclusive flag BTRFS_VOL_FLAG_EXCL_OPS. I'm not sure this is the right way to fix it, adding another bit to the uuid_mutex and device_list_mutex combo. To fix the race between mount and mkfs we can add a bit of logic to the device scanning so that mount calling scan will track the purpose and mkfs scan will not be allowed to free that device. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix race between free_stale_devices and close_fs_devices
On Thu, Jun 07, 2018 at 12:01:09AM +0800, Anand Jain wrote: > %fs_devices can be free-ed by btrfs_free_stale_devices() when the > close_fs_devices() drops fs_devices::opened to zero, but close_fs_devices > tries to access the %fs_devices again without the device_list_mutex. > > Fix this by bringing the %fs_devices access with in the device_list_mutex. > > Stack trace as below. > > HEAD commit:716a685fdb89 Merge branch 'x86-hyperv-for-linus' of git://.. > :: > CPU: 1 PID: 4499 Comm: syz-executor921 Not tainted 4.17.0+ #84 > :: > WARNING: CPU: 1 PID: 4499 at fs/btrfs/volumes.c:1071 > close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071 > Kernel panic - not syncing: panic_on_warn set ... > :: > RIP: 0010:close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071 > :: > btrfs_close_devices+0x29/0x150 fs/btrfs/volumes.c:1085 > open_ctree+0x589/0x7898 fs/btrfs/disk-io.c:3358 > btrfs_fill_super fs/btrfs/super.c:1202 [inline] > btrfs_mount_root+0x16df/0x1e70 fs/btrfs/super.c:1593 > mount_fs+0xae/0x328 fs/super.c:1277 > vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037 > vfs_kern_mount+0x40/0x60 fs/namespace.c:1027 > btrfs_mount+0x4a1/0x213e fs/btrfs/super.c:1661 > mount_fs+0xae/0x328 fs/super.c:1277 > > Reported-by: syzbot+ceb2606025ec1cc34...@syzkaller.appspotmail.com > Signed-off-by: Anand Jain > --- > fs/btrfs/volumes.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index c2b7d66192e8..32fba4e24027 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1153,6 +1153,12 @@ static int close_fs_devices(struct btrfs_fs_devices > *fs_devices) > btrfs_prepare_close_one_device(device); > list_add(&device->dev_list, &pending_put); > } > + > + WARN_ON(fs_devices->open_devices); > + WARN_ON(fs_devices->rw_devices); > + fs_devices->opened = 0; > + clear_bit(BTRFS_VOLUME_STATE_SEEDING, &fs_devices->volume_state); This is from some other patch. Moving that to the protected section should fix it but we'd also need to extend the critical section to: 1047 if (--fs_devices->opened > 0) 1048 return 0; Otherwise I think there are still some cornercases to fix as the fs_devices members are not accessed properly everywhere. This patch should be enough to fix the parallel mount and stale device freeing races though, so I'll queue it up. > + > mutex_unlock(&fs_devices->device_list_mutex); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs pre-release 4.17-rc1
Hi, a pre-release has been tagged. The full 4.17 release will be done next Thursday. Only fixes (with tests) or documentation updates will be added, exceptions on case-by-case basis. ETA for 4.17 is in +7 days (2018-06-14). Changes: * check * many lowmem mode improvements * properly report qgroup mismatch errors * check symlinks with append/immutable flags * fi usage * correctly calculate allocated/unallocated for raid10 * minor output updates * mkfs * detect ENOSPC on thinly provisioned devices * fix spurious EEXIST during directory traversal * restore: fix relative path for restore target * dump-tree: print symbolic tree names for backrefs * send: fix regression preventing send -p with subvolumes mounted on "/" * corrupt-tree: refactoring and command line updates * build * make it work with e2fsprogs < 1.42 again * restore support for autoconf 2.63 * detect if -std=gnu90 is supported * other * new tests * cleanups Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Axel Burri (1): btrfs-progs: fix regression preventing send -p with subvolumes mounted on "/" David Sterba (4): btrfs-progs: tests: update log markers btrfs-progs: tests: fix fsck-tests/031 when on NFS btrfs-progs: update CHANGES for v4.17 Btrfs progs v4.17-rc1 Gu JinXiang (1): btrfs-progs: Add DEBUG_CFLAGS_INTERNAL for libbtrfsutil Hans van Kranenburg (1): btrfs-progs: fi-usage: fix RAID10 raw disk usage James Harvey (1): btrfs-progs: map-logical: fix spelling of a variable Jeff Mahoney (3): btrfs-progs: convert: fix support for e2fsprogs < 1.42 btrfs-progs: build: autoconf 2.63 compatibility btrfs-progs: build: detect whether -std=gnu90 is supported Lu Fengqi (3): btrfs-progs: qgroup: swap the argument in the caller of update_qgroup_relation btrfs-progs: tests: check btrfs qgroup parent-child relation output btrfs-progs: print-tree: remove dead code from btrfs_print_tree Matthias Benkard (1): btrfs-progs: mkfs: traverse_directory: Reset error code on continue Misono Tomohiro (5): btrfs-progs: fi usage: change warning message more appropriately btrfs-progs: fi usage: change to output more info without root privilege btrfs-progs: ins: dump-tree: Use %lld for extent data backref offset btrfs-progs: ins: dump-tree: Print tree name for extent data/tree block backref btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item() Nikolay Borisov (37): btrfs-progs: Remove devid parameter from btrfs_rmap_block btrfs-progs: Remove add_cache_extent2 btrfs-progs: Remove objectid argument from alloc_cache_extent btrfs-progs: Use exclude_super_stripes instead of account_super_bytes btrfs-progs: check: fix DIR_ITEM checking in lowmem btrfs-progs: tests: Add test for collision DIR_ITEM handling btrfs-progs: check: Drop ext_ref parameter from find_inode_ref btrfs-progs: check: Drop ext_ref param from check_dir_item btrfs-progs: check: Drop ext_ref argument from check_inode_item btrfs-progs: check: Drop unused ext_ref parameter from process_one_leaf btrfs-progs: check: Remove ext_ref param from check_fs_first_inode btrfs-progs: check: Remove ext_ref param from walk_down_tree btrfs-progs: check: Drop ext_ref param from check_fs_first_inode btrfs-progs: check: Drop ext_ref arument from check_fs_root btrfs-progs: check: Remove ext_ref local variable from check_fs_roots_lowmem btrfs-progs: Make __btrfs_fs_incompat return bool btrfs-progs: check: Remove root argument from delete_extent_records btrfs-progs: check: Remove root parameter from btrfs_fix_block_accounting btrfs-progs: check: Remove root parameter from del_pending_extents btrfs-progs: check: Remove root argument from finish_current_insert btrfs-progs: check: Make update_pinned_extents take btrfs_fs_info btrfs-progs: Remove unused argument from clean_tree_block btrfs-progs: check: Remove unused root argument from btrfs_extent_post_op btrfs-progs: Change btrfs_root to btrfs_fs_info argument in btrfs_lookup_extent_info btrfs-progs: Remove root argument from btrfs_set_block_flags btrfs-progs: Remove root argument from write_one_cache_group btrfs-progs: Remove fs_info argument from write_ctree_super btrfs-progs: Use symbolic names for read ahead behavior btrfs-progs: corrupt-block: Factor out specific-root code btrfs-progs: corrupt-block: Correctly handle -r when passing -I btrfs-progs: corrupt-block: Factor out key parsing function btrfs-progs: corrupt-block: Change -I flag parameter format btrfs-progs: corrupt-block: Convert -K flag argument handling to common function btrfs-
Re: [PATCH v2 3/3] Fix misspelling of forward
On Thu, Jun 07, 2018 at 03:20:07AM -0400, james harvey wrote: > Signed-off-by: James Harvey Applied, thanks. Please don't forget to add the 'btrfs-progs' prefix to the subject so the patch does not get overlooked. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] btrfs-progs: check: verify symlinks with append/immutable flags
On Thu, Jun 07, 2018 at 02:59:23PM +0800, Su Yue wrote: > Could you replace this patchset with new V3? It fixes the problem > reported by Misono. Done, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item()
On Thu, Jun 07, 2018 at 11:49:58AM +0900, Misono Tomohiro wrote: > Initialize all filed of btrfs_inode_item to zero in order to prevent > having some garbage, especially for flags field. Have you observed in practice or is it a matter of precaution? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Add ioctl to clear unused space
On Fri, Apr 20, 2018 at 04:21:43PM +0200, David Sterba wrote: > This patchset adds new ioctl similar to TRIM, that provides several > other ways how to clear the unused space. The changelogs are > incomplete, for preview not for inclusion yet. > > It compiles and has been tested lightly, the clearing modes depend on hw > capabilities (secure discard, sector unmapping instead of zeros), so > I've tested only zeroing and discard. > > I personally think the zeroing has a usecase and the other modes were > easy to add. Further extensions can be considered, eg. WRITE_SAME, > overwriting with a randomly generated pattern, or some filesystem canary > patterns that can be used to report unused block read as metadata. > > As this is modelled after the generic FITRIM ioctl, so this could be a > new generic ioctl too. However, last time somebody wanted such ioctl, > there was a pushback. I'll consider making a generic version and send it > for comments to fsdevel eventually. > > git://github.com/kdave/btrfs-devel dev/zero-free > > > David Sterba (6): > btrfs: extend trim callchain to pass the operation type > btrfs: add wrapper to switch clearing operation > btrfs: add zeroing clear operation > btrfs: add new ioctl BTRFS_IOC_CLEAR_FREE > btrfs: add discard secure to clear unused space > btrfs: add more zeroout modes for clearing unused space For the record, this patchset has been shifted to 4.19 or later. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting a unique ino/dev pair to user space
On Thu, 2018-06-07 at 04:06 +0200, Mark Fasheh wrote: > Hi Ian, > > On Thu, Jun 07, 2018 at 08:47:28AM +0800, Ian Kent wrote: > > On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote: > > > Hi, > > > > I'm not sure I understand what the problem is. > > I'll try to elaborate below. > > > > > We have an inconsistency in how the kernel is exporting inode number / > > > device pairs for user space. There's of course stat(2) and statx(2), > > > but aside from those we simply dump inode->i_ino and super->s_dev. In > > > some cases, the dumped values differ from what is returned via stat(2) > > > or statx(2). Some filesystems might even show duplicate (but > > > internally different!) pairs when the raw i_ino/s_dev is used. > > > > How is it that you can dump the raw ino and s_dev if your not in > > kernel code? > > If you look below my first paragraph, you'll see a list of places where the > kernel publishes (maybe that's a better word?) ino/dev pairs by printing or > otherwise copying raw ino/s_dev values into user accesible buffers. > > > > For stat family system calls, if the file system defines the inode > > operation getattr it will be used to fill the stat structure otherwise > > the VFS will fill the stat structure and it will use inode->i_ino and > > sb->s_dev as you say. > > My concern is that those pairs are sometimes not unique and do not line up > with what statx(2) returns. We actually need the true inode / device as is > returned by statx in those places. I go into much more detail in my original > mail. IMHO I think you are right in that the values seen by user space should be consistent. I also think that vfs_statx() (which is pretty much what's called by all the stat family system calls, and indirectly vfs_getattr) should be the defining way to get those values if only because it's core VFS and maintained by the VFS maintainer who should be the one to set the rules. But, as you say there are a bunch of places, not necessarily easy to find, that would need review. And there's the question of 32 bit . > > > > > Some examples where we dump raw ino/dev: > > > > > > - /proc//maps. I've written about how this confuses lsof(8): > > > https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 > > > > > > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See > > > trace/events/lock.h or trace/events/writeback.h for examples. > > > > > > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo() > > > > > > - Audit records the raw ino/dev and passes them around. We do seem to > > > have paths printed from audit as well, but if it's printed with the > > > wrong ino/dev pair I believe my point still stands. > > > > > > > > > This breaks software which expects these pairs to be unique, and can > > > put the user in a situation where they might not be able to find an > > > inode referenced from the kernel. What's even worse - depending on how > > > ino is exported, they might even find the *wrong* inode. > > Thanks, > --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] btrfs-progs: check: verify symlinks with append/immutable flags
On 7.06.2018 12:10, Su Yue wrote: > > > On 06/07/2018 02:54 PM, Nikolay Borisov wrote: >> >> >> On 7.06.2018 09:55, Su Yue wrote: >>> This patchset can be fetch from my github: >>> https://github.com/Damenly/btrfs-progs/commits/odd_inode_flags >>> It's based on kdave/devel whose HEAD is: >>> commit 1c846faaf87fbb01e080c94098c02b1695ed86dd >>> Author: Nikolay Borisov >>> Date: Mon May 28 09:36:50 2018 +0300 >>> >>> btrfs-progs: Remove fs_info argument from write_ctree_super >>> >>> symlinks should never have append/immutable attributes. >>> This patchset enables btrfs check to verify such corruption. >>> >>> PATCH[1] is for original mode. >>> PATCH[2] is for original mode. >>> >>> PATCH[3] adds a test image. >>> For further use, the directory is called bad-inode-flags. >>> >>> #issue 133 >> >> >> So you areadding code to detect the problem but not to fix it. Is there >> some technical reason why you didn't implement clearing those flags or >> just didn't do it for this series? IMHO it will be good if the check + > > TBH I just didn't plan to do it while writing the patches. > If the functionality of repair is necessary, I'm willing to > do it. That's fine, however in case this code detects such a symlink what's the user supposed to do - they can just delete the symlink because it's immutable/append-only. SO they will know they have such symlinks but won't be able to repair them. > > However, it seems I have to complete work about repair mismatched file > type first. So we can verify whether a bad inode is a symlink with > append/immutable attributes or a normal file set with BTRFS_FT_SYMLINK. > > I'm fine that v2 patchset is to be replaced then support repair late > or to be reverted. > > Thanks, > Su > >> repair code are part of the same series. Then you can also extend the >> test case to cover the 2 functionalities? >> >>> >>> --- >>> Changelog: >>> v3: >>> Use S_ISLNK instead of BTRFS_FT_SYMLINK. Thanks Misono. >>> Change the test image created by hand. >>> v2: >>> Use "rec->errors |=" instead of "rec->errors =" in patch[1]. >>> Define new error bit of invalid inode flags in lowmem mode. >>> Adjust print message in patch[2]. Thanks, Qu and Nikolay. >>> Rename test directory from odd-inode-flags to bad-inode-flags. >>> >>> Su Yue (3): >>> btrfs-progs: check: check symlinks with append/immutable flags >>> btrfs-progs: check: lowmem: check symlinks with append/immutable >>> flags >>> btrfs-progs: fsck-tests: add test case to check symlinks with bad >>> flags >>> >>> check/main.c | 7 +++ >>> check/mode-lowmem.c | 10 ++ >>> check/mode-lowmem.h | 1 + >>> check/mode-original.h | 1 + >>> .../034-bad-inode-flags/default_case.img | Bin 0 -> 4096 bytes >>> tests/fsck-tests/034-bad-inode-flags/test.sh | 15 +++ >>> 6 files changed, 34 insertions(+) >>> create mode 100644 >>> tests/fsck-tests/034-bad-inode-flags/default_case.img >>> create mode 100755 tests/fsck-tests/034-bad-inode-flags/test.sh >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] btrfs-progs: check: verify symlinks with append/immutable flags
On 06/07/2018 02:54 PM, Nikolay Borisov wrote: On 7.06.2018 09:55, Su Yue wrote: This patchset can be fetch from my github: https://github.com/Damenly/btrfs-progs/commits/odd_inode_flags It's based on kdave/devel whose HEAD is: commit 1c846faaf87fbb01e080c94098c02b1695ed86dd Author: Nikolay Borisov Date: Mon May 28 09:36:50 2018 +0300 btrfs-progs: Remove fs_info argument from write_ctree_super symlinks should never have append/immutable attributes. This patchset enables btrfs check to verify such corruption. PATCH[1] is for original mode. PATCH[2] is for original mode. PATCH[3] adds a test image. For further use, the directory is called bad-inode-flags. #issue 133 So you areadding code to detect the problem but not to fix it. Is there some technical reason why you didn't implement clearing those flags or just didn't do it for this series? IMHO it will be good if the check + TBH I just didn't plan to do it while writing the patches. If the functionality of repair is necessary, I'm willing to do it. However, it seems I have to complete work about repair mismatched file type first. So we can verify whether a bad inode is a symlink with append/immutable attributes or a normal file set with BTRFS_FT_SYMLINK. I'm fine that v2 patchset is to be replaced then support repair late or to be reverted. Thanks, Su repair code are part of the same series. Then you can also extend the test case to cover the 2 functionalities? --- Changelog: v3: Use S_ISLNK instead of BTRFS_FT_SYMLINK. Thanks Misono. Change the test image created by hand. v2: Use "rec->errors |=" instead of "rec->errors =" in patch[1]. Define new error bit of invalid inode flags in lowmem mode. Adjust print message in patch[2]. Thanks, Qu and Nikolay. Rename test directory from odd-inode-flags to bad-inode-flags. Su Yue (3): btrfs-progs: check: check symlinks with append/immutable flags btrfs-progs: check: lowmem: check symlinks with append/immutable flags btrfs-progs: fsck-tests: add test case to check symlinks with bad flags check/main.c | 7 +++ check/mode-lowmem.c | 10 ++ check/mode-lowmem.h | 1 + check/mode-original.h| 1 + .../034-bad-inode-flags/default_case.img | Bin 0 -> 4096 bytes tests/fsck-tests/034-bad-inode-flags/test.sh | 15 +++ 6 files changed, 34 insertions(+) create mode 100644 tests/fsck-tests/034-bad-inode-flags/default_case.img create mode 100755 tests/fsck-tests/034-bad-inode-flags/test.sh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] Fix misspelling of forward
On 06/07/2018 03:20 PM, james harvey wrote: Signed-off-by: James Harvey Reviewed-by: Su Yue --- btrfs-map-logical.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 8a41b037..59ba731b 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -39,7 +39,7 @@ static FILE *info_file; static int map_one_extent(struct btrfs_fs_info *fs_info, - u64 *logical_ret, u64 *len_ret, int search_foward) + u64 *logical_ret, u64 *len_ret, int search_forward) { struct btrfs_path *path; struct btrfs_key key; @@ -75,11 +75,11 @@ static int map_one_extent(struct btrfs_fs_info *fs_info, again: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); - if ((search_foward && key.objectid < logical) || - (!search_foward && key.objectid > logical) || + if ((search_forward && key.objectid < logical) || + (!search_forward && key.objectid > logical) || (key.type != BTRFS_EXTENT_ITEM_KEY && key.type != BTRFS_METADATA_ITEM_KEY)) { - if (!search_foward) + if (!search_forward) ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()
On 06/07/2018 03:20 PM, james harvey wrote: btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY, which are the types we're looking for. Signed-off-by: James Harvey Reviewed-by: Su Yue --- btrfs-map-logical.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 2451012b..8a41b037 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -83,7 +83,8 @@ again: ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else - ret = btrfs_next_item(fs_info->extent_root, path); + ret = btrfs_next_extent_item(fs_info->extent_root, +path); if (ret) goto out; goto again; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] btrfs-progs: map-logical: look at next leaf if slot > items
On 06/07/2018 03:19 PM, james harvey wrote: btrfs-map-logical -l 10955980800 -b 4096 No extent found at range [10955980800,10955984896) But, this extent exists. btrfs-debug-tree shows: item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 142655488 count 1 ... item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 128958464 count 1 The code searches for (, 0, 0), and looks forward then backward. It needs to go to the next leaf when slot > number of items on this leaf. Signed-off-by: James Harvey --- btrfs-map-logical.c | 8 1 file changed, 8 insertions(+) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 7a8bcff9..2451012b 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info, BUG_ON(ret == 0); ret = 0; + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(fs_info->extent_root, path); It seems that you forgot to handle the case ret < 0. Thanks, Su + if (ret > 0) { + ret = -ENOENT; + goto out; + } + } + again: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if ((search_foward && key.objectid < logical) || -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting a unique ino/dev pair to user space
On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh wrote: > Hi, > > We have an inconsistency in how the kernel is exporting inode number / > device pairs for user space. There's of course stat(2) and statx(2), > but aside from those we simply dump inode->i_ino and super->s_dev. In > some cases, the dumped values differ from what is returned via stat(2) > or statx(2). Some filesystems might even show duplicate (but > internally different!) pairs when the raw i_ino/s_dev is used. > > Some examples where we dump raw ino/dev: > > - /proc//maps. I've written about how this confuses lsof(8): > https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 > > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See > trace/events/lock.h or trace/events/writeback.h for examples. > > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo() > > - Audit records the raw ino/dev and passes them around. We do seem to > have paths printed from audit as well, but if it's printed with the > wrong ino/dev pair I believe my point still stands. > knfsd also looks at i_ino. I converted one or two of these places to getattr callbacks recently, but there are still some more. Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode() an overlay inode should resolve several of the issues listed above - probably not all, but didn't check every tracepoint.. > > This breaks software which expects these pairs to be unique, and can > put the user in a situation where they might not be able to find an > inode referenced from the kernel. What's even worse - depending on how > ino is exported, they might even find the *wrong* inode. > > I also should point out that we're likely at this point because > stat(2) has been using an unsigned long for ino. On a 32 bit system, > it would have been impossible for the user to get the real inode > number in the first place. So there probably wasn't much we could do. > > These days though, we have statx(2) which will do the right thing on > all platforms so we no longer have that excuse. The user ought to be > able to take an ino/dev pair and ultimately find the actual file on > their system, partially with the help of statx(2). > > > Some examples of how ino is manipulated by filesystems: > > - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from > what I can tell). stat->dev is not always consistent with super->s_dev. > > - On 32 bits, many filesystems employ a transformation to squeeze a 64 > bit identifier into 32 bits. The exact methods are fs specific, > what's important is that we're losing information, introducing the > possibility of duplicate inode numbers. > > - On all platforms, Btrfs and Overlayfs can have duplicate inode > numbers. Of course, device can be different across the fs as well > with the kernel reporting s_dev and these filesystems returning > a different device via stat() or statx(). > > > Getting the inode number portion of this pair fixed would immediately > solve the situation for all filesystems except Btrfs and > Overlayfs - they report a different device from stat. > > Regarding the device portion of the pair, I'm honestly not sure > whether Overlayfs cares, and my attempts to fix the s_dev situation > for Btrfs have all come to the same dead ends that I've hit briefly > looking into this inode number issue - the problems are intrinsically > linked. > > So my questions are: > > 1) Do we care about this? On one hand, we've had this inconsistency >for a long time, for various reasons. On the other hand, I can point >to bugzilla's where these inconsistencies have become a problem. > >In the case that we don't care, any fs-internal solutions are >likely to be extremely disruptive to the end user. > > > 2) If we do care, how do we fix this? > > 2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious > downsides from a memory usage standpoint. Also it doesn't fully > address the issue - we still have a device field that Btrfs and > Overlayfs override. > > We could combine this with an intermediate structure between the > inode and super block so s_dev could be abstracted out. See my > fs_view patches for an example of how this could be done: > https://www.spinics.net/lists/linux-fsdevel/msg125492.html > > 2b) Do we call ->getattr for this information? Plumbing a vfsmount to > various regions of the kernel like audit, or some of the deeper > tracepoints looks ugly and prone to life-timing issues (which > vfsmount do we use?). On the upside, we could probably make it > really light by only sending down the STATX_INO flag and letting > the filesystem optimize accordingly. > > 2c) I don't think we can really use a dedicated callback without > passing the vfsmount through since Overlayfs ->getattr might call > the lower fs ->getattr. At that point we might as well use getattr. > Didn't get the Overlayfs lower fs getattr argume
[PATCH v2 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()
btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY, which are the types we're looking for. Signed-off-by: James Harvey --- btrfs-map-logical.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 2451012b..8a41b037 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -83,7 +83,8 @@ again: ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else - ret = btrfs_next_item(fs_info->extent_root, path); + ret = btrfs_next_extent_item(fs_info->extent_root, +path); if (ret) goto out; goto again; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] Fix misspelling of forward
Signed-off-by: James Harvey --- btrfs-map-logical.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 8a41b037..59ba731b 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -39,7 +39,7 @@ static FILE *info_file; static int map_one_extent(struct btrfs_fs_info *fs_info, - u64 *logical_ret, u64 *len_ret, int search_foward) + u64 *logical_ret, u64 *len_ret, int search_forward) { struct btrfs_path *path; struct btrfs_key key; @@ -75,11 +75,11 @@ static int map_one_extent(struct btrfs_fs_info *fs_info, again: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); - if ((search_foward && key.objectid < logical) || - (!search_foward && key.objectid > logical) || + if ((search_forward && key.objectid < logical) || + (!search_forward && key.objectid > logical) || (key.type != BTRFS_EXTENT_ITEM_KEY && key.type != BTRFS_METADATA_ITEM_KEY)) { - if (!search_foward) + if (!search_forward) ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); else -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs-progs: map-logical: look at next leaf if slot > items
btrfs-map-logical -l 10955980800 -b 4096 No extent found at range [10955980800,10955984896) But, this extent exists. btrfs-debug-tree shows: item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 142655488 count 1 ... item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37 refs 1 gen 62656 flags DATA shared data backref parent 128958464 count 1 The code searches for (, 0, 0), and looks forward then backward. It needs to go to the next leaf when slot > number of items on this leaf. Signed-off-by: James Harvey --- btrfs-map-logical.c | 8 1 file changed, 8 insertions(+) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 7a8bcff9..2451012b 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info, BUG_ON(ret == 0); ret = 0; + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(fs_info->extent_root, path); + if (ret > 0) { + ret = -ENOENT; + goto out; + } + } + again: btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if ((search_foward && key.objectid < logical) || -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items
On Thu, Jun 7, 2018 at 12:20 AM, Su Yue wrote: > On 06/07/2018 11:33 AM, james harvey wrote: >> Using btrfs-progs v4.16: >> >> No extent found at range [10955980800,10955984896) >> >> But, this extent exists. btrfs-debug-tree shows: > Make sense. IMP the commit message is too long to read. > Code wise is almost fine. Some nitpicks are below. Thanks for response. I'll try to work on that. >> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { >> + ret = btrfs_next_leaf(fs_info->extent_root, path); >> + if (ret != 0) { >> + ret = -1; > > btrfs_next_leaf() may return -EIO, so keep the negative return code is > prefered. > btrfs_next_leaf may return > 0, here I'd like to set ret=-ENOENT. > You can refer callers of btrfs_next_leaf() how to handle the return codes. Fixed in v2. >> else >> - ret = btrfs_next_item(fs_info->extent_root, path); >> + ret = > Nit: > Misclick here? Missed semicolon being 81st character, fixed in v2. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html