Re: Exporting a unique ino/dev pair to user space

2018-06-07 Thread Amir Goldstein
On Fri, Jun 8, 2018 at 12:06 AM, Mark Fasheh  wrote:
[...]

>> >  2c) I don't think we can really use a dedicated callback without
>> >  passing the vfsmount through since Overlayfs ->getattr might call
>> >  the lower fs ->getattr. At that point we might as well use getattr.
>> >
>>
>> Didn't get the Overlayfs lower fs getattr argument.
>> Overlayfs doesn't use the vfsmount passed into getattr
>> and it could very well pass a dentry to lower fs getattr.
>
> My main point in 2c) is that, from my understanding, Overlayfs may need to
> call down to one of the filesystem ->getattr() calls. Since those take a
> vfsmount we don't gain anything by having a unique callback from this - the
> plumbing work would be the same.
>

I guess I don't understand what you mean by "dedicated callback", but I
think we both in agreement that changing fs getattr() to take dentry is
preferred.

>
>> As a matter of fact, out of 35 getattr implementations in the kernel:
>> (git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v
>> "nfs.*_proc_getattr"|wc -l)
>> there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME
>> check and most of them only ever use d_inode(path->dentry).
>>
>> This API seems quite odd.
>> Maybe it should be fixed so more in kernel call sites could call getattr
>> without a vfsmount.
>> Not sure what would be the best way to handle nfs_getattr().
>
> Yeah I saw that nfs_getattr() is the only user of the vfsmount. I totally
> agree that this would be tons easier if we didn't have to pass the vfsmount
> (and like you said, there's only the ONE user).
>
> This is a bit hacky, but I wonder if we could blow the function signature
> back out to a dentry + vfsmount and make the vfsmount optional when getattr
> is called only for ino/dev. It's a bit ugly to have optional arguments like
> that but nfs would work with just a line or two change and the other fs
> would never even care.

OR.. change the signature of fs getattr() and vfs_getattr_nosec() to dentry
and set a kernel query_flag AT_STATX_NO_REMOTE_ATIME from
vfs_getattr(). No need to pass vfsmount to nfs.

Then users that access i_ino/s_dev can call vfs_getattr_nosec() with a
bonus of not having to go through security modules (which now they don't).

The only current user of vfs_getattr_nosec() expfs.c queries STATX_INO,
so change is safe.

Overlayfs doesn't need to get vfsmount in ovl_getattr() - it knows the
lower fs vfsmount for calling vfs_getattr() regardless and in other
places overlayfs just needs to query  STATX_INO, so it may also
use the light vfs_getattr_nosec() callback.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] fstests: btrfs: Test if btrfs will corrupt nodatasum compressed extent when replacing device

2018-06-07 Thread Qu Wenruo
This is a long existing bug (from 2012) but exposed by a reporter
recently, that when compressed extent without data csum get written to
device-replace target device, the written data is in fact uncompressed data
other than the original compressed data.

And since btrfs still consider the data is compressed and will try to read it
as compressed, it can cause read error.

The root cause is located, and one RFC patch already sent to fix it,
titled "[PATCH] btrfs: scrub: Don't use inode pages for device replace".
(The RFC is only for the extra possible way to fix the bug, the fix
itself should work without problem)

Reported-by: James Harvey 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Now the fix patch is no longer RFC.
  Remove _require_test as we don't really touch it.
  Add comment on the mount cycle.
  Add the test to group 'volume'.
---
 tests/btrfs/161 | 91 +
 tests/btrfs/161.out |  2 +
 tests/btrfs/group   |  1 +
 3 files changed, 94 insertions(+)
 create mode 100755 tests/btrfs/161
 create mode 100644 tests/btrfs/161.out

diff --git a/tests/btrfs/161 b/tests/btrfs/161
new file mode 100755
index ..ce1b0e04
--- /dev/null
+++ b/tests/btrfs/161
@@ -0,0 +1,91 @@
+#! /bin/bash
+# FS QA Test 161
+#
+# Test if btrfs will corrupt compressed data extent without data csum
+# by replacing it with uncompressed data, when doing replacing device.
+#
+# This could be fixed by the following RFC patch:
+# "[PATCH] btrfs: scrub: Don't use inode pages for device replace"
+#
+#---
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_require_scratch_dev_pool_equal_size
+
+
+_scratch_dev_pool_get 1
+_spare_dev_get
+_scratch_pool_mkfs >> $seqres.full 2>&1
+
+# Create nodatasum inode
+_scratch_mount "-o nodatasum"
+touch $SCRATCH_MNT/nodatasum_file
+_scratch_remount "datasum,compress"
+_pwrite_byte 0xcd 0 128K $SCRATCH_MNT/nodatasum_file > /dev/null
+
+# Write the compressed data back to disk
+sync
+
+# Replace the device
+_run_btrfs_util_prog replace start -Bf 1 $SPARE_DEV $SCRATCH_MNT
+
+# Unmount to drop all cache so next read will read from disk
+_scratch_unmount
+_mount $SPARE_DEV $SCRATCH_MNT
+
+# Now the EXTENT_DATA item still marks the extent as compressed,
+# but the on-disk data is uncompressed, thus reading it as compressed
+# will definitely cause EIO.
+cat $SCRATCH_MNT/nodatasum_file > /dev/null
+
+_scratch_unmount
+_spare_dev_put
+_scratch_dev_pool_put
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/161.out b/tests/btrfs/161.out
new file mode 100644
index ..1752a243
--- /dev/null
+++ b/tests/btrfs/161.out
@@ -0,0 +1,2 @@
+QA output created by 161
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f04ee8d5..9195b368 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -163,3 +163,4 @@
 158 auto quick raid scrub
 159 auto quick
 160 auto quick
+161 auto quick replace volume
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-06-07 Thread Qu Wenruo


On 2018年06月08日 12:37, Steve Leung wrote:
> On 06/06/2018 01:27 AM, Qu Wenruo wrote:
>> The patchset can be fetched from github (*):
>> https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes
>>
>> It's based on David's devel branch, whose HEAD is:
>> commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
>> Author: Matthias Benkard 
>> Date:   Wed Apr 25 16:34:54 2018 +0200
>>
>>  btrfs-progs: mkfs: traverse_directory: Reset error code on continue
>>
>> Reported-by Steve Leung , his old btrfs (at least
>> offending inodes are from 2014) has inline uncompressed extent, while
>> its ram_bytes mismatch with item size.
> 
> Took a while to run, but:
> 
> Tested-by: Steve Leung 
> 
> Verifying everything against backups now, but things look good so far.
> 
> I'm not 100% sure how to interpret the "btrfs check" output, but it
> sounded like it was going over each subvolume.  And since the corruption
> was referenced by many subvolumes, the same work essentially had to be
> duplicated many times.

Yes, that's the case.

That would be a little like the way we do in kernel balance, thus it's
not implemented in btrfs-progs, so it would take a long time before we
have similar facility to fix it properly.

Or, I could fix it in another way, break the CoW behavior and do
in-place fix.
But if anyone wants to interrupt the fix, it may cause disaster.

> 
> Is that a correct assessment?  Would it have saved time to remove a
> bunch of snapshots first before doing "btrfs check"?

Yep, it would save a lot of time.

Thanks,
Qu

> 
> Either way, many thanks for getting everything going again!
> 
> Steve
> 
> 
>> Latest kernel tree check catches this bug, while we failed to detect by
>> dump-tree.
>>
>> It turns out that btrfs-progs is doing something evil to avoid reading
>> ram_bytes from inline uncompressed extent.
>>
>>
>> So this patchset will address all such ram_bytes related problems.
>>
>> The 1st patch is a not-so-relative fix for restore, which is using
>> ram_bytes for decompress. Although thanks to the compression header, we
>> won't read out-of-boundary, but fixing it is never a bad thing.
>>
>> The 2nd patch will get rid of the evil btrfs_file_extent_inline_len()
>> which hides raw ram_bytes from us, and fooling us for a long long time.
>>
>> The 3rd~5th patches introduce check/repair function for both original
>> and lowmem mode (although lowmem mode can detect it even before this
>> patch).
>>
>> The last one is the test case for it as usual.
>>
>> *: Or should I just migrate to gitlab after M$ acquired github?
>>
>> Qu Wenruo (6):
>>    btrfs-progs: restore: Fix wrong compressed item size for decompress()
>>    btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len()
>>    btrfs-progs: check/original: Detect and repair wrong inline ram_bytes
>>    btrfs-progs: check/lowmem: Prepare check_file_extent() to handle
>>  repair
>>    btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for
>>  uncompressed extent
>>    btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes
>>
>>   check/main.c  |  46 ++-
>>   check/mode-lowmem.c   | 120 ++
>>   check/mode-original.h |   1 +
>>   cmds-restore.c    |   5 +-
>>   ctree.h   |  22 
>>   file.c    |   3 +-
>>   print-tree.c  |   4 +-
>>   .../offset_by_one.img | Bin 0 -> 3072 bytes
>>   .../035-inline-bad-ram-bytes/test.sh  |  11 ++
>>   9 files changed, 157 insertions(+), 55 deletions(-)
>>   create mode 100644
>> tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
>>   create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh
>>
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-06-07 Thread Steve Leung

On 06/06/2018 01:27 AM, Qu Wenruo wrote:

The patchset can be fetched from github (*):
https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes

It's based on David's devel branch, whose HEAD is:
commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
Author: Matthias Benkard 
Date:   Wed Apr 25 16:34:54 2018 +0200

 btrfs-progs: mkfs: traverse_directory: Reset error code on continue

Reported-by Steve Leung , his old btrfs (at least
offending inodes are from 2014) has inline uncompressed extent, while
its ram_bytes mismatch with item size.


Took a while to run, but:

Tested-by: Steve Leung 

Verifying everything against backups now, but things look good so far.

I'm not 100% sure how to interpret the "btrfs check" output, but it 
sounded like it was going over each subvolume.  And since the corruption 
was referenced by many subvolumes, the same work essentially had to be 
duplicated many times.


Is that a correct assessment?  Would it have saved time to remove a 
bunch of snapshots first before doing "btrfs check"?


Either way, many thanks for getting everything going again!

Steve



Latest kernel tree check catches this bug, while we failed to detect by
dump-tree.

It turns out that btrfs-progs is doing something evil to avoid reading
ram_bytes from inline uncompressed extent.


So this patchset will address all such ram_bytes related problems.

The 1st patch is a not-so-relative fix for restore, which is using
ram_bytes for decompress. Although thanks to the compression header, we
won't read out-of-boundary, but fixing it is never a bad thing.

The 2nd patch will get rid of the evil btrfs_file_extent_inline_len()
which hides raw ram_bytes from us, and fooling us for a long long time.

The 3rd~5th patches introduce check/repair function for both original
and lowmem mode (although lowmem mode can detect it even before this patch).

The last one is the test case for it as usual.

*: Or should I just migrate to gitlab after M$ acquired github?

Qu Wenruo (6):
   btrfs-progs: restore: Fix wrong compressed item size for decompress()
   btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len()
   btrfs-progs: check/original: Detect and repair wrong inline ram_bytes
   btrfs-progs: check/lowmem: Prepare check_file_extent() to handle
 repair
   btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for
 uncompressed extent
   btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes

  check/main.c  |  46 ++-
  check/mode-lowmem.c   | 120 ++
  check/mode-original.h |   1 +
  cmds-restore.c|   5 +-
  ctree.h   |  22 
  file.c|   3 +-
  print-tree.c  |   4 +-
  .../offset_by_one.img | Bin 0 -> 3072 bytes
  .../035-inline-bad-ram-bytes/test.sh  |  11 ++
  9 files changed, 157 insertions(+), 55 deletions(-)
  create mode 100644 tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
  create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: in which directions does btrfs send -p | btrfs receive work

2018-06-07 Thread Andrei Borzenkov
07.06.2018 05:50, Christoph Anton Mitterer пишет:
> Hey.
> 
> Just wondered about the following:
> 
> When I have a btrfs which acts as a master and from which I make copies
>  of snapshots on it via send/receive (with using -p at send) to other
> btrfs which acts as copies like this:
> master +--> copy1
>+--> copy2
>\--> copy3
> and if now e.g. the device of master breaks, can I move *with
> incremential send -p / receive backups from one of the copies?
> 
> Which of the following two would work (or both?):
> 
> A) Redesignating a copy to be a new master, e.g.:
>old-copy1/new-master +--> new-disk/new-copy1
>

What is old-copy1? You never told how it was created.


>+--> copy2
> \--> copy3
>Obviously at least
> send/receiving to new-copy1 shoud work,
>but would that work as well
> to copy2/copy3 (with -p), since they're
>based on (and probably using
> UUIDs) from the snapshot on the old
>broken master?
> 
> B) Let a new device be the master and move on from that (kinda creating
>a "send/receive cycle":
>1st:
>copy1 +--> new-disk/new-master
> 
>from then on (when new snapshots should be incrementally sent):
>new-master +--> copy1
>   +--> copy2
>   \--> copy3

It is again not clear - you want to overwrite exiting copy1, copy2,
copy3? Do these copyN have any relation to copyN in the beginning of
your message? Or just want to start new incremental replication from
scratch?

>Again, not sure whether send/receiving to copy2/3 would work, since
>they're based on snapshots/parents from the old broken master.
>And I'm even more unsure, whether this back&forth send/receiving,
>from copy1->new-master->copy1 would work.
> 
> 
> Any expert having some definite idea? :-)
> 
> Thanks,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad superblock when mounting rw, ro mount works

2018-06-07 Thread Qu Wenruo


On 2018年06月08日 01:07, Daniel Underwood wrote:
> Hi,
> 
> I have a raid10-like setup that is failing to mount in rw mode with the error
> 
> 
> mount: /mnt/media: wrong fs type, bad option, bad superblock on
> /dev/mapper/em1, missing codepage or helper program, or other error
> 
> read-only mounts seem to work and the files seem to be there.
> 
> I started having issues after a system crash during the process of
> deleting a number of large files. After this (Ubuntu 16.04/Kernel
> 4.4), any attempt to mount the array in rw mode would cause a similar
> crash. I did an upgrade to Ubuntu 18.04/Kernel 4.15 and now get the
> error above.
> 
> I have looked through a variety of posts on the mailing list, but
> couldn't find anything with the same issue. I have done a scrub on the
> array that resulted in 6 verify errors with dmesg showing something
> about extent trees. It didn't list them as uncorrectable errors, but
> couldn't correct them either as I can't mount in rw. I also tried
> `btrfs rescue zero-log /dev/mapper/em1`, which changes the above to
> (say) em5, but then zero-log on em5 causes it to go back to em1.

The problem is, your extent tree is corrupted.

And normally, it indicates your whole filesystem has more or less extra
errors.

In this case, btrfs check could help you to know how badly your fs is
corrupted.
(don't run --repair).

Optionally, btrfs check --mode=lowmem is pretty a good try, as it would
provide better error message.
(And help us enhance btrfs-progs)

> 
> Any direction would be appreciated. From what I could tell, my next
> steps would be a check with --repair or --init-extent-tree, though I'm
> reluctant to try those without being explicitly told to do so.

Never run this two, unless you're 100% sure about:
1) What's the problem
2) What --repair can fix

Personally speaking, and also stated in man page, unless you're a
btrfs-progs developer or has enough experience, don't do any write
operation with btrfs check.

Thanks,
Qu

> 
> I have attached a dmesg.log file and hope I haven't greped out
> anything important.
> 
> Thanks,
> Daniel
> 



signature.asc
Description: OpenPGP digital signature


Re: RAID-1 refuses to balance large drive

2018-06-07 Thread Zygo Blaxell
On Sat, May 26, 2018 at 06:27:57PM -0700, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.
> 
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
> repeat the very time consuming situation, so I wanted to find out if
> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
> consider the upgrade to  bionic (4.15) though that adds a lot more to
> my plate before a long trip and I would prefer to avoid if I can.
> 
> So what is the best strategy:
> 
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
> strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
> from 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
> recently vacated 6TB -- much longer procedure but possibly better

d) Run "btrfs balance start -dlimit=3 /fs" to make some unallocated
space on all drives *before* adding disks.  Then replace, resize up,
and balance until unallocated space on all disks are equal.  There is
no need to continue balancing after that, so once that point is reached
you can cancel the balance.

A number of bad things can happen when unallocated space goes to zero,
and being unable to expand a raid1 array is only one of them.  Avoid that
situation even when not resizing the array, because some cases can be
very difficult to get out of.

Assuming your disk is not filled to the last gigabyte, you'll be able
to keep at least 1GB unallocated on every disk at all times.  Monitor
the amount of unallocated space and balance a few data block groups
(e.g. -dlimit=3) whenever unallocated space gets low.

A potential btrfs enhancement area:  allow the 'devid' parameter of
balance to specify two disks to balance block groups that contain chunks
on both disks.  We want to balance only those block groups that consist of
one chunk on each smaller drive.  This redistributes those block groups
to have one chunk on the large disk and one chunk on one of the smaller
disks, freeing space on the other small disk for the next block group.
Block groups that consist of a chunk on the big disk and one of the
small disks are already in the desired configuration, so rebalancing
them is just a waste of time.  Currently it's only possible to do this
by writing a script to select individual block groups with python-btrfs
or similar--much faster than plain btrfs balance for this case, but more
involved to set up.

> Or has this all been fixed and method A will work fine and get to the
> ideal goal -- 3 drives, with available space suitably distributed to
> allow full utilization over time?
> 
> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
> > A few years ago, I encountered an issue (halfway between a bug and a
> > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
> > full.   The problem was that after replacing (by add/delete) a small drive
> > with a larger one, there were now 2 full drives and one new half-full one,
> > and balance was not able to correct this situation to produce the desired
> > result, which is 3 drives, each with a roughly even amount of free space.
> > It can't do it because the 2 smaller drives are full, and it doesn't realize
> > it could just move one of the copies of a block off the smaller drive onto
> > the larger drive to free space on the smaller drive, it wants to move them
> > both, and there is nowhere to put them both.
> >
> > I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
> > and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
> > time consuming situation, so I wanted to find out if things were fixed now.
> > I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
> > (4.15) though that adds a lot more to my plate before a long trip and I
> > would prefer to avoid if I can.
> >
> > So what is the best strategy:
> >
> > a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
> > strategy)
> > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
> > 4TB but possibly not enough)
> > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
> > vacated 6TB -- much longer procedure

Re: [PATCH v3 (Only) 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()

2018-06-07 Thread Su Yue




On 06/08/2018 02:41 AM, james harvey wrote:

On Thu, Jun 7, 2018 at 4:50 AM, Su Yue  wrote:

On 06/07/2018 03:20 PM, james harvey wrote:


btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and
BTRFS_METADATA_KEY,
which are the types we're looking for.

Signed-off-by: James Harvey 


Reviewed-by: Su Yue 


btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY,
which are the types we're looking for.

(Previous versions missed last 0 argument to btrfs_next_extent_item().  Sorry, I
realize I need to take more time before sending, being more careful on
everything.)



So do I.

If you want to write something which is unreleated to
changes in a patch. Just put them and changelog belows "---" under
SOB line.
It would be easier for maintainers to apply patches.

And, I noticed patches sent by you are separated in ML. It's OK.
Puting series into one thread is better for review. You can
use tools like 'git-send-email' to send patches.
The last nag: a coverletter is nice with patches.
Link of conversation:
https://www.spinics.net/lists/linux-btrfs/msg77791.html


Signed-off-by: James Harvey 
---
  btrfs-map-logical.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 2451012b..8a41b037 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -83,7 +83,8 @@ again:
 ret = btrfs_previous_extent_item(fs_info->extent_root,
  path, 0);
 else
-   ret = btrfs_next_item(fs_info->extent_root, path);
+   ret = btrfs_next_extent_item(fs_info->extent_root,
+path, 0);

After a careful look, if btrfs_next/previous_extent_item is used here.
It's no need to check

 if ((search_foward && key.objectid < logical) || 

(!search_foward && key.objectid > logical) || 

(key.type != BTRFS_EXTENT_ITEM_KEY && 


 key.type != BTRFS_METADATA_ITEM_KEY)) {
===
Pass @logical as the last parameters of
btrfs_next/previous_extent_item() is simple enough.

Thanks,
Su


 if (ret)
 goto out;
 goto again;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread Qu Wenruo


On 2018年06月08日 03:57, james harvey wrote:
> "On Thu, Jun 7, 2018 at 3:26 AM, Qu Wenruo  wrote:
>> On 2018年06月07日 15:19, james harvey wrote:
>>> On Thu, Jun 7, 2018 at 12:44 AM, Qu Wenruo  wrote:
 On 2018年06月07日 11:33, james harvey wrote:
> * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 
> 4096)
> !!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before 
> the 203
> given?  I think no bounds checking causes a buffer over-read here.
> btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
> read_eb_member() to call read_extent_buffer() which memcpy's using an 
> out
> of range index, at least for this leaf.

 Here the key you got could be garbage.
 Either the leaf has enough free space, then you got pending zero into
 the key, or the leaf is full, you got some item data into the key.

 Either way, the key should be read/trusted at all.
>>>
>>> Is the group's preference to leave things like bounds checking to the
>>> caller?
>>
>> The problem here is, btrfs_search_slot() is not only called for
>> read_only search, but also for insert.
>>
>> For insert usage, such returned path is completely valid and in fact
>> could be pretty useful.
> 
> I don't mean for btrfs_search_slot() behavior to change, since it is
> called for inserting.  I mean btrfs_item_key_to_cpu() behavior could
> change.

Makes sense!

An nice point to enhance!

> 
>>>  (I get the merit to that, avoiding redundant checks.)  My
>>> normal style would be to have read_extent_buffer() fail if start + len
>>> is out of bounds.
>>
>> In this case, it's not out-of-boundary at all, and the invalid key you
>> read out if from the last item, which could contain a valid key.
>>
>> For leaf, btrfs puts btrfs items pointers at the head, and the items
>> data at the tail.
>>
>> Thus for this case, your invalid key slot is just in the middle of the
>> leaf, making read_extent_buffer() unable to detect anything wrong.
>>
>> Thanks,
>> Qu
> 
> You're right.  read_extent_buffer() doesn't have the information to
> look at this.
> 
> And, "buffer over-read error" was probably the wrong term.  I mean in
> the sense that it has 203 valid values, and is being asked for the
> 204'th item.
> 
> btrfs_item_key() could error on
> 
> (nr > btrfs_header_nritems(eb))
> 
> This defensive programming would cause redundant checks, if the caller
> checks this too as they are supposed to, so I understand if the group
> doesn't swing that far on defensive programming.
> 
> Although, perhaps in such situation, btrfs_item_key() could
> automatically call btrfs_next_leaf().

This introduce extra return value, and for things like btrfs_item_key(),
we need to modify tons of call sites.

But at least, I could check if doing extra check (mostly a WARN_ON()) in
btrfs_item_key() is valid.

Thanks,
Qu

>  I'm thinking this might not be
> desired behavior everywhere, but just in case it is, mentioning it
> because then btrfs_item_key() could handle all of this, not allowing
> the caller to shoot their own foot, without redundant checks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item()

2018-06-07 Thread Misono Tomohiro
On 2018/06/07 21:22, David Sterba wrote:
> On Thu, Jun 07, 2018 at 11:49:58AM +0900, Misono Tomohiro wrote:
>> Initialize all filed of btrfs_inode_item to zero in order to prevent
>> having some garbage, especially for flags field.
> 
> Have you observed in practice or is it a matter of precaution?

I saw failure of fsck-test/010 in yesterday's devel branch and
made this patch. It turned out that root cause was wrong flag comparison
in btrfs check.
(https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg77758.html)

With Su's fix, failure of fsck-test/010 is also gone without this patch,
but it is better to initialize the variables anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-07 Thread Mark Fasheh
On Thu, Jun 07, 2018 at 10:38:51AM +0300, Amir Goldstein wrote:
> On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh  wrote:
> > Hi,
> >
> > We have an inconsistency in how the kernel is exporting inode number /
> > device pairs for user space. There's of course stat(2) and statx(2),
> > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > some cases, the dumped values differ from what is returned via stat(2)
> > or statx(2). Some filesystems might even show duplicate (but
> > internally different!) pairs when the raw i_ino/s_dev is used.
> >
> > Some examples where we dump raw ino/dev:
> >
> > - /proc//maps. I've written about how this confuses lsof(8):
> >   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> >
> > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> >   trace/events/lock.h or trace/events/writeback.h for examples.
> >
> > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> >
> > - Audit records the raw ino/dev and passes them around. We do seem to
> >   have paths printed from audit as well, but if it's printed with the
> >   wrong ino/dev pair I believe my point still stands.
> >
> 
> knfsd also looks at i_ino. I converted one or two of these places
> to getattr callbacks recently, but there are still some more.
> 
> Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode()
> an overlay inode should resolve several of the issues listed above -
> probably not all, but didn't check every tracepoint..

Ok, thanks for letting me know about knfsd and the overlayfs work. I haven't
had a chance to check out the overlayfs work yet but I will shortly.


> > This breaks software which expects these pairs to be unique, and can
> > put the user in a situation where they might not be able to find an
> > inode referenced from the kernel. What's even worse - depending on how
> > ino is exported, they might even find the *wrong* inode.
> >
> > I also should point out that we're likely at this point because
> > stat(2) has been using an unsigned long for ino. On a 32 bit system,
> > it would have been impossible for the user to get the real inode
> > number in the first place. So there probably wasn't much we could do.
> >
> > These days though, we have statx(2) which will do the right thing on
> > all platforms so we no longer have that excuse. The user ought to be
> > able to take an ino/dev pair and ultimately find the actual file on
> > their system, partially with the help of statx(2).
> >
> >
> > Some examples of how ino is manipulated by filesystems:
> >
> > - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
> >   what I can tell). stat->dev is not always consistent with super->s_dev.
> >
> > - On 32 bits, many filesystems employ a transformation to squeeze a 64
> >   bit identifier into 32 bits. The exact methods are fs specific,
> >   what's important is that we're losing information, introducing the
> >   possibility of duplicate inode numbers.
> >
> > - On all platforms, Btrfs and Overlayfs can have duplicate inode
> >   numbers. Of course, device can be different across the fs as well
> >   with the kernel reporting s_dev and these filesystems returning
> >   a different device via stat() or statx().
> >
> >
> > Getting the inode number portion of this pair fixed would immediately
> > solve the situation for all filesystems except Btrfs and
> > Overlayfs - they report a different device from stat.
> >
> > Regarding the device portion of the pair, I'm honestly not sure
> > whether Overlayfs cares, and my attempts to fix the s_dev situation
> > for Btrfs have all come to the same dead ends that I've hit briefly
> > looking into this inode number issue - the problems are intrinsically
> > linked.
> >
> > So my questions are:
> >
> > 1) Do we care about this? On one hand, we've had this inconsistency
> >for a long time, for various reasons. On the other hand, I can point
> >to bugzilla's where these inconsistencies have become a problem.
> >
> >In the case that we don't care, any fs-internal solutions are
> >likely to be extremely disruptive to the end user.
> >
> >
> > 2) If we do care, how do we fix this?
> >
> >  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
> >  downsides from a memory usage standpoint. Also it doesn't fully
> >  address the issue - we still have a device field that Btrfs and
> >  Overlayfs override.
> >
> >  We could combine this with an intermediate structure between the
> >  inode and super block so s_dev could be abstracted out. See my
> >  fs_view patches for an example of how this could be done:
> >  https://www.spinics.net/lists/linux-fsdevel/msg125492.html
> >
> >  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
> >  various regions of the kernel like audit, or some of the deeper
> >  tracepoints looks ugly and prone to life-timing issues (which
> >  vfsmount do we use?). On the upside, we 

Re: Bad superblock when mounting rw, ro mount works

2018-06-07 Thread Chris Murphy
On Thu, Jun 7, 2018 at 2:38 PM, Chris Murphy  wrote:


> Your very first task right now is to mount ro, and update your
> backups. Don't do anything else until you've done that. It's a
> testimony to Btrfs that this file system mounts at all, even ro, so
> take advantage of this fact before you can't mount it anymore.

After you've done the backup, you need to find out why one of these
devices is being so unreliable. That has to be fixed first. You can
recreate a new Btrfs or some other file system, and you'll just run
into the exact same problem down the road. Next, it might be useful to
see the output from btrfs-progs 4.16.1 'btrfs check' and 'btrfs check
--mode=lowmem'  both of which are slow, the second one is really slow
but is a different implementation so it's helpful to see both outputs.
That's safe as long as you do not use --repair.

Also we need to see the output from 'btrfs fi us ' with
the volume mounted (ro). Off  hand I think the most likely outcome is
that you get a backup from the ro mounted file system, and you'll have
to recreate it from scratch and restore from backups. In other words,
no matter what you need a current backup.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] vfs: allow dedupe of user owned read-only files

2018-06-07 Thread Mark Fasheh
Hi Darrick,

On Thu, Jun 07, 2018 at 11:17:51AM -0700, Darrick J. Wong wrote:
> Looks ok, but could you please update the manpage for
> ioctl_fideduperange to elaborate on when userspace can expect EPERM?
> 
> Acked-by: Darrick J. Wong 

Yes, good idea. I can handle that.

Thanks,
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad superblock when mounting rw, ro mount works

2018-06-07 Thread Chris Murphy
So you zero'd the log? Why? You need to slow down and stop making
changes to your file system unless you hate your data and want to lose
it. The log is there to help protect your data, you don't zero the log
except under very specific situations, none of which appear to apply
to your situation.

'btrfs check' is safe, --repair may or may not be safe, you didn't say
what version of btrfs-progs you're using, I would consider 4.14.1 at
the oldest, it should be straightforward with any distro to get
something relatively new, maybe from a testing repo. Current is 4.16.

And good god don't use --init-extent-tree, that's a hammer. The
absolute safest way to troubleshoot Btrfs problems is using newer
kernel versions, even though 4.15 is not old, there are still tons of
bug fixes in every release.

Suspicious items from your dmesg:

[95639.492347] BTRFS error (device dm-11): pending csums is 44810240
[95639.492791] BTRFS error (device dm-11): cleaner transaction attach
returned -30
[95639.609171] BTRFS error (device dm-11): open_ctree failed
[96614.907963] BTRFS info (device dm-11): disk space caching is enabled
[96614.907965] BTRFS info (device dm-11): has skinny extents
[96615.207692] BTRFS info (device dm-11): bdev /dev/mapper/em3 errs:
wr 0, rd 0, flush 0, corrupt 0, gen 3
[96615.207702] BTRFS info (device dm-11): bdev /dev/mapper/em5 errs:
wr 164286, rd 3444262, flush 2110, corrupt 3, gen 181

OK so it's a multiple device Btrfs volume, and you have one device
missing a metric done of reads and writes and flushes, with a few
corruptions and generation mismatches. What's wrong with
/dev/mapper/em5? Have you done any troubleshooting to figure out what
that problem is? Because Btrfs can't fix it, and it can't make up for
it. That looks like the device was missing for a long time, and writes
were only going to em3. So what's the backstory?

Your very first task right now is to mount ro, and update your
backups. Don't do anything else until you've done that. It's a
testimony to Btrfs that this file system mounts at all, even ro, so
take advantage of this fact before you can't mount it anymore.



--
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread james harvey
"On Thu, Jun 7, 2018 at 3:26 AM, Qu Wenruo  wrote:
> On 2018年06月07日 15:19, james harvey wrote:
>> On Thu, Jun 7, 2018 at 12:44 AM, Qu Wenruo  wrote:
>>> On 2018年06月07日 11:33, james harvey wrote:
 * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 
 4096)
 !!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before the 
 203
 given?  I think no bounds checking causes a buffer over-read here.
 btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
 read_eb_member() to call read_extent_buffer() which memcpy's using an 
 out
 of range index, at least for this leaf.
>>>
>>> Here the key you got could be garbage.
>>> Either the leaf has enough free space, then you got pending zero into
>>> the key, or the leaf is full, you got some item data into the key.
>>>
>>> Either way, the key should be read/trusted at all.
>>
>> Is the group's preference to leave things like bounds checking to the
>> caller?
>
> The problem here is, btrfs_search_slot() is not only called for
> read_only search, but also for insert.
>
> For insert usage, such returned path is completely valid and in fact
> could be pretty useful.

I don't mean for btrfs_search_slot() behavior to change, since it is
called for inserting.  I mean btrfs_item_key_to_cpu() behavior could
change.

>>  (I get the merit to that, avoiding redundant checks.)  My
>> normal style would be to have read_extent_buffer() fail if start + len
>> is out of bounds.
>
> In this case, it's not out-of-boundary at all, and the invalid key you
> read out if from the last item, which could contain a valid key.
>
> For leaf, btrfs puts btrfs items pointers at the head, and the items
> data at the tail.
>
> Thus for this case, your invalid key slot is just in the middle of the
> leaf, making read_extent_buffer() unable to detect anything wrong.
>
> Thanks,
> Qu

You're right.  read_extent_buffer() doesn't have the information to
look at this.

And, "buffer over-read error" was probably the wrong term.  I mean in
the sense that it has 203 valid values, and is being asked for the
204'th item.

btrfs_item_key() could error on

(nr > btrfs_header_nritems(eb))

This defensive programming would cause redundant checks, if the caller
checks this too as they are supposed to, so I understand if the group
doesn't swing that far on defensive programming.

Although, perhaps in such situation, btrfs_item_key() could
automatically call btrfs_next_leaf().  I'm thinking this might not be
desired behavior everywhere, but just in case it is, mentioning it
because then btrfs_item_key() could handle all of this, not allowing
the caller to shoot their own foot, without redundant checks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 (Only) 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()

2018-06-07 Thread james harvey
On Thu, Jun 7, 2018 at 4:50 AM, Su Yue  wrote:
> On 06/07/2018 03:20 PM, james harvey wrote:
>>
>> btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and
>> BTRFS_METADATA_KEY,
>> which are the types we're looking for.
>>
>> Signed-off-by: James Harvey 
>
> Reviewed-by: Su Yue 

btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY,
which are the types we're looking for.

(Previous versions missed last 0 argument to btrfs_next_extent_item().  Sorry, I
realize I need to take more time before sending, being more careful on
everything.)

Signed-off-by: James Harvey 
---
 btrfs-map-logical.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 2451012b..8a41b037 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -83,7 +83,8 @@ again:
ret = btrfs_previous_extent_item(fs_info->extent_root,
 path, 0);
else
-   ret = btrfs_next_item(fs_info->extent_root, path);
+   ret = btrfs_next_extent_item(fs_info->extent_root,
+path, 0);
if (ret)
goto out;
goto again;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] vfs: allow dedupe of user owned read-only files

2018-06-07 Thread Darrick J. Wong
On Thu, Jun 07, 2018 at 10:38:53AM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own read-only
> files. In addition, the write file descriptor that the user is forced to
> hold open can prevent execution of files. As file data during a dedupe
> does not change, the behavior is unexpected and this has caused a number of
> issue reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the process has write access
> - the owner of the file is asking for the dedupe
> - the process could get write access
> 
> That way users can open read-only and still get dedupe.
> 
> Signed-off-by: Mark Fasheh 

Looks ok, but could you please update the manpage for
ioctl_fideduperange to elaborate on when userspace can expect EPERM?

Acked-by: Darrick J. Wong 

--D

> ---
>  fs/read_write.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index e83bd9744b5d..71e9077f8bc1 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
> loff_t srcoff,
>  }
>  EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
> +/* Check whether we are allowed to dedupe the destination file */
> +static bool allow_file_dedupe(struct file *file)
> +{
> + if (capable(CAP_SYS_ADMIN))
> + return true;
> + if (file->f_mode & FMODE_WRITE)
> + return true;
> + if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
> + return true;
> + if (!inode_permission(file_inode(file), MAY_WRITE))
> + return true;
> + return false;
> +}
> +
>  int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  {
>   struct file_dedupe_range_info *info;
> @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>   u64 len;
>   int i;
>   int ret;
> - bool is_admin = capable(CAP_SYS_ADMIN);
>   u16 count = same->dest_count;
>   struct file *dst_file;
>   loff_t dst_off;
> @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!allow_file_dedupe(dst_file)) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 
> 2.15.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-06-07 Thread Mark Fasheh
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh 
Reviewed-by: Darrick J. Wong 
Acked-by: David Sterba 
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 71e9077f8bc1..7188982e2733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
if (info->reserved) {
info->status = -EINVAL;
} else if (!allow_file_dedupe(dst_file)) {
-   info->status = -EINVAL;
+   info->status = -EPERM;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
} else if (S_ISDIR(dst->i_mode)) {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/2] vfs: better dedupe permission check

2018-06-07 Thread Mark Fasheh
Hi Al,

The following patches fix a couple of issues with the permission check
we do in vfs_dedupe_file_range(). I sent them out for review twice, a
changelog is attached. If they look ok to you, I'd appreciate them
being pushed upstream.

You can get them from git if you like:

git pull https://github.com/markfasheh/linux dedupe-perms


The first patch expands our check to allow dedupe of a file if the
user owns it or otherwise would be allowed to write to it.

Current behavior is that we'll allow dedupe only if:

- the user is an admin (root)
- the user has the file open for write

This makes it impossible for a user to dedupe their own file set
unless they do it as root, or ensure that all files have write
permission. There's a couple of duperemove bugs open for this:

https://github.com/markfasheh/duperemove/issues/129
https://github.com/markfasheh/duperemove/issues/86

The other problem we have is also related to forcing the user to open
target files for write - A process trying to exec a file currently
being deduped gets ETXTBUSY. The answer (as above) is to allow them to
open the targets ro - root can already do this. There was a patch from
Adam Borowski to fix this back in 2016:

https://lkml.org/lkml/2016/7/17/130

which I have incorporated into my changes.


The 2nd patch fixes our return code for permission denied to be
EPERM. For some reason we're returning EINVAL - I think that's
probably my fault. At any rate, we need to be returning something
descriptive of the actual problem, otherwise callers see EINVAL and
can't really make a valid determination of what's gone wrong.

This has also popped up in duperemove, mostly in the form of cryptic
error messages. Because this is a code returned to userspace, I did
check the other users of extent-same that I could find. Both 'bees'
and 'rust-btrfs' do the same as duperemove and simply report the error
(as they should).

Please apply.

Thanks,
  --Mark

Changes from V2 to V3:
- Return bool from allow_file_dedupe
- V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html

Changes from V1 to V2:
- Add inode_permission check as suggested by Adam Borowski
- V1 discussion: https://marc.info/?l=linux-xfs&m=152606684017965&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] vfs: allow dedupe of user owned read-only files

2018-06-07 Thread Mark Fasheh
The permission check in vfs_dedupe_file_range() is too coarse - We
only allow dedupe of the destination file if the user is root, or
they have the file open for write.

This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access

That way users can open read-only and still get dedupe.

Signed-off-by: Mark Fasheh 
---
 fs/read_write.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e83bd9744b5d..71e9077f8bc1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
+/* Check whether we are allowed to dedupe the destination file */
+static bool allow_file_dedupe(struct file *file)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+   if (file->f_mode & FMODE_WRITE)
+   return true;
+   if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
+   return true;
+   if (!inode_permission(file_inode(file), MAY_WRITE))
+   return true;
+   return false;
+}
+
 int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 {
struct file_dedupe_range_info *info;
@@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
u64 len;
int i;
int ret;
-   bool is_admin = capable(CAP_SYS_ADMIN);
u16 count = same->dest_count;
struct file *dst_file;
loff_t dst_off;
@@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
if (info->reserved) {
info->status = -EINVAL;
-   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+   } else if (!allow_file_dedupe(dst_file)) {
info->status = -EINVAL;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 (Only) 1/3] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread james harvey
btrfs-map-logical -l 10955980800 -b 4096 
No extent found at range [10955980800,10955984896)

But, this extent exists.  btrfs-debug-tree shows:

   item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
   refs 1 gen 62656 flags DATA
   shared data backref parent 142655488 count 1
...
   item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
   refs 1 gen 62656 flags DATA
   shared data backref parent 128958464 count 1

The code searches for (logical, 0, 0), and looks forward then backward.  It
needs to go to the next leaf when slot > number of items on this leaf.

Signed-off-by: James Harvey 
---
 btrfs-map-logical.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 7a8bcff9..8a4228a4 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info,
BUG_ON(ret == 0);
ret = 0;

+   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(fs_info->extent_root, path);
+   if (ret > 0)
+   ret = -ENOENT;
+   if (ret)
+   goto out;
+   }
+
 again:
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
if ((search_foward && key.objectid < logical) ||
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bad superblock when mounting rw, ro mount works

2018-06-07 Thread Daniel Underwood
Hi,

I have a raid10-like setup that is failing to mount in rw mode with the error


mount: /mnt/media: wrong fs type, bad option, bad superblock on
/dev/mapper/em1, missing codepage or helper program, or other error

read-only mounts seem to work and the files seem to be there.

I started having issues after a system crash during the process of
deleting a number of large files. After this (Ubuntu 16.04/Kernel
4.4), any attempt to mount the array in rw mode would cause a similar
crash. I did an upgrade to Ubuntu 18.04/Kernel 4.15 and now get the
error above.

I have looked through a variety of posts on the mailing list, but
couldn't find anything with the same issue. I have done a scrub on the
array that resulted in 6 verify errors with dmesg showing something
about extent trees. It didn't list them as uncorrectable errors, but
couldn't correct them either as I can't mount in rw. I also tried
`btrfs rescue zero-log /dev/mapper/em1`, which changes the above to
(say) em5, but then zero-log on em5 causes it to go back to em1.

Any direction would be appreciated. From what I could tell, my next
steps would be a check with --repair or --init-extent-tree, though I'm
reluctant to try those without being explicitly told to do so.

I have attached a dmesg.log file and hope I haven't greped out
anything important.

Thanks,
Daniel
-- 
Daniel Underwood


dmesg.log
Description: Binary data


Re: [PATCH 3/3] btrfs: fix race between mkfs and mount

2018-06-07 Thread David Sterba
On Mon, Jun 04, 2018 at 11:00:30PM +0800, Anand Jain wrote:
> In an instrumented testing it is possible that the mount and
> a newer mkfs.btrfs thread on the same device can race and if the new
> mkfs.btrfs wins it will free the older fs_devices, then the mount thread
> will lead to oops.
> 
> Thread1   Thread2
> ---   ---
> mkfs.btrfs -fq /dev/sdb
> mount /dev/sdb /btrfs
> |_btrfs_mount_root()
>   |_btrfs_scan_one_device(... &fs_devices)
> 
>   mkfs.btrfs -fq /dev/sdb
>   |_btrfs_contol_ioctl()
> |_btrfs_scan_one_device(... 
> &fs_devices)
>   |_::
> 
> |_btrfs_free_stale_devices()
> 
>   |_btrfs_open_devices(fs_devices ..) <-- stale fs_devices.
> 
> Fix this with a mutually exclusive flag BTRFS_VOL_FLAG_EXCL_OPS.

I'm not sure this is the right way to fix it, adding another bit to the
uuid_mutex and device_list_mutex combo.

To fix the race between mount and mkfs we can add a bit of logic to the
device scanning so that mount calling scan will track the purpose and
mkfs scan will not be allowed to free that device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix race between free_stale_devices and close_fs_devices

2018-06-07 Thread David Sterba
On Thu, Jun 07, 2018 at 12:01:09AM +0800, Anand Jain wrote:
> %fs_devices can be free-ed by btrfs_free_stale_devices() when the
> close_fs_devices() drops fs_devices::opened to zero, but close_fs_devices
> tries to access the %fs_devices again without the device_list_mutex.
> 
> Fix this by bringing the %fs_devices access with in the device_list_mutex.
> 
> Stack trace as below.
> 
> HEAD commit:716a685fdb89 Merge branch 'x86-hyperv-for-linus' of git://..
> ::
> CPU: 1 PID: 4499 Comm: syz-executor921 Not tainted 4.17.0+ #84
> ::
> WARNING: CPU: 1 PID: 4499 at fs/btrfs/volumes.c:1071 
> close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071
> Kernel panic - not syncing: panic_on_warn set ...
> ::
> RIP: 0010:close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071
> ::
>  btrfs_close_devices+0x29/0x150 fs/btrfs/volumes.c:1085
>  open_ctree+0x589/0x7898 fs/btrfs/disk-io.c:3358
>  btrfs_fill_super fs/btrfs/super.c:1202 [inline]
>  btrfs_mount_root+0x16df/0x1e70 fs/btrfs/super.c:1593
>  mount_fs+0xae/0x328 fs/super.c:1277
>  vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
>  vfs_kern_mount+0x40/0x60 fs/namespace.c:1027
>  btrfs_mount+0x4a1/0x213e fs/btrfs/super.c:1661
>  mount_fs+0xae/0x328 fs/super.c:1277
> 
> Reported-by: syzbot+ceb2606025ec1cc34...@syzkaller.appspotmail.com
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c2b7d66192e8..32fba4e24027 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1153,6 +1153,12 @@ static int close_fs_devices(struct btrfs_fs_devices 
> *fs_devices)
>   btrfs_prepare_close_one_device(device);
>   list_add(&device->dev_list, &pending_put);
>   }
> +
> + WARN_ON(fs_devices->open_devices);
> + WARN_ON(fs_devices->rw_devices);
> + fs_devices->opened = 0;
> + clear_bit(BTRFS_VOLUME_STATE_SEEDING, &fs_devices->volume_state);

This is from some other patch.

Moving that to the protected section should fix it but we'd also need to
extend the critical section to:

1047 if (--fs_devices->opened > 0)
1048 return 0;

Otherwise I think there are still some cornercases to fix as the
fs_devices members are not accessed properly everywhere. This patch
should be enough to fix the parallel mount and stale device freeing
races though, so I'll queue it up.

> +
>   mutex_unlock(&fs_devices->device_list_mutex);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs pre-release 4.17-rc1

2018-06-07 Thread David Sterba
Hi,

a pre-release has been tagged. The full 4.17 release will be done next
Thursday.

Only fixes (with tests) or documentation updates will be added, exceptions on
case-by-case basis.

ETA for 4.17 is in +7 days (2018-06-14).

Changes:

  * check
* many lowmem mode improvements
* properly report qgroup mismatch errors
* check symlinks with append/immutable flags
  * fi usage
* correctly calculate allocated/unallocated for raid10
* minor output updates
  * mkfs
* detect ENOSPC on thinly provisioned devices
* fix spurious EEXIST during directory traversal
  * restore: fix relative path for restore target
  * dump-tree: print symbolic tree names for backrefs
  * send: fix regression preventing send -p with subvolumes mounted on "/"
  * corrupt-tree: refactoring and command line updates
  * build
* make it work with e2fsprogs < 1.42 again
* restore support for autoconf 2.63
* detect if -std=gnu90 is supported
  * other
* new tests
* cleanups

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Axel Burri (1):
  btrfs-progs: fix regression preventing send -p with subvolumes mounted on 
"/"

David Sterba (4):
  btrfs-progs: tests: update log markers
  btrfs-progs: tests: fix fsck-tests/031 when on NFS
  btrfs-progs: update CHANGES for v4.17
  Btrfs progs v4.17-rc1

Gu JinXiang (1):
  btrfs-progs: Add DEBUG_CFLAGS_INTERNAL for libbtrfsutil

Hans van Kranenburg (1):
  btrfs-progs: fi-usage: fix RAID10 raw disk usage

James Harvey (1):
  btrfs-progs: map-logical: fix spelling of a variable

Jeff Mahoney (3):
  btrfs-progs: convert: fix support for e2fsprogs < 1.42
  btrfs-progs: build: autoconf 2.63 compatibility
  btrfs-progs: build: detect whether -std=gnu90 is supported

Lu Fengqi (3):
  btrfs-progs: qgroup: swap the argument in the caller of 
update_qgroup_relation
  btrfs-progs: tests: check btrfs qgroup parent-child relation output
  btrfs-progs: print-tree: remove dead code from btrfs_print_tree

Matthias Benkard (1):
  btrfs-progs: mkfs: traverse_directory: Reset error code on continue

Misono Tomohiro (5):
  btrfs-progs: fi usage: change warning message more appropriately
  btrfs-progs: fi usage: change to output more info without root privilege
  btrfs-progs: ins: dump-tree: Use %lld for extent data backref offset
  btrfs-progs: ins: dump-tree: Print tree name for extent data/tree block 
backref
  btrfs-progs: check: Initialize all filed of btrfs_inode_item in 
insert_inode_item()

Nikolay Borisov (37):
  btrfs-progs: Remove devid parameter from btrfs_rmap_block
  btrfs-progs: Remove add_cache_extent2
  btrfs-progs: Remove objectid argument from alloc_cache_extent
  btrfs-progs: Use exclude_super_stripes instead of account_super_bytes
  btrfs-progs: check: fix DIR_ITEM checking in lowmem
  btrfs-progs: tests: Add test for collision DIR_ITEM handling
  btrfs-progs: check: Drop ext_ref parameter from find_inode_ref
  btrfs-progs: check: Drop ext_ref param from check_dir_item
  btrfs-progs: check: Drop ext_ref argument from check_inode_item
  btrfs-progs: check: Drop unused ext_ref parameter from process_one_leaf
  btrfs-progs: check: Remove ext_ref param from check_fs_first_inode
  btrfs-progs: check: Remove ext_ref param from walk_down_tree
  btrfs-progs: check: Drop ext_ref param from check_fs_first_inode
  btrfs-progs: check: Drop ext_ref arument from check_fs_root
  btrfs-progs: check: Remove ext_ref local variable from 
check_fs_roots_lowmem
  btrfs-progs: Make __btrfs_fs_incompat return bool
  btrfs-progs: check: Remove root argument from delete_extent_records
  btrfs-progs: check: Remove root parameter from btrfs_fix_block_accounting
  btrfs-progs: check: Remove root parameter from del_pending_extents
  btrfs-progs: check: Remove root argument from finish_current_insert
  btrfs-progs: check: Make update_pinned_extents take btrfs_fs_info
  btrfs-progs: Remove unused argument from clean_tree_block
  btrfs-progs: check: Remove unused root argument from btrfs_extent_post_op
  btrfs-progs: Change btrfs_root to btrfs_fs_info argument in 
btrfs_lookup_extent_info
  btrfs-progs: Remove root argument from btrfs_set_block_flags
  btrfs-progs: Remove root argument from write_one_cache_group
  btrfs-progs: Remove fs_info argument from write_ctree_super
  btrfs-progs: Use symbolic names for read ahead behavior
  btrfs-progs: corrupt-block: Factor out specific-root code
  btrfs-progs: corrupt-block: Correctly handle -r when passing -I
  btrfs-progs: corrupt-block: Factor out key parsing function
  btrfs-progs: corrupt-block: Change -I flag parameter format
  btrfs-progs: corrupt-block: Convert -K flag argument handling to common 
function
  btrfs-

Re: [PATCH v2 3/3] Fix misspelling of forward

2018-06-07 Thread David Sterba
On Thu, Jun 07, 2018 at 03:20:07AM -0400, james harvey wrote:
> Signed-off-by: James Harvey 

Applied, thanks. Please don't forget to add the 'btrfs-progs' prefix to
the subject so the patch does not get overlooked.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs-progs: check: verify symlinks with append/immutable flags

2018-06-07 Thread David Sterba
On Thu, Jun 07, 2018 at 02:59:23PM +0800, Su Yue wrote:
>   Could you replace this patchset with new V3? It fixes the problem 
> reported by Misono.

Done, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item()

2018-06-07 Thread David Sterba
On Thu, Jun 07, 2018 at 11:49:58AM +0900, Misono Tomohiro wrote:
> Initialize all filed of btrfs_inode_item to zero in order to prevent
> having some garbage, especially for flags field.

Have you observed in practice or is it a matter of precaution?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Add ioctl to clear unused space

2018-06-07 Thread David Sterba
On Fri, Apr 20, 2018 at 04:21:43PM +0200, David Sterba wrote:
> This patchset adds new ioctl similar to TRIM, that provides several
> other ways how to clear the unused space.  The changelogs are
> incomplete, for preview not for inclusion yet.
> 
> It compiles and has been tested lightly, the clearing modes depend on hw
> capabilities (secure discard, sector unmapping instead of zeros), so
> I've tested only zeroing and discard.
> 
> I personally think the zeroing has a usecase and the other modes were
> easy to add. Further extensions can be considered, eg. WRITE_SAME,
> overwriting with a randomly generated pattern, or some filesystem canary
> patterns that can be used to report unused block read as metadata.
> 
> As this is modelled after the generic FITRIM ioctl, so this could be a
> new generic ioctl too. However, last time somebody wanted such ioctl,
> there was a pushback. I'll consider making a generic version and send it
> for comments to fsdevel eventually.
> 
> git://github.com/kdave/btrfs-devel dev/zero-free
> 
> 
> David Sterba (6):
>   btrfs: extend trim callchain to pass the operation type
>   btrfs: add wrapper to switch clearing operation
>   btrfs: add zeroing clear operation
>   btrfs: add new ioctl BTRFS_IOC_CLEAR_FREE
>   btrfs: add discard secure to clear unused space
>   btrfs: add more zeroout modes for clearing unused space

For the record, this patchset has been shifted to 4.19 or later.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-07 Thread Ian Kent
On Thu, 2018-06-07 at 04:06 +0200, Mark Fasheh wrote:
> Hi Ian,
> 
> On Thu, Jun 07, 2018 at 08:47:28AM +0800, Ian Kent wrote:
> > On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> > > Hi,
> > 
> > I'm not sure I understand what the problem is.
> 
> I'll try to elaborate below.
> 
> 
> > > We have an inconsistency in how the kernel is exporting inode number /
> > > device pairs for user space. There's of course stat(2) and statx(2),
> > > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > > some cases, the dumped values differ from what is returned via stat(2)
> > > or statx(2). Some filesystems might even show duplicate (but
> > > internally different!) pairs when the raw i_ino/s_dev is used.
> > 
> > How is it that you can dump the raw ino and s_dev if your not in
> > kernel code?
> 
> If you look below my first paragraph, you'll see a list of places where the
> kernel publishes (maybe that's a better word?) ino/dev pairs by printing or
> otherwise copying raw ino/s_dev values into user accesible buffers.
> 
> 
> > For stat family system calls, if the file system defines the inode
> > operation getattr it will be used to fill the stat structure otherwise
> > the VFS will fill the stat  structure and it will use inode->i_ino and
> > sb->s_dev as you say.
> 
> My concern is that those pairs are sometimes not unique and do not line up
> with what statx(2) returns. We actually need the true inode / device as is
> returned by statx in those places. I go into much more detail in my original
> mail.

IMHO I think you are right in that the values seen by user space
should be consistent.

I also think that vfs_statx() (which is pretty much what's called
by all the stat family system calls, and indirectly vfs_getattr)
should be the defining way to get those values if only because it's
core VFS and maintained by the VFS maintainer who should be the one
to set the rules.

But, as you say there are a bunch of places, not necessarily easy
to find, that would need review.

And there's the question of 32 bit .

> 
> 
> > > Some examples where we dump raw ino/dev:
> > > 
> > > - /proc//maps. I've written about how this confuses lsof(8):
> > >   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> > > 
> > > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> > >   trace/events/lock.h or trace/events/writeback.h for examples.
> > > 
> > > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> > > 
> > > - Audit records the raw ino/dev and passes them around. We do seem to
> > >   have paths printed from audit as well, but if it's printed with the
> > >   wrong ino/dev pair I believe my point still stands.
> > > 
> > > 
> > > This breaks software which expects these pairs to be unique, and can
> > > put the user in a situation where they might not be able to find an
> > > inode referenced from the kernel. What's even worse - depending on how
> > > ino is exported, they might even find the *wrong* inode.
> 
> Thanks,
>   --Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] btrfs-progs: check: verify symlinks with append/immutable flags

2018-06-07 Thread Nikolay Borisov



On  7.06.2018 12:10, Su Yue wrote:
> 
> 
> On 06/07/2018 02:54 PM, Nikolay Borisov wrote:
>>
>>
>> On  7.06.2018 09:55, Su Yue wrote:
>>> This patchset can be fetch from my github:
>>> https://github.com/Damenly/btrfs-progs/commits/odd_inode_flags
>>> It's based on kdave/devel whose HEAD is:
>>> commit 1c846faaf87fbb01e080c94098c02b1695ed86dd
>>> Author: Nikolay Borisov 
>>> Date:   Mon May 28 09:36:50 2018 +0300
>>>
>>> btrfs-progs: Remove fs_info argument from write_ctree_super
>>> 
>>> symlinks should never have append/immutable attributes.
>>> This patchset enables btrfs check to verify such corruption.
>>>
>>> PATCH[1] is for original mode.
>>> PATCH[2] is for original mode.
>>>
>>> PATCH[3] adds a test image.
>>> For further use, the directory is called bad-inode-flags.
>>>
>>> #issue 133
>>
>>
>> So you areadding code to detect the problem but not to fix it. Is there
>> some technical reason why you didn't implement clearing those flags or
>> just didn't do it for this series? IMHO it will be good if the check +
> 
> TBH I just didn't plan to do it while writing the patches.
> If the functionality of repair is necessary, I'm willing to
> do it.

That's fine, however in case this code detects such a symlink what's the
user supposed to do - they can just delete the symlink because it's
immutable/append-only. SO they will know they have such symlinks but
won't be able to repair them.

> 
> However, it seems I have to complete work about repair mismatched file
> type first. So we can verify whether a bad inode is a symlink with
> append/immutable attributes or a normal file set with BTRFS_FT_SYMLINK.
> 
> I'm fine that v2 patchset is to be replaced then support repair late
> or to be reverted.
> 
> Thanks,
> Su
> 
>> repair code are part of the same series. Then you can also extend the
>> test case to cover the 2 functionalities?
>>
>>>
>>> ---
>>> Changelog:
>>> v3:
>>>    Use S_ISLNK instead of BTRFS_FT_SYMLINK. Thanks Misono.
>>>    Change the test image created by hand.
>>> v2:
>>>    Use "rec->errors |=" instead of "rec->errors =" in patch[1].
>>>    Define new error bit of invalid inode flags in lowmem mode.
>>>    Adjust print message in patch[2]. Thanks, Qu and Nikolay.
>>>    Rename test directory from odd-inode-flags to bad-inode-flags.
>>>
>>> Su Yue (3):
>>>    btrfs-progs: check: check symlinks with append/immutable flags
>>>    btrfs-progs: check: lowmem: check symlinks with append/immutable
>>> flags
>>>    btrfs-progs: fsck-tests: add test case to check symlinks with bad
>>>  flags
>>>
>>>   check/main.c |   7 +++
>>>   check/mode-lowmem.c  |  10 ++
>>>   check/mode-lowmem.h  |   1 +
>>>   check/mode-original.h    |   1 +
>>>   .../034-bad-inode-flags/default_case.img | Bin 0 -> 4096 bytes
>>>   tests/fsck-tests/034-bad-inode-flags/test.sh |  15 +++
>>>   6 files changed, 34 insertions(+)
>>>   create mode 100644
>>> tests/fsck-tests/034-bad-inode-flags/default_case.img
>>>   create mode 100755 tests/fsck-tests/034-bad-inode-flags/test.sh
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] btrfs-progs: check: verify symlinks with append/immutable flags

2018-06-07 Thread Su Yue




On 06/07/2018 02:54 PM, Nikolay Borisov wrote:



On  7.06.2018 09:55, Su Yue wrote:

This patchset can be fetch from my github:
https://github.com/Damenly/btrfs-progs/commits/odd_inode_flags
It's based on kdave/devel whose HEAD is:
commit 1c846faaf87fbb01e080c94098c02b1695ed86dd
Author: Nikolay Borisov 
Date:   Mon May 28 09:36:50 2018 +0300

btrfs-progs: Remove fs_info argument from write_ctree_super

symlinks should never have append/immutable attributes.
This patchset enables btrfs check to verify such corruption.

PATCH[1] is for original mode.
PATCH[2] is for original mode.

PATCH[3] adds a test image.
For further use, the directory is called bad-inode-flags.

#issue 133



So you areadding code to detect the problem but not to fix it. Is there
some technical reason why you didn't implement clearing those flags or
just didn't do it for this series? IMHO it will be good if the check +


TBH I just didn't plan to do it while writing the patches.
If the functionality of repair is necessary, I'm willing to
do it.

However, it seems I have to complete work about repair mismatched file 
type first. So we can verify whether a bad inode is a symlink with 
append/immutable attributes or a normal file set with BTRFS_FT_SYMLINK.


I'm fine that v2 patchset is to be replaced then support repair late
or to be reverted.

Thanks,
Su


repair code are part of the same series. Then you can also extend the
test case to cover the 2 functionalities?



---
Changelog:
v3:
   Use S_ISLNK instead of BTRFS_FT_SYMLINK. Thanks Misono.
   Change the test image created by hand.
v2:
   Use "rec->errors |=" instead of "rec->errors =" in patch[1].
   Define new error bit of invalid inode flags in lowmem mode.
   Adjust print message in patch[2]. Thanks, Qu and Nikolay.
   Rename test directory from odd-inode-flags to bad-inode-flags.

Su Yue (3):
   btrfs-progs: check: check symlinks with append/immutable flags
   btrfs-progs: check: lowmem: check symlinks with append/immutable flags
   btrfs-progs: fsck-tests: add test case to check symlinks with bad
 flags

  check/main.c |   7 +++
  check/mode-lowmem.c  |  10 ++
  check/mode-lowmem.h  |   1 +
  check/mode-original.h|   1 +
  .../034-bad-inode-flags/default_case.img | Bin 0 -> 4096 bytes
  tests/fsck-tests/034-bad-inode-flags/test.sh |  15 +++
  6 files changed, 34 insertions(+)
  create mode 100644 tests/fsck-tests/034-bad-inode-flags/default_case.img
  create mode 100755 tests/fsck-tests/034-bad-inode-flags/test.sh


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] Fix misspelling of forward

2018-06-07 Thread Su Yue




On 06/07/2018 03:20 PM, james harvey wrote:

Signed-off-by: James Harvey 


Reviewed-by: Su Yue 


---
  btrfs-map-logical.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 8a41b037..59ba731b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -39,7 +39,7 @@
  static FILE *info_file;

  static int map_one_extent(struct btrfs_fs_info *fs_info,
- u64 *logical_ret, u64 *len_ret, int search_foward)
+ u64 *logical_ret, u64 *len_ret, int search_forward)
  {
 struct btrfs_path *path;
 struct btrfs_key key;
@@ -75,11 +75,11 @@ static int map_one_extent(struct btrfs_fs_info *fs_info,

  again:
 btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-   if ((search_foward && key.objectid < logical) ||
-   (!search_foward && key.objectid > logical) ||
+   if ((search_forward && key.objectid < logical) ||
+   (!search_forward && key.objectid > logical) ||
 (key.type != BTRFS_EXTENT_ITEM_KEY &&
  key.type != BTRFS_METADATA_ITEM_KEY)) {
-   if (!search_foward)
+   if (!search_forward)
 ret = btrfs_previous_extent_item(fs_info->extent_root,
  path, 0);
 else
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()

2018-06-07 Thread Su Yue




On 06/07/2018 03:20 PM, james harvey wrote:

btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY,
which are the types we're looking for.

Signed-off-by: James Harvey 


Reviewed-by: Su Yue 


---
  btrfs-map-logical.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 2451012b..8a41b037 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -83,7 +83,8 @@ again:
 ret = btrfs_previous_extent_item(fs_info->extent_root,
  path, 0);
 else
-   ret = btrfs_next_item(fs_info->extent_root, path);
+   ret = btrfs_next_extent_item(fs_info->extent_root,
+path);
 if (ret)
 goto out;
 goto again;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread Su Yue




On 06/07/2018 03:19 PM, james harvey wrote:

btrfs-map-logical -l 10955980800 -b 4096 
No extent found at range [10955980800,10955984896)

But, this extent exists.  btrfs-debug-tree shows:

item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142655488 count 1
...
item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 128958464 count 1

The code searches for (, 0, 0), and looks forward then backward.  It
needs to go to the next leaf when slot > number of items on this leaf.

Signed-off-by: James Harvey 
---
  btrfs-map-logical.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 7a8bcff9..2451012b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info,
 BUG_ON(ret == 0);
 ret = 0;

+   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(fs_info->extent_root, path);


It seems that you forgot to handle the case ret < 0.

Thanks,
Su


+   if (ret > 0) {
+   ret = -ENOENT;
+   goto out;
+   }
+   }
+
  again:
 btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
 if ((search_foward && key.objectid < logical) ||
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-07 Thread Amir Goldstein
On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh  wrote:
> Hi,
>
> We have an inconsistency in how the kernel is exporting inode number /
> device pairs for user space. There's of course stat(2) and statx(2),
> but aside from those we simply dump inode->i_ino and super->s_dev. In
> some cases, the dumped values differ from what is returned via stat(2)
> or statx(2). Some filesystems might even show duplicate (but
> internally different!) pairs when the raw i_ino/s_dev is used.
>
> Some examples where we dump raw ino/dev:
>
> - /proc//maps. I've written about how this confuses lsof(8):
>   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
>
> - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
>   trace/events/lock.h or trace/events/writeback.h for examples.
>
> - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
>
> - Audit records the raw ino/dev and passes them around. We do seem to
>   have paths printed from audit as well, but if it's printed with the
>   wrong ino/dev pair I believe my point still stands.
>

knfsd also looks at i_ino. I converted one or two of these places
to getattr callbacks recently, but there are still some more.

Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode()
an overlay inode should resolve several of the issues listed above -
probably not all, but didn't check every tracepoint..

>
> This breaks software which expects these pairs to be unique, and can
> put the user in a situation where they might not be able to find an
> inode referenced from the kernel. What's even worse - depending on how
> ino is exported, they might even find the *wrong* inode.
>
> I also should point out that we're likely at this point because
> stat(2) has been using an unsigned long for ino. On a 32 bit system,
> it would have been impossible for the user to get the real inode
> number in the first place. So there probably wasn't much we could do.
>
> These days though, we have statx(2) which will do the right thing on
> all platforms so we no longer have that excuse. The user ought to be
> able to take an ino/dev pair and ultimately find the actual file on
> their system, partially with the help of statx(2).
>
>
> Some examples of how ino is manipulated by filesystems:
>
> - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
>   what I can tell). stat->dev is not always consistent with super->s_dev.
>
> - On 32 bits, many filesystems employ a transformation to squeeze a 64
>   bit identifier into 32 bits. The exact methods are fs specific,
>   what's important is that we're losing information, introducing the
>   possibility of duplicate inode numbers.
>
> - On all platforms, Btrfs and Overlayfs can have duplicate inode
>   numbers. Of course, device can be different across the fs as well
>   with the kernel reporting s_dev and these filesystems returning
>   a different device via stat() or statx().
>
>
> Getting the inode number portion of this pair fixed would immediately
> solve the situation for all filesystems except Btrfs and
> Overlayfs - they report a different device from stat.
>
> Regarding the device portion of the pair, I'm honestly not sure
> whether Overlayfs cares, and my attempts to fix the s_dev situation
> for Btrfs have all come to the same dead ends that I've hit briefly
> looking into this inode number issue - the problems are intrinsically
> linked.
>
> So my questions are:
>
> 1) Do we care about this? On one hand, we've had this inconsistency
>for a long time, for various reasons. On the other hand, I can point
>to bugzilla's where these inconsistencies have become a problem.
>
>In the case that we don't care, any fs-internal solutions are
>likely to be extremely disruptive to the end user.
>
>
> 2) If we do care, how do we fix this?
>
>  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
>  downsides from a memory usage standpoint. Also it doesn't fully
>  address the issue - we still have a device field that Btrfs and
>  Overlayfs override.
>
>  We could combine this with an intermediate structure between the
>  inode and super block so s_dev could be abstracted out. See my
>  fs_view patches for an example of how this could be done:
>  https://www.spinics.net/lists/linux-fsdevel/msg125492.html
>
>  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
>  various regions of the kernel like audit, or some of the deeper
>  tracepoints looks ugly and prone to life-timing issues (which
>  vfsmount do we use?). On the upside, we could probably make it
>  really light by only sending down the STATX_INO flag and letting
>  the filesystem optimize accordingly.
>
>  2c) I don't think we can really use a dedicated callback without
>  passing the vfsmount through since Overlayfs ->getattr might call
>  the lower fs ->getattr. At that point we might as well use getattr.
>

Didn't get the Overlayfs lower fs getattr argume

[PATCH v2 2/3] btrfs-progs: map-logical: Use btrfs_next_extent_item()

2018-06-07 Thread james harvey
btrfs_next_extent_item() looks for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_KEY,
which are the types we're looking for.

Signed-off-by: James Harvey 
---
 btrfs-map-logical.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 2451012b..8a41b037 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -83,7 +83,8 @@ again:
ret = btrfs_previous_extent_item(fs_info->extent_root,
 path, 0);
else
-   ret = btrfs_next_item(fs_info->extent_root, path);
+   ret = btrfs_next_extent_item(fs_info->extent_root,
+path);
if (ret)
goto out;
goto again;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] Fix misspelling of forward

2018-06-07 Thread james harvey
Signed-off-by: James Harvey 
---
 btrfs-map-logical.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 8a41b037..59ba731b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -39,7 +39,7 @@
 static FILE *info_file;

 static int map_one_extent(struct btrfs_fs_info *fs_info,
- u64 *logical_ret, u64 *len_ret, int search_foward)
+ u64 *logical_ret, u64 *len_ret, int search_forward)
 {
struct btrfs_path *path;
struct btrfs_key key;
@@ -75,11 +75,11 @@ static int map_one_extent(struct btrfs_fs_info *fs_info,

 again:
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-   if ((search_foward && key.objectid < logical) ||
-   (!search_foward && key.objectid > logical) ||
+   if ((search_forward && key.objectid < logical) ||
+   (!search_forward && key.objectid > logical) ||
(key.type != BTRFS_EXTENT_ITEM_KEY &&
 key.type != BTRFS_METADATA_ITEM_KEY)) {
-   if (!search_foward)
+   if (!search_forward)
ret = btrfs_previous_extent_item(fs_info->extent_root,
 path, 0);
else
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread james harvey
btrfs-map-logical -l 10955980800 -b 4096 
No extent found at range [10955980800,10955984896)

But, this extent exists.  btrfs-debug-tree shows:

   item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
   refs 1 gen 62656 flags DATA
   shared data backref parent 142655488 count 1
...
   item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
   refs 1 gen 62656 flags DATA
   shared data backref parent 128958464 count 1

The code searches for (, 0, 0), and looks forward then backward.  It
needs to go to the next leaf when slot > number of items on this leaf.

Signed-off-by: James Harvey 
---
 btrfs-map-logical.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 7a8bcff9..2451012b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -65,6 +65,14 @@ static int map_one_extent(struct btrfs_fs_info *fs_info,
BUG_ON(ret == 0);
ret = 0;

+   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(fs_info->extent_root, path);
+   if (ret > 0) {
+   ret = -ENOENT;
+   goto out;
+   }
+   }
+
 again:
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
if ((search_foward && key.objectid < logical) ||
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-07 Thread james harvey
On Thu, Jun 7, 2018 at 12:20 AM, Su Yue  wrote:
> On 06/07/2018 11:33 AM, james harvey wrote:
>> Using btrfs-progs v4.16:
>>
>> No extent found at range [10955980800,10955984896)
>>
>> But, this extent exists.  btrfs-debug-tree shows:
> Make sense. IMP the commit message is too long to read.
> Code wise is almost fine. Some nitpicks are below.

Thanks for response.  I'll try to work on that.

>> +   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
>> +   ret = btrfs_next_leaf(fs_info->extent_root, path);
>> +   if (ret != 0) {
>> +   ret = -1;
>
> btrfs_next_leaf() may return -EIO, so keep the negative return code is
> prefered.
> btrfs_next_leaf may return > 0, here I'd like to set ret=-ENOENT.
> You can refer callers of btrfs_next_leaf() how to handle the return codes.

Fixed in v2.

>> else
>> -   ret = btrfs_next_item(fs_info->extent_root, path);
>> +   ret =
> Nit:
> Misclick here?

Missed semicolon being 81st character, fixed in v2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html