[PATCH 2/2] btrfs: submit a normal bio for the mirror dio read

2017-11-28 Thread Anand Jain
Similarly do not set the REQ_FAILFAST_DEV for mirror read bio during dio.

[Taken from patch in ml
 btrfs: submit a normal bio for the mirror read]
[When the fist mirror failed to read we submit bio again to read from the
2nd mirror, however during this, we also set the flag REQ_FAILFAST_DEV
which indicates to the underlying block device driver not to perform the
default IO retry (sd, 5 counts), and thus command will be failed and
returned at the first failed attempt.

On the contrary, in fact, it should have been other way around that is,
as 2nd mirror being the last copy of the data, it should rather try
equally hard to make this read cmd successful and at the same time with
or without REQ_FAILFAST_DEV there is no performance benefits if the
command is successful in the first attempt itself.

The only benefit which would be achieved with REQ_FAILFAST_DEV is that
when both the copies encounter failed read, then for the applications,
the EIO will be reported roughly 40% faster (since it saves 4 retries).
But when first mirror has failed whats more important will be to make
a successful read from the 2nd mirror.

And for the DUP case where both the copies are on the same disk, this
makes to retry 5 times on 2 different LBA/sector but on the same disk,
which probably is not a good idea if your test case involves pulling
out the disk. But as of now we don't have a way to tell the block layer
how critical a read is, so that block layer can accommodate the retry
dynamically].

Signed-off-by: Anand Jain 
---
 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b93fe05a39c7..9a539cf82e51 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8014,9 +8014,6 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
}
 
segs = bio_segments(failed_bio);
-   if (segs > 1 ||
-   (failed_bio->bi_io_vec->bv_len > btrfs_inode_sectorsize(inode)))
-   read_mode |= REQ_FAILFAST_DEV;
 
isector = start - btrfs_io_bio(failed_bio)->logical;
isector >>= inode->i_sb->s_blocksize_bits;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: submit a normal bio for the mirror read

2017-11-28 Thread Anand Jain
When the fist mirror failed to read we submit bio again to read from the
2nd mirror, however during this, we also set the flag REQ_FAILFAST_DEV
which indicates to the underlying block device driver not to perform the
default IO retry (sd, 5 counts), and thus command will be failed and
returned at the first failed attempt.

On the contrary, in fact, it should have been other way around that is,
as 2nd mirror being the last copy of the data, it should rather try
equally hard to make this read cmd successful and at the same time with
or without REQ_FAILFAST_DEV there is no performance benefits if the
command is successful in the first attempt itself.

The only benefit which would be achieved with REQ_FAILFAST_DEV is that
when both the copies encounter failed read, then for the applications,
the EIO will be reported roughly 40% faster (since it saves 4 retries).
But when first mirror has failed whats more important will be to make
a successful read from the 2nd mirror.

And for the DUP case where both the copies are on the same disk, this
makes to retry 5 times on 2 different LBA/sector but on the same disk,
which probably is not a good idea if your test case involves pulling
out the disk. But as of now we don't have a way to tell the block layer
how critical a read is, so that block layer can accommodate the retry
dynamically.

Signed-off-by: Anand Jain 
---
 fs/btrfs/extent_io.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d8eb457f2a70..29c03fb4d5af 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2388,9 +2388,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
return -EIO;
}
 
-   if (failed_bio->bi_vcnt > 1)
-   read_mode |= REQ_FAILFAST_DEV;
-
phy_offset >>= inode->i_sb->s_blocksize_bits;
bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
  start - page_offset(page),
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly

2017-11-28 Thread Qu Wenruo
Ancient commit f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in
fitrim ioctl") introduced a regression where btrfs may fail to trim any
free space in existing block groups.

It's caused by confusion with btrfs_super_block->total_bytes and btrfs
logical address space.
Unlike physical address, any aligned bytenr in range [0, U64_MAX) is
valid in btrfs logical address space, and it's chunk mapping mechanism
of btrfs to handle the logical<->physical mapping.

The test case will craft a btrfs with the following features:
0) Single data/meta profile
   Make trimmed bytes reporting and chunk allocation more predictable.

1) All chunks start beyond super_block->total_bytes (1G)
   By relocating these blocks several times.

2) Unallocated space is less than 50% of the whole fs

3) Fragmented data chunks
   Data chunks will be full of fragments, 50% of data chunks will be
   free space.

So in theory fstrim should be able to trim over 50% space of the fs.
(after fix, 64% of the fs can be trimmed)
While the regression makes btrfs only able to trim unallocated space,
which is less than 50% of the total space.
(without fix, it's only 31%)

Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs".

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/155 | 120 
 tests/btrfs/155.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 123 insertions(+)
 create mode 100755 tests/btrfs/155
 create mode 100644 tests/btrfs/155.out

diff --git a/tests/btrfs/155 b/tests/btrfs/155
new file mode 100755
index ..6918f093
--- /dev/null
+++ b/tests/btrfs/155
@@ -0,0 +1,120 @@
+#! /bin/bash
+# FS QA Test 155
+#
+# Check if btrfs can correctly trim free space in block groups
+#
+# An ancient regression prevent btrfs from trimming free space inside
+# existing block groups, if bytenr of block group starts beyond
+# btrfs_super_block->total_bytes.
+# However all bytenr in btrfs is in btrfs logical address space,
+# where any bytenr in range [0, U64_MAX] is valid.
+#
+# Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs".
+#
+#---
+# Copyright (c) 2017 SUSE Linux Products GmbH.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fstrim
+
+# 1024fs size
+fs_size=$((1024 * 1024 * 1024))
+
+# Use small files to fill half of the fs
+file_size=$(( 1024 * 1024 ))
+nr_files=$(( $fs_size / $file_size / 2))
+
+# Force to use single data and meta profile.
+# Since the test relies on fstrim output, which will differ for different
+# profiles
+_scratch_mkfs -b $fs_size -m single -d single > /dev/null
+_scratch_mount
+
+_require_batched_discard "$SCRATCH_MNT"
+
+for n in $(seq -w 0 $(( $nr_files - 1))); do
+   $XFS_IO_PROG -f -c "pwrite 0 $file_size" "$SCRATCH_MNT/file_$n" \
+   > /dev/null
+done
+
+# Flush all buffer data into disk, to trigger chunk allocation
+sync
+
+# Now we have take at least 50% of the filesystem, relocate all chunks twice
+# so all chunks will start after 1G in logical space.
+# (Btrfs chunk allocation will not rewind to reuse lower space)
+_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT"
+
+# To avoid possible false ENOSPC alert on v4.15-rc1, seems to be a
+# reserved space related bug (maybe related to outstanding space rework?),
+# but that's another story.
+sync
+
+_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT"
+
+# Now remove half of the files to make some holes for later trim.
+# While still keep the chunk space fragmented, so no chunk will be freed
+rm $SCRATCH_MNT/file_*[13579] -f
+
+# Make sure space is freed
+sync
+
+trimmed=$($FSTRIM_PROG -v "$SCRATCH_MNT" | _filter_fstrim)
+echo 

[PATCH 0/4] define BTRFS_DEV_STATE

2017-11-28 Thread Anand Jain
As of now device properties and states are being represented as int
variable. Patches in the ML such as device failed state needs this
cleanup as well.

Anand Jain (4):
  btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
  btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
  btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
  btrfs: cleanup device states define BTRFS_DEV_STATE_CAN_DISCARD

 fs/btrfs/dev-replace.c |   3 +-
 fs/btrfs/disk-io.c |  23 ++---
 fs/btrfs/extent-tree.c |   5 +-
 fs/btrfs/extent_io.c   |   3 +-
 fs/btrfs/ioctl.c   |   2 +-
 fs/btrfs/scrub.c   |  13 +++--
 fs/btrfs/super.c   |   7 +--
 fs/btrfs/volumes.c | 129 -
 fs/btrfs/volumes.h |  10 ++--
 9 files changed, 118 insertions(+), 77 deletions(-)

-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA

2017-11-28 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 21 ++---
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/super.c   |  5 +++--
 fs/btrfs/volumes.c | 29 +
 fs/btrfs/volumes.h |  2 +-
 5 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0d361b6713e1..ab1a514e5c8d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3563,8 +3563,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3579,8 +3581,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3677,7 +3681,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata ||
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
@@ -3717,8 +3722,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e027e0de66a5..060fa93731e5 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4129,7 +4129,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
}
 
mutex_lock(_info->scrub_lock);
-   if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) {
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
+   dev->is_tgtdev_for_dev_replace) {
mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 161694b66038..9089aad2f3aa 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1945,8 +1945,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
 
rcu_read_lock();
list_for_each_entry_rcu(device, _devices->devices, dev_list) {
-   if (!device->in_fs_metadata || !device->bdev ||
-   device->is_tgtdev_for_dev_replace)
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !device->bdev || device->is_tgtdev_for_dev_replace)
continue;
 
if (i >= nr_devices)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0f5be1808c6e..13a6ae80bee1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -652,7 +652,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
fs_devices->rotating = 1;
 
device->bdev = bdev;
-   device->in_fs_metadata = 0;
+   clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
device->mode = flags;
 
fs_devices->open_devices++;
@@ -857,7 +857,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices 
*fs_devices, int step)
 again:
/* This is the 

[PATCH 4/4] btrfs: cleanup device states define BTRFS_DEV_STATE_CAN_DISCARD

2017-11-28 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::can_discard. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_CAN_DISCARD and use
the bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/extent-tree.c | 3 ++-
 fs/btrfs/volumes.c | 6 +++---
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f81d928754e1..ee79f7cdc543 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2155,7 +2155,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
 
for (i = 0; i < bbio->num_stripes; i++, stripe++) {
u64 bytes;
-   if (!stripe->dev->can_discard)
+   if (!test_bit(BTRFS_DEV_STATE_CAN_DISCARD,
+   >dev->dev_state))
continue;
 
ret = btrfs_issue_discard(stripe->dev->bdev,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 75839e07ce10..a9c6486f06f4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -647,7 +647,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
 
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
-   device->can_discard = 1;
+   set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state);
if (!blk_queue_nonrot(q))
fs_devices->rotating = 1;
 
@@ -2401,7 +2401,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
 
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
-   device->can_discard = 1;
+   set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state);
set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
device->generation = trans->transid;
device->io_width = fs_info->sectorsize;
@@ -2601,7 +2601,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info 
*fs_info,
 
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
-   device->can_discard = 1;
+   set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state);
mutex_lock(_info->fs_devices->device_list_mutex);
set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
device->generation = 0;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2fbff6902c8d..85e4b2dcc071 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -50,6 +50,7 @@ struct btrfs_pending_bios {
 #define BTRFS_DEV_STATE_WRITEABLE  (1UL << 1)
 #define BTRFS_DEV_STATE_IN_FS_METADATA (1UL << 2)
 #define BTRFS_DEV_STATE_MISSING(1UL << 3)
+#define BTRFS_DEV_STATE_CAN_DISCARD(1UL << 4)
 
 struct btrfs_device {
struct list_head dev_list;
@@ -74,7 +75,6 @@ struct btrfs_device {
fmode_t mode;
 
unsigned long dev_state;
-   int can_discard;
int is_tgtdev_for_dev_replace;
blk_status_t last_flush_error;
int flush_bio_sent;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING

2017-11-28 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c |  3 ++-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/scrub.c   |  7 ---
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 33 +++--
 fs/btrfs/volumes.h |  2 +-
 6 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 4b6ceb38cb5f..559db7667f38 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -306,7 +306,8 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info 
*fs_info)
 
 static inline char *dev_missing_or_rcu_str(struct btrfs_device *device)
 {
-   return device->missing ? "" : rcu_str_deref(device->name);
+   return test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) ?
+   "" : rcu_str_deref(device->name);
 }
 
 int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ab1a514e5c8d..ac1079fc9382 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3559,7 +3559,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
/* send down all the barriers */
head = >fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev)
continue;
@@ -3575,7 +3575,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev) {
errors_wait++;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 060fa93731e5..666c00cbee03 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2535,7 +2535,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
}
 
WARN_ON(sblock->page_count == 0);
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
/*
 * This case should only be hit for RAID 5/6 device replace. See
 * the comment in scrub_missing_raid56_pages() for details.
@@ -2870,7 +2870,7 @@ static int scrub_extent_for_parity(struct scrub_parity 
*sparity,
u8 csum[BTRFS_CSUM_SIZE];
u32 blocksize;
 
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
scrub_parity_mark_sectors_error(sparity, logical, len);
return 0;
}
@@ -4112,7 +4112,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
 
mutex_lock(_info->fs_devices->device_list_mutex);
dev = btrfs_find_device(fs_info, devid, NULL, NULL);
-   if (!dev || (dev->missing && !is_dev_replace)) {
+   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
+   !is_dev_replace)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
return -ENODEV;
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9089aad2f3aa..4b70ae6f0245 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2243,7 +2243,7 @@ static int btrfs_show_devname(struct seq_file *m, struct 
dentry *root)
while (cur_devices) {
head = _devices->devices;
list_for_each_entry(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->name)
continue;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 13a6ae80bee1..75839e07ce10 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -773,9 +773,9 @@ static noinline int device_list_add(const char *path,
return -ENOMEM;
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
-   if (device->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
fs_devices->missing_devices--;
-   device->missing = 0;
+   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
}
}
 
@@ -957,7 +957,7 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
fs_devices->rw_devices--;
}
 
-   if (device->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
  

[PATCH 1/4] btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE

2017-11-28 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 12 ++
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/extent_io.c   |  3 ++-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/volumes.c | 63 +-
 fs/btrfs/volumes.h |  4 +++-
 7 files changed, 54 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..0d361b6713e1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3563,7 +3563,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3578,7 +3579,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3675,7 +3677,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
btrfs_set_stack_device_generation(dev_item, 0);
@@ -3714,7 +3717,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e2d7e86b51d1..f81d928754e1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10935,7 +10935,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
*trimmed = 0;
 
/* Not writeable = nothing to do. */
-   if (!device->writeable)
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
return 0;
 
/* No free space = nothing to do. */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7fa50e12f18e..f51c797847c7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2028,7 +2028,8 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
ino, u64 start,
bio->bi_iter.bi_sector = sector;
dev = bbio->stripes[bbio->mirror_num - 1].dev;
btrfs_put_bbio(bbio);
-   if (!dev || !dev->bdev || !dev->writeable) {
+   if (!dev || !dev->bdev ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_bio_counter_dec(fs_info);
bio_put(bio);
return -EIO;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6c7a49faf4e0..8a74c83503d6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1518,7 +1518,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
goto out_free;
}
 
-   if (!device->writeable) {
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_info(fs_info,
   "resizer unable to apply on readonly device %llu",
   devid);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e3f6c49e5c4d..e027e0de66a5 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4117,7 +4117,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
return -ENODEV;
}
 
-   if (!is_dev_replace && !readonly && !dev->writeable) {
+   if (!is_dev_replace && !readonly &&
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
rcu_read_lock();
name = rcu_dereference(dev->name);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 397b335d108c..0f5be1808c6e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -634,10 +634,15 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 

Re: [PATCH 0/7] retry write on error

2017-11-28 Thread Anand Jain



On 11/29/2017 07:41 AM, p...@btrfs.list.sabi.co.uk wrote:

If the underlying protocal doesn't support retry and there
are some transient errors happening somewhere in our IO
stack, we'd like to give an extra chance for IO.



A limited number of retries may make sense, though I saw some
long stalls after retries on bad disks.


Indeed! One of the major issues in actual storage administration
is to find ways to reliably disable most retries, or to shorten
them, both at the block device level and the device level,
because in almost all cases where storage reliability matters
what is important is simply swapping out the failing device
immediately and then examining and possible refreshing it
offline.

To the point that many device manufacturers deliberately cripple
in cheaper products retry shortening or disabling options to
force long stalls, so that people who care about reliability
more than price will buy the more expensive version that can
disable or shorten retries.


Seems preferable to avoid issuing retries when the underlying
transport layer(s) has already done so, but I am not sure
there is a way to know that at the fs level.


Inded, and to use an euphemism, a third layer of retries at the
filesystem level are currently a thoroughly imbecilic idea :-),
as whether retries are worth doing is not a filesystem dependent
issue (but then plugging is done at the block io level when it
is entirely device dependent whether it is worth doing, so there
is famous precedent).

There are excellent reasons why error recovery is in general not
done at the filesystem level since around 20 years ago, which do
not need repeating every time. However one of them is that where
it makes sense device firmware does retries, and the block
device layer does retries too, which is often a bad idea, and
where it is not, the block io level should be do that, not the
filesystem.

A large part of the above discussion would not be needed if
Linux kernel "developers" exposed a clear notion of hardware
device and block device state machine and related semantics, or
even knew that it were desirable, but that's an idea that is
only 50 years old, so may not have yet reached popularity :-).



 I agree with Ed and Peter, similar opinion was posted here [1].

[1]
https://www.spinics.net/lists/linux-btrfs/msg70240.html

Thanks, Anand




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported

2017-11-28 Thread Su Yue
If ioctl of defrag range is unsupported, defrag will exit immediately.

Since caller can handle the error, let cmd_filesystem_defrag()
close file, break the loop and return error instead of calling exit(1).

Suggested-by: David Sterba 
Signed-off-by: Su Yue 
---
Changelog:
v2: Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi
defrag: clean up duplicate code if find errors").
v3: Call close_file_or_dir() before breaking the loop.
---
 cmds-filesystem.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 17d399d58adf..232d4e88e0c0 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -1049,8 +1049,10 @@ static int cmd_filesystem_defrag(int argc, char **argv)
if (recursive && S_ISDIR(st.st_mode)) {
ret = nftw(argv[i], defrag_callback, 10,
FTW_MOUNT | FTW_PHYS);
-   if (ret == ENOTTY)
-   exit(1);
+   if (ret == ENOTTY) {
+   close_file_or_dir(fd, dirstream);
+   break;
+   }
/* errors are handled in the callback */
ret = 0;
} else {
-- 
2.15.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files

2017-11-28 Thread Su Yue



On 11/29/2017 12:07 AM, David Sterba wrote:

On Tue, Nov 28, 2017 at 05:14:50PM +0800, Su Yue wrote:

Now, files which have nocompress flag also will be defraged
with compression. However, nocompress flag is still existed
and have to be cleared manually.

So add an option '--clear-nocompress' to extend -c to drop
nocompress flag after defragement.



Suggested-by: David Sterba 
Suggested-by: Anand Jain 


Do you have the pointer to the discussion? The whole idea sounds
familiar and seeing my name here means I must have been involved, but I
have only vague memories.



Here is the link:
https://www.spinics.net/lists/linux-btrfs/msg67529.html


Signed-off-by: Su Yue 
---
Changelog:
v2: Remove change about indentation of defrag_callback().

---
  cmds-filesystem.c | 65 ++-
  1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 3931333f76c6..84242814798e 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "kerncompat.h"

  #include "ctree.h"
@@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = {
"-l len  defragment only up to len bytes",
"-t size target extent size hint (default: 32M)",
"",
+   "--compress-forceclear nocompress flag on files after defragment, only 
work with option -c",


This is probably --clear-nocompress .



Oh... I forgot to change the usage info here.


The "force compress" semantics is up to kernel, so I think naming the
options 'clear-nocompress' is the right thing, as it's what it really
does.


+   "",
"Warning: most Linux kernels will break up the ref-links of COW data",
"(e.g., files copied with 'cp --reflink', snapshots) which may cause",
"considerable increase of space usage. See btrfs-filesystem(8) for",
@@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] = {
NULL
  };
  
+static int clear_nocompress_flag(int fd)

+{
+   unsigned int flags;
+   int ret = 0;
+
+   ret = ioctl(fd, FS_IOC_GETFLAGS, );
+   if (ret < 0) {
+   ret = -errno;
+   error("failed to get flags: %s", strerror(-ret));
+   goto out;
+   }
+
+   if (!(flags & FS_NOCOMP_FL)) {
+   ret = 0;
+   goto out;
+   }
+   flags &= ~FS_NOCOMP_FL;
+   ret = ioctl(fd, FS_IOC_SETFLAGS, );


This is inherently racy, but the inode flags do not change that often so
concurrent change is unlikely.



I don't quite understand what you mean.
Could you explain some details?

Thanks,
Su

+   if (ret < 0) {
+   ret = -errno;
+   error("failed to set flags: %s", strerror(-ret));
+   goto out;
+   }
+
+   ret = 0;
+out:
+   return ret;
+}
+
  static struct btrfs_ioctl_defrag_range_args defrag_global_range;
  static int defrag_global_verbose;
  static int defrag_global_errors;
+static int defrag_global_clear_nocompress;
  static int defrag_callback(const char *fpath, const struct stat *sb,
int typeflag, struct FTW *ftwbuf)
  {
@@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
err = errno;
goto error;
}
+
+   if (defrag_global_clear_nocompress) {
+   ret = clear_nocompress_flag(fd);
+   if (ret) {
+   err = -ret;
+   goto error;
+   }
+   }
}
return 0;
  
@@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv)

int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
  
+	enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257};

+   static const struct option long_options[] = {
+   { "clear-nocompress", no_argument, NULL,
+ GETOPT_VAL_CLEAR_NOCOMPRESS},
+   { NULL, 0, NULL, 0}
+   };
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv)
defrag_global_errors = 0;
defrag_global_verbose = 0;
defrag_global_errors = 0;
+   defrag_global_clear_nocompress = 0;
while(1) {
-   int c = getopt(argc, argv, "vrc::fs:l:t:");
+   int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options,
+   NULL);
if (c < 0)
break;
  
@@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)

case 'r':
recursive 

Re: Read before you deploy btrfs + zstd

2017-11-28 Thread Nick Terrell

> On Nov 28, 2017, at 3:49 PM, David Sterba  wrote:
> 
> On Tue, Nov 28, 2017 at 09:31:57PM +, Nick Terrell wrote:
>> 
>>> On Nov 21, 2017, at 8:22 AM, David Sterba  wrote:
>>> 
>>> On Wed, Nov 15, 2017 at 08:09:15PM +, Nick Terrell wrote:
 On 11/15/17, 6:41 AM, "David Sterba"  wrote:
> The branch is now in a state that can be tested. Turns out the memory
> requirements are too much for grub, so the boot fails with "not enough
> memory". The calculated value
> 
> ZSTD_BTRFS_MAX_INPUT: 131072
> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
> 
> This is not something I could fix easily, we'd probalby need a tuned
> version of ZSTD for grub constraints. Adding Nick to CC.
 
 If I understand the grub code correctly, we only need to read, and we have
 the entire input and output buffer in one segment. In that case you can use
 ZSTD_initDCtx(), and ZSTD_decompressDCtx(). ZSTD_DCtxWorkspaceBound() is
 only 155984. See decompress_single() in
 https://patchwork.kernel.org/patch/9997909/ for an example.
>>> 
>>> Does not help, still ENOMEM.
>> 
>> It looks like XZ had the same issue, and they make the decompression
>> context a static object (grep for GRUB_EMBED_DECOMPRESSOR). We could
>> potentially do the same and statically allocate the workspace:
>> 
>> ```
>> /* Could also be size_t */
>> #define BTRFS_ZSTD_WORKSPACE_SIZE_U64 (155984 / sizeof(uint64_t))
>> static uint64_t workspace[BTRFS_ZSTD_WORKSPACE_SIZE_U64];
>> 
>> /* ... */
>> 
>> assert(sizeof(workspace) >= ZSTD_DCtxWorkspaceBound());
>> ```
> 
> Interesting, thanks for the tip, I'll try it next.
> 
> I've meanwhile tried to tweak the numbers, the maximum block for zstd,
> that squeezed the DCtx somewhere under 48k, with block size 8k. Still
> enomem.

Sadly the block size has to stay 128 KiB, since that is the block size that
is used for compression.

> I've tried to add some debugging prints to see what numbers get actually
> passed to the allocator, but did not see anything printed.  I'm sure
> there is a more intelligent way to test the grub changes.  So far each
> test loop takes quite some time, as I build the rpm package, test it in
> a VM and have to recreate the environmet each time.

Is the code in github.com/kdave/grub in the btrfs-zstd branch up to date?
btrfs.c:1230 looks like it will error for zstd compression, but not with
ENOMEM. btrfs.c:1272 [2] looks like a typo. Maybe the second is causing
the compressed block to be interpreted as uncompressed, and causes a
too-large allocation?

[1] https://github.com/kdave/grub/blob/btrfs-zstd/grub-core/fs/btrfs.c#L1230
[2] https://github.com/kdave/grub/blob/btrfs-zstd/grub-core/fs/btrfs.c#L1272--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] Btrfs: retry write for non-raid56

2017-11-28 Thread Liu Bo
On Wed, Nov 22, 2017 at 04:41:10PM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.11.2017 02:35, Liu Bo wrote:
> > If the underlying protocal doesn't support retry and there are some
> > transient errors happening somewhere in our IO stack, we'd like to
> > give an extra chance for IO.
> > 
> > In btrfs, read retry is handled in bio_readpage_error() with the retry
> > unit being page size , for write retry however, we're going to do it
> > in a different way, as a write may consist of several writes onto
> > different stripes, retry write needs to be done right after the IO on
> > each stripe completes and reaches to endio.
> > 
> > As write errors are really corner cases, performance impact is not
> > considered.
> > 
> > This adds code to retry write on errors _sector by sector_ in order to
> > find out which sector is really bad so that we can mark it somewhere.
> > And since endio is running in interrupt context, the real retry work
> > is scheduled in system_unbound_wq in order to get a normal context.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/volumes.c | 162 
> > +
> >  fs/btrfs/volumes.h |   3 +
> >  2 files changed, 142 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 853f9ce..c11db0b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
> > }
> >  }
> >  
> > -static void btrfs_end_bio(struct bio *bio)
> > +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
> >  {
> > struct btrfs_bio *bbio = bio->bi_private;
> > -   int is_orig_bio = 0;
> > +   unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> > +
> > +   BUG_ON(stripe_index >= bbio->num_stripes);
> > +   return bbio->stripes[stripe_index].dev;
> > +}
> > +
> > +/*
> > + * return 1 if every sector retry returns successful.
> > + * return 0 if one or more sector retries fails.
> > + */
> > +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +   u64 sectors_to_write;
> > +   u64 offset;
> > +   u64 orig;
> > +   u64 unit;
> > +   u64 block_sectors;
> > +   int ok = 1;
> > +   struct bio *wbio;
> > +
> > +   /* offset and unit are bytes aligned, not 512-bytes aligned. */
> > +   sectors_to_write = io_bio->iter.bi_size >> 9;
> > +   orig = io_bio->iter.bi_sector;
> > +   offset = 0;
> > +   block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> > +   unit = block_sectors;
> > +   ASSERT(unit == 1);
> > +
> > +   while (1) {
> > +   if (!sectors_to_write)
> > +   break;
> > +   /*
> > +* LIUBO: I don't think unit > sectors_to_write could
> > +* happen, sectors_to_write should be aligned to PAGE_SIZE
> > +* which is > unit.  Just in case.
> > +*/
> > +   if (unit > sectors_to_write) {
> > +   WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", 
> > unit, sectors_to_write);
> > +   unit = sectors_to_write;
> > +   }
> > +
> > +   /* write @unit bytes at @offset */
> > +   /* this would never fail, check btrfs_bio_clone(). */
> > +   wbio = btrfs_bio_clone(bio);
> > +   wbio->bi_opf = REQ_OP_WRITE;
> > +   wbio->bi_iter = io_bio->iter;
> > +
> > +   bio_trim(wbio, offset, unit);
> > +   bio_copy_dev(wbio, bio);
> > +
> > +   /* submit in sync way */
> > +   /*
> > +* LIUBO: There is an issue, if this bio is quite
> > +* large, say 1M or 2M, and sector size is just 512,
> > +* then this may take a while.
> > +*
> > +* May need to schedule the job to workqueue.
> > +*/
> > +   if (submit_bio_wait(wbio) < 0) {
> > +   ok = 0 && ok;
> > +   /*
> > +* This is not correct if badblocks is enabled
> > +* as we need to record every bad sector by
> > +* trying sectors one by one.
> > +*/
> > +   break;
> > +   }
> > +
> > +   bio_put(wbio);
> > +   offset += unit;
> > +   sectors_to_write -= unit;
> > +   unit = block_sectors;
> > +   }
> > +   return ok;
> > +}
> > +
> > +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +   if (bio->bi_status == BLK_STS_IOERR ||
> > +   bio->bi_status == BLK_STS_TARGET) {
> > +   if (dev->bdev) {
> > +   if (bio_data_dir(bio) == WRITE)
> > +   btrfs_dev_stat_inc(dev,
> > +  BTRFS_DEV_STAT_WRITE_ERRS);
> > +   else
> > +   btrfs_dev_stat_inc(dev,
> > +  

Re: [PATCH] btrfs: handle dynamically reappearing missing device

2017-11-28 Thread Liu Bo
On Fri, Nov 17, 2017 at 07:53:29PM +0800, Anand Jain wrote:
> 
> 
> On 11/17/2017 07:28 AM, Liu Bo wrote:
> > On Sun, Nov 12, 2017 at 06:56:50PM +0800, Anand Jain wrote:
> > > If the device is not present at the time of (-o degrade) mount
> > > the mount context will create a dummy missing struct btrfs_device.
> > > Later this device may reappear after the FS is mounted.
> 
> > This commit log doesn't explain what would happen when the missing
> > device re-appears.
> 
>  Hm. You mean in the current design without this patch.?
>  Just nothing it gives a false impression to the user by
> 'btrfs dev ready' and 'btrfs fi show' that missing device
>  has been included. Will update change log.
> 
> > > So this
> > > patch handles that case by going through the open_device steps
> 
> > > which this device missed and finally adds to the device alloc list.
> > > 
> > > So now with this patch, to bring back the missing device user can run,
> > > 
> > > btrfs dev scan 
> > 
> > Most common distros already have some udev rules of btrfs to process a
> > reappeared disk.
> 
>  Right. Do you see anything that will break ?
>  Planning to add comments [1] in v2 for more clarity around this.
> 
> > > 
> > > Signed-off-by: Anand Jain 
> > > ---
> > > This patch needs:
> > >   [PATCH 0/4]  factor __btrfs_open_devices()
> > > 
> > >   fs/btrfs/volumes.c | 43 +--
> > >   1 file changed, 41 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index d24e966ee29f..e7dd996831f2 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -769,8 +769,47 @@ static noinline int device_list_add(const char *path,
> > >   rcu_string_free(device->name);
> > >   rcu_assign_pointer(device->name, name);
> > >   if (device->missing) {
> > > - fs_devices->missing_devices--;
> > > - device->missing = 0;
> > > + int ret;
> > > + struct btrfs_fs_info *fs_info = fs_devices->fs_info;
> > > + fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
> > > +
> > > + if (btrfs_super_flags(disk_super) & 
> > > BTRFS_SUPER_FLAG_SEEDING)
> > > + fmode &= ~FMODE_WRITE;
> > > +
> > > + /*
> > > +  * Missing can be set only when FS is mounted.
> > > +  * So here its always fs_devices->opened > 0 and most
> > > +  * of the struct device members are already updated by
> > > +  * the mount process even if this device was missing, so
> > > +  * now follow the normal open device procedure for this
> > > +  * device. The scrub will take care of filling the
> > > +  * missing stripes for raid56 and balance for raid1 and
> > > +  * raid10.
> > > +  */
> > 
> > 
> > The idea looks good to me.
> > 
> > What if users skip balance/scrub given both could take a while?
> > 
> > Then btrfs takes the disks back and reads content from it, and it's OK
> > for raid1/10 as they may have a good copy,
> 
>  Yes, except for the split brain situation for which patches are
>  in the ML, review comments appreciated.
> 
> > but in case of raid56, it
> > might hit the reconstruction bug if two disks are missing and added
> > back, which I tried to address recently.
> 
> > At least ->writable should not be set until balance/scrub completes
> > the re-sync job.
> 
>  Hm. Why ?
>

(Sorry for the late reply.)

On second thought, write is OK since parity will be calculated along
writes.

Will check the v2.


Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 09:31:57PM +, Nick Terrell wrote:
> 
> > On Nov 21, 2017, at 8:22 AM, David Sterba  wrote:
> > 
> > On Wed, Nov 15, 2017 at 08:09:15PM +, Nick Terrell wrote:
> >> On 11/15/17, 6:41 AM, "David Sterba"  wrote:
> >>> The branch is now in a state that can be tested. Turns out the memory
> >>> requirements are too much for grub, so the boot fails with "not enough
> >>> memory". The calculated value
> >>> 
> >>> ZSTD_BTRFS_MAX_INPUT: 131072
> >>> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
> >>> 
> >>> This is not something I could fix easily, we'd probalby need a tuned
> >>> version of ZSTD for grub constraints. Adding Nick to CC.
> >> 
> >> If I understand the grub code correctly, we only need to read, and we have
> >> the entire input and output buffer in one segment. In that case you can use
> >> ZSTD_initDCtx(), and ZSTD_decompressDCtx(). ZSTD_DCtxWorkspaceBound() is
> >> only 155984. See decompress_single() in
> >> https://patchwork.kernel.org/patch/9997909/ for an example.
> > 
> > Does not help, still ENOMEM.
> 
> It looks like XZ had the same issue, and they make the decompression
> context a static object (grep for GRUB_EMBED_DECOMPRESSOR). We could
> potentially do the same and statically allocate the workspace:
> 
> ```
> /* Could also be size_t */
> #define BTRFS_ZSTD_WORKSPACE_SIZE_U64 (155984 / sizeof(uint64_t))
> static uint64_t workspace[BTRFS_ZSTD_WORKSPACE_SIZE_U64];
> 
> /* ... */
> 
> assert(sizeof(workspace) >= ZSTD_DCtxWorkspaceBound());
> ```

Interesting, thanks for the tip, I'll try it next.

I've meanwhile tried to tweak the numbers, the maximum block for zstd,
that squeezed the DCtx somewhere under 48k, with block size 8k. Still
enomem.

I've tried to add some debugging prints to see what numbers get actually
passed to the allocator, but did not see anything printed.  I'm sure
there is a more intelligent way to test the grub changes.  So far each
test loop takes quite some time, as I build the rpm package, test it in
a VM and have to recreate the environmet each time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] retry write on error

2017-11-28 Thread Peter Grandi
>>> If the underlying protocal doesn't support retry and there
>>> are some transient errors happening somewhere in our IO
>>> stack, we'd like to give an extra chance for IO.

>> A limited number of retries may make sense, though I saw some
>> long stalls after retries on bad disks.

Indeed! One of the major issues in actual storage administration
is to find ways to reliably disable most retries, or to shorten
them, both at the block device level and the device level,
because in almost all cases where storage reliability matters
what is important is simply swapping out the failing device
immediately and then examining and possible refreshing it
offline.

To the point that many device manufacturers deliberately cripple
in cheaper products retry shortening or disabling options to
force long stalls, so that people who care about reliability
more than price will buy the more expensive version that can
disable or shorten retries.

> Seems preferable to avoid issuing retries when the underlying
> transport layer(s) has already done so, but I am not sure
> there is a way to know that at the fs level.

Inded, and to use an euphemism, a third layer of retries at the
filesystem level are currently a thoroughly imbecilic idea :-),
as whether retries are worth doing is not a filesystem dependent
issue (but then plugging is done at the block io level when it
is entirely device dependent whether it is worth doing, so there
is famous precedent).

There are excellent reasons why error recovery is in general not
done at the filesystem level since around 20 years ago, which do
not need repeating every time. However one of them is that where
it makes sense device firmware does retries, and the block
device layer does retries too, which is often a bad idea, and
where it is not, the block io level should be do that, not the
filesystem.

A large part of the above discussion would not be needed if
Linux kernel "developers" exposed a clear notion of hardware
device and block device state machine and related semantics, or
even knew that it were desirable, but that's an idea that is
only 50 years old, so may not have yet reached popularity :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] retry write on error

2017-11-28 Thread Edmund Nadolski


On 11/28/2017 12:22 PM, David Sterba wrote:
> On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote:
>> If the underlying protocal doesn't support retry and there are some
>> transient errors happening somewhere in our IO stack, we'd like to
>> give an extra chance for IO.  Or sometimes you see btrfs reporting
>> 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
>> retry may help a bit.
> 
> A limited number of retries may make sense, though I saw some long
> stalls after retries on bad disks. Tracking the retries would be a good
> addition to the dev stats, ie. a soft error but still worth reporting.

Seems preferable to avoid issuing retries when the underlying transport
layer(s) has already done so, but I am not sure there is a way to know
that at the fs level. Likewise for cases where a retry cannot succeed
(transport failures? media errors?).  It can get tricky to know how long
to delay between retries and when it is appropriate to stop retrying.

Ed

>> In btrfs, read retry is handled in bio_readpage_error() with the retry
>> unit being page size, for write retry however, we're going to do it in
>> a different way, as a write may consist of several writes onto
>> different stripes, retry write needs to be done right after the IO on
>> each stripe completes and arrives at endio.
>>
>> Patch 1-3 are the implementation of retry write on error for
>> non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
>> and non-raid56 shares one retry function helper.
>>
>> Patch 3 does retry sector by sector, but since this patch set doesn't
>> included badblocks support, patch 7 changes it back to retry the whole
>> bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
>> patch 7 once badblocks support is done, but I'm open to it.)
> 
> What does 'badblocks' refer to? I know about the badblocks utility that
> find and reportts bad blocks, possibly ext2 understands that and avoids
> allocating them. Btrfs does not have such support.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread Timofey Titovets
2017-11-28 21:48 GMT+03:00 David Sterba :
> On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote:
>> As we all know, under certain circumstances, it is more appropriate to
>> create some subvolumes rather than keep everything in the same
>> subvolume. As the condition of demand change, the user may need to
>> convert a previous directory to a subvolume. For this reason,how about
>> adding an ioctl to convert a directory to a subvolume?
>
> I'd say too difficult to get everything right in kernel. This is
> possible to be done in userspace, with existing tools.
>
> The problem is that the conversion cannot be done atomically in most
> cases, so even if it's just one ioctl call, there are several possible
> intermediate states that would exist during the call. Reporting where
> did the ioctl fail would need some extended error code semantics.
>
>> Users can convert by the scripts mentioned in this
>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
>> it easier to use the off-the-shelf btrfs subcommand?
>
> Adding a subcommand would work, though I'd rather avoid reimplementing
> 'cp -ax' or 'rsync -ax'.  We want to copy the files preserving all
> attributes, with reflink, and be able to identify partially synced
> files, and not cross the mountpoints or subvolumes.
>
> The middle step with snapshotting the containing subvolume before
> syncing the data is also a valid option, but not always necessary.
>
>> After an initial consideration, our implementation is broadly divided
>> into the following steps:
>> 1. Freeze the filesystem or set the subvolume above the source directory
>> to read-only;
>
> Freezing the filesystme will freeze all IO, so this would not work, but
> I understand what you mean. The file data are synced before the snapshot
> is taken, but nothing prevents applications to continue writing data.
>
> Open and live files is a problem and don't see a nice solution here.
>
>> 2. Perform a pre-check, for example, check if a cross-device link
>> creation during the conversion;
>
> Cross-device links are not a problem as long as we use 'cp' ie. the
> manual creation of files in the target.
>
>> 3. Perform conversion, such as creating a new subvolume and moving the
>> contents of the source directory;
>> 4. Thaw the filesystem or restore the subvolume writable property.
>>
>> In fact, I am not so sure whether this use of freeze is appropriate
>> because the source directory the user needs to convert may be located
>> at / or /home and this pre-check and conversion process may take a long
>> time, which can lead to some shell and graphical application suspended.
>
> I think the closest operation is a read-only remount, which is not
> always possible due to open files and can otherwise considered as quite
> intrusive operation to the whole system. And the root filesystem cannot
> be easily remounted read-only in the systemd days anyway.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

My two 2c,
Then we talking about 'fast' (i.e. i like the idea where ioctl calls
to be fast) conversion of dir to subvolume,
can be done like that (sorry if i miss understood something and that a
rave or i'm crazy..):

For make idea more clear, for userspace that can looks like:
1. Create snapshot of parent subvol for that dir
2. Cleanup all data, except content of dir in snapshot
3. Move content of that dir to snapshot root
4. Replace dir with that snapshot/subvol
i.e. no copy, no cp, only rename() and garbage collecting.

In kernel that in "theory" will looks like:
1. Copy of subvol root inode
2. Replace root inode with target dir inode
3. Replace target dir in old subvol with new subvol
4. GC old dir content from parent subvol, GC all useless content of
around dir in new subvol

That's may be a fastest way for user, but that will not solve problems
with opened files & etc,
but that must be fast from user point of view, and all other staff can
be simply cleaned in background

Thanks
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/5] add infrastructure for tagging functions as error injectable

2017-11-28 Thread Josef Bacik
On Tue, Nov 28, 2017 at 11:58:41AM -0700, Jonathan Corbet wrote:
> On Wed, 22 Nov 2017 16:23:30 -0500
> Josef Bacik  wrote:
> 
> > From: Josef Bacik 
> > 
> > Using BPF we can override kprob'ed functions and return arbitrary
> > values.  Obviously this can be a bit unsafe, so make this feature opt-in
> > for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> > order to give BPF access to that function for error injection purposes.
> > 
> > Signed-off-by: Josef Bacik 
> > Acked-by: Ingo Molnar 
> > ---
> >  arch/x86/include/asm/asm.h|   6 ++
> >  include/asm-generic/vmlinux.lds.h |  10 +++
> >  include/linux/bpf.h   |  11 +++
> >  include/linux/kprobes.h   |   1 +
> >  include/linux/module.h|   5 ++
> >  kernel/kprobes.c  | 163 
> > ++
> >  kernel/module.c   |   6 +-
> >  7 files changed, 201 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
> > index b0dc91f4bedc..340f4cc43255 100644
> > --- a/arch/x86/include/asm/asm.h
> > +++ b/arch/x86/include/asm/asm.h
> > @@ -85,6 +85,12 @@
> > _ASM_PTR (entry);   \
> > .popsection
> >  
> > +# define _ASM_KPROBE_ERROR_INJECT(entry)   \
> > +   .pushsection "_kprobe_error_inject_list","aw" ; \
> > +   _ASM_ALIGN ;\
> > +   _ASM_PTR (entry);   \
> > +   .popseciton
> 
> So this stuff is not my area of greatest expertise, but I do have to wonder
> how ".popseciton" can work ... ?
> 

Well fuck, do you want me to send a increment Daniel/Alexei or resend this patch
fixed?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread Austin S. Hemmelgarn

On 2017-11-28 13:48, David Sterba wrote:

On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote:

As we all know, under certain circumstances, it is more appropriate to
create some subvolumes rather than keep everything in the same
subvolume. As the condition of demand change, the user may need to
convert a previous directory to a subvolume. For this reason,how about
adding an ioctl to convert a directory to a subvolume?


I'd say too difficult to get everything right in kernel. This is
possible to be done in userspace, with existing tools.

The problem is that the conversion cannot be done atomically in most
cases, so even if it's just one ioctl call, there are several possible
intermediate states that would exist during the call. Reporting where
did the ioctl fail would need some extended error code semantics.
I think you mean it can't be done atomically in an inexpensive manner 
without significant work on the kernel side.  It should in theory be 
possible to do it atomically by watching for and mirroring changes from 
the source directory to the new subvolume.  Such an approach is however 
expensive, and is not guaranteed to ever finish if the source directory 
is under active usage.  The only issue is updating open file descriptors 
to point to the new files.


In short, the flow I'm thinking of is:

1. Create the subvolume in a temporary state that would get cleaned up 
by garbage collection if the FS got remounted.
2. Start watching the directory structure in the source directory for 
changes, recursively, and mirror all such changes to the subvolume as 
they happen.
3. For each file in the source directory and sub-directories, create a 
file in the new subvolume using a reflink, and add start watching that 
file for changes.  Reflink any detected updates to the temporary file.

Beyond this point, there are two possible methods to finish things:
A:
4. Freeze all userspace I/O to the filesystem.
5. Update the dentry for the source directory to point to the new 
subvolume, remove the subvolume's 'temporary' status, and force a commit.
6. Update all open file descriptors to point to the files in the 
new subvolume.

7. Thaw all userspace I/O to the filesystem.
8. Garbage collect the source directory and it's contents.
or:
B:
4. Update the dentry for the source directory to point to the new 
subvolume, remove the subvolume's temporary status, and force a commit.
5. Keep the old data around until there are no references to it, 
and indirect opens on files that were already open prior to step 4 to 
point to the old file, while keeping watches around for the old files 
that were open.


Prior to step A5 or B4, this can be atomically rolled back by simply 
nuking the temporary subvolume and removing the watches.  After those 
steps, it's fully complete as far as the on-device state is concerned. 
Method A of completing the conversion has less overall impact on 
long-term operation of the system, but may require significant changes 
to the VFS API to be viable (I don't know if some of the overlay stuff 
could be used or not).  Method B will continue to negatively impact 
performance until all the files that were open are closed, but shouldn't 
require as much legwork on the kernel side.  In both cases, it ends up 
working similarly to the device replace operation, or LVM's pvmove 
operation, both of which can be made atomic.


For what it's worth, I am of the opinion that this would be nice to have 
not so that stuff could be converted on-line, but so that you can 
convert a directory more easily off-line.  Right now, it's a serious 
pain in the arse to do such a conversion (the Python code I linked is 
simple because it's using high-level operations out of the shutil 
standard module, and/or the reflink module on PyPI), largely because it 
is decidedly non-trivial to actually copy all the data about a file (and 
both `cp -ax` and `rsync -ax` miss information other than reflinks, most 
notably file attributes normally set by `chattr`).


I personally care less about the atomicity of the operation than the 
fact that it actually preserves _everything_ (with the likely exception 
of EVM and IMA xattrs, but _nothing_ should be preserving those).  IOW, 
I would be perfectly fine with something that does this in the kernel 
but returns -EWHATEVER if there are open files below the source 
directory and blocks modification to them until the switch is done.



Users can convert by the scripts mentioned in this
thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
it easier to use the off-the-shelf btrfs subcommand?


Adding a subcommand would work, though I'd rather avoid reimplementing
'cp -ax' or 'rsync -ax'.  We want to copy the files preserving all
attributes, with reflink, and be able to identify partially synced
files, and not cross the mountpoints or subvolumes.

The middle step with snapshotting the containing subvolume before
syncing the data is also a 

Re: About 'key type for persistent [...] items'

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 08:28:58PM +0100, Hans van Kranenburg wrote:
> >> E.g. if I wanted to (just a random idea) add per device statistics, and
> >> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
> >> I have multiple devices. That's already a no go if there's anyone in any
> >> other tree that is doing anything with any objectid in the range of
> >> valid device numbers.
> > 
> > In that case there would be a new objectid PER_DEVICE_OBJECTID, with the
> > persistent key, and all the device ids can go to the offset field. The
> > objectid field should not be dynamic, by design.
> 
> Ok, didn't think of that yet, clear.
> 
> So... just a random idea...

You're not the first one to have that idea :)

> How cool would it be to have the block group items being done this
> way... PER_CHUNK_OBJECTID.
> 
> Imagine the time to mount a large btrfs filesystem going down from
> minutes or tens of minutes to just a few seconds...

Exactly for that reason, but in such case we're allowed to grab a new
key type and set objectid to a value that would group all the bg items,
mayb also moving the to the beginning of the key space.

The slightly tricky part would be the backward compatibility when we'd
have to keep the existing bg and new-bg in sync, but there are no
fundamental obstacles.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread Hans van Kranenburg
Ok, just want to add one more thing :)

On 11/28/2017 08:12 PM, David Sterba wrote:
> On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote:
>> On 11/28/2017 06:34 PM, David Sterba wrote:
>>> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
 Last week, when implementing the automatic classifier to dynamically
 create tree item data objects by key type in python-btrfs, I ran into
 the following commits in btrfs-progs:

   commit 8609c8bad68528f668d9ce564b868aa4828107a0
   btrfs-progs: print-tree: factor out temporary_item dump
 and
   commit a4b65f00d53deb1b495728dd58253af44fcf70df
   btrfs-progs: print-tree: factor out persistent_item dump

 ...which are related to kernel...

   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
   btrfs: introduce key type for persistent permanent items
 and
   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
   btrfs: introduce key type for persistent temporary items

 Afaics the goal is to overload types because there can be only 256 in
 total. However, I'm missing the design decisions behind the
 implementation of it. It's not in the commit messages, and it hasn't
 been on the mailing list.
>>>
>>> The reason is avoid wasting key types but still allow to store new types
>>> of data to the btrees, if they were not part of the on-disk format.
>>>
>>> I'm not sure if this has been discussed or mentioned under some patches
>>> or maybe unrelated patches. I do remember that I discussed that with
>>> Chris in private on IRC and have the logs, dated 2015-09-02.
>>>
>>> At that time the balance item and dev stats item were introduced, maybe
>>> also the qgroup status item type. This had me alarmed enough to
>>> reconsider how the keys are allocated.
>>>
 Before, there was an 1:1 mapping from key types to data structures. Now,
 with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
 which use this type can be using any data structure they want, so it's
 some kind of YOLO_ITEM_KEY.
>>>
>>> In some sense it is, so it's key+objectid to determine the structure.
>>>
 The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
 example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
 case BTRFS_DEV_STATS_OBJECTID.

 However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
 that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
 0, and I'm storing a btrfs_kebab_item struct indexed at
 (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
 will try to parse the data by calling print_dev_stats?
>>>
>>> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
>>> case.
>>
>> (I'm just thinking out loud here, if you think I'm wasting your time
>> just say.)
>>
>> Yes, so the objectid numbers have to be "registered" / "reserved" in the
>> documentation, and they have to be unique over all trees.
> 
> Right.
> 
>> Maybe the information I was looking for is... in what cases should or
>> shouldn't this be used? Because that limits the possible usage quite a
>> bit. Or is it only for very very specific things.
> 
> The keys are not free for use, they need to have a defined meaning and
> are sort of part of the on-disk format. There must be some usecase and
> reasoning why it's necessary to be done that way, and not in another.
> Like xattr.
> 
> I don't have an example now, but I could imagine some per-tree
> information that can be tracked and updated at commit time.
> 
>> E.g. if I wanted to (just a random idea) add per device statistics, and
>> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
>> I have multiple devices. That's already a no go if there's anyone in any
>> other tree that is doing anything with any objectid in the range of
>> valid device numbers.
> 
> In that case there would be a new objectid PER_DEVICE_OBJECTID, with the
> persistent key, and all the device ids can go to the offset field. The
> objectid field should not be dynamic, by design.

Ok, didn't think of that yet, clear.

So... just a random idea...

How cool would it be to have the block group items being done this
way... PER_CHUNK_OBJECTID.

Imagine the time to mount a large btrfs filesystem going down from
minutes or tens of minutes to just a few seconds...

\:D/

 What's the idea behind that? Instead of having the key type field define
 the struct and meaning, we now suddenly need the tuple (tree, objectid,
 type), and we need all three to determine what's inside the item data?
 So, the code in print_tree.c would also need to know about the tree
 number and pass that into the different functions.
>>>
>>> No, all key types, even the persistent/temporary are independent of the
>>> tree type. So it's only type <-> structure mapping, besides
>>> persistent/temporary types.
>>
>> Yeah, I wasn't explicit 

Re: [PATCH 0/7] retry write on error

2017-11-28 Thread David Sterba
On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote:
> If the underlying protocal doesn't support retry and there are some
> transient errors happening somewhere in our IO stack, we'd like to
> give an extra chance for IO.  Or sometimes you see btrfs reporting
> 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
> retry may help a bit.

A limited number of retries may make sense, though I saw some long
stalls after retries on bad disks. Tracking the retries would be a good
addition to the dev stats, ie. a soft error but still worth reporting.

> In btrfs, read retry is handled in bio_readpage_error() with the retry
> unit being page size, for write retry however, we're going to do it in
> a different way, as a write may consist of several writes onto
> different stripes, retry write needs to be done right after the IO on
> each stripe completes and arrives at endio.
> 
> Patch 1-3 are the implementation of retry write on error for
> non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
> and non-raid56 shares one retry function helper.
> 
> Patch 3 does retry sector by sector, but since this patch set doesn't
> included badblocks support, patch 7 changes it back to retry the whole
> bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
> patch 7 once badblocks support is done, but I'm open to it.)

What does 'badblocks' refer to? I know about the badblocks utility that
find and reportts bad blocks, possibly ext2 understands that and avoids
allocating them. Btrfs does not have such support.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote:
> On 11/28/2017 06:34 PM, David Sterba wrote:
> > On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
> >> Last week, when implementing the automatic classifier to dynamically
> >> create tree item data objects by key type in python-btrfs, I ran into
> >> the following commits in btrfs-progs:
> >>
> >>   commit 8609c8bad68528f668d9ce564b868aa4828107a0
> >>   btrfs-progs: print-tree: factor out temporary_item dump
> >> and
> >>   commit a4b65f00d53deb1b495728dd58253af44fcf70df
> >>   btrfs-progs: print-tree: factor out persistent_item dump
> >>
> >> ...which are related to kernel...
> >>
> >>   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
> >>   btrfs: introduce key type for persistent permanent items
> >> and
> >>   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
> >>   btrfs: introduce key type for persistent temporary items
> >>
> >> Afaics the goal is to overload types because there can be only 256 in
> >> total. However, I'm missing the design decisions behind the
> >> implementation of it. It's not in the commit messages, and it hasn't
> >> been on the mailing list.
> > 
> > The reason is avoid wasting key types but still allow to store new types
> > of data to the btrees, if they were not part of the on-disk format.
> > 
> > I'm not sure if this has been discussed or mentioned under some patches
> > or maybe unrelated patches. I do remember that I discussed that with
> > Chris in private on IRC and have the logs, dated 2015-09-02.
> > 
> > At that time the balance item and dev stats item were introduced, maybe
> > also the qgroup status item type. This had me alarmed enough to
> > reconsider how the keys are allocated.
> > 
> >> Before, there was an 1:1 mapping from key types to data structures. Now,
> >> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
> >> which use this type can be using any data structure they want, so it's
> >> some kind of YOLO_ITEM_KEY.
> > 
> > In some sense it is, so it's key+objectid to determine the structure.
> > 
> >> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
> >> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
> >> case BTRFS_DEV_STATS_OBJECTID.
> >>
> >> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
> >> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
> >> 0, and I'm storing a btrfs_kebab_item struct indexed at
> >> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
> >> will try to parse the data by calling print_dev_stats?
> > 
> > As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
> > case.
> 
> (I'm just thinking out loud here, if you think I'm wasting your time
> just say.)
> 
> Yes, so the objectid numbers have to be "registered" / "reserved" in the
> documentation, and they have to be unique over all trees.

Right.

> Maybe the information I was looking for is... in what cases should or
> shouldn't this be used? Because that limits the possible usage quite a
> bit. Or is it only for very very specific things.

The keys are not free for use, they need to have a defined meaning and
are sort of part of the on-disk format. There must be some usecase and
reasoning why it's necessary to be done that way, and not in another.
Like xattr.

I don't have an example now, but I could imagine some per-tree
information that can be tracked and updated at commit time.

> E.g. if I wanted to (just a random idea) add per device statistics, and
> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
> I have multiple devices. That's already a no go if there's anyone in any
> other tree that is doing anything with any objectid in the range of
> valid device numbers.

In that case there would be a new objectid PER_DEVICE_OBJECTID, with the
persistent key, and all the device ids can go to the offset field. The
objectid field should not be dynamic, by design.

> >> What's the idea behind that? Instead of having the key type field define
> >> the struct and meaning, we now suddenly need the tuple (tree, objectid,
> >> type), and we need all three to determine what's inside the item data?
> >> So, the code in print_tree.c would also need to know about the tree
> >> number and pass that into the different functions.
> > 
> > No, all key types, even the persistent/temporary are independent of the
> > tree type. So it's only type <-> structure mapping, besides
> > persistent/temporary types.
> 
> Yeah, I wasn't explicit about that, I meant only for the
> persistent/temporary case yes.

So for this case it's type + objectid, the tree independence stays.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] btrfs-progs: Add location check when process share_data_ref item

2017-11-28 Thread David Sterba
On Fri, Nov 24, 2017 at 06:41:28PM +0800, Gu Jinxiang wrote:
> The following test failed becasuse share_data_ref be added into
> extent_cache when deal with root tree node.

The whole function run_next_block would need to be revisited with
respect to error handling and sanity checks. Adding them only when a
fuzzed image hits the particular code path is not IMHO the best
approach.

If there's some fuzzed test case, we should try to find all similar
missing checks and fix them before moving to another type. Addressing
only the failed tests gives a false sense of fixing, there are usally
more similar bugs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/5] add infrastructure for tagging functions as error injectable

2017-11-28 Thread Jonathan Corbet
On Wed, 22 Nov 2017 16:23:30 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Using BPF we can override kprob'ed functions and return arbitrary
> values.  Obviously this can be a bit unsafe, so make this feature opt-in
> for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> order to give BPF access to that function for error injection purposes.
> 
> Signed-off-by: Josef Bacik 
> Acked-by: Ingo Molnar 
> ---
>  arch/x86/include/asm/asm.h|   6 ++
>  include/asm-generic/vmlinux.lds.h |  10 +++
>  include/linux/bpf.h   |  11 +++
>  include/linux/kprobes.h   |   1 +
>  include/linux/module.h|   5 ++
>  kernel/kprobes.c  | 163 
> ++
>  kernel/module.c   |   6 +-
>  7 files changed, 201 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
> index b0dc91f4bedc..340f4cc43255 100644
> --- a/arch/x86/include/asm/asm.h
> +++ b/arch/x86/include/asm/asm.h
> @@ -85,6 +85,12 @@
>   _ASM_PTR (entry);   \
>   .popsection
>  
> +# define _ASM_KPROBE_ERROR_INJECT(entry) \
> + .pushsection "_kprobe_error_inject_list","aw" ; \
> + _ASM_ALIGN ;\
> + _ASM_PTR (entry);   \
> + .popseciton

So this stuff is not my area of greatest expertise, but I do have to wonder
how ".popseciton" can work ... ?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread David Sterba
On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote:
> As we all know, under certain circumstances, it is more appropriate to
> create some subvolumes rather than keep everything in the same
> subvolume. As the condition of demand change, the user may need to
> convert a previous directory to a subvolume. For this reason,how about
> adding an ioctl to convert a directory to a subvolume?

I'd say too difficult to get everything right in kernel. This is
possible to be done in userspace, with existing tools.

The problem is that the conversion cannot be done atomically in most
cases, so even if it's just one ioctl call, there are several possible
intermediate states that would exist during the call. Reporting where
did the ioctl fail would need some extended error code semantics.

> Users can convert by the scripts mentioned in this
> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
> it easier to use the off-the-shelf btrfs subcommand?

Adding a subcommand would work, though I'd rather avoid reimplementing
'cp -ax' or 'rsync -ax'.  We want to copy the files preserving all
attributes, with reflink, and be able to identify partially synced
files, and not cross the mountpoints or subvolumes.

The middle step with snapshotting the containing subvolume before
syncing the data is also a valid option, but not always necessary.

> After an initial consideration, our implementation is broadly divided
> into the following steps:
> 1. Freeze the filesystem or set the subvolume above the source directory
> to read-only;

Freezing the filesystme will freeze all IO, so this would not work, but
I understand what you mean. The file data are synced before the snapshot
is taken, but nothing prevents applications to continue writing data.

Open and live files is a problem and don't see a nice solution here.

> 2. Perform a pre-check, for example, check if a cross-device link
> creation during the conversion;

Cross-device links are not a problem as long as we use 'cp' ie. the
manual creation of files in the target.

> 3. Perform conversion, such as creating a new subvolume and moving the
> contents of the source directory;
> 4. Thaw the filesystem or restore the subvolume writable property.
> 
> In fact, I am not so sure whether this use of freeze is appropriate
> because the source directory the user needs to convert may be located
> at / or /home and this pre-check and conversion process may take a long
> time, which can lead to some shell and graphical application suspended.

I think the closest operation is a read-only remount, which is not
always possible due to open files and can otherwise considered as quite
intrusive operation to the whole system. And the root filesystem cannot
be easily remounted read-only in the systemd days anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread Hans van Kranenburg
On 11/28/2017 06:34 PM, David Sterba wrote:
> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
>> Last week, when implementing the automatic classifier to dynamically
>> create tree item data objects by key type in python-btrfs, I ran into
>> the following commits in btrfs-progs:
>>
>>   commit 8609c8bad68528f668d9ce564b868aa4828107a0
>>   btrfs-progs: print-tree: factor out temporary_item dump
>> and
>>   commit a4b65f00d53deb1b495728dd58253af44fcf70df
>>   btrfs-progs: print-tree: factor out persistent_item dump
>>
>> ...which are related to kernel...
>>
>>   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
>>   btrfs: introduce key type for persistent permanent items
>> and
>>   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
>>   btrfs: introduce key type for persistent temporary items
>>
>> Afaics the goal is to overload types because there can be only 256 in
>> total. However, I'm missing the design decisions behind the
>> implementation of it. It's not in the commit messages, and it hasn't
>> been on the mailing list.
> 
> The reason is avoid wasting key types but still allow to store new types
> of data to the btrees, if they were not part of the on-disk format.
> 
> I'm not sure if this has been discussed or mentioned under some patches
> or maybe unrelated patches. I do remember that I discussed that with
> Chris in private on IRC and have the logs, dated 2015-09-02.
> 
> At that time the balance item and dev stats item were introduced, maybe
> also the qgroup status item type. This had me alarmed enough to
> reconsider how the keys are allocated.
> 
>> Before, there was an 1:1 mapping from key types to data structures. Now,
>> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
>> which use this type can be using any data structure they want, so it's
>> some kind of YOLO_ITEM_KEY.
> 
> In some sense it is, so it's key+objectid to determine the structure.
> 
>> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
>> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
>> case BTRFS_DEV_STATS_OBJECTID.
>>
>> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
>> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
>> 0, and I'm storing a btrfs_kebab_item struct indexed at
>> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
>> will try to parse the data by calling print_dev_stats?
> 
> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
> case.

(I'm just thinking out loud here, if you think I'm wasting your time
just say.)

Yes, so the objectid numbers have to be "registered" / "reserved" in the
documentation, and they have to be unique over all trees.

Maybe the information I was looking for is... in what cases should or
shouldn't this be used? Because that limits the possible usage quite a
bit. Or is it only for very very specific things.

E.g. if I wanted to (just a random idea) add per device statistics, and
use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
I have multiple devices. That's already a no go if there's anyone in any
other tree that is doing anything with any objectid in the range of
valid device numbers.

>> What's the idea behind that? Instead of having the key type field define
>> the struct and meaning, we now suddenly need the tuple (tree, objectid,
>> type), and we need all three to determine what's inside the item data?
>> So, the code in print_tree.c would also need to know about the tree
>> number and pass that into the different functions.
> 
> No, all key types, even the persistent/temporary are independent of the
> tree type. So it's only type <-> structure mapping, besides
> persistent/temporary types.

Yeah, I wasn't explicit about that, I meant only for the
persistent/temporary case yes.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread David Sterba
On Sat, Nov 25, 2017 at 09:12:35AM +0800, Qu Wenruo wrote:
> On 2017年11月25日 03:16, Hans van Kranenburg wrote:
> > Last week, when implementing the automatic classifier to dynamically
> > create tree item data objects by key type in python-btrfs, I ran into
> > the following commits in btrfs-progs:
> > 
> >   commit 8609c8bad68528f668d9ce564b868aa4828107a0
> >   btrfs-progs: print-tree: factor out temporary_item dump
> > and
> >   commit a4b65f00d53deb1b495728dd58253af44fcf70df
> >   btrfs-progs: print-tree: factor out persistent_item dump
> > 
> > ...which are related to kernel...
> > 
> >   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
> >   btrfs: introduce key type for persistent permanent items
> > and
> >   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
> >   btrfs: introduce key type for persistent temporary items
> > 
> > Afaics the goal is to overload types because there can be only 256 in
> > total.
> 
> Personally speaking, to overload types, we can easily make different
> meanings of type based on tree owner.

Key type tied to a tree would cause some ambiguity, while now we always
know what's in the type item if we eg. read a random metadata block.

We're now at ~40 key types, scattered in the 256 range so further
extensions are possible should we need a particular number due to the
ordering.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread David Sterba
On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
> Last week, when implementing the automatic classifier to dynamically
> create tree item data objects by key type in python-btrfs, I ran into
> the following commits in btrfs-progs:
> 
>   commit 8609c8bad68528f668d9ce564b868aa4828107a0
>   btrfs-progs: print-tree: factor out temporary_item dump
> and
>   commit a4b65f00d53deb1b495728dd58253af44fcf70df
>   btrfs-progs: print-tree: factor out persistent_item dump
> 
> ...which are related to kernel...
> 
>   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
>   btrfs: introduce key type for persistent permanent items
> and
>   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
>   btrfs: introduce key type for persistent temporary items
> 
> Afaics the goal is to overload types because there can be only 256 in
> total. However, I'm missing the design decisions behind the
> implementation of it. It's not in the commit messages, and it hasn't
> been on the mailing list.

The reason is avoid wasting key types but still allow to store new types
of data to the btrees, if they were not part of the on-disk format.

I'm not sure if this has been discussed or mentioned under some patches
or maybe unrelated patches. I do remember that I discussed that with
Chris in private on IRC and have the logs, dated 2015-09-02.

At that time the balance item and dev stats item were introduced, maybe
also the qgroup status item type. This had me alarmed enough to
reconsider how the keys are allocated.

> Before, there was an 1:1 mapping from key types to data structures. Now,
> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
> which use this type can be using any data structure they want, so it's
> some kind of YOLO_ITEM_KEY.

In some sense it is, so it's key+objectid to determine the structure.

> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
> case BTRFS_DEV_STATS_OBJECTID.
> 
> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
> 0, and I'm storing a btrfs_kebab_item struct indexed at
> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
> will try to parse the data by calling print_dev_stats?

As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
case.

> What's the idea behind that? Instead of having the key type field define
> the struct and meaning, we now suddenly need the tuple (tree, objectid,
> type), and we need all three to determine what's inside the item data?
> So, the code in print_tree.c would also need to know about the tree
> number and pass that into the different functions.

No, all key types, even the persistent/temporary are independent of the
tree type. So it's only type <-> structure mapping, besides
persistent/temporary types.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/4] Add the ability to do BPF directed error injection

2017-11-28 Thread Daniel Borkmann
On 11/22/2017 10:23 PM, Josef Bacik wrote:
> This is hopefully the final version, I've addressed the comment by Igno and
> added his Acks.
> 
> v6->v7:
> - moved the opt-in macro to bpf.h out of kprobes.h.
> 
> v5->v6:
> - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
>   feature.  This way only functions that opt-in will be allowed to be
>   overridden.
> - added a btrfs patch to allow error injection for open_ctree() so that the 
> bpf
>   sample actually works.
> 
> v4->v5:
> - disallow kprobe_override programs from being put in the prog map array so we
>   don't tail call into something we didn't check.  This allows us to make the
>   normal path still fast without a bunch of percpu operations.
> 
> v3->v4:
> - fix a build error found by kbuild test bot (I didn't wait long enough
>   apparently.)
> - Added a warning message as per Daniels suggestion.
> 
> v2->v3:
> - added a ->kprobe_override flag to bpf_prog.
> - added some sanity checks to disallow attaching bpf progs that have
>   ->kprobe_override set that aren't for ftrace kprobes.
> - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
>   ftrace kprobe.
> - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read 
> this
>   value in the kprobe path, and thus only write to it if we're overriding or
>   clearing the override.
> 
> v1->v2:
> - moved things around to make sure that bpf_override_return could really only 
> be
>   used for an ftrace kprobe.
> - killed the special return values from trace_call_bpf.
> - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
>   it was being called from an ftrace kprobe context.
> - reworked the logic in kprobe_perf_func to take advantage of 
> bpf_kprobe_state.
> - updated the test as per Alexei's review.
> 
> - Original message -
> 
> A lot of our error paths are not well tested because we have no good way of
> injecting errors generically.  Some subystems (block, memory) have ways to
> inject errors, but they are random so it's hard to get reproduceable results.
> 
> With BPF we can add determinism to our error injection.  We can use kprobes 
> and
> other things to verify we are injecting errors at the exact case we are trying
> to test.  This patch gives us the tool to actual do the error injection part.
> It is very simple, we just set the return value of the pt_regs we're given to
> whatever we provide, and then override the PC with a dummy function that 
> simply
> returns.
> 
> Right now this only works on x86, but it would be simple enough to expand to
> other architectures.  Thanks,

Ok, given the remaining feedback from Ingo was addressed and therefore
the series acked, I've applied it to bpf-next tree, thanks Josef.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: incremental send, fix wrong unlink path after renaming file

2017-11-28 Thread David Sterba
On Fri, Nov 17, 2017 at 02:48:30PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Under some circumstances, an incremental send operation can issue wrong
> paths for unlink commands related to files that have multiple hard links
> and some (or all) of those links were renamed between the parent and send
> snapshots. Consider the following example:
> 
> Parent snapshot
> 
>  .  (ino 256)
>  | a/   (ino 257)
>  | | b/ (ino 259)
>  | | | c/   (ino 260)
>  | | | f2   (ino 261)
>  | |
>  | | f2l1   (ino 261)
>  |
>  | d/   (ino 262)
>| f1l1_2 (ino 258)
>| f2l2   (ino 261)
>| f1_2   (ino 258)
> 
> Send snapshot
> 
>  .  (ino 256)
>  | a/   (ino 257)
>  | | f2l1/  (ino 263)
>  | | b2/(ino 259)
>  |   | c/   (ino 260)
>  |   | | d3 (ino 262)
>  |   |   | f1l1_2   (ino 258)
>  |   |   | f2l2_2   (ino 261)
>  |   |   | f1_2 (ino 258)
>  |   |
>  |   | f2   (ino 261)
>  |   | f1l2 (ino 258)
>  |
>  | d(ino 261)
> 
> When computing the incremental send stream the following steps happen:
> 
> 1) When processing inode 261, a rename operation is issued that renames
>inode 262, which currently as a path of "d", to an orphan name of
>"o262-7-0". This is done because in the send snapshot, inode 261 has
>of its hard links with a path of "d" as well.
> 
> 2) Two link operations are issued that create the new hard links for
>inode 261, whose names are "d" and "f2l2_2", at paths "/" and
>"o262-7-0/" respectively.
> 
> 3) Still while processing inode 261, unlink operations are issued to
>remove the old hard links of inode 261, with names "f2l1" and "f2l2",
>at paths "a/" and "d/". However path "d/" does not correspond anymore
>to the directory inode 262 but corresponds instead to a hard link of
>inode 261 (link command issued in the previous step). This makes the
>receiver fail with a ENOTDIR error when attempting the unlink
>operation.
> 
> The problem happens because before sending the unlink operation, we failed
> to detect that inode 262 was one of ancestors for inode 261 in the parent
> snapshot, and therefore we didn't recompute the path for inode 262 before
> issuing the unlink operation for the link named "f2l2" of inode 262. The
> detection failed because the function "is_ancestor()" only follows the
> first hard link it finds for an inode instead of all of its hard links
> (as it was originally created for being used with directories only, for
> which only one hard link exists). So fix this by making "is_ancestor()"
> follow all hard links of the input inode.
> 
> A test case for fstests follows soon.
> 
> Signed-off-by: Filipe Manana 

FYI, added to the rc2 pull.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs: fix verbose value

2017-11-28 Thread David Sterba
On Fri, Nov 24, 2017 at 02:24:17PM +0900, Misono, Tomohiro wrote:
> The value of 'verbose' is either 1 (default) or 0 (-q)
> and "verbose >= 2" will not be true.
> 
> After fix this, we get something like:
>  adding device /dev/sde id 2
>  adding device /dev/sdf id 3
> during mkfs time when multiple devices are used.

I think the original idea was to allow different levels of verbosity,
and the "adding device" messages do not bring much value to be printed
by default.

The verbose/quiet options will be handled globally so this code will get
cleaned at that time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs-progs: fi defrag: clean up duplicate code if find errors

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 05:14:48PM +0800, Su Yue wrote:
> In function cmd_filesystem_defrag(), lines of code for error handling
> are duplicate and hard to expand in further.
> 
> Create a jump label for errors.
> 
> Signed-off-by: Su Yue 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 05:14:50PM +0800, Su Yue wrote:
> Now, files which have nocompress flag also will be defraged
> with compression. However, nocompress flag is still existed
> and have to be cleared manually.
> 
> So add an option '--clear-nocompress' to extend -c to drop
> nocompress flag after defragement.

> Suggested-by: David Sterba 
> Suggested-by: Anand Jain 

Do you have the pointer to the discussion? The whole idea sounds
familiar and seeing my name here means I must have been involved, but I
have only vague memories.

> Signed-off-by: Su Yue 
> ---
> Changelog:
> v2:   Remove change about indentation of defrag_callback().
> 
> ---
>  cmds-filesystem.c | 65 
> ++-
>  1 file changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index 3931333f76c6..84242814798e 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "kerncompat.h"
>  #include "ctree.h"
> @@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = 
> {
>   "-l len  defragment only up to len bytes",
>   "-t size target extent size hint (default: 32M)",
>   "",
> + "--compress-forceclear nocompress flag on files after defragment, 
> only work with option -c",

This is probably --clear-nocompress .

The "force compress" semantics is up to kernel, so I think naming the
options 'clear-nocompress' is the right thing, as it's what it really
does.

> + "",
>   "Warning: most Linux kernels will break up the ref-links of COW data",
>   "(e.g., files copied with 'cp --reflink', snapshots) which may cause",
>   "considerable increase of space usage. See btrfs-filesystem(8) for",
> @@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] 
> = {
>   NULL
>  };
>  
> +static int clear_nocompress_flag(int fd)
> +{
> + unsigned int flags;
> + int ret = 0;
> +
> + ret = ioctl(fd, FS_IOC_GETFLAGS, );
> + if (ret < 0) {
> + ret = -errno;
> + error("failed to get flags: %s", strerror(-ret));
> + goto out;
> + }
> +
> + if (!(flags & FS_NOCOMP_FL)) {
> + ret = 0;
> + goto out;
> + }
> + flags &= ~FS_NOCOMP_FL;
> + ret = ioctl(fd, FS_IOC_SETFLAGS, );

This is inherently racy, but the inode flags do not change that often so
concurrent change is unlikely.

> + if (ret < 0) {
> + ret = -errno;
> + error("failed to set flags: %s", strerror(-ret));
> + goto out;
> + }
> +
> + ret = 0;
> +out:
> + return ret;
> +}
> +
>  static struct btrfs_ioctl_defrag_range_args defrag_global_range;
>  static int defrag_global_verbose;
>  static int defrag_global_errors;
> +static int defrag_global_clear_nocompress;
>  static int defrag_callback(const char *fpath, const struct stat *sb,
>   int typeflag, struct FTW *ftwbuf)
>  {
> @@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const 
> struct stat *sb,
>   err = errno;
>   goto error;
>   }
> +
> + if (defrag_global_clear_nocompress) {
> + ret = clear_nocompress_flag(fd);
> + if (ret) {
> + err = -ret;
> + goto error;
> + }
> + }
>   }
>   return 0;
>  
> @@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv)
>   int compress_type = BTRFS_COMPRESS_NONE;
>   DIR *dirstream;
>  
> + enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257};
> + static const struct option long_options[] = {
> + { "clear-nocompress", no_argument, NULL,
> +   GETOPT_VAL_CLEAR_NOCOMPRESS},
> + { NULL, 0, NULL, 0}
> + };
>   /*
>* Kernel has a different default (256K) that is supposed to be safe,
>* but it does not defragment very well. The 32M will likely lead to
> @@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv)
>   defrag_global_errors = 0;
>   defrag_global_verbose = 0;
>   defrag_global_errors = 0;
> + defrag_global_clear_nocompress = 0;
>   while(1) {
> - int c = getopt(argc, argv, "vrc::fs:l:t:");
> + int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options,
> + NULL);
>   if (c < 0)
>   break;
>  
> @@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
>   case 'r':
>   recursive = 1;
>   break;
> + case GETOPT_VAL_CLEAR_NOCOMPRESS:
> + defrag_global_clear_nocompress = 1;
> + 

Re: [PATCH v2 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 05:14:49PM +0800, Su Yue wrote:
> If ioctl of defrag range is unsupported, defrag will exit
> immediately.
> 
> Since caller can handle the error, let cmd_filesystem_defrag()
> break the loop and return error instead of calling exit(1).
> 
> Suggested-by: David Sterba 
> Signed-off-by: Su Yue 
> ---
> Changelog:
> v2:   Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi
>   defrag: clean up duplicate code if find errors").
> ---
>  cmds-filesystem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index 17d399d58adf..3931333f76c6 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -1050,7 +1050,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
>   ret = nftw(argv[i], defrag_callback, 10,
>   FTW_MOUNT | FTW_PHYS);
>   if (ret == ENOTTY)
> - exit(1);
> + break;

And we still need to call close_file_or_dir.

>   /* errors are handled in the callback */
>   ret = 0;
>   } else {
> -- 
> 2.15.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] btrfs-progs: mkfs: Reword --rootdir

2017-11-28 Thread David Sterba
On Fri, Oct 20, 2017 at 09:59:00AM +0800, Qu Wenruo wrote:
> Qu Wenruo (7):
>   btrfs-progs: mkfs: Don't use custom chunk allocator for rootdir
>   btrfs-progs: mkfs/rootdir: Use over-reserve method to make size
> estimate easier
>   btrfs-progs: mkfs: Only zero out the first 1M for rootdir
>   btrfs-progs: mkfs/rootdir: Introduce function to get end position of
> last device extent
>   btrfs-progs: mkfs/rootdir: Shrink fs for rootdir option
>   btrfs-progs: mkfs: Update allocation info before verbose output
>   btrfs-progs: mkfs: Separate shrink from rootdir

I've merged independent patches or what applied almost cleanly on
current devel. Please refresh and resend patches 1, 2, 5 and 7, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] btrfs-progs: mkfs: Update allocation info before verbose output

2017-11-28 Thread David Sterba
On Fri, Oct 20, 2017 at 09:59:06AM +0800, Qu Wenruo wrote:
> Since new --rootdir can allocate chunk, it will modify the chunk
> allocation result.
> 
> This patch will update allocation info before verbose output to reflect
> such info.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] btrfs-progs: mkfs/rootdir: Introduce function to get end position of last device extent

2017-11-28 Thread David Sterba
On Fri, Oct 20, 2017 at 09:59:04AM +0800, Qu Wenruo wrote:
> Useful for later 'mkfs.btrfs --rootdir' shrink support.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] btrfs-progs: mkfs: Only zero out the first 1M for rootdir

2017-11-28 Thread David Sterba
On Fri, Oct 20, 2017 at 09:59:03AM +0800, Qu Wenruo wrote:
> It's a waste of IO to fill the whole image before creating btrfs on it,
> just wiping the first 1M, and then write 1 byte to the last position to
> create a sparse file.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/4] btrfs-progs: test/mkfs: Test if the minimal device size is valid

2017-11-28 Thread David Sterba
On Wed, Nov 01, 2017 at 09:30:43AM +0800, Qu Wenruo wrote:
> New test case to test if the minimal device size given by "mkfs.btrfs"
> failure case is valid.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 02:16:15AM +0100, Adam Borowski wrote:
> On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote:
> > On 2017年11月27日 22:22, David Sterba wrote:
> > > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
> > >> On 4.15-rc1, I get the following failure:
> > >>
> > >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> > >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed 
> > >> inline
> > >> extent, have 134 expect 281474976710677
> > > 
> > > By a quick look at suspiciously large number
> > > 
> > > hex(281474976710677) = 0x10015
> > > 
> > > may be a bitflip, but 0x15 does not match 134, so there could be
> > > something else involved in the corruption.
> > 
> > That's a known bug, fixed by that patch which is not merged yet.
> > 
> > https://patchwork.kernel.org/patch/10047489/
> 
> This helped, thanks!

Patch is going to 4.15-rc2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: swapon: swapfile has holes

2017-11-28 Thread Austin S. Hemmelgarn

On 2017-11-28 07:31, Cristian wrote:

Hello,

Report in:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723449

dmesg:
[ 3.782191] swapon: swapfile has holes


As mentioned in the linked bug report, this is a known issue.  To use a 
swap file on BTRFS, you need to:


1. Create an empty file.
2. Mark it NOCOW with `chattr +C`
3. Use fallocate to pre-allocate to the appropriate size.
4. Hook it up to a loop device.
5. Use the loop device for swap.

This is because the kernel needs a consistent block mapping for a swap 
file, since it bypasses the filesystem and accesses the block device 
directly.  However, since BTRFS has copy-on-write semantics, blocks move 
around on each write, which violates the requirement for a consistent 
block mapping, you can't use a file on a BTRFS filesystem directly as a 
swap file.


Of the above steps, you technically don't need to mark the file NOCOW or 
pre-allocate it, but those are good practice to remove random latency.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file

2017-11-28 Thread David Sterba
On Tue, Nov 28, 2017 at 08:40:47AM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年11月28日 07:21, David Sterba wrote:
> > On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote:
> >> +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile,
> >> + u64 data_profile)
> >> +{
> >> +  u64 reserved = 0;
> >> +  u64 meta_size;
> >> +  u64 data_size;
> >> +
> >> +  if (mixed)
> >> +  return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
> >> +  btrfs_min_global_blk_rsv_size(nodesize));
> >> +
> >> +  /*
> >> +   * minimal size calculation is complex due to several factors:
> >> +   * 1) Temporary chunk resue
> >> +   *If specified chunk profile is SINGLE, we can resue
> >> +   *temporary chunks, no need to alloc new chunks.
> >> +   *
> >> +   * 2) Different minimal chunk size for different profile
> >> +   *For initial sys chunk, chunk size is fixed to 4M.
> >> +   *For single profile, minimal chunk size is 8M for all.
> >> +   *For other profiles, minimal chunk and stripe size
> >> +   *differs from 8M to 64M.
> >> +   *
> >> +   * To calculate it a little easier, here we assume we don't
> >> +   * reuse any temporary chunk, and calculate the size all
> >> +   * by ourselves.
> >> +   *
> >> +   * Temporary chunks sizes are always fixed:
> >> +   * One initial sys chunk, one SINGLE meta, and one SINGLE data.
> >> +   * The latter two are all 8M, accroding to @calc_size of
> >> +   * btrfs_alloc_chunk().
> >> +   */
> >> +  reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2;
> >> +
> >> +  /*
> >> +   * For real chunks, we need to refer different sizes:
> >> +   * For SINGLE, it's still fixed to 8M (@calc_size).
> >> +   * For other profiles, refer to max(@min_stripe_size, @calc_size).
> >> +   *
> >> +   * And use the stripe size to calculate its physical used space.
> >> +   */
> >> +  if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
> >> +  meta_size = SZ_8M + SZ_32M;
> >> +  else
> >> +  meta_size = SZ_8M + SZ_8M;
> >> +  /* Only metadata put 2 stripes into one disk */
> > 
> > 'mkfs.btrfs -d dup' also works, so I think this should be handled here
> > the same way as metadata, right?
> 
> Just a few lines under.
> > 
> >> +  if (meta_profile & BTRFS_BLOCK_GROUP_DUP)
> >> +  meta_size *= 2;
> >> +  reserved += meta_size;
> >> +
> >> +  if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
> >> +  data_size = SZ_64M;
> >> +  else
> >> +  data_size = SZ_8M;
> >> +  if (data_profile & BTRFS_BLOCK_GROUP_DUP)
> >> +  data_size *= 2;
> Here data DUP is also handled.
> 
> So the problem is about the word "Only".

Yeah, it's confusing when comment does not match the code because it's
not clear what's the actual intention.

> Do I need to re-submit the whole patchset or just a separate patch to
> update the comment?

No, I'll update it myself as it is fairly trivial change.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


swapon: swapfile has holes

2017-11-28 Thread Cristian
Hello,

Report in:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723449

dmesg:
[ 3.782191] swapon: swapfile has holes

Regards,
-- 
Cristian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files

2017-11-28 Thread Su Yue
Now, files which have nocompress flag also will be defraged
with compression. However, nocompress flag is still existed
and have to be cleared manually.

So add an option '--clear-nocompress' to extend -c to drop
nocompress flag after defragement.

Suggested-by: David Sterba 
Suggested-by: Anand Jain 
Signed-off-by: Su Yue 
---
Changelog:
v2: Remove change about indentation of defrag_callback().

---
 cmds-filesystem.c | 65 ++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 3931333f76c6..84242814798e 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kerncompat.h"
 #include "ctree.h"
@@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = {
"-l len  defragment only up to len bytes",
"-t size target extent size hint (default: 32M)",
"",
+   "--compress-forceclear nocompress flag on files after defragment, 
only work with option -c",
+   "",
"Warning: most Linux kernels will break up the ref-links of COW data",
"(e.g., files copied with 'cp --reflink', snapshots) which may cause",
"considerable increase of space usage. See btrfs-filesystem(8) for",
@@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] = {
NULL
 };
 
+static int clear_nocompress_flag(int fd)
+{
+   unsigned int flags;
+   int ret = 0;
+
+   ret = ioctl(fd, FS_IOC_GETFLAGS, );
+   if (ret < 0) {
+   ret = -errno;
+   error("failed to get flags: %s", strerror(-ret));
+   goto out;
+   }
+
+   if (!(flags & FS_NOCOMP_FL)) {
+   ret = 0;
+   goto out;
+   }
+   flags &= ~FS_NOCOMP_FL;
+   ret = ioctl(fd, FS_IOC_SETFLAGS, );
+   if (ret < 0) {
+   ret = -errno;
+   error("failed to set flags: %s", strerror(-ret));
+   goto out;
+   }
+
+   ret = 0;
+out:
+   return ret;
+}
+
 static struct btrfs_ioctl_defrag_range_args defrag_global_range;
 static int defrag_global_verbose;
 static int defrag_global_errors;
+static int defrag_global_clear_nocompress;
 static int defrag_callback(const char *fpath, const struct stat *sb,
int typeflag, struct FTW *ftwbuf)
 {
@@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
err = errno;
goto error;
}
+
+   if (defrag_global_clear_nocompress) {
+   ret = clear_nocompress_flag(fd);
+   if (ret) {
+   err = -ret;
+   goto error;
+   }
+   }
}
return 0;
 
@@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257};
+   static const struct option long_options[] = {
+   { "clear-nocompress", no_argument, NULL,
+ GETOPT_VAL_CLEAR_NOCOMPRESS},
+   { NULL, 0, NULL, 0}
+   };
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv)
defrag_global_errors = 0;
defrag_global_verbose = 0;
defrag_global_errors = 0;
+   defrag_global_clear_nocompress = 0;
while(1) {
-   int c = getopt(argc, argv, "vrc::fs:l:t:");
+   int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options,
+   NULL);
if (c < 0)
break;
 
@@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
case 'r':
recursive = 1;
break;
+   case GETOPT_VAL_CLEAR_NOCOMPRESS:
+   defrag_global_clear_nocompress = 1;
+   break;
default:
usage(cmd_filesystem_defrag_usage);
}
@@ -987,6 +1039,8 @@ static int cmd_filesystem_defrag(int argc, char **argv)
if (compress_type) {
defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_COMPRESS;
defrag_global_range.compress_type = compress_type;
+   } else if (defrag_global_clear_nocompress) {
+   warning("Option --clear-nocompress only works for -c");
}
if (flush)
defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_START_IO;
@@ -1071,6 +1125,15 @@ static int 

[PATCH v2 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported

2017-11-28 Thread Su Yue
If ioctl of defrag range is unsupported, defrag will exit
immediately.

Since caller can handle the error, let cmd_filesystem_defrag()
break the loop and return error instead of calling exit(1).

Suggested-by: David Sterba 
Signed-off-by: Su Yue 
---
Changelog:
v2: Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi
defrag: clean up duplicate code if find errors").
---
 cmds-filesystem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 17d399d58adf..3931333f76c6 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -1050,7 +1050,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
ret = nftw(argv[i], defrag_callback, 10,
FTW_MOUNT | FTW_PHYS);
if (ret == ENOTTY)
-   exit(1);
+   break;
/* errors are handled in the callback */
ret = 0;
} else {
-- 
2.15.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs-progs: fi defrag: clean up duplicate code if find errors

2017-11-28 Thread Su Yue
In function cmd_filesystem_defrag(), lines of code for error handling
are duplicate and hard to expand in further.

Create a jump label for errors.

Signed-off-by: Su Yue 
---
Changelog:
v2: Record -errno to return value if open() and fstat() failed.
Move change "do no exit if defrag range ioctl is unsupported"
to another patch.
Suggested by David.
---
 cmds-filesystem.c | 44 
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 7728430f16a1..17d399d58adf 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -1029,23 +1029,22 @@ static int cmd_filesystem_defrag(int argc, char **argv)
if (fd < 0) {
error("cannot open %s: %s", argv[i],
strerror(errno));
-   defrag_global_errors++;
-   close_file_or_dir(fd, dirstream);
-   continue;
+   ret = -errno;
+   goto next;
}
-   if (fstat(fd, )) {
+
+   ret = fstat(fd, );
+   if (ret) {
error("failed to stat %s: %s",
argv[i], strerror(errno));
-   defrag_global_errors++;
-   close_file_or_dir(fd, dirstream);
-   continue;
+   ret = -errno;
+   goto next;
}
if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) {
error("%s is not a directory or a regular file",
argv[i]);
-   defrag_global_errors++;
-   close_file_or_dir(fd, dirstream);
-   continue;
+   ret = -EINVAL;
+   goto next;
}
if (recursive && S_ISDIR(st.st_mode)) {
ret = nftw(argv[i], defrag_callback, 10,
@@ -1060,20 +1059,25 @@ static int cmd_filesystem_defrag(int argc, char **argv)
ret = ioctl(fd, BTRFS_IOC_DEFRAG_RANGE,
_global_range);
defrag_err = errno;
-   }
-   close_file_or_dir(fd, dirstream);
-   if (ret && defrag_err == ENOTTY) {
-   error(
+   if (ret && defrag_err == ENOTTY) {
+   error(
 "defrag range ioctl not supported in this kernel version, 2.6.33 and newer is 
required");
-   defrag_global_errors++;
-   break;
+   defrag_global_errors++;
+   close_file_or_dir(fd, dirstream);
+   break;
+   }
+   if (ret) {
+   error("defrag failed on %s: %s", argv[i],
+ strerror(defrag_err));
+   goto next;
+   }
}
-   if (ret) {
-   error("defrag failed on %s: %s", argv[i],
-   strerror(defrag_err));
+next:
+   if (ret)
defrag_global_errors++;
-   }
+   close_file_or_dir(fd, dirstream);
}
+
if (defrag_global_errors)
fprintf(stderr, "total %d failures\n", defrag_global_errors);
 
-- 
2.15.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items

2017-11-28 Thread Su Yue



On 11/28/2017 04:25 PM, Qu Wenruo wrote:



On 2017年11月28日 15:47, Su Yue wrote:



On 11/28/2017 12:05 PM, Qu Wenruo wrote:



On 2017年11月28日 10:38, Su Yue wrote:

Thanks for review.

On 11/27/2017 06:37 PM, Qu Wenruo wrote:



On 2017年11月27日 11:13, Su Yue wrote:

Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which
screws up extent allocator") removes pin_metadata_blocks() from
lowmem repair.
So we have to find another way to exclude extents which should be
occupied by tree blocks.


First thing first, any tree block which is referred by will not be
freed.

Unless some special case, like extent tree initialization which clears
the whole extent tree so we can't check if one tree block should be
freed using extent tree, there is no need to explicitly pin existing
tree blocks.

Their creation/freeing is already handled well.


If I understand the logic of extents correctly:

The extents in free space cache are handled in the way like
cache_block_group() in extent-tree.c.
It traverses all extents items then marks all holes free in free space
cache.

Yes, in a normal situation, extents are already handled well.
But original and lowmem check try to repair corrupted extent tree.

Suppose a situation:
1) Let an extent item of one tree block is corrupted or missed.
2) The correct extent in free space cache will be marked as free.
3) Next CoW may allocate the "used" extents from free space
     and insert an extent item.
4) Lowmem repair increases refs of the extent item and
     causes a wrong extent item.
OR
3) Lowmem repair inserts the extent item firstly.
4) CoW may allocate the "used" extents from free space.
     BUG_ON failure of inserting the extent item.


OK, this explains things, but still not optimal.

Such behavior should not happen if nothing is exposed.
Pinning down all tree blocks is never a light operation and should be
avoided if possible.



Agreed.


It would be nice to do it when starting a new transaction to modify
the fs.



Now, there is only one transaction while checking chunks and extents
because of previous wrong call of pin_metadata_blocks().
It's meaningless to put it at start of new transaction now.


Because you start that transaction under all condition.

Normally, transaction should only be started if you're sure you will
modify the filesystem.

Start them unconditionally is not a preferred behavior.



I will split the transaction into many minors. Then lowmem repair will
speed up if no error in chunks and extents. But it is just a little
optimization.


Compared to iterating all tree blocks, it's a big optimization.

Just think of filesystem with several TB metadata.






Please explain the reason why you need to pin extents first.


What I do in the patch is like origin mode's.
Pining extents first ensures every extents(corrupted and uncorrupted)
will not be rellocated.


Only extent tree reinit will pin down all metadata block in original
mode while normal repair won't.

So you're still doing something which is not done in original mode,
which either needs to be explained in detail or fix original mode first.



You are right. I did miss details of how original mode frees extent
records.

After come thinking, pining extents is still necessary in lowmem
repair.

Original mode has extent cache to record all extents and free normal
extents while traversing tree blocks.
After traversal, only corrupted extent records exist in the cache.
At this moment, it pins them and start transactions to repair.

Lowmem mode does not store anything like extent records. It begins
transactions(in further patch) in traversal. We can not guarantee
extents allocated from free space is occupied by one tree block which
has not been traversed.

Although pining extents is heavy and painfully work, it seems
inevitable. Otherwise, I have no better idea than pining all
extents.


Firstly, only missing backref in extent tree will cause problem.

Even we have corrupted backref or extra backref which doesn't have any
tree block/data referring it, it won't cause a big problem.

So for lowmem mode, we can do it in a different way, just like seed
device, we allocate new meta chunks, and only use new meta chunks for
our tree operations.

This will avoid any extra tree iteration, and easier to implement AFAIK.
(Just mark all existing block groups full, to force chunk allocation)



Seems feasible! I will try to implement it.

Thanks,
Su


Thanks,
Qu



Thanks,
Su



Thanks,
Su


Thanks,
Qu



Modify pin_down_tree_blocks() only for code reuse.
So behavior of pin_metadata_blocks() which works with option
'init-extent-tree' is not influenced.

Introduce exclude_blocks_and_extent_items() to mark extents of all
tree
blocks dirty in fs_info->excluded_extents. It also calls
exclude_extent_items() to exclude extent_items in extent tree.

Signed-off-by: Su Yue 
---
    cmds-check.c | 122
++-
    1 file changed, 112 

Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread Qu Wenruo


On 2017年11月28日 16:29, Lu Fengqi wrote:
> On Mon, Nov 27, 2017 at 06:13:10PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2017年11月27日 17:41, Lu Fengqi wrote:
>>> Hi all,
>>>
>>> As we all know, under certain circumstances, it is more appropriate to
>>> create some subvolumes rather than keep everything in the same
>>> subvolume. As the condition of demand change, the user may need to
>>> convert a previous directory to a subvolume. For this reason,how about
>>> adding an ioctl to convert a directory to a subvolume?
>>
>> The idea seems interesting.
>>
>> However in my opinion, this can be done quite easily in (mostly) user
>> space, thanks to btrfs support of relink.
>>
>> The method from Hugo or Chris is quite good, maybe it can be enhanced a
>> little.
>>
>> Use the following layout as an example:
>>
>> root_subv
>> |- subvolume_1
>> |  |- dir_1
>> |  |  |- file_1
>> |  |  |- file_2
>> |  |- dir_2
>> | |- file_3
>> |- subvolume_2
>>
>> If we want to convert dir_1 into subvolume, we can do it like:
>>
>> 1) Create a temporary readonly snapshot of parent subvolume containing
>>   the desired dir
>>   # btrfs sub snapshot -r root_subv/subvolume_1 \
>> root_subv/tmp_snapshot_1
>>
>> 2) Create a new subvolume, as destination.
>>   # btrfs sub create root_subv/tmp_dest/
>>
>> 3) Copy the content and sync the fs
>>   Use of reflink is necessary.
>>   # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
>> root_subv/tmp_dest
>>   # btrfs sync root_subv/tmp_dest
>>
>> 4) Delete temporary readonly snapshot
>>   # btrfs subvolume delete root_subv/tmp_snapshot_1
>>
>> 5) Remove the source dir
>>   # rm -rf root_subv/subvolume_1/dir_1
>>
>> 5) Create a final destination snapshot of "root_subv/temporary_dest"
>>   # btrfs subvolume snapshot root_subv/tmp_dest \
>> root_subv/subvolume_1/dir_1
>>
>> 6) Remove the temporary destination
>>   # btrfs subvolume delete root_subv/tmp_dest
>>
>>
>> The main challenge is in step 3).
>> In fact above method can only handle normal dir/files.
>> If there is another subvolume inside the desired dir, current "cp -r" is
>> a bad idea.
>> We need to skip subvolume dir, and create snapshot for it.
>>
>> But it's quite easy to write a user space program to handle it.
>> Maybe using "find" command can already handle it well.
>>
>> Anyway, doing it in user space onsidering btrfs snapshot creation won't 
>> flush data, which already means some buffered data will not occur in 
>> snapshot.
is already possible and much easier than
>> doing it in kernel.
>>
>>>
>>> Users can convert by the scripts mentioned in this
>>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
>>> it easier to use the off-the-shelf btrfs subcommand?
>>
>> If you just want to integrate the functionality into btrfs-progs, maybe
>> it's possible.
>>
>> But if you insist in providing a new ioctl for this, I highly doubt if
>> the extra hassle is worthy.
>>
> 
> Thanks for getting back to me.
> 
> The enhanced approach you provide is pretty good, and I agree that it
> meets the needs of some users.
> 
>>>
>>> After an initial consideration, our implementation is broadly divided
>>> into the following steps:
>>> 1. Freeze the filesystem or set the subvolume above the source directory
>>> to read-only;
>>
>> Not really need to freeze the whole fs.
>> Just create a readonly snapshot of the parent subvolume which contains
>> the dir.
>> That's how snapshot is designed for.
>>
> 
> I still worry about the following problem. Although tmp_snapshot_1 is
> read-only, the source directory dir_1 is still writable. Also, step 5)
> will delete dir_1, which may result in the loss of newly written data
> during the conversion.

Just a problem of timing.

Output message after or before creation of read-only snapshot to info
user will be good enough.

Especially considering btrfs snapshot creation won't flush data, which
already means some buffered data will not occur in snapshot.

Thanks,
Qu

> 
> I can not think of any method else for now, except I set the subvolume_1
> read-only(obviously this is a bad idea). Could you provide some
> suggestions?
> 
>>> 2. Perform a pre-check, for example, check if a cross-device link
>>> creation during the conversion;
>>
>> This can be done in-the-fly.
>> As the check is so easy (only needs to check if the inode number is 256).
>> We only need a mid-order iteration of the source dir (in temporary
>> snapshot), and for normal file, use reflink.
>> For subvolume dir, create a snapshot for it.
>>
>> And for such iteration, a python script less than 100 lines would be
>> sufficient.
> 
> Just to clarify, the check is to confirm if there is a hard link across
> dir_1.> The dir_1 can't be converted to subvolume if there is such a
> cross-device link that can not be created.
> 
>>
>> Thanks,
>> Qu
>>
>>> 3. Perform conversion, such as creating a new subvolume and moving the
>>> contents of the source directory;
>>> 4. Thaw the filesystem or restore the subvolume writable 

Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-28 Thread Lu Fengqi
On Mon, Nov 27, 2017 at 06:13:10PM +0800, Qu Wenruo wrote:
>
>
>On 2017年11月27日 17:41, Lu Fengqi wrote:
>> Hi all,
>> 
>> As we all know, under certain circumstances, it is more appropriate to
>> create some subvolumes rather than keep everything in the same
>> subvolume. As the condition of demand change, the user may need to
>> convert a previous directory to a subvolume. For this reason,how about
>> adding an ioctl to convert a directory to a subvolume?
>
>The idea seems interesting.
>
>However in my opinion, this can be done quite easily in (mostly) user
>space, thanks to btrfs support of relink.
>
>The method from Hugo or Chris is quite good, maybe it can be enhanced a
>little.
>
>Use the following layout as an example:
>
>root_subv
>|- subvolume_1
>|  |- dir_1
>|  |  |- file_1
>|  |  |- file_2
>|  |- dir_2
>| |- file_3
>|- subvolume_2
>
>If we want to convert dir_1 into subvolume, we can do it like:
>
>1) Create a temporary readonly snapshot of parent subvolume containing
>   the desired dir
>   # btrfs sub snapshot -r root_subv/subvolume_1 \
> root_subv/tmp_snapshot_1
>
>2) Create a new subvolume, as destination.
>   # btrfs sub create root_subv/tmp_dest/
>
>3) Copy the content and sync the fs
>   Use of reflink is necessary.
>   # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
> root_subv/tmp_dest
>   # btrfs sync root_subv/tmp_dest
>
>4) Delete temporary readonly snapshot
>   # btrfs subvolume delete root_subv/tmp_snapshot_1
>
>5) Remove the source dir
>   # rm -rf root_subv/subvolume_1/dir_1
>
>5) Create a final destination snapshot of "root_subv/temporary_dest"
>   # btrfs subvolume snapshot root_subv/tmp_dest \
> root_subv/subvolume_1/dir_1
>
>6) Remove the temporary destination
>   # btrfs subvolume delete root_subv/tmp_dest
>
>
>The main challenge is in step 3).
>In fact above method can only handle normal dir/files.
>If there is another subvolume inside the desired dir, current "cp -r" is
>a bad idea.
>We need to skip subvolume dir, and create snapshot for it.
>
>But it's quite easy to write a user space program to handle it.
>Maybe using "find" command can already handle it well.
>
>Anyway, doing it in user space is already possible and much easier than
>doing it in kernel.
>
>> 
>> Users can convert by the scripts mentioned in this
>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
>> it easier to use the off-the-shelf btrfs subcommand?
>
>If you just want to integrate the functionality into btrfs-progs, maybe
>it's possible.
>
>But if you insist in providing a new ioctl for this, I highly doubt if
>the extra hassle is worthy.
>

Thanks for getting back to me.

The enhanced approach you provide is pretty good, and I agree that it
meets the needs of some users.

>> 
>> After an initial consideration, our implementation is broadly divided
>> into the following steps:
>> 1. Freeze the filesystem or set the subvolume above the source directory
>> to read-only;
>
>Not really need to freeze the whole fs.
>Just create a readonly snapshot of the parent subvolume which contains
>the dir.
>That's how snapshot is designed for.
>

I still worry about the following problem. Although tmp_snapshot_1 is
read-only, the source directory dir_1 is still writable. Also, step 5)
will delete dir_1, which may result in the loss of newly written data
during the conversion.

I can not think of any method else for now, except I set the subvolume_1
read-only(obviously this is a bad idea). Could you provide some
suggestions?

>> 2. Perform a pre-check, for example, check if a cross-device link
>> creation during the conversion;
>
>This can be done in-the-fly.
>As the check is so easy (only needs to check if the inode number is 256).
>We only need a mid-order iteration of the source dir (in temporary
>snapshot), and for normal file, use reflink.
>For subvolume dir, create a snapshot for it.
>
>And for such iteration, a python script less than 100 lines would be
>sufficient.

Just to clarify, the check is to confirm if there is a hard link across
dir_1. The dir_1 can't be converted to subvolume if there is such a
cross-device link that can not be created.

>
>Thanks,
>Qu
>
>> 3. Perform conversion, such as creating a new subvolume and moving the
>> contents of the source directory;
>> 4. Thaw the filesystem or restore the subvolume writable property.
>> 
>> In fact, I am not so sure whether this use of freeze is appropriate
>> because the source directory the user needs to convert may be located
>> at / or /home and this pre-check and conversion process may take a long
>> time, which can lead to some shell and graphical application suspended.
>> 
>> Please give your comments if any.
>> 
>

-- 
Thanks,
Lu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items

2017-11-28 Thread Qu Wenruo


On 2017年11月28日 15:47, Su Yue wrote:
> 
> 
> On 11/28/2017 12:05 PM, Qu Wenruo wrote:
>>
>>
>> On 2017年11月28日 10:38, Su Yue wrote:
>>> Thanks for review.
>>>
>>> On 11/27/2017 06:37 PM, Qu Wenruo wrote:


 On 2017年11月27日 11:13, Su Yue wrote:
> Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which
> screws up extent allocator") removes pin_metadata_blocks() from
> lowmem repair.
> So we have to find another way to exclude extents which should be
> occupied by tree blocks.

 First thing first, any tree block which is referred by will not be
 freed.

 Unless some special case, like extent tree initialization which clears
 the whole extent tree so we can't check if one tree block should be
 freed using extent tree, there is no need to explicitly pin existing
 tree blocks.

 Their creation/freeing is already handled well.

>>> If I understand the logic of extents correctly:
>>>
>>> The extents in free space cache are handled in the way like
>>> cache_block_group() in extent-tree.c.
>>> It traverses all extents items then marks all holes free in free space
>>> cache.
>>>
>>> Yes, in a normal situation, extents are already handled well.
>>> But original and lowmem check try to repair corrupted extent tree.
>>>
>>> Suppose a situation:
>>> 1) Let an extent item of one tree block is corrupted or missed.
>>> 2) The correct extent in free space cache will be marked as free.
>>> 3) Next CoW may allocate the "used" extents from free space
>>>     and insert an extent item.
>>> 4) Lowmem repair increases refs of the extent item and
>>>     causes a wrong extent item.
>>> OR
>>> 3) Lowmem repair inserts the extent item firstly.
>>> 4) CoW may allocate the "used" extents from free space.
>>>     BUG_ON failure of inserting the extent item.
>>
>> OK, this explains things, but still not optimal.
>>
>> Such behavior should not happen if nothing is exposed.
>> Pinning down all tree blocks is never a light operation and should be
>> avoided if possible.
>>
> 
> Agreed.
> 
>> It would be nice to do it when starting a new transaction to modify
>> the fs.
>>
> 
> Now, there is only one transaction while checking chunks and extents
> because of previous wrong call of pin_metadata_blocks().
> It's meaningless to put it at start of new transaction now.

Because you start that transaction under all condition.

Normally, transaction should only be started if you're sure you will
modify the filesystem.

Start them unconditionally is not a preferred behavior.

> 
> I will split the transaction into many minors. Then lowmem repair will
> speed up if no error in chunks and extents. But it is just a little
> optimization.

Compared to iterating all tree blocks, it's a big optimization.

Just think of filesystem with several TB metadata.

> 
>>>
 Please explain the reason why you need to pin extents first.

>>> What I do in the patch is like origin mode's.
>>> Pining extents first ensures every extents(corrupted and uncorrupted)
>>> will not be rellocated.
>>
>> Only extent tree reinit will pin down all metadata block in original
>> mode while normal repair won't.
>>
>> So you're still doing something which is not done in original mode,
>> which either needs to be explained in detail or fix original mode first.
>>
> 
> You are right. I did miss details of how original mode frees extent
> records.
> 
> After come thinking, pining extents is still necessary in lowmem
> repair.
> 
> Original mode has extent cache to record all extents and free normal
> extents while traversing tree blocks.
> After traversal, only corrupted extent records exist in the cache.
> At this moment, it pins them and start transactions to repair.
> 
> Lowmem mode does not store anything like extent records. It begins
> transactions(in further patch) in traversal. We can not guarantee
> extents allocated from free space is occupied by one tree block which
> has not been traversed.
> 
> Although pining extents is heavy and painfully work, it seems
> inevitable. Otherwise, I have no better idea than pining all
> extents.

Firstly, only missing backref in extent tree will cause problem.

Even we have corrupted backref or extra backref which doesn't have any
tree block/data referring it, it won't cause a big problem.

So for lowmem mode, we can do it in a different way, just like seed
device, we allocate new meta chunks, and only use new meta chunks for
our tree operations.

This will avoid any extra tree iteration, and easier to implement AFAIK.
(Just mark all existing block groups full, to force chunk allocation)

Thanks,
Qu

> 
> Thanks,
> Su
> 
>>>
>>> Thanks,
>>> Su
>>>
 Thanks,
 Qu

>
> Modify pin_down_tree_blocks() only for code reuse.
> So behavior of pin_metadata_blocks() which works with option
> 'init-extent-tree' is not influenced.
>
> Introduce exclude_blocks_and_extent_items() to mark extents of all
> tree
> blocks