date:20170901

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Duncan

Austin S. Hemmelgarn posted on Fri, 01 Sep 2017 10:07:47 -0400 as
excerpted:

>  On 2017-09-01 09:54, Qu Wenruo wrote:
>> 
>> On 2017年09月01日 20:47, Austin S. Hemmelgarn wrote:

>>> On 2017-09-01 08:19, Qu Wenruo wrote:

 Current kernel (and btrfs-progs also tries to follow kernel chunk
 allocator's behavior) will not make a chunk larger than 10% of RW
 space.
 So for small filesystem chunk won't be too maximally sized.

>>> Are you sure about this?  I've got a couple of sub 10GB BTRFS volumes
>>> that definitely have more than one 1GB data chunk.
>> 
>> Yes, check the following code:
>> 
>>      /* we don't want a chunk larger than 10% of writeable space */
>>  max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>>   max_chunk_size);
>>
>> Which is in __btrfs_alloc_chunk() function in fs/btrfs/volumes.c

> Huh, I may have the remnants of an old bug present on those filesystems
> then, I'll have to look further into this.

Please do.

While I redid my /boot (and backups on other devices) when I upgraded to
all-ssd, and my /boot is now 512 MiB, with what appears to be a 64 MiB
first-chunk (mixed-bg mode so data/metadata), doubled due to dup-mode to
128 MiB, and while that works out to 1/8=12.5% of the filesystem size,
it still rebalances just fine as long as there's still 128 MiB unallocated
to create the new chunks to write into.

But my earlier layout had a 256 MiB /boot (and backups), and the initial
chunk size was *STILL* 64 MiB, dup to 128 MiB, half the 256 MiB filesystem
size so obviously no balance of that chunk possible because there's not
enough space left after the system chunk takes its byte as well, to write
the new copy of the chunk (and its dup) into.

Now the last time I redid the old layout /boot with a mkfs.btrfs was
several kernel and userspace cycles ago now, so mkfs.btrfs might well have
changed and now creates smaller chunks on a 256 MiB filesystem, but it
sure was frustrating not to be able to rebalance that chunk, and I don't
/know/ that the bug was fixed, because I have the larger /boot now.  Tho
as I said even there it's apparently 1/8, 12.5%, larger than the 10%
quoted, yet I know 32 MiB and I believe even 16 MiB chunks are possible,
tho I'm not sure what the exact minimum is.

Anyway, be sure and check mixed-mode too, because that's where I had my
problems, tho I'm not sure if it's where you had yours.  But it could be
that the differing code path of mixed-mode misses that 10% check, which
would explain my problem, and possibly yours if that's what yours were.

Meanwhile, I might have some free time to do my own checks tomorrow.
It'd be worth it just to my own peace of mind to settle the issue,
since it frustrated me for so long.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: move finish_wait out of the loop

2017-09-01 Thread Liu Bo

If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fa1b78c..19e4dec 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -651,8 +651,8 @@ static void btrfs_wait_for_no_snapshoting_writes(struct 
btrfs_root *root)
if (writers)
schedule();
 
-   finish_wait(>subv_writers->wait, );
} while (writers);
+   finish_wait(>subv_writers->wait, );
 }
 
 static int create_snapshot(struct btrfs_root *root, struct inode *dir,
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: remove batch plug in run_scheduled_IO

2017-09-01 Thread Liu Bo

Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so
we don't gain benefits by batching 64 bios here.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e8b9a26..de55024 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -524,12 +524,6 @@ static noinline void run_scheduled_bios(struct 
btrfs_device *device)
 >work);
goto done;
}
-   /* unplug every 64 requests just for good measure */
-   if (batch_run % 64 == 0) {
-   blk_finish_plug();
-   blk_start_plug();
-   sync_pending = 0;
-   }
}
 
cond_resched();
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: use wait_event instead of a single function

2017-09-01 Thread Liu Bo

Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ioctl.c | 21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 19e4dec..1c8bdde 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -638,23 +638,6 @@ static noinline int create_subvol(struct inode *dir,
return ret;
 }
 
-static void btrfs_wait_for_no_snapshoting_writes(struct btrfs_root *root)
-{
-   s64 writers;
-   DEFINE_WAIT(wait);
-
-   do {
-   prepare_to_wait(>subv_writers->wait, ,
-   TASK_UNINTERRUPTIBLE);
-
-   writers = percpu_counter_sum(>subv_writers->counter);
-   if (writers)
-   schedule();
-
-   } while (writers);
-   finish_wait(>subv_writers->wait, );
-}
-
 static int create_snapshot(struct btrfs_root *root, struct inode *dir,
   struct dentry *dentry,
   u64 *async_transid, bool readonly,
@@ -683,7 +666,9 @@ static int create_snapshot(struct btrfs_root *root, struct 
inode *dir,
 
atomic_inc(>will_be_snapshoted);
smp_mb__after_atomic();
-   btrfs_wait_for_no_snapshoting_writes(root);
+   /* wait for no snapshot writes */
+   wait_event(root->subv_writers->wait,
+  percpu_counter_sum(>subv_writers->counter) == 0);
 
ret = btrfs_start_delalloc_inodes(root, 0);
if (ret)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: protect conditions within root->log_mutex while waiting

2017-09-01 Thread Liu Bo

Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.

This refactors code a bit to be lock safe.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tree-log.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 3a11ae6..434dcd5 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2701,34 +2701,36 @@ static void wait_log_commit(struct btrfs_root *root, 
int transid)
 * so we know that if ours is more than 2 older than the
 * current transaction, we're done
 */
-   do {
+   for (;;) {
prepare_to_wait(>log_commit_wait[index],
, TASK_UNINTERRUPTIBLE);
-   mutex_unlock(>log_mutex);
 
-   if (root->log_transid_committed < transid &&
-   atomic_read(>log_commit[index]))
-   schedule();
+   if (!(root->log_transid_committed < transid &&
+ atomic_read(>log_commit[index])))
+   break;
 
-   finish_wait(>log_commit_wait[index], );
+   mutex_unlock(>log_mutex);
+   schedule();
mutex_lock(>log_mutex);
-   } while (root->log_transid_committed < transid &&
-atomic_read(>log_commit[index]));
+   }
+   finish_wait(>log_commit_wait[index], );
 }
 
 static void wait_for_writer(struct btrfs_root *root)
 {
DEFINE_WAIT(wait);
 
-   while (atomic_read(>log_writers)) {
-   prepare_to_wait(>log_writer_wait,
-   , TASK_UNINTERRUPTIBLE);
+   for (;;) {
+   prepare_to_wait(>log_writer_wait, ,
+   TASK_UNINTERRUPTIBLE);
+   if (!atomic_read(>log_writers))
+   break;
+
mutex_unlock(>log_mutex);
-   if (atomic_read(>log_writers))
-   schedule();
-   finish_wait(>log_writer_wait, );
+   schedule();
mutex_lock(>log_mutex);
}
+   finish_wait(>log_writer_wait, );
 }
 
 static inline void btrfs_remove_log_ctx(struct btrfs_root *root,
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: avoid wake_up if possible

2017-09-01 Thread Liu Bo

wake_up() will go to check whether someone is on the waiting list with
holding spin_lock().

Around some btrfs code, we don't check waitqueue_active() firstly, so
the spin_lock() pair in wake_up() is called even if no one is waiting
on the queue.

There are more wake_up()s without waitqueue_active(), but these two
are the hottest one I've run into so far.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c| 9 -
 fs/btrfs/ordered-data.c | 8 +++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 825fad6..e2dc042 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -536,8 +536,15 @@ static struct extent_state *clear_state_bit(struct 
extent_io_tree *tree,
clear_state_cb(tree, state, bits);
add_extent_changeset(state, bits_to_clear, changeset, 0);
state->state &= ~bits_to_clear;
-   if (wake)
+
+   assert_spin_locked(>lock);
+   /*
+* spin_lock is acquired by both waker and waiter, thus no
+* need to restrict the order.
+**/
+   if (wake && waitqueue_active(>wq))
wake_up(>wq);
+
if (state->state == 0) {
next = next_state(state);
if (extent_state_in_tree(state)) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a3aca49..e439fb4 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -647,7 +647,13 @@ void btrfs_remove_ordered_extent(struct inode *inode,
spin_unlock(_info->ordered_root_lock);
}
spin_unlock(>ordered_extent_lock);
-   wake_up(>wait);
+
+   /*
+* setting flag is protected by spin_lock pair, which has a
+* implicit memory barrier.
+*/
+   if (waitqueue_active(>wait))
+   wake_up(>wait);
 }
 
 static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

2017-09-01 Thread Josef Bacik

You'll be fine, it's only happening on the one fs right?  That's 13gib of 
metadata with checksums and all that shit, it'll probably look like 8 or 9gib 
of ram worst case.  I'd mount with -o ref_verify and check the slab amount in 
/proc/meminfo to get an idea of real usage.  Once the mount is finished that'll 
be about as much metadata you will use, of course it'll grow as metadata usage 
grows but it should be nominal.  Thanks,

Josef

Sent from my iPhone

> On Sep 1, 2017, at 4:43 PM, Marc MERLIN  wrote:
> 
>> On Thu, Aug 31, 2017 at 05:48:23PM +, Josef Bacik wrote:
>> We are using 4.11 in production at fb with backports from recent (a month 
>> ago?) stuff.  I’m relatively certain nothing bad will happen, and this 
>> branch has the most recent fsync() corruption fix (which exists in your 
>> kernel so it’s not new).  That said if you are uncomfortable I can rebase 
>> this patch onto whatever base you want and push out a branch, it’s your 
>> choice.  Keep in mind this is going to hold a lot of shit in memory, so I 
>> hope you have enough, and I’d definitely remove the sleep’s from your 
>> script, there’s no telling if this is a race condition or not and the 
>> overhead of the ref-verify stuff may cause it to be less likely to happen.  
>> Thanks,
> 
> Thanks for the warning. I have 32GB of RAM in the server, and I probably use
> 8. Most of the rest is so that I can do btrfs check --repair without the
> machine dying :-/
> 
> I am concerned that I have a lot more metadata than I have memory:
> gargamel:~# btrfs fi df /mnt/btrfs_pool1
> Data, single: total=10.66TiB, used=10.60TiB
> System, DUP: total=32.00MiB, used=1.20MiB
> Metadata, DUP: total=58.00GiB, used=12.76GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> gargamel:~# btrfs fi df /mnt/btrfs_pool2
> Data, single: total=5.07TiB, used=4.78TiB
> System, DUP: total=8.00MiB, used=640.00KiB
> Metadata, DUP: total=70.50GiB, used=66.58GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> That's 13GB + 67GB.
> Is it going to fall over if I only have 32GB of RAM?
> 
> If I stop mounting /mnt/btrfs_pool2 for a while, will 32GB of RAM
> cover the 13GB of metadata from /mnt/btrfs_pool1 ?
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>   what McDonalds is to gourmet cooking
> Home page: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__marc.merlins.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=sDzg6MvHymKOUgI8SFIm4Q=9sSxC-1zmDEfNiAWSOeOTrz03WlT5Fd1j_U0WK0kfPk=YbE1JGIKZGAAWnKVWJfwkj0Fu_GC6OYF7fmbfjcrqHY=
>

Re: mount time for big filesystems

2017-09-01 Thread Dan Merillat

On Fri, Sep 1, 2017 at 11:20 AM, Austin S. Hemmelgarn
 wrote:
> No, that's not what I'm talking about.  You always get one bcache device per
> backing device, but multiple bcache devices can use the same physical cache
> device (that is, backing devices map 1:1 to bcache devices, but cache
> devices can map 1:N to bcache devices).  So, in other words, the layout I'm
> suggesting looks like this:
>
> This is actually simpler to manage for multiple reasons, and will avoid
> wasting space on the cache device because of random choices made by BTRFS
> when deciding where to read data.

Be careful with bcache - if you lose the SSD and it has dirty data on
it, your entire FS is gone.   I ended up contributing a number of
patches to the recovery tools digging my array out from that.   Even
if a single file is dirty, the new metadata tree will only exist on
the cache device, which doesn't honor barriers writing back to the
underlying storage.   That means it's likely to have a root pointing
at a metadata tree that's no longer there.  The recovery method is
finding an older root that has a complete tree, and recovery-walking
the entire FS from that.

I don't know if dm-cache honors write barriers from the cache to the
backing storage, but I would still recommend using them both in
write-through mode, not write-back.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

2017-09-01 Thread Marc MERLIN

On Thu, Aug 31, 2017 at 05:48:23PM +, Josef Bacik wrote:
> We are using 4.11 in production at fb with backports from recent (a month 
> ago?) stuff.  I’m relatively certain nothing bad will happen, and this branch 
> has the most recent fsync() corruption fix (which exists in your kernel so 
> it’s not new).  That said if you are uncomfortable I can rebase this patch 
> onto whatever base you want and push out a branch, it’s your choice.  Keep in 
> mind this is going to hold a lot of shit in memory, so I hope you have 
> enough, and I’d definitely remove the sleep’s from your script, there’s no 
> telling if this is a race condition or not and the overhead of the ref-verify 
> stuff may cause it to be less likely to happen.  Thanks,

Thanks for the warning. I have 32GB of RAM in the server, and I probably use
8. Most of the rest is so that I can do btrfs check --repair without the
machine dying :-/

I am concerned that I have a lot more metadata than I have memory:
gargamel:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total=10.66TiB, used=10.60TiB
System, DUP: total=32.00MiB, used=1.20MiB
Metadata, DUP: total=58.00GiB, used=12.76GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
gargamel:~# btrfs fi df /mnt/btrfs_pool2
Data, single: total=5.07TiB, used=4.78TiB
System, DUP: total=8.00MiB, used=640.00KiB
Metadata, DUP: total=70.50GiB, used=66.58GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

That's 13GB + 67GB.
Is it going to fall over if I only have 32GB of RAM?

If I stop mounting /mnt/btrfs_pool2 for a while, will 32GB of RAM
cover the 13GB of metadata from /mnt/btrfs_pool1 ?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-09-01 Thread Chris Murphy

On Fri, Sep 1, 2017 at 7:38 AM, Eric Wolf <19w...@gmail.com> wrote:
> Okay,
> I have a hex editor open. Now what? Your instructions seems
> straightforward, but I have no idea what I'm doing.

First step, backup as much as you can, because if you don't know what
you're doing, good chance you make a mistake and break the file
system. So be prepared for making things worse.

Next, you need to get the physical sector(s) for the leaf containing
the error. Use btrfs-map-logical -l  for this, where address
is the same one you used before with btrfs-debug-tree. If this is a
default file system, the leaf is 16KiB, that's 32 512 byte sectors.
And btrfs-map-logical will report back LBA based on 512 bytes *if*
this is a 512e drive, which most drives are, but you should make sure.

]$ sudo blockdev --getss /dev/nvme0n1
512

If it's 512 bytes, your bad item is in one of 32 sectors starting from
the LBA reported by btrfs-map-logical. If it's 4096 bytes then it's
one of four sectors starting from that LBA. Find out which sector the
bad key is in, which you have to do looking at hex, literally you will
have to go byte by byte from the beginning, learning how to parse the
on-disk format. And fix the bad item object id as described by Hugo.
And then after you fix that, mount the volume, scrub (or read a file
that will trigger the problem if you can) this time you'll get a bad
csum error instead of the original corrupt leaf error. And that csum
error will tell you what csum it found and what csum it expects. So
now you go back to that first sector, which is where the csum is
stored, find it, and replace it with the expected csum. And maybe (not
sure) the csum error will be in decimal, in which case you'd have to
convert to hex to find the bad one and replace with the good one.

Tedious. Maybe the Hans van Kranenberg's btrfs-python is easier, I have no idea.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: doc: fix btrfs-inspect-internal rootid doc

2017-09-01 Thread David Sterba

On Fri, Aug 25, 2017 at 09:34:49AM +0900, Misono, Tomohiro wrote:
> On 2017/08/25 2:37, David Sterba wrote:
> > On Thu, Aug 24, 2017 at 04:39:53PM +0900, Misono, Tomohiro wrote:
> >> "btrfs inspect-internal rootid " rejects a file to be specified in
> >> the implementation.
> >> Therefore change "file or directory" to "directory" in the doc.
> > 
> > Is there a reason why a file should not be accepted? The ioctl supports
> > that just fine.
> 
> I thought ioctl would work for file, but current code checks whether 
>  is directory or not in the first place. So, I change the doc.
> 
> If it is better to change the code rather than the doc, I will do that. 
> What do you think?

There are helpers that can open either file or directory from a given
path, eg. open_file_or_dir3 so it can be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Move check of mixed block early.

2017-09-01 Thread David Sterba

On Tue, Jul 25, 2017 at 09:57:44AM +0800, Gu Jinxiang wrote:
> Make the check of mixed block groups early.
> Reason:
> We do not support re-initing extent tree for mixed block groups.
> So it will return -EINVAL in function reinit_extent_tree.
> In this situation, we do not need to start transaction.
> We do not have a btrfs_abort_transaction like kernel now,
> so we should prevent starting transaction not necessary.

I've fixed the crash in another way, and started adding parts of the
transaction abort code. The check for mixed blockgroups IMO belongs
inside the function and not to cmd_check, as it's a lower level
implementation detail.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][v2] btrfs-progs: print the csum length in debug-tree

2017-09-01 Thread David Sterba

On Fri, Aug 25, 2017 at 06:17:23PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.08.2017 18:11, jo...@toxicpanda.com wrote:
> > From: Josef Bacik 
> > 
> > While looking at a log of a corrupted fs I needed to verify we were
> > missing csums for a given range.  Make this easier by printing out the
> > range of bytes a csum item covers.
> > 
> > Signed-off-by: Josef Bacik 
> 
> Reviewed-by: Nikolay Borisov 

v2 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Snapper Rollback filesystem is read only

2017-09-01 Thread Eric Wolf

I rolled back my filesystem with 'snapper rollback 81 (or whatever
snapshot it was)' and now when I boot my filesystem is read-only. How
do I fix it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount time for big filesystems

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 11:00, Juan Orti Alcaine wrote:



El 1 sept. 2017 15:59, "Austin S. Hemmelgarn" > escribió:


If you are going to use bcache, you don't need separate caches for
each device (and in fact, you're probably better off sharing a cache
across devices).


But, if I mix all the backing devices, I'll only get one bcache device, 
so I won't be able to do btrfs RAID1 on that.
No, that's not what I'm talking about.  You always get one bcache device 
per backing device, but multiple bcache devices can use the same 
physical cache device (that is, backing devices map 1:1 to bcache 
devices, but cache devices can map 1:N to bcache devices).  So, in other 
words, the layout I'm suggesting looks like this:


/dev/sda1: Backing device.
/dev/sdb1: Backing device.
/dev/sdc1: Backing device.
/dev/sdd1: Backing device.
/dev/sde1: SSD cache device.
/dev/bcache0: Corresponds to /dev/sda1, uses /dev/sde1 as cache
/dev/bcache1: Corresponds to /dev/sdb1, uses /dev/sde1 as cache
/dev/bcache2: Corresponds to /dev/sdc1, uses /dev/sde1 as cache
/dev/bcache3: Corresponds to /dev/sdd1, uses /dev/sde1 as cache

This is actually simpler to manage for multiple reasons, and will avoid 
wasting space on the cache device because of random choices made by 
BTRFS when deciding where to read data.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: print the csum length in debug-tree

2017-09-01 Thread David Sterba

On Fri, Aug 25, 2017 at 04:48:43PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.08.2017 16:13, jo...@toxicpanda.com wrote:
> > From: Josef Bacik 
> > 
> > While looking at a log of a corrupted fs I needed to verify we were
> > missing csums for a given range.  Make this easier by printing out how
> > many bytes a csum extent item represents when using btrfs_debug_tree.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  print-tree.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/print-tree.c b/print-tree.c
> > index 5927ed3..a124c96 100644
> > --- a/print-tree.c
> > +++ b/print-tree.c
> > @@ -1103,9 +1103,15 @@ void btrfs_print_leaf(struct btrfs_root *root, 
> > struct extent_buffer *eb)
> > case BTRFS_CSUM_ITEM_KEY:
> > printf("\t\tcsum item\n");
> > break;
> > -   case BTRFS_EXTENT_CSUM_KEY:
> > -   printf("\t\textent csum item\n");
> > +   case BTRFS_EXTENT_CSUM_KEY: {
> > +   u16 csum_size =
> > +   
> > btrfs_super_csum_size(root->fs_info->super_copy);
> > +   u32 size = (item_size / csum_size) *
> > +   root->fs_info->sectorsize;
> > +   printf("\t\textent csum item bytes %lu\n",
> > +  (unsigned long)size);
> 
> Currently for a csum item we get:
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 1103101952) itemoff 16279 itemsize 4
>   extent csum item
> 
> 
>  Why don't you go one step further and print the covered range -
> it would be key.offset + size (the number you've calculated) And so you
> can print something like:
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 1103101952) itemoff 16279 itemsize 4
>   extent csum item 1103101952-1103106048
> 
> "extent csum item range %llu-%lly\n", key.offset, key.offset + size

I think the calculation is really simple to do, if one needs it.

The 'bytes' seem to refer to the range so I'll change that to length.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: incorrect invalid lookup_path_rootid() error handling

2017-09-01 Thread David Sterba

On Fri, Aug 18, 2017 at 09:04:19AM +0200, Goffredo Baroncelli wrote:
> Hi All,
> 
> Piotr and Chris, pointed me at these bugs which could be triggered by this 
> test case:
> 
> # btrfs sub create test1
> # btrfs sub create test1/test2
> # btrfs sub snap test1 test1.snap
> # btrfs fi du -s test1
>  Total   Exclusive  Set shared  Filename
>  0.00B   0.00B   0.00B  test1
> # btrfs fi du -s test1.snap
>  Total   Exclusive  Set shared  Filename
> ERROR: cannot check space of 'test1.snap': Inappropriate ioctl for device
> 
> These are two bugs:
> 1) in the function du_walk_dir() the error returned by du_add_file() normally
> is ignored. But if du_walk_dir() "walks" the last items, the error is returned
> to the caller
> 2) in the function du_add_file() it doesn't make sense to call
> lookup_path_rootid() when the inode is BTRFS_EMPTY_SUBVOL_DIR_OBJECTID: in 
> this
> case the function doesn't return a valid value.

Thanks, patches applied and I've added a testcase following the
instructions above.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: joining to contribute

2017-09-01 Thread Hans van Kranenburg

On 09/01/2017 07:15 AM, Qu Wenruo wrote:
> 
> 
> On 2017年09月01日 11:36, Anthony Riley wrote:
>> Hey folks,
>>
>> I thought I would finally take a swing at things I've wanted to be an
>> kernel/fs dev fora few years now. My current $job is as an
>> Infrastructure Engineer. I'm currently teaching myself C and have
>> background in shell scripting & python. I love doing deep dives and
>> learning about linux internals.  I've read the btrfs.wiki and can't
>> really decide which project to choose to start.
>>
>> Also should I go through this
>> https://kernelnewbies.org/FirstKernelPatch first?
>> Or should i start with something in Userspace?
> 
> Well, personally I strongly recommended to start with btrfs on-disk
> format first, and then btrfs-progs/test cases, and kernel contribution
> as final objective.
> 
> BTW, if you want to start with btrfs on-disk format, print-tree.c from
> btrfs-progs is a good start point and btrfs wiki has relatively well
> documented entry for it.
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> https://btrfs.wiki.kernel.org/index.php/Btree_Items
> 

(Insert THX sound effect here)

Previous very useful Qu answer:

https://www.spinics.net/lists/linux-btrfs/msg66489.html

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 09:54, Qu Wenruo wrote:



On 2017年09月01日 20:47, Austin S. Hemmelgarn wrote:

On 2017-09-01 08:19, Qu Wenruo wrote:



On 2017年09月01日 20:05, Austin S. Hemmelgarn wrote:

On 2017-09-01 07:49, Qu Wenruo wrote:


On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:


On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,

I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation 
tool that I know of that offers functionality like this generates 
the filesystem just large enough to contain the data you want in 
it, so I would argue that making this use the whole device is 
actually breaking consistency with other tools, not to mention 
removing functionality that is useful (even aside from the system 
image generation use case I mentioned, there are other practical 
applications (seed 'device' generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, 
we still try to make a large chunk to cover scattered data extents.
For a one-shot or read-only filesystem though, a maximally sized 
chunk is probably suboptimal.


Not exactly.
Current kernel (and btrfs-progs also tries to follow kernel chunk 
allocator's behavior) will not make a chunk larger than 10% of RW space.

So for small filesystem chunk won't be too maximally sized.
Are you sure about this?  I've got a couple of sub 10GB BTRFS volumes 
that definitely have more than one 1GB data chunk.


Yes, check the following code:

     /* we don't want a chunk larger than 10% of writeable space */
     max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
  max_chunk_size);
Which is in __btrfs_alloc_chunk() function in fs/btrfs/volumes.c
Huh, I may have the remnants of an old bug present on those filesystems 
then, I'll have to look further into this.




  Suppose you use this to generate a base image for a system in the 
form of a seed device.  This actually ends up being a pretty easy 
way to get factory reset functionality.  It's also a case where you 
want the base image to take up as little space as possible, so that 
the end-user usable storage space is as much as possible.  In that 
case, if your base image doesn't need an exact multiple of 1GB for 
data chunks, then using 1GB data chunks is not the best choice for 
at least the final data chunk (because the rest of that 1GB gets 
wasted).  A similar argument applies for metadata.


Yes, your example makes sense. (despite of above 10% limit I mentioned).

The problem is, no one really knows how the image will be used.
Maybe it will be used as normal btrfs (with fi resize), or with your 
purpose.
We can't save users from making poor choices.  If we could, we 
wouldn't have anywhere near as many e-mails on the list from people 
who are trying to recover data from their broken filesystems because 
they have no backups.


The only case I can find where '-r' is a win is when you need the 
filesystem to be as small as possible with no free space.  The moment 
you need free space, it's actually faster to just create the 
filesystem, resize it to the desired size, and then copy in your data 
(I've actually benchmarked this, and while it's not _much_ difference 
in time spent, there is a measurable difference, with my guess being 
that the allocation code is doing more work in userspace than in the 
kernel).  At a minimum, I think it's probably worth documenting this 
fact.


I still remember some time ago, other guys told me that the main 
advantage of -r is we don't need root privilege to mount.
That's a good point I hadn't thought of.  I'm used to working on single 
user systems (where I can just trap out to root to do stuff like that 
without any issue), or multi-user systems where I'm the admin (where I 
can also trap out to root to do that kind of thing with limited issues). 
 Getting a working FUSE module for BTRFS could help with this too 
though (and actually would probably not be hugely difficult, considering 
that we have most of the FS specific code in userspace already as part 
of btrfs-progs), but that's kind of beyond this discussion.


Anyway, documentation is important, but we need to first know the 
correct or designed behavior of -r.

Agreed.


At least mkfs.ext4 -d option doesn't limit the size.
In my test, 1G file with mkfs.ext -d still shows about 900M+ available 
space.
That may just be them choosing to use whatever size the device has. 
They have limited incentive to do anything else, because genext2fs 
exists and covers the minimal filesystem generation side of things.


For normal btrfs case, although it may not cause much problem, but it 
will not be the optimized use case and may need

Re: mount time for big filesystems

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 09:52, Juan Orti Alcaine wrote:

2017-08-31 13:36 GMT+02:00 Roman Mamedov :

If you could implement SSD caching in front of your FS (such as lvmcache or
bcache), that would work wonders for performance in general, and especially
for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
a 14 TB FS.


I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
have doubts about how to correctly do it in a multidevice filesystem.

I guess I should make 4 partitions on the SSD and pair them with my
backing devices, then create the btrfs on top of bcache0, bcache1,...
is this the right way to do it?
If you are going to use bcache, you don't need separate caches for each 
device (and in fact, you're probably better off sharing a cache across 
devices).


If instead you're going to use dm-cache/LVM, you will need two logical 
volumes per-device for the cache, one big one (for the actual cache), 
and one little one (for metadata, usually a few hundred MB is fine).


In general though, you're correct, it is preferred to do things in the 
order you suggested.  It is technically possible sometimes to convert an 
existing device to being cached in-place, but it's risky, and restoring 
from a backup onto a clean filesystem has other benefits too.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 20:47, Austin S. Hemmelgarn wrote:

On 2017-09-01 08:19, Qu Wenruo wrote:



On 2017年09月01日 20:05, Austin S. Hemmelgarn wrote:

On 2017-09-01 07:49, Qu Wenruo wrote:


On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:


On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,

I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation 
tool that I know of that offers functionality like this generates 
the filesystem just large enough to contain the data you want in 
it, so I would argue that making this use the whole device is 
actually breaking consistency with other tools, not to mention 
removing functionality that is useful (even aside from the system 
image generation use case I mentioned, there are other practical 
applications (seed 'device' generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, 
we still try to make a large chunk to cover scattered data extents.
For a one-shot or read-only filesystem though, a maximally sized 
chunk is probably suboptimal.


Not exactly.
Current kernel (and btrfs-progs also tries to follow kernel chunk 
allocator's behavior) will not make a chunk larger than 10% of RW space.

So for small filesystem chunk won't be too maximally sized.
Are you sure about this?  I've got a couple of sub 10GB BTRFS volumes 
that definitely have more than one 1GB data chunk.


Yes, check the following code:

/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
 max_chunk_size);
Which is in __btrfs_alloc_chunk() function in fs/btrfs/volumes.c



  Suppose you use this to generate a base image for a system in the 
form of a seed device.  This actually ends up being a pretty easy way 
to get factory reset functionality.  It's also a case where you want 
the base image to take up as little space as possible, so that the 
end-user usable storage space is as much as possible.  In that case, 
if your base image doesn't need an exact multiple of 1GB for data 
chunks, then using 1GB data chunks is not the best choice for at 
least the final data chunk (because the rest of that 1GB gets 
wasted).  A similar argument applies for metadata.


Yes, your example makes sense. (despite of above 10% limit I mentioned).

The problem is, no one really knows how the image will be used.
Maybe it will be used as normal btrfs (with fi resize), or with your 
purpose.
We can't save users from making poor choices.  If we could, we wouldn't 
have anywhere near as many e-mails on the list from people who are 
trying to recover data from their broken filesystems because they have 
no backups.


The only case I can find where '-r' is a win is when you need the 
filesystem to be as small as possible with no free space.  The moment 
you need free space, it's actually faster to just create the filesystem, 
resize it to the desired size, and then copy in your data (I've actually 
benchmarked this, and while it's not _much_ difference in time spent, 
there is a measurable difference, with my guess being that the 
allocation code is doing more work in userspace than in the kernel).  At 
a minimum, I think it's probably worth documenting this fact.


I still remember some time ago, other guys told me that the main 
advantage of -r is we don't need root privilege to mount.


Anyway, documentation is important, but we need to first know the 
correct or designed behavior of -r.


At least mkfs.ext4 -d option doesn't limit the size.
In my test, 1G file with mkfs.ext -d still shows about 900M+ available 
space.


For normal btrfs case, although it may not cause much problem, but it 
will not be the optimized use case and may need extra manual balance.
Actually, until the first write to the filesystem, it will still be an 
optimal layout.  Once you start writing to any BTRFS filesystem that has 
an optimal layout though, it immediately becomes non-optimal, and 
there's not really anything we can do about that unless we allow chunks 
that are already allocated to be resized on the fly (which is a bad idea 
for multiple reasons).




At least to me, it's not the case for chunk created by -r option.

BTW, seed device is RO anyway, how much or how less spare space we 
have is not a problem at all.
That really depends on how you look at it.  Aside from the above 
example, there's the rather specific question of why you would not 
want to avoid wasting space.  The filesystem is read-only, which 
means that any 'free space' on that filesystem is completely 
unusable, can't be reclaimed for anything else, and in general is 
just a waste.


Still same problem

Re: mount time for big filesystems

2017-09-01 Thread Juan Orti Alcaine

2017-08-31 13:36 GMT+02:00 Roman Mamedov :
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.

I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
have doubts about how to correctly do it in a multidevice filesystem.

I guess I should make 4 partitions on the SSD and pair them with my
backing devices, then create the btrfs on top of bcache0, bcache1,...
is this the right way to do it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-09-01 Thread Eric Wolf

On Thu, Aug 31, 2017 at 4:11 PM, Hugo Mills  wrote:
> On Thu, Aug 31, 2017 at 03:21:07PM -0400, Eric Wolf wrote:
>> I've previously confirmed it's a bad ram module which I have already
>> submitted an RMA for. Any advice for manually fixing the bits?
>
>What I'd do... use a hex editor and the contents of ctree.h as
> documentation to find the byte in question, change it back to what it
> should be, mount the FS, try reading the directory again, look up the
> csum failure in dmesg, edit the block again to fix up the csum, and
> it's done. (Yes, I've done this before, and I'm a massive nerd).
>
>It's also possible to use Hans van Kranenberg's btrfs-python to fix
> up this kind of thing, but I've not done it myself. There should be a
> couple of talk-throughs from Hans in various archives -- both this
> list (find it on, say, http://www.spinics.net/lists/linux-btrfs/), and
> on the IRC archives (http://logs.tvrrug.org.uk/logs/%23btrfs/latest.html).
>
>> Sorry for top leveling, not sure how mailing lists work (again sorry
>> if this message is top leveled, how do I ensure it's not?)
>
>Just write your answers _after_ the quoted text that you're
> replying to, not before. It's a convention, rather than a technical
> thing...
>
>Hugo.
>
>>
>>
>>
>> On Thu, Aug 31, 2017 at 2:59 PM, Hugo Mills  wrote:
>> >(Please don't top-post; edited for conversation flow)
>> >
>> > On Thu, Aug 31, 2017 at 02:44:39PM -0400, Eric Wolf wrote:
>> >> On Thu, Aug 31, 2017 at 2:33 PM, Hugo Mills  wrote:
>> >> > On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:
>> >> >> I'm having issues with a bad block(?) on my root ssd.
>> >> >>
>> >> >> dmesg is consistently outputting "BTRFS critical (device sda2):
>> >> >> corrupt leaf, bad key order: block=293438636032, root=1, slot=11"
>> >> >>
>> >> >> "btrfs scrub stat /" outputs "scrub status for 
>> >> >> b2c9ff7b-[snip]-48a02cc4f508
>> >> >> scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55
>> >> >> total bytes scrubbed: 53.41GiB with 2 errors
>> >> >> error details: verify=2
>> >> >> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0"
>> >> >>
>> >> >> Running "btrfs check --repair /dev/sda2" from a live system stalls
>> >> >> after telling me corrupt leaf etc etc then "11 12". CPU usage hits
>> >> >> 100% and disk activity remains at 0.
>> >> >
>> >> >This error is usually attributable to bad hardware. Typically RAM,
>> >> > but might also be marginal power regulation (blown capacitor
>> >> > somewhere) or a slightly broken CPU.
>> >> >
>> >> >Can you show us the output of "btrfs-debug-tree -b 293438636032 
>> >> > /dev/sda2"?
>> >
>> >Here's the culprit:
>> >
>> > [snip]
>> >> item 10 key (890553 EXTENT_DATA 0) itemoff 14685 itemsize 269
>> >>inline extent data size 248 ram 248 compress 0
>> >> item 11 key (890554 INODE_ITEM 0) itemoff 14525 itemsize 160
>> >>inode generation 5386763 transid 5386764 size 135 nbytes 135
>> >>block group 0 mode 100644 links 1 uid 10 gid 10
>> >>rdev 0 flags 0x0
>> >> item 12 key (856762 INODE_REF 31762) itemoff 14496 itemsize 29
>> >>inode ref index 2745 namelen 19 name: dpkg.statoverride.0
>> >> item 13 key (890554 EXTENT_DATA 0) itemoff 14340 itemsize 156
>> >>inline extent data size 135 ram 135 compress 0
>> > [snip]
>> >
>> >Note the objectid field -- the first number in the brackets after
>> > "key" for each item. This sequence of values should be non-decreasing.
>> > Thus, item 12 should have an objectid of 890554 to match the items
>> > either side of it, and instead it has 856762.
>> >
>> >In hex, these are:
>> >
>>  hex(890554)
>> > '0xd96ba'
>>  hex(856762)
>> > '0xd12ba'
>> >
>> >Which means you've had two bitflips close together:
>> >
>>  hex(856762 ^ 890554)
>> > '0x8400'
>> >
>> >Given that everything else is OK, and it's just one byte affected
>> > in the middle of a load of data that's really quite sensitive to
>> > errors, it's very unlikely that it's the result of a misplaced pointer
>> > in the kernel, or some other subsystem accidentally walking over that
>> > piece of RAM. It is, therefore, almost certainly your hardware that's
>> > at fault.
>> >
>> >I would strongly suggest running memtest86 on your machine -- I'd
>> > usually say a minimum of 8 hours, or longer if you possibly can (24
>> > hours), or until you have errors reported. If you get errors reported
>> > in the same place on multiple passes, then it's the RAM. If you have
>> > errors scattered around seemingly at random, then it's probably your
>> > power regulation (PSU or motherboard).
>> >
>> >Sadly, btrfs check on its own won't be able to fix this, as it's
>> > two bits flipped. (It can cope with one bit flipped in the key, most
>> > of the time, but not two). It can be fixed manually, if you're
>> > familiar with a hex editor and the on-disk data structures.
>> >
>>

Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-09-01 Thread Eric Wolf

Okay,
I have a hex editor open. Now what? Your instructions seems
straightforward, but I have no idea what I'm doing.
---
Eric Wolf
(201) 316-6098
19w...@gmail.com


On Thu, Aug 31, 2017 at 4:11 PM, Hugo Mills  wrote:
> On Thu, Aug 31, 2017 at 03:21:07PM -0400, Eric Wolf wrote:
>> I've previously confirmed it's a bad ram module which I have already
>> submitted an RMA for. Any advice for manually fixing the bits?
>
>What I'd do... use a hex editor and the contents of ctree.h as
> documentation to find the byte in question, change it back to what it
> should be, mount the FS, try reading the directory again, look up the
> csum failure in dmesg, edit the block again to fix up the csum, and
> it's done. (Yes, I've done this before, and I'm a massive nerd).
>
>It's also possible to use Hans van Kranenberg's btrfs-python to fix
> up this kind of thing, but I've not done it myself. There should be a
> couple of talk-throughs from Hans in various archives -- both this
> list (find it on, say, http://www.spinics.net/lists/linux-btrfs/), and
> on the IRC archives (http://logs.tvrrug.org.uk/logs/%23btrfs/latest.html).
>
>> Sorry for top leveling, not sure how mailing lists work (again sorry
>> if this message is top leveled, how do I ensure it's not?)
>
>Just write your answers _after_ the quoted text that you're
> replying to, not before. It's a convention, rather than a technical
> thing...
>
>Hugo.
>
>> ---
>> Eric Wolf
>> (201) 316-6098
>> 19w...@gmail.com
>>
>>
>> On Thu, Aug 31, 2017 at 2:59 PM, Hugo Mills  wrote:
>> >(Please don't top-post; edited for conversation flow)
>> >
>> > On Thu, Aug 31, 2017 at 02:44:39PM -0400, Eric Wolf wrote:
>> >> On Thu, Aug 31, 2017 at 2:33 PM, Hugo Mills  wrote:
>> >> > On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:
>> >> >> I'm having issues with a bad block(?) on my root ssd.
>> >> >>
>> >> >> dmesg is consistently outputting "BTRFS critical (device sda2):
>> >> >> corrupt leaf, bad key order: block=293438636032, root=1, slot=11"
>> >> >>
>> >> >> "btrfs scrub stat /" outputs "scrub status for 
>> >> >> b2c9ff7b-[snip]-48a02cc4f508
>> >> >> scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55
>> >> >> total bytes scrubbed: 53.41GiB with 2 errors
>> >> >> error details: verify=2
>> >> >> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0"
>> >> >>
>> >> >> Running "btrfs check --repair /dev/sda2" from a live system stalls
>> >> >> after telling me corrupt leaf etc etc then "11 12". CPU usage hits
>> >> >> 100% and disk activity remains at 0.
>> >> >
>> >> >This error is usually attributable to bad hardware. Typically RAM,
>> >> > but might also be marginal power regulation (blown capacitor
>> >> > somewhere) or a slightly broken CPU.
>> >> >
>> >> >Can you show us the output of "btrfs-debug-tree -b 293438636032 
>> >> > /dev/sda2"?
>> >
>> >Here's the culprit:
>> >
>> > [snip]
>> >> item 10 key (890553 EXTENT_DATA 0) itemoff 14685 itemsize 269
>> >>inline extent data size 248 ram 248 compress 0
>> >> item 11 key (890554 INODE_ITEM 0) itemoff 14525 itemsize 160
>> >>inode generation 5386763 transid 5386764 size 135 nbytes 135
>> >>block group 0 mode 100644 links 1 uid 10 gid 10
>> >>rdev 0 flags 0x0
>> >> item 12 key (856762 INODE_REF 31762) itemoff 14496 itemsize 29
>> >>inode ref index 2745 namelen 19 name: dpkg.statoverride.0
>> >> item 13 key (890554 EXTENT_DATA 0) itemoff 14340 itemsize 156
>> >>inline extent data size 135 ram 135 compress 0
>> > [snip]
>> >
>> >Note the objectid field -- the first number in the brackets after
>> > "key" for each item. This sequence of values should be non-decreasing.
>> > Thus, item 12 should have an objectid of 890554 to match the items
>> > either side of it, and instead it has 856762.
>> >
>> >In hex, these are:
>> >
>>  hex(890554)
>> > '0xd96ba'
>>  hex(856762)
>> > '0xd12ba'
>> >
>> >Which means you've had two bitflips close together:
>> >
>>  hex(856762 ^ 890554)
>> > '0x8400'
>> >
>> >Given that everything else is OK, and it's just one byte affected
>> > in the middle of a load of data that's really quite sensitive to
>> > errors, it's very unlikely that it's the result of a misplaced pointer
>> > in the kernel, or some other subsystem accidentally walking over that
>> > piece of RAM. It is, therefore, almost certainly your hardware that's
>> > at fault.
>> >
>> >I would strongly suggest running memtest86 on your machine -- I'd
>> > usually say a minimum of 8 hours, or longer if you possibly can (24
>> > hours), or until you have errors reported. If you get errors reported
>> > in the same place on multiple passes, then it's the RAM. If you have
>> > errors scattered around seemingly at random, then it's probably your
>> > power regulation (PSU or motherboard).
>> >
>> >Sadly, btrfs check on its own won't be able to fix this,

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 08:19, Qu Wenruo wrote:



On 2017年09月01日 20:05, Austin S. Hemmelgarn wrote:

On 2017-09-01 07:49, Qu Wenruo wrote:


On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:


On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,

I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  
tool that I know of that offers functionality like this generates 
the filesystem just large enough to contain the data you want in it, 
so I would argue that making this use the whole device is actually 
breaking consistency with other tools, not to mention removing 
functionality that is useful (even aside from the system image 
generation use case I mentioned, there are other practical 
applications (seed 'device' generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, 
we still try to make a large chunk to cover scattered data extents.
For a one-shot or read-only filesystem though, a maximally sized chunk 
is probably suboptimal.


Not exactly.
Current kernel (and btrfs-progs also tries to follow kernel chunk 
allocator's behavior) will not make a chunk larger than 10% of RW space.

So for small filesystem chunk won't be too maximally sized.
Are you sure about this?  I've got a couple of sub 10GB BTRFS volumes 
that definitely have more than one 1GB data chunk.


  Suppose you use this to generate a base image for a system in the 
form of a seed device.  This actually ends up being a pretty easy way 
to get factory reset functionality.  It's also a case where you want 
the base image to take up as little space as possible, so that the 
end-user usable storage space is as much as possible.  In that case, 
if your base image doesn't need an exact multiple of 1GB for data 
chunks, then using 1GB data chunks is not the best choice for at least 
the final data chunk (because the rest of that 1GB gets wasted).  A 
similar argument applies for metadata.


Yes, your example makes sense. (despite of above 10% limit I mentioned).

The problem is, no one really knows how the image will be used.
Maybe it will be used as normal btrfs (with fi resize), or with your 
purpose.
We can't save users from making poor choices.  If we could, we wouldn't 
have anywhere near as many e-mails on the list from people who are 
trying to recover data from their broken filesystems because they have 
no backups.


The only case I can find where '-r' is a win is when you need the 
filesystem to be as small as possible with no free space.  The moment 
you need free space, it's actually faster to just create the filesystem, 
resize it to the desired size, and then copy in your data (I've actually 
benchmarked this, and while it's not _much_ difference in time spent, 
there is a measurable difference, with my guess being that the 
allocation code is doing more work in userspace than in the kernel).  At 
a minimum, I think it's probably worth documenting this fact.
For normal btrfs case, although it may not cause much problem, but it 
will not be the optimized use case and may need extra manual balance.
Actually, until the first write to the filesystem, it will still be an 
optimal layout.  Once you start writing to any BTRFS filesystem that has 
an optimal layout though, it immediately becomes non-optimal, and 
there's not really anything we can do about that unless we allow chunks 
that are already allocated to be resized on the fly (which is a bad idea 
for multiple reasons).




At least to me, it's not the case for chunk created by -r option.

BTW, seed device is RO anyway, how much or how less spare space we 
have is not a problem at all.
That really depends on how you look at it.  Aside from the above 
example, there's the rather specific question of why you would not 
want to avoid wasting space.  The filesystem is read-only, which means 
that any 'free space' on that filesystem is completely unusable, can't 
be reclaimed for anything else, and in general is just a waste.


Still same problem above.
What if the seed device is de-attached and then be used as normal btrfs?



So to me, even follow other tools -r, we should follow the normal 
extent allocator behavior to create data/metadata, and then set the 
device size to end of its dev extents.

I don't entirely agree, but I think I've made my point well enough above.


Yes, you did make your point clear, and I agree that use cases you 
mentioned exist and wasted space also exists.


But since we don't really know what the image will be used, I prefer to 
keep everything to use kernel (or btrfs-progs) chunk allocator to make 
the behavior consistent.


So my point is more about consistent behavior of btrfs-progs and kernel, 
and less

Re: [PATCH 1/2] btrfs: clear ordered flag on cleaning up ordered extents

2017-09-01 Thread Josef Bacik

On Fri, Sep 01, 2017 at 05:58:47PM +0900, Naohiro Aota wrote:
> commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
> ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
> submitted ordered extents. However, it does not clear the ordered bit
> (Private2) of coresponding pages. Thus, the following BUG occurs from
> free_pages_check_bad() (on btrfs/125 with nospace_cache).
> 
> BUG: Bad page state in process btrfs  pfn:3fa787
> page:df2acfe9e1c0 count:0 mapcount:0 mapping:  (null) index:0xd
> flags: 0x80002008(uptodate|private_2)
> raw: 80002008  000d 
> raw: df2acf5c1b20 b443802238b0  
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> bad because of flags: 0x2000(private_2)
> 
> This patch clear the flag as same as other places calling
> btrfs_dec_test_ordered_pending() for every page in the specified range.
> 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: finish ordered extent cleaning if no progress is found

2017-09-01 Thread Josef Bacik

On Fri, Sep 01, 2017 at 05:59:07PM +0900, Naohiro Aota wrote:
> __endio_write_update_ordered() repeats the search until it reaches the end
> of the specified range. This works well with direct IO path, because before
> the function is called, it's ensured that there are ordered extents filling
> whole the range. It's not the case, however, when it's called from
> run_delalloc_range(): it is possible to have error in the midle of the loop
> in e.g. run_delalloc_nocow(), so that there exisits the range not covered
> by any ordered extents. By cleaning such "uncomplete" range,
> __endio_write_update_ordered() stucks at offset where there're no ordered
> extents.
> 
> Since the ordered extents are created from head to tail, we can stop the
> search if there are no offset progress.
> 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 20:05, Austin S. Hemmelgarn wrote:

On 2017-09-01 07:49, Qu Wenruo wrote:


On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:


On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,

I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  tool 
that I know of that offers functionality like this generates the 
filesystem just large enough to contain the data you want in it, so I 
would argue that making this use the whole device is actually 
breaking consistency with other tools, not to mention removing 
functionality that is useful (even aside from the system image 
generation use case I mentioned, there are other practical 
applications (seed 'device' generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, we 
still try to make a large chunk to cover scattered data extents.
For a one-shot or read-only filesystem though, a maximally sized chunk 
is probably suboptimal.


Not exactly.
Current kernel (and btrfs-progs also tries to follow kernel chunk 
allocator's behavior) will not make a chunk larger than 10% of RW space.

So for small filesystem chunk won't be too maximally sized.

  Suppose you use this to generate a base image 
for a system in the form of a seed device.  This actually ends up being 
a pretty easy way to get factory reset functionality.  It's also a case 
where you want the base image to take up as little space as possible, so 
that the end-user usable storage space is as much as possible.  In that 
case, if your base image doesn't need an exact multiple of 1GB for data 
chunks, then using 1GB data chunks is not the best choice for at least 
the final data chunk (because the rest of that 1GB gets wasted).  A 
similar argument applies for metadata.


Yes, your example makes sense. (despite of above 10% limit I mentioned).

The problem is, no one really knows how the image will be used.
Maybe it will be used as normal btrfs (with fi resize), or with your 
purpose.


For normal btrfs case, although it may not cause much problem, but it 
will not be the optimized use case and may need extra manual balance.




At least to me, it's not the case for chunk created by -r option.

BTW, seed device is RO anyway, how much or how less spare space we 
have is not a problem at all.
That really depends on how you look at it.  Aside from the above 
example, there's the rather specific question of why you would not want 
to avoid wasting space.  The filesystem is read-only, which means that 
any 'free space' on that filesystem is completely unusable, can't be 
reclaimed for anything else, and in general is just a waste.


Still same problem above.
What if the seed device is de-attached and then be used as normal btrfs?



So to me, even follow other tools -r, we should follow the normal 
extent allocator behavior to create data/metadata, and then set the 
device size to end of its dev extents.

I don't entirely agree, but I think I've made my point well enough above.


Yes, you did make your point clear, and I agree that use cases you 
mentioned exist and wasted space also exists.


But since we don't really know what the image will be used, I prefer to 
keep everything to use kernel (or btrfs-progs) chunk allocator to make 
the behavior consistent.


So my point is more about consistent behavior of btrfs-progs and kernel, 
and less maintenance.
(That's to say, my goal for mkfs.btrfs -r is just to do mkfs, mount, cp 
without privilege)


Thanks,
Qu



For example it will create dev extent starting from physical offset 
0, while kernel and mkfs will avoid that range, as 0~1M on each 
device is reserved.


According to the code, -r will modify chunk layout by itself, not 
the traditional way kernel is doing.


I'll fix them (if I'm not a lazybone), before that fix, please don't 
use -r option as it's not well maintained or fully tested.
FWIW, based on my own testing, filesystems generated with '-r' work 
just fine as long as you don't try to embed boot code in the FS itself.


It works fine because btrfs extent allocator will try to avoid 
superblock.

But it doesn't mean we should put dev extents into 0~1M range.

In fact, there is a deprecated mount option, alloc_start, to set how 
many bytes we should reserve for *each* device.
And since it's deprecated, we'd better follow the 1M reservation for 
each device.


Anyway, kernel balance and plain mkfs won't create chunk stripe in 
0~1M range of each device, mkfs -r should also follow it.
Agreed, although the comments I made above about wasted space do still 
apply here (albeit to a lesser degree, 1MB is not going to make much of 
a difference for most people).



--
To unsubscribe from this

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 07:49, Qu Wenruo wrote:


On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:


On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,

I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  tool 
that I know of that offers functionality like this generates the 
filesystem just large enough to contain the data you want in it, so I 
would argue that making this use the whole device is actually breaking 
consistency with other tools, not to mention removing functionality 
that is useful (even aside from the system image generation use case I 
mentioned, there are other practical applications (seed 'device' 
generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, we 
still try to make a large chunk to cover scattered data extents.
For a one-shot or read-only filesystem though, a maximally sized chunk 
is probably suboptimal.  Suppose you use this to generate a base image 
for a system in the form of a seed device.  This actually ends up being 
a pretty easy way to get factory reset functionality.  It's also a case 
where you want the base image to take up as little space as possible, so 
that the end-user usable storage space is as much as possible.  In that 
case, if your base image doesn't need an exact multiple of 1GB for data 
chunks, then using 1GB data chunks is not the best choice for at least 
the final data chunk (because the rest of that 1GB gets wasted).  A 
similar argument applies for metadata.


At least to me, it's not the case for chunk created by -r option.

BTW, seed device is RO anyway, how much or how less spare space we have 
is not a problem at all.
That really depends on how you look at it.  Aside from the above 
example, there's the rather specific question of why you would not want 
to avoid wasting space.  The filesystem is read-only, which means that 
any 'free space' on that filesystem is completely unusable, can't be 
reclaimed for anything else, and in general is just a waste.


So to me, even follow other tools -r, we should follow the normal extent 
allocator behavior to create data/metadata, and then set the device size 
to end of its dev extents.

I don't entirely agree, but I think I've made my point well enough above.


For example it will create dev extent starting from physical offset 
0, while kernel and mkfs will avoid that range, as 0~1M on each 
device is reserved.


According to the code, -r will modify chunk layout by itself, not the 
traditional way kernel is doing.


I'll fix them (if I'm not a lazybone), before that fix, please don't 
use -r option as it's not well maintained or fully tested.
FWIW, based on my own testing, filesystems generated with '-r' work 
just fine as long as you don't try to embed boot code in the FS itself.


It works fine because btrfs extent allocator will try to avoid superblock.
But it doesn't mean we should put dev extents into 0~1M range.

In fact, there is a deprecated mount option, alloc_start, to set how 
many bytes we should reserve for *each* device.
And since it's deprecated, we'd better follow the 1M reservation for 
each device.


Anyway, kernel balance and plain mkfs won't create chunk stripe in 0~1M 
range of each device, mkfs -r should also follow it.
Agreed, although the comments I made above about wasted space do still 
apply here (albeit to a lesser degree, 1MB is not going to make much of 
a difference for most people).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:



On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,


I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  tool 
that I know of that offers functionality like this generates the 
filesystem just large enough to contain the data you want in it,


At least I tried mkfs.ext4 with an almost empty directory (only one 512K 
file),

After mount the fs, there is still over 900M available space.

Even mkfs.ext4 doesn't explain much about its -d option, I think it's 
not the case, at least for -d option alone.


Thanks,
Qu

so I 
would argue that making this use the whole device is actually breaking 
consistency with other tools, not to mention removing functionality that 
is useful (even aside from the system image generation use case I 
mentioned, there are other practical applications (seed 'device' 
generation comes to mind).


For example it will create dev extent starting from physical offset 0, 
while kernel and mkfs will avoid that range, as 0~1M on each device is 
reserved.


According to the code, -r will modify chunk layout by itself, not the 
traditional way kernel is doing.


I'll fix them (if I'm not a lazybone), before that fix, please don't 
use -r option as it's not well maintained or fully tested.
FWIW, based on my own testing, filesystems generated with '-r' work just 
fine as long as you don't try to embed boot code in the FS itself.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 19:28, Austin S. Hemmelgarn wrote:

On 2017-08-31 20:13, Qu Wenruo wrote:



On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,


I found a bug in mkfs.btrfs, when it is used the option '-r'. It 
seems that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  tool 
that I know of that offers functionality like this generates the 
filesystem just large enough to contain the data you want in it, so I 
would argue that making this use the whole device is actually breaking 
consistency with other tools, not to mention removing functionality that 
is useful (even aside from the system image generation use case I 
mentioned, there are other practical applications (seed 'device' 
generation comes to mind).


Well, then documentation bug.

And I'm not sure the chunk size is correct or optimized.
Even for btrfs-convert, which will make data chunks very scattered, we 
still try to make a large chunk to cover scattered data extents.


At least to me, it's not the case for chunk created by -r option.

BTW, seed device is RO anyway, how much or how less spare space we have 
is not a problem at all.


So to me, even follow other tools -r, we should follow the normal extent 
allocator behavior to create data/metadata, and then set the device size 
to end of its dev extents.




For example it will create dev extent starting from physical offset 0, 
while kernel and mkfs will avoid that range, as 0~1M on each device is 
reserved.


According to the code, -r will modify chunk layout by itself, not the 
traditional way kernel is doing.


I'll fix them (if I'm not a lazybone), before that fix, please don't 
use -r option as it's not well maintained or fully tested.
FWIW, based on my own testing, filesystems generated with '-r' work just 
fine as long as you don't try to embed boot code in the FS itself.


It works fine because btrfs extent allocator will try to avoid superblock.
But it doesn't mean we should put dev extents into 0~1M range.

In fact, there is a deprecated mount option, alloc_start, to set how 
many bytes we should reserve for *each* device.
And since it's deprecated, we'd better follow the 1M reservation for 
each device.


Anyway, kernel balance and plain mkfs won't create chunk stripe in 0~1M 
range of each device, mkfs -r should also follow it.


Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-08-31 16:29, Goffredo Baroncelli wrote:

On 2017-08-31 20:49, Austin S. Hemmelgarn wrote:

On 2017-08-31 13:27, Goffredo Baroncelli wrote:

Hi All,


I found a bug in mkfs.btrfs, when it is used the option '-r'. It
seems that it is not visible the full disk.

$ uname -a Linux venice.bhome 4.12.8 #268 SMP Thu Aug 17 09:03:26
CEST 2017 x86_64 GNU/Linux $ btrfs --version btrfs-progs v4.12

As far as I understand it, this is intended behavior.  Tools that
offer equivalent options (genext2fs for example) are designed to
generate pre-packaged system images that can then be resized to fit
the target device's space.  As an example use case, I do full system
image updates on some of my systems (that is, I keep per-system
configuration to a minimum and replace the entire root filesystem
when I update the system) I use this option to generate a base-image,
which then gets automatically resized by my update scripts to fill
the partition during the update process.


Sorry, but I am a bit confused. If I run "mkfs.btrfs -r " on a partition... 
how I can detect the end of the filesystem in order to cut the unused space ?

 From your explanation I should do

# mkfs.btrfs -r  /dev/sdX

then

# dd if=/dev/sdX of=/tmp/image bs=1M count=

What I have to put in  ?
Mount the filesystem, and see what size `btrfs filesystem show` reports 
for the device, then go just over that (to account for rounding).


genext2fs in effect works generating a file. Instead mkfs.btrfs seems to work 
only with disks (or file already created)...
Most mkfs tools require a file to already exist because they were 
designed to create fixed size filesystem images.




Overall, this could probably stand to be documented better though
(I'll look at writing a patch to update the documentation to clarify
this when I have some spare time over the weekend).


This would be great. However I think that some code should be update in order 
to generate a file instead of rely on a block device.

BR
G.Baroncelli



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: number of subvolumes

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-09-01 06:21, ein wrote:

Very comprehensive, thank you. I was asking because I'd like to learn
how really random writes by VM affects BTRFS (vs XFS,Ext4) performance
and try to develop some workaround to reduce/prevent it while having
csums, cow (snapshots) and compression.


I've actually done some significant testing of VM's using BTRFS as 
backing storage, and would suggest the following:


1. Use raw images in sparse files, and expose them to the VM like SSD's 
(so that DISCARD support works).  Assuming your guests actually support 
DISCARD commands, this will save you just as much space as using 
allocate-on-demand formats like qcow2 or VDI, while providing access 
patterns that BTRFS is a bit better at dealing with.  There are a couple 
of specific caveats to doing this however, namely that you can't use the 
regular VirtIO block device when using QEMU, and have to use VirtIO SCSI 
instead, and you'll need smarter tools to copy images around (ddrescue 
is what I would personally suggest, it can handle sparse files 
correctly, and gives you a nice progress indicator).


2. Provide separate disks to the VM for any data you need to store in 
them.  By keeping this separate from the root filesystem's disk, you 
reduce what gets fragmented, and also aren't tied to a particular guest 
OS to the same degree.  As a more concrete example, if you're running a 
database server in a VM, give it a separate disk image for storing the 
database from the one the root filesystem is on.


3. Run with autodefrag on the host system.  Unless you're hitting the 
disks hard from inside the VM, this will actually do a reasonably good 
job of handling fragmentation.


4. Make sure the filesystem you're storing the disk images on has extra 
space.  Ideally, I would set things up so that it has enough room to 
store a second copy of the largest disk image you expect to use.  By 
keeping lots of extra space, you give the allocator in BTRFS more 
opportunity to arrange things intelligently.


5. Defragmet your disk images regularly (even when using autodefrag), 
both inside and outside the VM's.  Ideally, run a full defrag inside the 
VM, then shut it down and defragment the disk image outside the VM. 
Defragmenting inside the VM first will compact free space, which in turn 
means that defragmenting outside the VM will do a better job, and that 
future fragmentation will be less likely.  In general, I'd suggest doing 
this at least monthly (possibly a lot more frequently if you're running 
databases or similar random access workloads in the VM).


6. Avoid using a CoW filesystem inside the VM's.  This sounds odd at 
first, but is actually one of the biggest things you can do to reduce 
fragmentation.  In particular, try to avoid using BTRFS or ZFS on a VM 
disk that is stored on BTRFS, both of them will make usual fragmentation 
problems look nonexistent in comparison.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: finish ordered extent cleaning if no progress is found

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 16:59, Naohiro Aota wrote:

__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Signed-off-by: Naohiro Aota 
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent 
hang")
Cc:  # 4.12
---
  fs/btrfs/inode.c |4 
  1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ae4c0a1bef38..fd5934121b4b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8309,6 +8309,7 @@ static void __endio_write_update_ordered(struct inode 
*inode,
btrfs_work_func_t func;
u64 ordered_offset = offset;
u64 ordered_bytes = bytes;
+   u64 last_offset;
int ret;
  
  	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {

@@ -8320,6 +8321,7 @@ static void __endio_write_update_ordered(struct inode 
*inode,
}
  
  again:

+   last_offset = ordered_offset;
ret = btrfs_dec_test_first_ordered_pending(inode, ,
   _offset,
   ordered_bytes,
@@ -8330,6 +8332,8 @@ static void __endio_write_update_ordered(struct inode 
*inode,
btrfs_init_work(>work, func, finish_ordered_fn, NULL, NULL);
btrfs_queue_work(wq, >work);
  out_test:
+   if (ordered_offset == last_offset)
+   return;


Such check seems strange to me.

For ordered_offset == last_offset case, it means 
btrfs_dec_test_ordered_pending() doesn't find any ordered extent in that 
range, so we can exit.

This takes me a while to dig into btrfs_dec_test_first_ordered_pending().

Extra comment will help here, or to modify 
btrfs_dev_test_ordered_pending() to explicitly info caller there is no 
extra pending ordered extent.


Thanks,
Qu


/*
 * our bio might span multiple ordered extents.  If we haven't
 * completed the accounting for the whole dio, go back and try again

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs-progs] Bug in mkfs.btrfs -r

2017-09-01 Thread Austin S. Hemmelgarn


On 2017-08-31 20:13, Qu Wenruo wrote:



On 2017年09月01日 01:27, Goffredo Baroncelli wrote:

Hi All,


I found a bug in mkfs.btrfs, when it is used the option '-r'. It seems 
that it is not visible the full disk.


Despite the new bug you found, -r has several existing bugs.
Is this actually a bug though?  Every other filesystem creation  tool 
that I know of that offers functionality like this generates the 
filesystem just large enough to contain the data you want in it, so I 
would argue that making this use the whole device is actually breaking 
consistency with other tools, not to mention removing functionality that 
is useful (even aside from the system image generation use case I 
mentioned, there are other practical applications (seed 'device' 
generation comes to mind).


For example it will create dev extent starting from physical offset 0, 
while kernel and mkfs will avoid that range, as 0~1M on each device is 
reserved.


According to the code, -r will modify chunk layout by itself, not the 
traditional way kernel is doing.


I'll fix them (if I'm not a lazybone), before that fix, please don't use 
-r option as it's not well maintained or fully tested.
FWIW, based on my own testing, filesystems generated with '-r' work just 
fine as long as you don't try to embed boot code in the FS itself.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: clear ordered flag on cleaning up ordered extents

2017-09-01 Thread Qu Wenruo




On 2017年09月01日 16:58, Naohiro Aota wrote:

commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of coresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs  pfn:3fa787
page:df2acfe9e1c0 count:0 mapcount:0 mapping:  (null) index:0xd
flags: 0x80002008(uptodate|private_2)
raw: 80002008  000d 
raw: df2acf5c1b20 b443802238b0  
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clear the flag as same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Signed-off-by: Naohiro Aota 
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent 
hang")
Cc:  # 4.12
---
  fs/btrfs/inode.c |   12 
  1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24bcd5cd9cf2..ae4c0a1bef38 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -135,6 +135,18 @@ static inline void btrfs_cleanup_ordered_extents(struct 
inode *inode,
 const u64 offset,
 const u64 bytes)
  {
+   unsigned long index = offset >> PAGE_SHIFT;
+   unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+   struct page *page;
+
+   while (index <= end_index) {
+   page = find_get_page(inode->i_mapping, index);
+   index++;
+   if (!page)
+   continue;
+   ClearPagePrivate2(page);
+   put_page(page);
+   }


At first glance, explicitly clearing Private2 flag here seems a little 
strange to me.
However btrfs_invalidatepage() is also doing the same thing, I think 
it's fine.


Reviewed-by: Qu Wenruo 

BTW, Private2 flag is set by extent_clear_unlock_delalloc() with 
page_ops |= PAGE_SET_PRIVATE2, but we're clearing the page flag without 
any encapsulation, it may be better to use similar function to clear 
Private2 flag.


Thanks,
Qu


return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
bytes - PAGE_SIZE, false);
  }

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: number of subvolumes

2017-09-01 Thread ein

On 08/31/2017 06:18 PM, Duncan wrote:
[...]
> Michał Sokołowski posted on Thu, 31 Aug 2017 16:38:14 +0200 as excerpted:
>> Is there another tool to verify fragments number of given file when
>> using compression?
> AFAIK there isn't an official one, tho someone posted a script (python, 
> IIRC) at one point and may repost it here.
>
> You can actually get the information needed from filefrag -v (and the 
> script does), but it takes a bit more effort than usual, scripted or 
> brain-power, to convert the results into real fragmentation numbers.
>
> The problem is that btrfs compression works in 128 KiB blocks, and 
> filefrag sees each of those as a fragment.  The extra effort involves 
> checking the addresses of the reported 128 KiB blocks to see if they are 
> actually contiguous, that is, one starts just after the previous one 
> ends.  If so it's actually not fragmented at that point.  But if the 
> addresses aren't contiguous, there's fragmentation at that point.
>
> I don't personally worry too much about it here, for two reasons.  First, 
> I /always/ run with the autodefrag mount option, which keeps 
> fragmentation manageable in any case[1], and second, I'm on ssd, where 
> the effects of fragmentation aren't as pronounced.  (On spinning rust 
> it's generally the seek times that dominate.  On ssds that's 0, but 
> there's still an IOPS cost.)
>
> So while I've run filefrag -v and looked at the results a few times out 
> of curiousity, and indeed could see the reported fragmentation that was 
> actually contiguous, it was simply a curiosity to me, thus my not 
> grabbing that script and putting it to immediate use.
>
> ---
> [1] AFAIK autodefrag checks fragmentation on writes, and rewrites 16 MiB 
> blocks if necessary.  If like me you always run it from the moment you 
> start putting data on the filesystem, that should work pretty well.  If 
> however you haven't been running it or doing manual defrag, because 
> defrag only works on writes and the free space may be fragmented enough 
> there's not 16 MiB blocks to write into, it may take awhile to "catch 
> up", and of course won't defrag anything that's never written to again, 
> but is often reread, making its existing fragmentation an issue.

Very comprehensive, thank you. I was asking because I'd like to learn
how really random writes by VM affects BTRFS (vs XFS,Ext4) performance
and try to develop some workaround to reduce/prevent it while having
csums, cow (snapshots) and compression.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: joining to contribute

2017-09-01 Thread Hugo Mills

On Fri, Sep 01, 2017 at 01:15:45PM +0800, Qu Wenruo wrote:
> On 2017年09月01日 11:36, Anthony Riley wrote:
> >Hey folks,
> >
> >I thought I would finally take a swing at things I've wanted to be an
> >kernel/fs dev fora few years now. My current $job is as an
> >Infrastructure Engineer. I'm currently teaching myself C and have
> >background in shell scripting & python. I love doing deep dives and
> >learning about linux internals.  I've read the btrfs.wiki and can't
> >really decide which project to choose to start.
> >
> >Also should I go through this https://kernelnewbies.org/FirstKernelPatch 
> >first?
> >Or should i start with something in Userspace?
> 
> Well, personally I strongly recommended to start with btrfs on-disk
> format first, and then btrfs-progs/test cases, and kernel
> contribution as final objective.

   Pick a project which bothers you -- is there some feature that you
want to have, or a particular bug, or a sharp corner that needs
rounding off?

   This next bit is purely my opinion. Feel free to ignore it if it
doesn't float your boat.

   For a relatively easy example (from a code point of view), there's
a bug with send-receive where stuff breaks if you try snapshotting and
then sending a subvolume which already has a received-uuid set.
There's probably several ways of dealing with this.

   Alongside this, there's also a requirement for being able to do
round-trip send/receive while preserving the ability to do incremental
sends. This is likely to be related to the above bug-fix. I did a
complete write-up of what's happening, and what needs to happen, here:

http://www.spinics.net/lists/linux-btrfs/msg44089.html

   If you can fix the first bug in a way that doesn't rule out the
round-trip work, that's great. If you can also get the round-trip
stuff working, that's even better (but be warned that it will get
postponed until kdave is ready for all the stream format changes to
happen at once).

> BTW, if you want to start with btrfs on-disk format, print-tree.c
> from btrfs-progs is a good start point and btrfs wiki has relatively
> well documented entry for it.

   Seconded on this. For the on-disk format, it's also useful to
create a small test filesystem and use btrfs-debug-tree on it as you
do stuff to it (create files, create subvols, make links, modify
files). Do this in conjunction with the "Data Structures" page, and
you can see how it all actually fits together.

   It took me a couple of weeks to learn how it all worked the first
time round, but I didn't have much detailed documentation to work from
back then.

   Since you speak python, there's also Hans's python-btrfs:

https://github.com/knorrie/python-btrfs

> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> https://btrfs.wiki.kernel.org/index.php/Btree_Items

And, more generally: 
https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation

I'd also point out the Data Structures and Trees pages linked
there. Some of the information is a bit out of date, or represents a
prototype of what it's describing. The source code is canonical -- use
the documentation as a guide to help you see where in the source to
look.

   Hugo.

-- 
Hugo Mills | I'm always right. But I might be wrong about that.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

signature.asc
Description: Digital signature

[PATCH 2/2] btrfs: finish ordered extent cleaning if no progress is found

2017-09-01 Thread Naohiro Aota

__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Signed-off-by: Naohiro Aota 
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered 
extent hang")
Cc:  # 4.12
---
 fs/btrfs/inode.c |4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ae4c0a1bef38..fd5934121b4b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8309,6 +8309,7 @@ static void __endio_write_update_ordered(struct inode 
*inode,
btrfs_work_func_t func;
u64 ordered_offset = offset;
u64 ordered_bytes = bytes;
+   u64 last_offset;
int ret;
 
if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
@@ -8320,6 +8321,7 @@ static void __endio_write_update_ordered(struct inode 
*inode,
}
 
 again:
+   last_offset = ordered_offset;
ret = btrfs_dec_test_first_ordered_pending(inode, ,
   _offset,
   ordered_bytes,
@@ -8330,6 +8332,8 @@ static void __endio_write_update_ordered(struct inode 
*inode,
btrfs_init_work(>work, func, finish_ordered_fn, NULL, NULL);
btrfs_queue_work(wq, >work);
 out_test:
+   if (ordered_offset == last_offset)
+   return;
/*
 * our bio might span multiple ordered extents.  If we haven't
 * completed the accounting for the whole dio, go back and try again

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] btrfs: clear ordered flag on cleaning up ordered extents

2017-09-01 Thread Naohiro Aota

commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of coresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs  pfn:3fa787
page:df2acfe9e1c0 count:0 mapcount:0 mapping:  (null) index:0xd
flags: 0x80002008(uptodate|private_2)
raw: 80002008  000d 
raw: df2acf5c1b20 b443802238b0  
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clear the flag as same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Signed-off-by: Naohiro Aota 
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered 
extent hang")
Cc:  # 4.12
---
 fs/btrfs/inode.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24bcd5cd9cf2..ae4c0a1bef38 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -135,6 +135,18 @@ static inline void btrfs_cleanup_ordered_extents(struct 
inode *inode,
 const u64 offset,
 const u64 bytes)
 {
+   unsigned long index = offset >> PAGE_SHIFT;
+   unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+   struct page *page;
+
+   while (index <= end_index) {
+   page = find_get_page(inode->i_mapping, index);
+   index++;
+   if (!page)
+   continue;
+   ClearPagePrivate2(page);
+   put_page(page);
+   }
return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
bytes - PAGE_SIZE, false);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add new common filter function

2017-09-01 Thread Eryu Guan

On Fri, Sep 01, 2017 at 10:14:41AM +0300, Amir Goldstein wrote:
> On Fri, Sep 1, 2017 at 10:04 AM, Eryu Guan  wrote:
> > On Fri, Sep 01, 2017 at 02:39:44PM +0900, Misono, Tomohiro wrote:
> >> Several tests uses both _filter_test_dir and _filter_scratch
> >> concatenated by pipe to filter $TEST_DIR and $SCRATCH_MNT. However, this
> >> would fail if the shorter string is a substring of the other (like
> >> "/mnt" and "/mnt2").
> >>
> >> This patch introduces new common filter function to safely call both
> >> _filter_test_dir and _filter_scratch.
> >>
> >> I chedked this with btrfs/029, generic/409,410,411, and generic/381,383,
> >> xfs/106,108 (which calls _filter_quota). Thanks Eryu for advice.
> >>
> >> Signed-off-by: Tomohiro Misono 
> >
> > Thanks! Though I don't think we need two separate patches, so I merged
> > the two patches together at commit time.
> >
> 
> FYI, there is still a way for a creative user to mess this up:
> 
> TEST_DEV=/test
> TEST_DIR=/test/ovl-mnt
> SCRATCH_DEV=/ovl
> SCRATCH_MNT=/ovl/ovl-mnt
> 
> $SCRATCH_DEV is a substring of $TEST_DIR
> and _filter_scratch is run first.
> 
> It's not that crazy to get to this config with the 'new'
> -overlay command, e.g.:
> TEST_DEV=/dev/sda
> TEST_DIR=/test
> SCRATCH_DEV=/dev/sdb
> SCRATCH_MNT=/ovl
> 
> Will be auto converted to the values above.

Yeah, overlay makes it more complicated.

> 
> This patch didn't break this use case.

Good to know! I just tested this config with a quick overlay tests run,
it did pass.

Seems like we need some tests for fstests itself, like
./check selftest ? :)

Thanks,
Eryu

> 
> >> ---
> >>  common/filter | 11 +++
> >>  1 file changed, 11 insertions(+)
> >>
> >> diff --git a/common/filter b/common/filter
> >> index 1ef342b..75570f9 100644
> >> --- a/common/filter
> >> +++ b/common/filter
> >> @@ -295,6 +295,17 @@ _filter_scratch()
> >>   -e "/.use_space/d"
> >>  }
> >>
> >> +_filter_testdir_and_scratch()
> >> +{
> >> + # filter both $TEST_DIR and $SCRATCH_MNT, but always filter the 
> >> longer
> >> + # string first if the other string is a substring of the first one
> >> + if echo "$TEST_DIR" | grep -q "$SCRATCH_MNT"; then
> >> +   _filter_test_dir | _filter_scratch
> >> + else
> >> +   _filter_scratch | _filter_test_dir
> >
> > And fixed the indention here, used tab :)
> >
> > Thanks,
> > Eryu
> >
> >> + fi
> >> +}
> >> +
> >>  # Turn any device in the scratch pool into SCRATCH_DEV
> >>  _filter_scratch_pool()
> >>  {
> >> --
> >> 2.9.5
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe fstests" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add new common filter function

2017-09-01 Thread Amir Goldstein

On Fri, Sep 1, 2017 at 10:04 AM, Eryu Guan  wrote:
> On Fri, Sep 01, 2017 at 02:39:44PM +0900, Misono, Tomohiro wrote:
>> Several tests uses both _filter_test_dir and _filter_scratch
>> concatenated by pipe to filter $TEST_DIR and $SCRATCH_MNT. However, this
>> would fail if the shorter string is a substring of the other (like
>> "/mnt" and "/mnt2").
>>
>> This patch introduces new common filter function to safely call both
>> _filter_test_dir and _filter_scratch.
>>
>> I chedked this with btrfs/029, generic/409,410,411, and generic/381,383,
>> xfs/106,108 (which calls _filter_quota). Thanks Eryu for advice.
>>
>> Signed-off-by: Tomohiro Misono 
>
> Thanks! Though I don't think we need two separate patches, so I merged
> the two patches together at commit time.
>

FYI, there is still a way for a creative user to mess this up:

TEST_DEV=/test
TEST_DIR=/test/ovl-mnt
SCRATCH_DEV=/ovl
SCRATCH_MNT=/ovl/ovl-mnt

$SCRATCH_DEV is a substring of $TEST_DIR
and _filter_scratch is run first.

It's not that crazy to get to this config with the 'new'
-overlay command, e.g.:
TEST_DEV=/dev/sda
TEST_DIR=/test
SCRATCH_DEV=/dev/sdb
SCRATCH_MNT=/ovl

Will be auto converted to the values above.

This patch didn't break this use case.

>> ---
>>  common/filter | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/common/filter b/common/filter
>> index 1ef342b..75570f9 100644
>> --- a/common/filter
>> +++ b/common/filter
>> @@ -295,6 +295,17 @@ _filter_scratch()
>>   -e "/.use_space/d"
>>  }
>>
>> +_filter_testdir_and_scratch()
>> +{
>> + # filter both $TEST_DIR and $SCRATCH_MNT, but always filter the longer
>> + # string first if the other string is a substring of the first one
>> + if echo "$TEST_DIR" | grep -q "$SCRATCH_MNT"; then
>> +   _filter_test_dir | _filter_scratch
>> + else
>> +   _filter_scratch | _filter_test_dir
>
> And fixed the indention here, used tab :)
>
> Thanks,
> Eryu
>
>> + fi
>> +}
>> +
>>  # Turn any device in the scratch pool into SCRATCH_DEV
>>  _filter_scratch_pool()
>>  {
>> --
>> 2.9.5
>>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH][v3] Btrfs: add a extent ref verify tool

2017-09-01 Thread josef

From: Josef Bacik 

We were having corruption issues that were tied back to problems with the extent
tree.  In order to track them down I built this tool to try and find the
culprit, which was pretty successful.  If you compile with this tool on it will
live verify every ref update that the fs makes and make sure it is consistent
and valid.  I've run this through with xfstests and haven't gotten any false
positives.  Thanks,

Signed-off-by: Josef Bacik 
---
v2->v3:
- fix use after free in an error case

 fs/btrfs/Kconfig   |   10 +
 fs/btrfs/Makefile  |1 +
 fs/btrfs/ctree.c   |2 +-
 fs/btrfs/ctree.h   |   14 +-
 fs/btrfs/disk-io.c |   15 +
 fs/btrfs/extent-tree.c |   44 ++-
 fs/btrfs/file.c|   10 +-
 fs/btrfs/inode.c   |9 +-
 fs/btrfs/ioctl.c   |2 +-
 fs/btrfs/ref-verify.c  | 1019 
 fs/btrfs/ref-verify.h  |   51 +++
 fs/btrfs/relocation.c  |   14 +-
 fs/btrfs/super.c   |   17 +
 fs/btrfs/tree-log.c|2 +-
 14 files changed, 1178 insertions(+), 32 deletions(-)
 create mode 100644 fs/btrfs/ref-verify.c
 create mode 100644 fs/btrfs/ref-verify.h

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..77d7f74 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -89,3 +89,13 @@ config BTRFS_ASSERT
  any of the assertions trip.  This is meant for btrfs developers only.
 
  If unsure, say N.
+
+config BTRFS_FS_REF_VERIFY
+   bool "Btrfs with the ref verify tool compiled in"
+   depends on BTRFS_FS
+   help
+ Enable run-time extent reference verification instrumentation.  This
+ is meant to be used by btrfs developers for tracking down extent
+ reference problems or verifying they didn't break something.
+
+ If unsure, say N.
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..3172751 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -13,6 +13,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
+btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 6d49db7..a4812ce 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -192,7 +192,7 @@ struct extent_buffer *btrfs_lock_root_node(struct 
btrfs_root *root)
  * tree until you end up with a lock on the root.  A locked buffer
  * is returned, with a reference held.
  */
-static struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
+struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
 {
struct extent_buffer *eb;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d49b045..4fa3ddd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1098,6 +1098,12 @@ struct btrfs_fs_info {
u32 nodesize;
u32 sectorsize;
u32 stripesize;
+
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+   spinlock_t ref_verify_lock;
+   struct rb_root block_tree;
+   bool ref_verify_enabled;
+#endif
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
@@ -1336,6 +1342,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 #define BTRFS_MOUNT_FRAGMENT_METADATA  (1 << 25)
 #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27)
+#define BTRFS_MOUNT_REF_VERIFY (1 << 28)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
@@ -2627,7 +2634,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
   struct extent_buffer *buf,
   u64 parent, int last_ref);
 int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
-u64 root_objectid, u64 owner,
+struct btrfs_root *root, u64 owner,
 u64 offset, u64 ram_bytes,
 struct btrfs_key *ins);
 int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
@@ -2646,7 +2653,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle 
*trans,
u64 bytenr, u64 num_bytes, u64 flags,
int level, int is_data);
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *fs_info,
+ struct btrfs_root *root,
  u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
  u64 owner, u64 offset);
 
@@ -2658,7 +2665,7 @@ void btrfs_prepare_extent_commit(struct btrfs_fs_info 
*fs_info);
 int btrfs_finish_extent_commit(struct

Re: [PATCH 1/2] Add new common filter function

2017-09-01 Thread Eryu Guan

On Fri, Sep 01, 2017 at 02:39:44PM +0900, Misono, Tomohiro wrote:
> Several tests uses both _filter_test_dir and _filter_scratch
> concatenated by pipe to filter $TEST_DIR and $SCRATCH_MNT. However, this
> would fail if the shorter string is a substring of the other (like
> "/mnt" and "/mnt2").
> 
> This patch introduces new common filter function to safely call both
> _filter_test_dir and _filter_scratch.
> 
> I chedked this with btrfs/029, generic/409,410,411, and generic/381,383,
> xfs/106,108 (which calls _filter_quota). Thanks Eryu for advice.
> 
> Signed-off-by: Tomohiro Misono 

Thanks! Though I don't think we need two separate patches, so I merged
the two patches together at commit time.

> ---
>  common/filter | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/common/filter b/common/filter
> index 1ef342b..75570f9 100644
> --- a/common/filter
> +++ b/common/filter
> @@ -295,6 +295,17 @@ _filter_scratch()
>   -e "/.use_space/d"
>  }
>  
> +_filter_testdir_and_scratch()
> +{
> + # filter both $TEST_DIR and $SCRATCH_MNT, but always filter the longer
> + # string first if the other string is a substring of the first one
> + if echo "$TEST_DIR" | grep -q "$SCRATCH_MNT"; then
> +   _filter_test_dir | _filter_scratch
> + else
> +   _filter_scratch | _filter_test_dir

And fixed the indention here, used tab :)

Thanks,
Eryu

> + fi
> +}
> +
>  # Turn any device in the scratch pool into SCRATCH_DEV
>  _filter_scratch_pool()
>  {
> -- 
> 2.9.5
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

44 matches

Mail list logo