date:20170208

The original csum error message only outputs inode number, offset, check
sum and expected check sum.

However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).

Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.

This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.

Signed-off-by: Qu Wenruo 
---
v2:
  Output mirror number in both inode.c and compression.c
---
 fs/btrfs/btrfs_inode.h | 18 ++
 fs/btrfs/compression.c |  6 ++
 fs/btrfs/inode.c   |  5 ++---
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1a8fa46ff87e..3cb8e6347b24 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -326,6 +326,24 @@ static inline void btrfs_inode_resume_unlocked_dio(struct 
inode *inode)
  _I(inode)->runtime_flags);
 }
 
+static inline void btrfs_print_data_csum_error(struct inode *inode,
+   u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+
+   /* Output minus objectid, which is more meaningful */
+   if (root->objectid >= BTRFS_LAST_FREE_OBJECTID)
+   btrfs_warn_rl(root->fs_info,
+   "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum 
0x%08x mirror %d",
+   root->objectid, btrfs_ino(inode), logical_start, csum,
+   csum_expected, mirror_num);
+   else
+   btrfs_warn_rl(root->fs_info,
+   "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum 
0x%08x mirror %d",
+   root->objectid, btrfs_ino(inode), logical_start, csum,
+   csum_expected, mirror_num);
+}
+
 bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
 
 #endif
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 7f390849343b..a7a770ad93ad 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -124,10 +124,8 @@ static int check_compressed_csum(struct inode *inode,
kunmap_atomic(kaddr);
 
if (csum != *cb_sum) {
-   btrfs_info(BTRFS_I(inode)->root->fs_info,
-  "csum failed ino %llu extent %llu csum %u wanted %u 
mirror %d",
-  btrfs_ino(inode), disk_start, csum, *cb_sum,
-  cb->mirror_num);
+   btrfs_print_data_csum_error(inode, disk_start, csum,
+   *cb_sum, cb->mirror_num);
ret = -EIO;
goto fail;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e861a063721..5cfd904cc6e6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3123,9 +3123,8 @@ static int __readpage_endio_check(struct inode *inode,
kunmap_atomic(kaddr);
return 0;
 zeroit:
-   btrfs_warn_rl(BTRFS_I(inode)->root->fs_info,
-   "csum failed ino %llu off %llu csum %u expected csum %u",
-  btrfs_ino(inode), start, csum, csum_expected);
+   btrfs_print_data_csum_error(inode, start, csum, csum_expected,
+   io_bio->mirror_num);
memset(kaddr + pgoff, 1, len);
flush_dcache_page(page);
kunmap_atomic(kaddr);
-- 
2.11.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: Better csum error message for data csum mismatch

The original csum error message only outputs inode number, offset, check
sum and expected check sum.

However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).

Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.

This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.

Signed-off-by: Qu Wenruo 
---
v2:
  Output mirror number in both inode.c and compression.c
---
 fs/btrfs/btrfs_inode.h | 18 ++
 fs/btrfs/compression.c |  6 ++
 fs/btrfs/inode.c   |  5 ++---
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1a8fa46ff87e..3cb8e6347b24 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -326,6 +326,24 @@ static inline void btrfs_inode_resume_unlocked_dio(struct 
inode *inode)
  _I(inode)->runtime_flags);
 }
 
+static inline void btrfs_print_data_csum_error(struct inode *inode,
+   u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+
+   /* Output minus objectid, which is more meaningful */
+   if (root->objectid >= BTRFS_LAST_FREE_OBJECTID)
+   btrfs_warn_rl(root->fs_info,
+   "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum 
0x%08x mirror %d",
+   root->objectid, btrfs_ino(inode), logical_start, csum,
+   csum_expected, mirror_num);
+   else
+   btrfs_warn_rl(root->fs_info,
+   "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum 
0x%08x mirror %d",
+   root->objectid, btrfs_ino(inode), logical_start, csum,
+   csum_expected, mirror_num);
+}
+
 bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
 
 #endif
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 7f390849343b..a7a770ad93ad 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -124,10 +124,8 @@ static int check_compressed_csum(struct inode *inode,
kunmap_atomic(kaddr);
 
if (csum != *cb_sum) {
-   btrfs_info(BTRFS_I(inode)->root->fs_info,
-  "csum failed ino %llu extent %llu csum %u wanted %u 
mirror %d",
-  btrfs_ino(inode), disk_start, csum, *cb_sum,
-  cb->mirror_num);
+   btrfs_print_data_csum_error(inode, disk_start, csum,
+   *cb_sum, cb->mirror_num);
ret = -EIO;
goto fail;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e861a063721..5cfd904cc6e6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3123,9 +3123,8 @@ static int __readpage_endio_check(struct inode *inode,
kunmap_atomic(kaddr);
return 0;
 zeroit:
-   btrfs_warn_rl(BTRFS_I(inode)->root->fs_info,
-   "csum failed ino %llu off %llu csum %u expected csum %u",
-  btrfs_ino(inode), start, csum, csum_expected);
+   btrfs_print_data_csum_error(inode, start, csum, csum_expected,
+   io_bio->mirror_num);
memset(kaddr + pgoff, 1, len);
flush_dcache_page(page);
kunmap_atomic(kaddr);
-- 
2.11.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding disk space usage




At 02/08/2017 05:55 PM, Vasco Visser wrote:

Thank you for the explanation. What I would still like to know is how
to relate the chunk level abstraction to the file level abstraction.
According to the btrfs output there is 2G of data space is available
and 24G of data space is being used. Does this mean 24G of data used
in files?


Yes, 24G data is used to store data.
(And space cache, while space cache is relatively small, less than 1M 
for each chunk)



How do I know which files take up most space? du seems
pretty useless as it reports only 9G of files on the volume.


Are you using snapshots?

If you are only using 1 subvolume(including snapshots), then it seems 
that btrfs data CoW waste quite a lot of space.


In case of btrfs data CoW, for example you have a 128M file(one extent), 
then you rewrite 64M of it, your data space usage will be 128M + 64M, as 
the first 128M will only be freed after *all* its user get freed.


For single subvolume and little to none reflink usage case, "btrfs fi 
defrag" should help to free some space.


If you have multiple snapshots or a lot of reflinked files, then I'm 
afraid you have to delete some file (including reflink copy or snapshot) 
to free some data.


Thanks,
Qu



--
Vasco


On Wed, Feb 8, 2017 at 4:48 AM, Qu Wenruo  wrote:



At 02/08/2017 12:44 AM, Vasco Visser wrote:


Hello,

My system is or seems to be running out of disk space but I can't find
out how or why. Might be a BTRFS peculiarity, hence posting on this
list. Most indicators seem to suggest I'm filling up, but I can't
trace the disk usage to files on the FS.

The issue is on my root filesystem on a 28GiB ssd partition (commands
below issued when booted into single user mode):


$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda3  28G   26G  2.1G  93% /


$ btrfs --version
btrfs-progs v4.4


$ btrfs fi usage /
Overall:
Device size:  27.94GiB
Device allocated:  27.94GiB
Device unallocated:   1.00MiB



So from chunk level, your fs is already full.

And balance won't success since there is no unallocated space at all.
The first 1M of btrfs is always reserved and won't be allocated, and 1M is
too small for btrfs to allocate a chunk.


Device missing: 0.00B
Used:  25.03GiB
Free (estimated):   2.37GiB (min: 2.37GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 256.00MiB (used: 0.00B)
Data,single: Size:26.69GiB, Used:24.32GiB



You still have 2G data space, so you can still write things.


   /dev/sda3  26.69GiB
Metadata,single: Size:1.22GiB, Used:731.45MiB



Metadata has has less space when considering "Global reserve".
In fact the used space would be 987M.

But it's still OK for normal write.


   /dev/sda3   1.22GiB
System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda3  32.00MiB



System chunk can hardly be used up.


Unallocated:
   /dev/sda3   1.00MiB


$ btrfs fi df /
Data, single: total=26.69GiB, used=24.32GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.22GiB, used=731.48MiB
GlobalReserve, single: total=256.00MiB, used=0.00B


However:
$ mount -o bind / /mnt
$ sudo du -hs /mnt
9.3G /mnt


Try to balance:
$ btrfs balance start /
ERROR: error during balancing '/': No space left on device


Am I really filling up? What can explain the huge discrepancy with the
output of du (no open file descriptors on deleted files can explain
this in single user mode) and the FS stats?



Just don't believe the vanilla df output for btrfs.

For btrfs, unlike other fs like ext4/xfs, which allocates chunk dynamically
and has different metadata/data profile, we can only get a clear view of the
fs from both chunk level(allocated/unallocated) and extent
level(total/used).

In your case, your fs doesn't have any unallocated space, this make balance
unable to work at all.

And your data/metadata usage is quite high, although both has small
available space left, the fs should be writable for some time, but not long.

To proceed, add a larger device to current fs, and do a balance or just
delete the 28G partition then btrfs will handle the rest well.

Thanks,
Qu



Any advice on possible causes and how to proceed?


--
Vasco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html











--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

csum failed, checksum error, questions

2017-02-08 Thread Ian Kelling

I had a file read fail repeatably, in syslog, lines like this

kernel: BTRFS warning (device dm-5): csum failed ino 2241616 off
51580928 csum 4redacted expected csum 2redacted

I rmed the file.

Another error more recently, 5 instances which look like this:
kernel: BTRFS warning (device dm-5): checksum error at logical
16147043602432 on dev /dev/mapper/dev-name-redacted, sector 1177577896,
root 4679, inode 2241616, offset 51597312, length 4096, links 1 (path:
file/path/redacted)
kernel: BTRFS error (device dm-5): bdev /dev/mapper/dev-name-redacted
errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
kernel: BTRFS error (device dm-5): unable to fixup (regular) error at
logical 16147043602432 on dev /dev/mapper/dev-name-redacted

In this case, I think the file got rmed as well.

I'm assuming this is a problem with the drive, not btrfs. Any opinions
on how likely catastrophic failure of the drive is?

Is rming the problematic file sufficient? How about if the subvolume
containing this bad file was previously snapshotted?

Is there anything else besides "kernel: BTRFS (error|warning)" that I
should grep for my syslog to watch for filesystem/drive problems?
For example, is there anything in addition to error/warning like 
"fatal" or "critical"?

For at least the second error, I was running
Linux 4.9.0-1-amd64 #1 SMP Debian 4.9.2-2 (2017-01-12) x86_64 GNU/Linux
btrfs-progs 4.7.3-1

Thanks,
Ian Kelling
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans




At 02/08/2017 10:09 PM, Filipe Manana wrote:

On Wed, Feb 8, 2017 at 1:56 AM, Qu Wenruo  wrote:

Just as Filipe pointed out, the most time consuming part of qgroup is
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().


there's an "and" so the "is" should be "are" and "part" should be "parts".


Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.

However for old_roots, we don't really need to calculate it at transaction
commit time.

This patch moves the old_roots accounting part out of
commit_transaction(), so at least we won't block transaction too long.


Doing stuff inside btrfs_commit_transaction() is only bad if it's
within the critical section, that is, after setting the transaction's
state to TRANS_STATE_COMMIT_DOING and before setting the state to
TRANS_STATE_UNBLOCKED. This should be explained somehow in the
changelog.


In this context, only critical section is under concern





But please note that, this won't speedup qgroup overall, it just moves
half of the cost out of commit_transaction().

Cc: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 20 
 fs/btrfs/qgroup.c  | 33 ++---
 fs/btrfs/qgroup.h  | 14 ++
 3 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ef724a5..0ee927e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 struct btrfs_delayed_ref_node *ref,
 struct btrfs_qgroup_extent_record *qrecord,
 u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
-int action, int is_data)
+int action, int is_data, int *qrecord_inserted_ret)
 {
struct btrfs_delayed_ref_head *existing;
struct btrfs_delayed_ref_head *head_ref = NULL;
struct btrfs_delayed_ref_root *delayed_refs;
int count_mod = 1;
int must_insert_reserved = 0;
+   int qrecord_inserted = 0;

/* If reserved is provided, it must be a data extent. */
BUG_ON(!is_data && reserved);
@@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
if(btrfs_qgroup_trace_extent_nolock(fs_info,
delayed_refs, qrecord))
kfree(qrecord);
+   else
+   qrecord_inserted = 1;
}

spin_lock_init(_ref->lock);
@@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
atomic_inc(_refs->num_entries);
trans->delayed_ref_updates++;
}
+   if (qrecord_inserted_ret)
+   *qrecord_inserted_ret = qrecord_inserted;
return head_ref;
 }

@@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int qrecord_inserted;

BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
@@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
 * the spin lock
 */
head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
-   bytenr, num_bytes, 0, 0, action, 0);
+   bytenr, num_bytes, 0, 0, action, 0,
+   _inserted);

add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr,
 num_bytes, parent, ref_root, level, action);
spin_unlock(_refs->lock);

+   if (qrecord_inserted)
+   return btrfs_qgroup_trace_extent_post(fs_info, record);
return 0;

 free_head_ref:
@@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int qrecord_inserted;

BUG_ON(extent_op && !extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
@@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 */
head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
bytenr, num_bytes, ref_root, reserved,
-   action, 1);
+   action, 1, _inserted);

add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
   num_bytes, parent, ref_root, owner, offset,

Re: Very slow balance / btrfs-transaction




At 02/08/2017 09:56 PM, Filipe Manana wrote:

On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo  wrote:



At 02/07/2017 11:55 PM, Filipe Manana wrote:


On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo 
wrote:




At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:




Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:





At 02/04/2017 09:47 AM, Jorg Bornschein wrote:



February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
wrote:









Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.


Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?



Thanks!

   j




Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance
process:

commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.

commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.



This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.


The cause is already known.

It's find_parent_node() which takes most of the time to find all
referencer
of an extent.

And it's also the cause for FIEMAP softlockup (fixed in recent release by
early quit).

The biggest problem is, current find_parent_node() uses list to iterate,
which is quite slow especially it's done in a loop.
In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce
some
cache for find_parent_node().



Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.



You must accept the fact that we must call find_parent_node() at least twice
to get correct owner modification for each touched extent.
Or qgroup number will never be correct.

One for old_roots by searching commit root, and one for new_roots by
searching current root.

You can call find_parent_node() as many time as you like, but that's just
wasting your CPU time.

Only the final find_parent_node() will determine new_roots for that extent,
and there is no better timing than commit_transaction().


You're missing my point.

My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and with too high latency - and this is not
theory or speculation, there are upstream reports about this as well
as several in suse's bugzilla, all caused when qgroups are enabled on
4.2+ kernels (when the last qgroups major changes landed).

Judging from that code and from your reply to this and other threads
it seems you didn't understand the consequences of doing all that
accounting stuff inside the critical section of a transaction commit.


NO, I know what you're talking about.
Or I won't send the patch to

Re: [PATCH] Btrfs: fix use-after-free due to wrong order of destroying work queues

2017-02-08 Thread Liu Bo

On Tue, Feb 07, 2017 at 05:02:53PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Before we destroy all work queues (and wait for their tasks to complete)
> we were destroying the work queues used for metadata I/O operations, which
> can result in a use-after-free problem because most tasks from all work
> queues do metadata I/O operations. For example, the tasks from the caching
> workers work queue (fs_info->caching_workers), which is destroyed only
> after the work queue used for metadata reads (fs_info->endio_meta_workers)
> is destroyed, do metadata reads, which result in attempts to queue tasks
> into the later work queue, triggering a use-after-free with a trace like
> the following:
> 
> [23114.613543] general protection fault:  [#1] PREEMPT SMP
> [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data 
> dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod 
> crc32c_generic
> acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 
> processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 
> crc16
> jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci 
> libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug]
> [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 
> 4.9.0-rc7-btrfs-next-36+ #1
> [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs]
> [23114.616932] task: 880221d45780 task.stack: c9000bc5
> [23114.616932] RIP: 0010:[]  [] 
> btrfs_queue_work+0x2c/0x190 [btrfs]
> [23114.616932] RSP: 0018:88023f443d60  EFLAGS: 00010246
> [23114.616932] RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
> 0102
> [23114.616932] RDX: a0419000 RSI: 88011df534f0 RDI: 
> 880101f01c00
> [23114.616932] RBP: 88023f443d80 R08: 000f7000 R09: 
> 
> [23114.616932] R10: 88023f443d48 R11: 1000 R12: 
> 88011df534f0
> [23114.616932] R13: 880135963868 R14: 1000 R15: 
> 1000
> [23114.616932] FS:  () GS:88023f44() 
> knlGS:
> [23114.616932] CS:  0010 DS:  ES:  CR0: 80050033
> [23114.616932] CR2: 7f0fb9f8e520 CR3: 01a0b000 CR4: 
> 06e0
> [23114.616932] Stack:
> [23114.616932]  880101f01c00 88011df534f0 880135963868 
> 1000
> [23114.616932]  88023f443da0 a03470af 880149b37200 
> 880135963868
> [23114.616932]  88023f443db8 8125293c 880149b37200 
> 88023f443de0
> [23114.616932] Call Trace:
> [23114.616932]   [23114.616932]  [] 
> end_workqueue_bio+0xd5/0xda [btrfs]
> [23114.616932]  [] bio_endio+0x54/0x57
> [23114.616932]  [] btrfs_end_bio+0xf7/0x106 [btrfs]
> [23114.616932]  [] bio_endio+0x54/0x57
> [23114.616932]  [] blk_update_request+0x21a/0x30f
> [23114.616932]  [] scsi_end_request+0x31/0x182 [scsi_mod]
> [23114.616932]  [] scsi_io_completion+0x1ce/0x4c8 [scsi_mod]
> [23114.616932]  [] scsi_finish_command+0x104/0x10d 
> [scsi_mod]
> [23114.616932]  [] scsi_softirq_done+0x101/0x10a [scsi_mod]
> [23114.616932]  [] blk_done_softirq+0x82/0x8d
> [23114.616932]  [] __do_softirq+0x1ab/0x412
> [23114.616932]  [] irq_exit+0x49/0x99
> [23114.616932]  [] 
> smp_call_function_single_interrupt+0x24/0x26
> [23114.616932]  [] call_function_single_interrupt+0x89/0x90
> [23114.616932]   [23114.616932]  [] ? 
> scsi_request_fn+0x13a/0x2a1 [scsi_mod]
> [23114.616932]  [] ? _raw_spin_unlock_irq+0x2c/0x4a
> [23114.616932]  [] ? _raw_spin_unlock_irq+0x32/0x4a
> [23114.616932]  [] ? _raw_spin_unlock_irq+0x2c/0x4a
> [23114.616932]  [] scsi_request_fn+0x13a/0x2a1 [scsi_mod]
> [23114.616932]  [] __blk_run_queue_uncond+0x22/0x2b
> [23114.616932]  [] __blk_run_queue+0x19/0x1b
> [23114.616932]  [] blk_queue_bio+0x268/0x282
> [23114.616932]  [] generic_make_request+0xbd/0x160
> [23114.616932]  [] submit_bio+0x100/0x11d
> [23114.616932]  [] ? __this_cpu_preempt_check+0x13/0x15
> [23114.616932]  [] ? __percpu_counter_add+0x8e/0xa7
> [23114.616932]  [] btrfsic_submit_bio+0x1a/0x1d [btrfs]
> [23114.616932]  [] btrfs_map_bio+0x1f4/0x26d [btrfs]
> [23114.616932]  [] btree_submit_bio_hook+0x74/0xbf [btrfs]
> [23114.616932]  [] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs]
> [23114.616932]  [] submit_one_bio+0x6b/0x89 [btrfs]
> [23114.616932]  [] read_extent_buffer_pages+0x170/0x1ec 
> [btrfs]
> [23114.616932]  [] ? free_root_pointers+0x64/0x64 [btrfs]
> [23114.616932]  [] readahead_tree_block+0x3f/0x4c [btrfs]
> [23114.616932]  [] 
> read_block_for_search.isra.20+0x1ce/0x23d [btrfs]
> [23114.616932]  [] btrfs_search_slot+0x65f/0x774 [btrfs]
> [23114.616932]  [] ? free_extent_buffer+0x73/0x7e [btrfs]
> [23114.616932]  [] btrfs_next_old_leaf+0xa1/0x33c [btrfs]
> [23114.616932]  []

Re: [PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10

2017-02-08 Thread Chris Mason


On Wed, Feb 08, 2017 at 05:51:28PM +0100, David Sterba wrote:

Hi,

could you please merge this single-patch pull request, for 4.10 still?  There
are quite a few patches on top of v4.10-rc7 so this IMHO does not look like
look too bad even late in the release cycle. Though it's a fix for an uncommon
usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the
ioctls. Thanks.


Hi Dave, I'll pull this in and it, thanks.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and cyrus mail server

2017-02-08 Thread Graham Cobb

On 08/02/17 18:38, Libor Klepáč wrote:
> I'm interested in using:
...
>  - send/receive for offisite backup

I don't particularly recommend that. I do use send/receive for onsite
backups (I actually use btrbk). But for offsite I use a traditional
backup tool (I use dar). For three main reasons:

1) Paranoia: I want a backup that does not use btrfs just in case there
turned out to be some problem with btrfs which could corrupt the backup.
I can't think of anything but I did say it was paranoia!

2) send/receive in incremental mode (the obvious way to use it for
offsite backups) relies on the target being up to date and properly
synchronised with the source. If, for any reason, it gets out of sync,
you have to start again with sending a full backup - a lot of data.
Traditional backup formats are more forgiving and having a corrupted
incremental does not normally prevent you getting access to data stored
in the other incrementals. This would particularly be a risk if you
thought about storing the actual send streams instead of doing the
receive: a single bit error in one could make all the subsequent streams
useless.

3) send/receive doesn't work particularly well with encryption. I store
my offsite backups in a cloud service and I want them encrypted both in
transit and when stored. To get the same with send/receive requires
putting together your own encrypted communication channel (e.g. using
ssh) and requires that you have a remote server, with an encrypted
filesystem receiving the data (and it has to be accessible in the clear
on that server). Traditional backups can just be stored offsite as
encrypted files without ever having to be in the clear anywhere except
onsite.

Just my reasons.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding disk space usage

2017-02-08 Thread Peter Grandi

[ ... ]
> The issue isn't total size, it's the difference between total
> size and the amount of data you want to store on it. and how
> well you manage chunk usage. If you're balancing regularly to
> compact chunks that are less than 50% full, [ ... ] BTRFS on
> 16GB disk images before with absolutely zero issues, and have
> a handful of fairly active 8GB BTRFS volumes [ ... ]

Unfortunately balance operations are quite expensive, especially
from inside VMs. On the other hand if the system is not much
disk constrained relatively frequent balances is a good idea
indeed. It is a bit like the advice in the other thread on OLTP
to run frequent data defrags, which are also quite expensive.

Both combined are like running the compactor/cleaner on log
structured (another variants of "COW") filesystems like NILFS2:
running that frequently means tighter space use and better
locality, but is quite expensive too.

>> [ ... ] My impression is that the Btrfs design trades space
>> for performance and reliability.

> In general, yes, but a more accurate statement would be that
> it offers a trade-off between space and convenience. [ ... ]

It is not quite "convenience", it is overhead: whole-volume
operations like compacting, defragmenting (or fscking) tend to
cost significantly in IOPS and also in transfer rate, and on
flash SSDs they also consume lifetime.

Therefore personally I prefer to have quite a bit of unused
space in Btrfs or NILFS2, at a minimum around double at 10-20%
than the 5-10% that I think is the minimum advisable with
conventional designs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Копия: Sеx & Dатing: Tеchnоlоgy аnd тhе feаr оf rejеcтіоn

2017-02-08 Thread Наследие

Это копия сообщения, которое вы отправили Балашева Майя Валерьевна через 
Культурный фонд "Наследие"

Это письмо отправлено с сайта http://www.xn8sbkcebuvoch5b6a.xn--p1ai/ от:
LolacakEncum 

Оfтеnтiмеs, тhe sаyіng, “тhеrе arе plеnтy of fish in тhe sеа,” іs gіvеn аs a 
соnsolатіon аfter a brеакuр оr as rеаssurаnсe during а perіоd оf lоnеlinеss. If 
тhis seа rеally exіsтs, surеly a pоol suсh as UGА, flооdеd wiтh тhоusands оf 
undеrgraduaте sтudеnтs, wоuld sеrve as а good plаcе тo cаst bаiт. 
 
Hоwеvеr, sомеtiмes iт sеемs fеw fish аre biтіng оn самрus—а noтion тhат cоuld 
bе аттribuтеd tо thе laск of fishermеn castіng тheir nетs in-pеrsоn. Insтеad, 
мany studеnтs find dаtеs оnlіnе тhrоugh аррlісatiоns. 
 
http://wooga.info/6Kts?SafTwighTauhinhask>Sex & Dаting

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding disk space usage

On 2017-02-08 09:46, Peter Grandi wrote:

My system is or seems to be running out of disk space but I
can't find out how or why. [ ... ]
FilesystemSize Used Avail Use% Mounted on
/dev/sda3 28G 26G 2.1G 93% /

[ ... ]

So from chunk level, your fs is already full. And balance
won't success since there is no unallocated space at all.

To add to this, 28GiB is a bit too small for Btrfs, because at
that point chunk size is 1GiB. I have the habit of sizing
partitions to an exact number of GiB, and that means that most
of 1GiB will never be used by Btrfs because there is a small
amount of space allocated that is smaller than 1GiB and thus
there will be eventually just less than 1GiB unallocated.
Unfortunately the chunk size is not manually settable.
28GB is a perfectly reasonable (if a bit odd) size for a non-mixed-mode
volume. The issue isn't total size, it's the difference between total
size and the amount of data you want to store on it. and how well you
manage chunk usage. If you're balancing regularly to compact chunks
that are less than 50% full, you can get away with as little as 4GB of
extra space beyond your regular data-set with absolutely zero issues.
I've run full Linux installations in VM's with BTRFS on 16GB disk images
before with absolutely zero issues, and have a handful of fairly active
8GB BTRFS volumes on both of my primary systems that never have any
issues with free space despite averaging 5GB of space usage.

Example here from 'btrfs fi usage':

Overall:
Device size: 88.00GiB
Device allocated: 86.06GiB
Device unallocated:1.94GiB
Device missing: 0.00B
Used: 80.11GiB
Free (estimated): 6.26GiB (min: 5.30GiB)

That means that I should 'btrfs balance' now, because of the
1.94GiB "unallocated", 0.94GiB will never be allocated, and that
leaves just 1GiB "unallocated" which is the minimum for running
'btrfs balance'. I have just done so and this is the result:
Actually, that 0.94GB would be used. BTRFS will create smaller chunks
if it has to, so if you allocated two data chunks with that 1.94GB of
space, you would get one 1GB chunk and one 0.94GB chunk.

Overall:
Device size: 88.00GiB
Device allocated: 82.03GiB
Device unallocated:5.97GiB
Device missing: 0.00B
Used: 80.11GiB
Free (estimated): 6.26GiB (min: 3.28GiB)

At some point I had decided to use 'mixedbg' allocation to
reduce this problem and hopefully improve locality, but that
means that metadata and data need to have the same profile, and
I really want metadata to be 'dup' because of checksumming,
and I don't want data to be 'dup' too.

You could also use larger partitions and keep a better handle on free space.

[ ... ] To proceed, add a larger device to current fs, and do
a balance or just delete the 28G partition then btrfs will
handle the rest well.

Usually for this I use a USB stick, with a 1-3GiB partition plus
a bit extra because of that extra bit of space.
If you have a lot of RAM and can guarantee that things won't crash (or
don't care about the filesystem too much and are just trying to avoid
having to restore a backup), a ramdisk works well for this too.

https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21
marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

Unfortunately if it is a single device volume and metadata is
'dup' to remove the extra temporary device one has first to
convert the metadata to 'single' and then back to 'dup' after
removal.
This shouldn't be needed, if it is then it's a bug that should be
reported and ideally fixed (there was such a bug when converting from
multi-device raid profiles to single device, but that got fixed quite a
few kernel versions ago (I distinctly remember because I wrote the fix)).

There are also some additional reasons why space used (rather
than allocated) may be larger than expected, in special but not
wholly infrequent cases. My impression is that the Btrfs design
trades space for performance and reliability.
In general, yes, but a more accurate statement would be that it offers a
trade-off between space and convenience. If you're not going to take
the time to maintain the filesystem properly, then you will need more
excess space for it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: BTRFS and cyrus mail server

2017-02-08 Thread Kai Krakow

Am Wed, 08 Feb 2017 19:38:06 +0100
schrieb Libor Klepáč :

> Hello,
> inspired by recent discussion on BTRFS vs. databases i wanted to ask
> on suitability of BTRFS for hosting a Cyrus imap server spool. I
> haven't found any recent article on this topic.
> 
> I'm preparing migration of our mailserver to Debian Stretch, ie.
> kernel 4.9 for now. We are using XFS for storage now. I will migrate
> using imapsync to new server. Both are virtual machines running on
> vmware on Dell hardware. Disks are on battery backed hw raid
> controllers over vmfs.
> 
> I'm considering using BTRFS, but I'm little concerned because of
> reading this mailing list ;)
> 
> I'm interested in using:
>  - compression (emails should compress well - right?)

Not really... The small part that's compressible (headers and a few
lines of text) are already small, so a sector (maybe 4k) is still a
sector. Compression gains you no benefit here. That big parts of mails
is already compressed (images, attachments). Mail spools only compress
well if you're compressing mails to a solid archive (like 7zip or tgz).
If you're compressing each mail individually, there's almost no gain
because of file system slack.

>  - maybe deduplication (cyrus does it by hardlinking of same content
> messages now) later

It won't work that way. I'd stick to hardlinking. Only
offline/nearline deduplication will help you. And it will have a hard
time finding the duplicates. This would only properly work if Cyrus
separates mail headers and bodies (I don't know if it does, dovecot
doesn't which is what I use) because delivering to the spool usually
adds some headers like "Delivered-To". This changes the byte offsets
between similar mails so that deduplication will no longer work.
 
>  - snapshots for history

Don't do snapshots too deep. I had similar plans but instead decided it
would be better to use the following setup as a continuous backup
strategy: Deliver mails to two spools, one being the user accessible
spool, and one being the backup spool. Once per day you rename the
backup spool and let it be recreated. Then store away the old backup
store in whatever way you want (snapshots, traditional backup with
retention, ...).

>  - send/receive for offisite backup

It's not that stable that I'd use it in production...

>  - what about data inlining, should it be turned off?

How much data can be inlined? I'm not sure, I never thought about that.

> Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 
> mailboxes.

Similar numbers here, just more mailboxes and less space because we
take care that customers remove their mails from our servers and store
it in their own systems and backups. With a few exceptions, and those
have really big mailboxes.

> We have message size limit of ~25MB, so emails are not bigger than
> that.

50 MB raw size here... (after 3-in-4 decoding this makes around 37 MB
worth of attachments)

> There are however bigger files, these are per mailbox
> caches/index files of cyrus (some of them are around 300MB) - and
> these are also files which are most modified.
> Rest of files (messages) are usualy just writen once.

I'm still struggling if I should try btrfs or stay with xfs. Xfs has a
huge benefit of scaling very very well to parallel workloads and
accross multiple devices. Btrfs does exactly that not very well yet
(because of write-serialization etc).

> 
> ---
> I started using btrfs on backup server as a storage for 4 backuppc
> run in containers (backups are then send away with btrbk), year ago.
> After switching off data inlining i'm satisfied, everything works
> (send/ receive is sometime slow, but i guess it's because of sata
> disks on receive side).

I've started to love borgbackup. It's very fast, efficient, and
reliable. Not sure how good it works for VM images, but for delta
backups in general it's very efficient and fast.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and cyrus mail server


On 2017-02-08 13:38, Libor Klepáč wrote:

Hello,
inspired by recent discussion on BTRFS vs. databases i wanted to ask on
suitability of BTRFS for hosting a Cyrus imap server spool. I haven't found
any recent article on this topic.

I'm preparing migration of our mailserver to Debian Stretch, ie. kernel 4.9
for now. We are using XFS for storage now. I will migrate using imapsync to
new server. Both are virtual machines running on vmware on Dell hardware.
Disks are on battery backed hw raid controllers over vmfs.

I'm considering using BTRFS, but I'm little concerned because of reading this
mailing list ;)
FWIW, as long as you're using a recent kernel and take the time to do 
proper maintenance on the filesystem, BTRFS is generally very stable. 
WRT mail servers specifically, before we went to a cloud service for 
e-mail where I work, we used Postfix + Dovecot on our internal server, 
and actually saw a measurable performance improvement when switching 
from XFS to BTRFS.  That was about 3.12-3.18 vintage on the kernel 
though, so YMMV.


I'm interested in using:
 - compression (emails should compress well - right?)
Yes, very well assuming you're storing the actual text form of them (I 
don't recall if Cyrus does so, but I know Postfix, Sendmail, and most 
other FOSS mail server software do).  The in-line compression will also 
help reduce fragmentation, and unless you have a really fast storage 
device, should probably improve performance in general.

 - maybe deduplication (cyrus does it by hardlinking of same content messages
now) later
Deduplication beyond what Cyrus does is probably not worth it.  In most 
cases about 10% of an e-mail in text form is going to be duplicated if 
it's not a copy of an existing message, and that 10% is generally spread 
throughout the file (stuff like MIME headers and such), so you would 
probably see near zero space savings for doing anything beyond what 
Cyrus does while using an insanely larger amount of resources.

 - snapshots for history
Make sure you use a sane exponential thinning system.  Once you get past 
about 300 snapshots, you'll start seeing some serious performance 
issues, and even double digits might hurt performance at the scale 
you're talking about.

 - send/receive for offisite backup
This is up to you, but I would probably not use send-receive for 
off-site backups.  Unless you're using reflinking, you can copy all the 
same attributes that send-receive does using almost any other backup 
tool, and other tools often have much better security built-in.  Send 
streams also don't compress very well in my experience, so using 
send-receive has a tendency to require more network resources.

 - what about data inlining, should it be turned off?
Generally no, and especially if you handle lots of small e-mails. 
Metadata blocks need to be looked up to open and read files anyway, 
in-lining the data means that you don't need to read in any more blocks 
for files small enough to fit in the spare space in the metadata block 
or when you only need to read the first few kilobytes of the file (and 
if Cyrus' IMAP/POP server works anything like most others I've seen, it 
will be parsing those first few KB because that's where the headers it 
indexes are).


Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000
mailboxes.
We have message size limit of ~25MB, so emails are not bigger than that.
There are however bigger files, these are per mailbox caches/index files of
cyrus (some of them are around 300MB) - and these are also files which are
most modified.
I would mark these files NOCOW for performance reasons (and because if 
they're just caches and indexes, they should be pretty simple to 
regenerate).

Rest of files (messages) are usualy just writen once.

---
I started using btrfs on backup server as a storage for 4 backuppc run in
containers (backups are then send away with btrbk), year ago.
After switching off data inlining i'm satisfied, everything works (send/
receive is sometime slow, but i guess it's because of sata disks on receive
side).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-08 Thread Goffredo Baroncelli

On 2017-02-07 22:35, Kai Krakow wrote:
[...]
>>
>> Atomicity can be a relative term. If the snapshot atomicity is
>> relative to barriers but not relative to individual writes between
>> barriers then AFAICT it's fine because the filesystem doesn't make
>> any promise it won't keep even in the context of its snapshots.
>> Consider a power loss : the filesystems atomicity guarantees can't go
>> beyond what the hardware guarantees which means not all current in fly
>> write will reach the disk and partial writes can happen. Modern
>> filesystems will remain consistent though and if an application using
>> them makes uses of f*sync it can provide its own guarantees too. The
>> same should apply to snapshots : all the writes in fly can complete or
>> not on disk before the snapshot what matters is that both the snapshot
>> and these writes will be completed after the next barrier (and any
>> robust application will ignore all the in fly writes it finds in the
>> snapshot if they were part of a batch that should be atomically
>> commited).
>>
>> This is why AFAIK PostgreSQL or MySQL with their default ACID
>> compliant configuration will recover from a BTRFS snapshot in the
>> same way they recover from a power loss.
> 
> This is what I meant in my other reply. But this is also why it should
> be documented. Wrongly implying that snapshots are single point in time
> snapshots is a wrong assumption with possibly horrible side effects one
> wouldn't expect.

I don't understand what are you saying. 
Until now, my understanding was that "all the writings which were passed to 
btrfs before the snapshot time are in the snapshot. The ones after not".
Am I wrong ? Which are the others possible interpretations ?


[..]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.


On 2017-02-08 08:46, Tomasz Torcz wrote:

On Wed, Feb 08, 2017 at 07:50:22AM -0500, Austin S. Hemmelgarn wrote:

 It is exponentially safer in BTRFS
to run single data single metadata than half raid1 data half raid1 metadata.


  Why?


To convert to profiles _designed_ for a single device and then convert back
to raid1 when I got another disk.  The issue you've stumbled across is only
partial motivation for this, the bigger motivation is that running half a 2
disk array is more risky than running a single disk by itself.


  Again, why?  What's the difference?  What causes increased risk?
Aside from bugs like the one that sparked this thread that is?  Just off 
the top of my head:
* You're running with half a System chunk.  This is _very_ risky because 
almost any errors in the system chunk run the risk of nuking entire 
files and possibly the whole filesystem.  This is part of the reason 
that I explicitly listed -mconvert=dup instead of -mconvert=single.
* It performs significantly better.  As odd as this sounds, this 
actually has an impact on safety.  Better overall performance reduces 
the size of the windows of time during which part of the filesystem is 
committed.  This has less impact than running a traditional filesystem 
on top of a traditional RAID array, but it still has some impact.
* Single device is exponentially more well tested than running a 
degraded multi-device array.  IOW, you're less likely to hit obscure 
bugs by running a single profile instead of half a raid1 profile.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding disk space usage

2017-02-08 Thread Hugo Mills

On Wed, Feb 08, 2017 at 02:46:32PM +, Peter Grandi wrote:
> >> My system is or seems to be running out of disk space but I
> >> can't find out how or why. [ ... ]
> >> FilesystemSize  Used Avail Use% Mounted on
> >> /dev/sda3  28G   26G  2.1G  93% /
> [ ... ]
> > So from chunk level, your fs is already full.  And balance
> > won't success since there is no unallocated space at all.
> 
> To add to this, 28GiB is a bit too small for Btrfs, because at
> that point chunk size is 1GiB. I have the habit of sizing
> partitions to an exact number of GiB, and that means that most
> of 1GiB will never be used by Btrfs because there is a small
> amount of space allocated that is smaller than 1GiB and thus
> there will be eventually just less than 1GiB unallocated.

   Not true -- the last chunk can be smaller than 1 GiB, to use the
available space completely.

   Hugo.

> Unfortunately the chunk size is not manually settable.
> 
> Example here from 'btrfs fi usage':
> 
> Overall:
> Device size:  88.00GiB
> Device allocated: 86.06GiB
> Device unallocated:1.94GiB
> Device missing:  0.00B
> Used: 80.11GiB
> Free (estimated):  6.26GiB  (min: 5.30GiB)
> 
> That means that I should 'btrfs balance' now, because of the
> 1.94GiB "unallocated", 0.94GiB will never be allocated, and that
> leaves just 1GiB "unallocated" which is the minimum for running
> 'btrfs balance'. I have just done so and this is the result:
> 
> Overall:
> Device size:  88.00GiB
> Device allocated: 82.03GiB
> Device unallocated:5.97GiB
> Device missing:  0.00B
> Used: 80.11GiB
> Free (estimated):  6.26GiB  (min: 3.28GiB)
> 
> At some point I had decided to use 'mixedbg' allocation to
> reduce this problem and hopefully improve locality, but that
> means that metadata and data need to have the same profile, and
> I really want metadata to be 'dup' because of checksumming,
> and I don't want data to be 'dup' too.
> 
> > [ ... ] To proceed, add a larger device to current fs, and do
> > a balance or just delete the 28G partition then btrfs will
> > handle the rest well.
> 
> Usually for this I use a USB stick, with a 1-3GiB partition plus
> a bit extra because of that extra bit of space.
> 
> https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F
> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21
> marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
> 
> Unfortunately if it is a single device volume and metadata is
> 'dup' to remove the extra temporary device one has first to
> convert the metadata to 'single' and then back to 'dup' after
> removal.
> 
> There are also some additional reasons why space used (rather
> than allocated) may be larger than expected, in special but not
> wholly infrequent cases. My impression is that the Btrfs design
> trades space for performance and reliability.

-- 
Hugo Mills | Alert status chocolate viridian: Authorised
hugo@... carfax.org.uk | personnel only. Dogs must be carried on escalator.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Re: [PATCH v2] Btrfs: create a helper to create em for IO

On Tue, Jan 31, 2017 at 07:50:22AM -0800, Liu Bo wrote:
> We have similar codes to create and insert extent mapping around IO path,
> this merges them into a single helper.

Looks good, comments below.

> +static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 
> len,
> +u64 orig_start, u64 block_start,
> +u64 block_len, u64 orig_block_len,
> +u64 ram_bytes, int compress_type,
> +int type);
>  
>  static int btrfs_dirty_inode(struct inode *inode);
>  
> @@ -690,7 +690,6 @@ static noinline void submit_compressed_extents(struct 
> inode *inode,
>   struct btrfs_key ins;
>   struct extent_map *em;
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> - struct extent_map_tree *em_tree = _I(inode)->extent_tree;
>   struct extent_io_tree *io_tree;
>   int ret = 0;
>  
> @@ -778,46 +777,19 @@ static noinline void submit_compressed_extents(struct 
> inode *inode,
>* here we're doing allocation and writeback of the
>* compressed pages
>*/
> - btrfs_drop_extent_cache(inode, async_extent->start,
> - async_extent->start +
> - async_extent->ram_size - 1, 0);
> -
> - em = alloc_extent_map();
> - if (!em) {
> - ret = -ENOMEM;
> - goto out_free_reserve;
> - }
> - em->start = async_extent->start;
> - em->len = async_extent->ram_size;
> - em->orig_start = em->start;
> - em->mod_start = em->start;
> - em->mod_len = em->len;
> -
> - em->block_start = ins.objectid;
> - em->block_len = ins.offset;
> - em->orig_block_len = ins.offset;
> - em->ram_bytes = async_extent->ram_size;
> - em->bdev = fs_info->fs_devices->latest_bdev;
> - em->compress_type = async_extent->compress_type;
> - set_bit(EXTENT_FLAG_PINNED, >flags);
> - set_bit(EXTENT_FLAG_COMPRESSED, >flags);
> - em->generation = -1;
> -
> - while (1) {
> - write_lock(_tree->lock);
> - ret = add_extent_mapping(em_tree, em, 1);
> - write_unlock(_tree->lock);
> - if (ret != -EEXIST) {
> - free_extent_map(em);
> - break;
> - }
> - btrfs_drop_extent_cache(inode, async_extent->start,
> - async_extent->start +
> - async_extent->ram_size - 1, 0);
> - }
> -
> - if (ret)
> + em = create_io_em(inode, async_extent->start,
> +   async_extent->ram_size, /* len */
> +   async_extent->start, /* orig_start */
> +   ins.objectid, /* block_start */
> +   ins.offset, /* block_len */
> +   ins.offset, /* orig_block_len */
> +   async_extent->ram_size, /* ram_bytes */
> +   async_extent->compress_type,
> +   BTRFS_ORDERED_COMPRESSED);
> + if (IS_ERR(em))
> + /* ret value is not necessary due to void function */
>   goto out_free_reserve;
> + free_extent_map(em);
>  
>   ret = btrfs_add_ordered_extent_compress(inode,
>   async_extent->start,
> @@ -952,7 +924,6 @@ static noinline int cow_file_range(struct inode *inode,
>   u64 blocksize = fs_info->sectorsize;
>   struct btrfs_key ins;
>   struct extent_map *em;
> - struct extent_map_tree *em_tree = _I(inode)->extent_tree;
>   int ret = 0;
>  
>   if (btrfs_is_free_space_inode(inode)) {
> @@ -1008,39 +979,18 @@ static noinline int cow_file_range(struct inode *inode,
>   if (ret < 0)
>   goto out_unlock;
>  
> - em = alloc_extent_map();
> - if (!em) {
> - ret = -ENOMEM;
> - goto out_reserve;
> - }
> - em->start = start;
> - em->orig_start = em->start;
>   ram_size = ins.offset;
> - em->len = ins.offset;
> - em->mod_start = em->start;
> - em->mod_len = em->len;
> -
> - em->block_start = ins.objectid;
> - em->block_len = ins.offset;
> - em->orig_block_len = ins.offset;
> - em->ram_bytes = ram_size;
> - em->bdev = fs_info->fs_devices->latest_bdev;
> - set_bit(EXTENT_FLAG_PINNED,

BTRFS and cyrus mail server

2017-02-08 Thread Libor Klepáč

Hello,
inspired by recent discussion on BTRFS vs. databases i wanted to ask on 
suitability of BTRFS for hosting a Cyrus imap server spool. I haven't found 
any recent article on this topic.

I'm preparing migration of our mailserver to Debian Stretch, ie. kernel 4.9 
for now. We are using XFS for storage now. I will migrate using imapsync to 
new server. Both are virtual machines running on vmware on Dell hardware.
Disks are on battery backed hw raid controllers over vmfs.

I'm considering using BTRFS, but I'm little concerned because of reading this 
mailing list ;)

I'm interested in using:
 - compression (emails should compress well - right?)
 - maybe deduplication (cyrus does it by hardlinking of same content messages 
now) later
 - snapshots for history
 - send/receive for offisite backup
 - what about data inlining, should it be turned off?

Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 
mailboxes.
We have message size limit of ~25MB, so emails are not bigger than that.
There are however bigger files, these are per mailbox caches/index files of 
cyrus (some of them are around 300MB) - and these are also files which are 
most modified.
Rest of files (messages) are usualy just writen once.

---
I started using btrfs on backup server as a storage for 4 backuppc run in 
containers (backups are then send away with btrbk), year ago.
After switching off data inlining i'm satisfied, everything works (send/
receive is sometime slow, but i guess it's because of sata disks on receive 
side).


Thanks for you opinions,

Libor

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-02-08 Thread Tomasz Torcz

On Wed, Feb 08, 2017 at 07:50:22AM -0500, Austin S. Hemmelgarn wrote:
>  It is exponentially safer in BTRFS
> to run single data single metadata than half raid1 data half raid1 metadata.

  Why?
 
> To convert to profiles _designed_ for a single device and then convert back
> to raid1 when I got another disk.  The issue you've stumbled across is only
> partial motivation for this, the bigger motivation is that running half a 2
> disk array is more risky than running a single disk by itself.

  Again, why?  What's the difference?  What causes increased risk?

-- 
Tomasz TorczOnly gods can safely risk perfection,
xmpp: zdzich...@chrome.pl it's a dangerous thing for a man.  -- Alia

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure

On Tue, Feb 07, 2017 at 12:14:51PM -0800, Liu Bo wrote:
> > +   end_page_writeback(page);
> > +   }
> > 
> > cur = cur + iosize;
> > pg_offset += iosize;
> > @@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct
> > extent_buffer *eb,
> > epd->bio_flags = bio_flags;
> > if (ret) {
> > set_btree_ioerr(p);
> > -   end_page_writeback(p);
> > +   if (PageWriteback(p))
> > +   end_page_writeback(p);
> > if (atomic_sub_and_test(num_pages - i, >io_pages))
> > end_extent_buffer_writeback(eb);
> > ret = -EIO;
> > 
> > ---
> > 
> 
> Looks good, could you please make a comment for the if statement in your
> commit log so that others could know why we put it?

Thank you both. Please resend v2 so I can add it to 4.11 queue.
> 
> Since you've got a reproducer, baking it into a fstests case is also
> welcome.

AFAICS the reproducer needs a kernel patch so the memory allocation
fails reliably, this is not suitable for fstests. We don't have an easy
way to inject allocation failures easily, but some reduced steps to
reprroduce could be added to the changelog.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs: Better csum error message for data csum mismatch

On Tue, Feb 07, 2017 at 02:57:17PM +0800, Qu Wenruo wrote:
> The original csum error message only outputs inode number, offset, check
> sum and expected check sum.
> 
> However no root objectid is outputted, which sometimes makes debugging
> quite painful under multi-subvolume case (including relocation).
> 
> Also the checksum output is decimal, which seldom makes sense for
> users/developers and is hard to read in most time.
> 
> This patch will add root objectid, which will be %lld for rootid larger
> than LAST_FREE_OBJECTID, and hex csum output for better readability.

Ok for the change.

> + "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum 
> 0x%08x",

> + "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum 
> 0x%08x",

> -"csum failed ino %llu extent %llu csum %u wanted %u 
> mirror %d",

so the new code does not print mirror number, I think this still makes
sense in cases where we know it. Please extend the helper and callchain
that leads to the new print functions so we see the mirror as well.

btrfs_readpage_end_io_hook
  __readpage_endio_check
(print the csum failed message)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/24] btrfs: Convert to separately allocated bdi

On Thu, Feb 02, 2017 at 06:34:06PM +0100, Jan Kara wrote:
> Allocate struct backing_dev_info separately instead of embedding it
> inside superblock. This unifies handling of bdi among users.
> 
> CC: Chris Mason 
> CC: Josef Bacik 
> CC: David Sterba 
> CC: linux-btrfs@vger.kernel.org
> Signed-off-by: Jan Kara 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Remove unused function arg in delete_extent_records

On Fri, Feb 03, 2017 at 10:15:32AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> new_len is not used in delete_extent_records().
> 
> Signed-off-by: Goldwyn Rodrigues 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

On Mon, Feb 06, 2017 at 07:39:09PM -0500, Jeff Mahoney wrote:
> Commit 4c63c2454ef incorrectly assumed that returning -ENOIOCTLCMD would
> cause the native ioctl to be called.  The ->compat_ioctl callback is
> expected to handle all ioctls, not just compat variants.  As a result,
> when using 32-bit userspace on 64-bit kernels, everything except those
> three ioctls would return -ENOTTY.
> 
> Fixes: 4c63c2454ef ("btrfs: bugfix: handle 
> FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jeff Mahoney 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10

Hi,

could you please merge this single-patch pull request, for 4.10 still?  There
are quite a few patches on top of v4.10-rc7 so this IMHO does not look like
look too bad even late in the release cycle. Though it's a fix for an uncommon
usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the
ioctls. Thanks.

The following changes since commit 57b59ed2e5b91e958843609c7884794e29e6c4cb:

  Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations 
(2017-01-26 15:48:56 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git fixes-4.10

for you to fetch changes up to 2a362249187a8d0f6d942d6e1d763d150a296f47:

  btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls (2017-02-08 
17:47:30 +0100)


Jeff Mahoney (1):
  btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

 fs/btrfs/ioctl.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding disk space usage

2017-02-08 Thread Peter Grandi

>> My system is or seems to be running out of disk space but I
>> can't find out how or why. [ ... ]
>> FilesystemSize  Used Avail Use% Mounted on
>> /dev/sda3  28G   26G  2.1G  93% /
[ ... ]
> So from chunk level, your fs is already full.  And balance
> won't success since there is no unallocated space at all.

To add to this, 28GiB is a bit too small for Btrfs, because at
that point chunk size is 1GiB. I have the habit of sizing
partitions to an exact number of GiB, and that means that most
of 1GiB will never be used by Btrfs because there is a small
amount of space allocated that is smaller than 1GiB and thus
there will be eventually just less than 1GiB unallocated.
Unfortunately the chunk size is not manually settable.

Example here from 'btrfs fi usage':

Overall:
Device size:  88.00GiB
Device allocated: 86.06GiB
Device unallocated:1.94GiB
Device missing:  0.00B
Used: 80.11GiB
Free (estimated):  6.26GiB  (min: 5.30GiB)

That means that I should 'btrfs balance' now, because of the
1.94GiB "unallocated", 0.94GiB will never be allocated, and that
leaves just 1GiB "unallocated" which is the minimum for running
'btrfs balance'. I have just done so and this is the result:

Overall:
Device size:  88.00GiB
Device allocated: 82.03GiB
Device unallocated:5.97GiB
Device missing:  0.00B
Used: 80.11GiB
Free (estimated):  6.26GiB  (min: 3.28GiB)

At some point I had decided to use 'mixedbg' allocation to
reduce this problem and hopefully improve locality, but that
means that metadata and data need to have the same profile, and
I really want metadata to be 'dup' because of checksumming,
and I don't want data to be 'dup' too.

> [ ... ] To proceed, add a larger device to current fs, and do
> a balance or just delete the 28G partition then btrfs will
> handle the rest well.

Usually for this I use a USB stick, with a 1-3GiB partition plus
a bit extra because of that extra bit of space.

https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21
marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

Unfortunately if it is a single device volume and metadata is
'dup' to remove the extra temporary device one has first to
convert the metadata to 'single' and then back to 'dup' after
removal.

There are also some additional reasons why space used (rather
than allocated) may be larger than expected, in special but not
wholly infrequent cases. My impression is that the Btrfs design
trades space for performance and reliability.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

partial quota rescan

2017-02-08 Thread Marat Khalili

I'm using  trying to use qgroups to keep track of storage occupied by 
snapshots. I noticed that:
a) no two rescans can run in parallel, and there's no way to schedule 
another rescan while one is running;
b) seems like it's a whole-disk operation regardless of path specified 
in CLI.


I only just started to fill my new 24Tb btrfs volume using qgroups, but 
rescans already take a long time, and due to (a) above I each time have 
to wait for previous rescan to finish in my scripts. Can anything be 
done about it, like trashing and recomputing only statistics for 
specific qgroup?


Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4

--
--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-08 Thread Adrian Brzezinski

W dniu 2017-02-08 o 13:14 PM, Martin Raiber pisze:
> Hi,
>
> On 08.02.2017 03:11 Peter Zaitsev wrote:
>> Out of curiosity, I see one problem here:
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
>>
>> How is this going to be addressed? Is there some snapshot aware API to
>> let user space subscribe to such events and do proper preparation? Is
>> this planned? LVM could be a user of such an API, too. I think this
>> could have nice enterprise-grade value for Linux.
>>
>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>> still, also this needs to be integrated with MySQL to properly work. I
>> once (years ago) researched on this but gave up on my plans when I
>> planned database backups for our web server infrastructure. We moved to
>> creating SQL dumps instead, although there're binlogs which can be used
>> to recover to a clean and stable transactional state after taking
>> snapshots. But I simply didn't want to fiddle around with properly
>> cleaning up binlogs which accumulate horribly much space usage over
>> time. The cleanup process requires to create a cold copy or dump of the
>> complete database from time to time, only then it's safe to remove all
>> binlogs up to that point in time.
> little bit off topic, but I for one would be on board with such an
> effort. It "just" needs coordination between the backup
> software/snapshot tools, the backed up software and the various snapshot
> providers. If you look at the Windows VSS API, this would be a
> relatively large undertaking if all the corner cases are taken into
> account, like e.g. a database having the database log on a separate
> volume from the data, dependencies between different components etc.
>
> You'll know more about this, but databases usually fsync quite often in
> their default configuration, so btrfs snapshots shouldn't be much behind
> the properly snapshotted state, so I see the advantages more with
> usability and taking care of corner cases automatically.
>
> Regards,
> Martin Raiber

xfs_freeze works also for BTRFS...


-- 

Adrian Brzeziński

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-08 Thread Peter Zaitsev

Hi,

When it comes to MySQL I'm not really sure what you're trying to
achieve.  Because MySQL manages its own cache flushing OS cache to the
disk and "freezing" FS does not really do much - it will still need to
do crash recovery when such snapshot is restored.

The reason people would use xfs_freeze with MySQL is when we have the
database spread across different filesystems - typically   log files
placed on the different partition than the data or databases placed on
different partitions.  In this case you need to have consistent single
point in time snapshot across the filesystems for backup to be
recoverable. More common approach though is to keep it KISS
and have everything on single filesystem.

On Wed, Feb 8, 2017 at 8:26 AM, Martin Raiber  wrote:
> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
>> On 2017-02-08 07:14, Martin Raiber wrote:
>>> Hi,
>>>
>>> On 08.02.2017 03:11 Peter Zaitsev wrote:
 Out of curiosity, I see one problem here:
 If you're doing snapshots of the live database, each snapshot leaves
 the database files like killing the database in-flight. Like shutting
 the system down in the middle of writing data.

 This is because I think there's no API for user space to subscribe to
 events like a snapshot - unlike e.g. the VSS API (volume snapshot
 service) in Windows. You should put the database into frozen state to
 prepare it for a hotcopy before creating the snapshot, then ensure all
 data is flushed before continuing.

 I think I've read that btrfs snapshots do not guarantee single point in
 time snapshots - the snapshot may be smeared across a longer period of
 time while the kernel is still writing data. So parts of your writes
 may still end up in the snapshot after issuing the snapshot command,
 instead of in the working copy as expected.

 How is this going to be addressed? Is there some snapshot aware API to
 let user space subscribe to such events and do proper preparation? Is
 this planned? LVM could be a user of such an API, too. I think this
 could have nice enterprise-grade value for Linux.

 XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
 still, also this needs to be integrated with MySQL to properly work. I
 once (years ago) researched on this but gave up on my plans when I
 planned database backups for our web server infrastructure. We moved to
 creating SQL dumps instead, although there're binlogs which can be used
 to recover to a clean and stable transactional state after taking
 snapshots. But I simply didn't want to fiddle around with properly
 cleaning up binlogs which accumulate horribly much space usage over
 time. The cleanup process requires to create a cold copy or dump of the
 complete database from time to time, only then it's safe to remove all
 binlogs up to that point in time.
>>>
>>> little bit off topic, but I for one would be on board with such an
>>> effort. It "just" needs coordination between the backup
>>> software/snapshot tools, the backed up software and the various snapshot
>>> providers. If you look at the Windows VSS API, this would be a
>>> relatively large undertaking if all the corner cases are taken into
>>> account, like e.g. a database having the database log on a separate
>>> volume from the data, dependencies between different components etc.
>>>
>>> You'll know more about this, but databases usually fsync quite often in
>>> their default configuration, so btrfs snapshots shouldn't be much behind
>>> the properly snapshotted state, so I see the advantages more with
>>> usability and taking care of corner cases automatically.
>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
>> reflinking to userspace, and therefore it's fully possible to
>> implement this in userspace.  Having a version of the fsfreeze (the
>> generic form of xfs_freeze) stuff that worked on individual sub-trees
>> would be nice from a practical perspective, but implementing it would
>> not be easy by any means, and would be essentially necessary for a
>> VSS-like API.  In the meantime though, it is fully possible for the
>> application software to implement this itself without needing anything
>> more from the kernel.
>
> VSS snapshots whole volumes, not individual files (so comparable to an
> LVM snapshot). The sub-folder freeze would be something useful in some
> situations, but duplicating the files+extends might also take too long
> in a lot of situations. You are correct that the kernel features are
> there and what is missing is a user-space daemon, plus a protocol that
> facilitates/coordinates the backups/snapshots.
>
> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
> manages its on buffer pool which won't get the FIFREEZE and flush, but
> as said, the

Re: BTRFS for OLTP Databases

2017-02-08 Thread Adrian Brzezinski

W dniu 2017-02-08 o 14:32 PM, Austin S. Hemmelgarn pisze:
> On 2017-02-08 08:26, Martin Raiber wrote:
>> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
>>> On 2017-02-08 07:14, Martin Raiber wrote:
 Hi,

 On 08.02.2017 03:11 Peter Zaitsev wrote:
> Out of curiosity, I see one problem here:
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure
> all
> data is flushed before continuing.
>
> I think I've read that btrfs snapshots do not guarantee single
> point in
> time snapshots - the snapshot may be smeared across a longer
> period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
>
> How is this going to be addressed? Is there some snapshot aware
> API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM
> snapshots. But
> still, also this needs to be integrated with MySQL to properly
> work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We
> moved to
> creating SQL dumps instead, although there're binlogs which can be
> used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump
> of the
> complete database from time to time, only then it's safe to remove
> all
> binlogs up to that point in time.

 little bit off topic, but I for one would be on board with such an
 effort. It "just" needs coordination between the backup
 software/snapshot tools, the backed up software and the various
 snapshot
 providers. If you look at the Windows VSS API, this would be a
 relatively large undertaking if all the corner cases are taken into
 account, like e.g. a database having the database log on a separate
 volume from the data, dependencies between different components etc.

 You'll know more about this, but databases usually fsync quite
 often in
 their default configuration, so btrfs snapshots shouldn't be much
 behind
 the properly snapshotted state, so I see the advantages more with
 usability and taking care of corner cases automatically.
>>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
>>> reflinking to userspace, and therefore it's fully possible to
>>> implement this in userspace.  Having a version of the fsfreeze (the
>>> generic form of xfs_freeze) stuff that worked on individual sub-trees
>>> would be nice from a practical perspective, but implementing it would
>>> not be easy by any means, and would be essentially necessary for a
>>> VSS-like API.  In the meantime though, it is fully possible for the
>>> application software to implement this itself without needing anything
>>> more from the kernel.
>>
>> VSS snapshots whole volumes, not individual files (so comparable to an
>> LVM snapshot). The sub-folder freeze would be something useful in some
>> situations, but duplicating the files+extends might also take too long
>> in a lot of situations. You are correct that the kernel features are
>> there and what is missing is a user-space daemon, plus a protocol that
>> facilitates/coordinates the backups/snapshots.
>>
>> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
>> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
>> manages its on buffer pool which won't get the FIFREEZE and flush, but
>> as said, the default configuration is to flush/fsync on every commit.
> OK, there's part of the misunderstanding.  You can't FIFREEZE a BTRFS
> filesystem and then take a snapshot in it, because the snapshot
> requires writing to the filesystem (which the FIFREEZE would prevent,
> so a script that tried to do this would deadlock).  A new version of
> the FIFREEZE ioctl would be needed that operates on subvolumes.
You can also you put your filesystem on LVM, and take LVM snapshots.


-- 
Adrian Brzeziński
--
To unsubscribe from this list: send the line

Re: [PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans

2017-02-08 Thread Filipe Manana

On Wed, Feb 8, 2017 at 1:56 AM, Qu Wenruo  wrote:
> Just as Filipe pointed out, the most time consuming part of qgroup is
> btrfs_qgroup_account_extents() and
> btrfs_qgroup_prepare_account_extents().

there's an "and" so the "is" should be "are" and "part" should be "parts".

> Which both call btrfs_find_all_roots() to get old_roots and new_roots
> ulist.
>
> However for old_roots, we don't really need to calculate it at transaction
> commit time.
>
> This patch moves the old_roots accounting part out of
> commit_transaction(), so at least we won't block transaction too long.

Doing stuff inside btrfs_commit_transaction() is only bad if it's
within the critical section, that is, after setting the transaction's
state to TRANS_STATE_COMMIT_DOING and before setting the state to
TRANS_STATE_UNBLOCKED. This should be explained somehow in the
changelog.

>
> But please note that, this won't speedup qgroup overall, it just moves
> half of the cost out of commit_transaction().
>
> Cc: Filipe Manana 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/delayed-ref.c | 20 
>  fs/btrfs/qgroup.c  | 33 ++---
>  fs/btrfs/qgroup.h  | 14 ++
>  3 files changed, 60 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index ef724a5..0ee927e 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>  struct btrfs_delayed_ref_node *ref,
>  struct btrfs_qgroup_extent_record *qrecord,
>  u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
> -int action, int is_data)
> +int action, int is_data, int *qrecord_inserted_ret)
>  {
> struct btrfs_delayed_ref_head *existing;
> struct btrfs_delayed_ref_head *head_ref = NULL;
> struct btrfs_delayed_ref_root *delayed_refs;
> int count_mod = 1;
> int must_insert_reserved = 0;
> +   int qrecord_inserted = 0;
>
> /* If reserved is provided, it must be a data extent. */
> BUG_ON(!is_data && reserved);
> @@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
> if(btrfs_qgroup_trace_extent_nolock(fs_info,
> delayed_refs, qrecord))
> kfree(qrecord);
> +   else
> +   qrecord_inserted = 1;
> }
>
> spin_lock_init(_ref->lock);
> @@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
> atomic_inc(_refs->num_entries);
> trans->delayed_ref_updates++;
> }
> +   if (qrecord_inserted_ret)
> +   *qrecord_inserted_ret = qrecord_inserted;
> return head_ref;
>  }
>
> @@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
> struct btrfs_delayed_ref_head *head_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> struct btrfs_qgroup_extent_record *record = NULL;
> +   int qrecord_inserted;
>
> BUG_ON(extent_op && extent_op->is_data);
> ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
> @@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>  * the spin lock
>  */
> head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, 
> record,
> -   bytenr, num_bytes, 0, 0, action, 0);
> +   bytenr, num_bytes, 0, 0, action, 0,
> +   _inserted);
>
> add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr,
>  num_bytes, parent, ref_root, level, action);
> spin_unlock(_refs->lock);
>
> +   if (qrecord_inserted)
> +   return btrfs_qgroup_trace_extent_post(fs_info, record);
> return 0;
>
>  free_head_ref:
> @@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
> *fs_info,
> struct btrfs_delayed_ref_head *head_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> struct btrfs_qgroup_extent_record *record = NULL;
> +   int qrecord_inserted;
>
> BUG_ON(extent_op && !extent_op->is_data);
> ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
> @@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
> *fs_info,
>  */
> head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, 
> record,
> bytenr, num_bytes, ref_root, reserved,
> -   action, 1);
> +   action, 1, _inserted);
>
> add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
>

Re: Very slow balance / btrfs-transaction

2017-02-08 Thread Filipe Manana

On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo  wrote:
>
>
> At 02/07/2017 11:55 PM, Filipe Manana wrote:
>>
>> On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:



 Hi Qu,

 On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>
>
>
>
> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
>>
>>
>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
>> wrote:



 

>>
>>
>> Quata support was indeed active -- and it warned me that the qroup
>> data was inconsistent.
>>
>> Disabling quotas had an immediate impact on balance throughput -- it's
>> *much* faster now!
>> From a quick glance at iostat I would guess it's at least a factor 100
>> faster.
>>
>>
>> Should quota support generally be disabled during balances? Or did I
>> somehow push my fs into a weired state where it triggered a slow-path?
>>
>>
>>
>> Thanks!
>>
>>j
>
>
>
> Would you please provide the kernel version?
>
> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
> fix qgroup bytes leaking, but also hugely slow down the balance
> process:
>
> commit 62b99540a1d91e46422f0e04de50fc723812c421
> Author: Qu Wenruo 
> Date:   Mon Aug 15 10:36:51 2016 +0800
>
> btrfs: relocation: Fix leaking qgroups numbers on data extents
>
> Sorry for that.
>
> And in v4.10, a better method is applied to fix the byte leaking
> problem, and should be a little faster than previous one.
>
> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
> Author: Qu Wenruo 
> Date:   Tue Oct 18 09:31:29 2016 +0800
>
> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>
>
> However, using balance with qgroup is still slower than balance without
> qgroup, the root fix needs us to rework current backref iteration.
>

 This patch has made the btrfs balance performance worse. The balance
 task has become more CPU intensive compared to earlier and takes longer
 to complete, besides hogging resources. While correctness is important,
 we need to figure out how this can be made more efficient.

>>> The cause is already known.
>>>
>>> It's find_parent_node() which takes most of the time to find all
>>> referencer
>>> of an extent.
>>>
>>> And it's also the cause for FIEMAP softlockup (fixed in recent release by
>>> early quit).
>>>
>>> The biggest problem is, current find_parent_node() uses list to iterate,
>>> which is quite slow especially it's done in a loop.
>>> In real world find_parent_node() is about O(n^3).
>>> We can either improve find_parent_node() by using rb_tree, or introduce
>>> some
>>> cache for find_parent_node().
>>
>>
>> Even if anyone is able to reduce that function's complexity from
>> O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
>> implementation of qgroups will always be a problem. The real problem
>> is that this more recent rework of qgroups does all this accounting
>> inside the critical section of a transaction - blocking any other
>> tasks that want to start a new transaction or attempt to join the
>> current transaction. Not to mention that on systems with small amounts
>> of memory (2Gb or 4Gb from what I've seen from user reports) we also
>> OOM due this allocation of struct btrfs_qgroup_extent_record per
>> delayed data reference head, that are used for that accounting phase
>> in the critical section of a transaction commit.
>>
>> Let's face it and be realistic, even if someone manages to make
>> find_parent_node() much much better, like O(n) for example, it will
>> always be a problem due to the reasons mentioned before. Many extents
>> touched per transaction and many subvolumes/snapshots, will always
>> expose that root problem - doing the accounting in the transaction
>> commit critical section.
>
>
> You must accept the fact that we must call find_parent_node() at least twice
> to get correct owner modification for each touched extent.
> Or qgroup number will never be correct.
>
> One for old_roots by searching commit root, and one for new_roots by
> searching current root.
>
> You can call find_parent_node() as many time as you like, but that's just
> wasting your CPU time.
>
> Only the final find_parent_node() will determine new_roots for that extent,
> and there is no better timing than commit_transaction().

You're missing my point.

My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and

Re: BTRFS for OLTP Databases


On 2017-02-08 08:26, Martin Raiber wrote:

On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:

On 2017-02-08 07:14, Martin Raiber wrote:

Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:

Out of curiosity, I see one problem here:
If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.


little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.

Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
reflinking to userspace, and therefore it's fully possible to
implement this in userspace.  Having a version of the fsfreeze (the
generic form of xfs_freeze) stuff that worked on individual sub-trees
would be nice from a practical perspective, but implementing it would
not be easy by any means, and would be essentially necessary for a
VSS-like API.  In the meantime though, it is fully possible for the
application software to implement this itself without needing anything
more from the kernel.


VSS snapshots whole volumes, not individual files (so comparable to an
LVM snapshot). The sub-folder freeze would be something useful in some
situations, but duplicating the files+extends might also take too long
in a lot of situations. You are correct that the kernel features are
there and what is missing is a user-space daemon, plus a protocol that
facilitates/coordinates the backups/snapshots.

Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
manages its on buffer pool which won't get the FIFREEZE and flush, but
as said, the default configuration is to flush/fsync on every commit.
OK, there's part of the misunderstanding.  You can't FIFREEZE a BTRFS 
filesystem and then take a snapshot in it, because the snapshot requires 
writing to the filesystem (which the FIFREEZE would prevent, so a script 
that tried to do this would deadlock).  A new version of the FIFREEZE 
ioctl would be needed that operates on subvolumes.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber

On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
> On 2017-02-08 07:14, Martin Raiber wrote:
>> Hi,
>>
>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>> Out of curiosity, I see one problem here:
>>> If you're doing snapshots of the live database, each snapshot leaves
>>> the database files like killing the database in-flight. Like shutting
>>> the system down in the middle of writing data.
>>>
>>> This is because I think there's no API for user space to subscribe to
>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>> service) in Windows. You should put the database into frozen state to
>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>> data is flushed before continuing.
>>>
>>> I think I've read that btrfs snapshots do not guarantee single point in
>>> time snapshots - the snapshot may be smeared across a longer period of
>>> time while the kernel is still writing data. So parts of your writes
>>> may still end up in the snapshot after issuing the snapshot command,
>>> instead of in the working copy as expected.
>>>
>>> How is this going to be addressed? Is there some snapshot aware API to
>>> let user space subscribe to such events and do proper preparation? Is
>>> this planned? LVM could be a user of such an API, too. I think this
>>> could have nice enterprise-grade value for Linux.
>>>
>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>> still, also this needs to be integrated with MySQL to properly work. I
>>> once (years ago) researched on this but gave up on my plans when I
>>> planned database backups for our web server infrastructure. We moved to
>>> creating SQL dumps instead, although there're binlogs which can be used
>>> to recover to a clean and stable transactional state after taking
>>> snapshots. But I simply didn't want to fiddle around with properly
>>> cleaning up binlogs which accumulate horribly much space usage over
>>> time. The cleanup process requires to create a cold copy or dump of the
>>> complete database from time to time, only then it's safe to remove all
>>> binlogs up to that point in time.
>>
>> little bit off topic, but I for one would be on board with such an
>> effort. It "just" needs coordination between the backup
>> software/snapshot tools, the backed up software and the various snapshot
>> providers. If you look at the Windows VSS API, this would be a
>> relatively large undertaking if all the corner cases are taken into
>> account, like e.g. a database having the database log on a separate
>> volume from the data, dependencies between different components etc.
>>
>> You'll know more about this, but databases usually fsync quite often in
>> their default configuration, so btrfs snapshots shouldn't be much behind
>> the properly snapshotted state, so I see the advantages more with
>> usability and taking care of corner cases automatically.
> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
> reflinking to userspace, and therefore it's fully possible to
> implement this in userspace.  Having a version of the fsfreeze (the
> generic form of xfs_freeze) stuff that worked on individual sub-trees
> would be nice from a practical perspective, but implementing it would
> not be easy by any means, and would be essentially necessary for a
> VSS-like API.  In the meantime though, it is fully possible for the
> application software to implement this itself without needing anything
> more from the kernel.

VSS snapshots whole volumes, not individual files (so comparable to an
LVM snapshot). The sub-folder freeze would be something useful in some
situations, but duplicating the files+extends might also take too long
in a lot of situations. You are correct that the kernel features are
there and what is missing is a user-space daemon, plus a protocol that
facilitates/coordinates the backups/snapshots.

Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
manages its on buffer pool which won't get the FIFREEZE and flush, but
as said, the default configuration is to flush/fsync on every commit.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: user_subvol_rm_allowed? Is there a user_subvol_create_deny|allowed?


On 2017-02-07 20:49, Nicholas D Steeves wrote:

Dear btrfs community,

Please accept my apologies in advance if I missed something in recent
btrfs development; my MUA tells me I'm ~1500 unread messages
out-of-date. :/

I recently read about "mount -t btrfs -o user_subvol_rm_allowed" while
doing reading up on LXC handling of snapshots with the btrfs backend.
Is this mount option per-subvolume, or per volume?

AFAIK, it's per-volume.


Also, what mechanisms to restrict a user's ability to create an
arbitrarily large number of snapshots?  Is there a
user_subvol_create_deny|allowed?  What I've read about the inverse
correlation between number of subvols to performance, a potentially
hostile user could cause an IO denial of service or potentially even
trigger an ENOSPC.
Currently, there is nothing that restricts this ability.  This is one of 
a handful of outstanding issues that I'd love to see fixed, but don't 
have the time, patience, or background to fix it myself.


From what I gather, the following will reproduce the hypothetical
issue related to my question:

# as root
btrfs sub create /some/dir/subvol
chown some-user /some/dir/subvol

# as some-user
cd /home/dir/subvol
cp -ar --reflink=always /some/big/files ./
COUNT=1
while [ 0 -lt 1 ]; do
  btrfs sub snap ./ ./snapshot-$COUNT
  COUNT=COUNT+1
  sleep 2   # --maybe unnecessary
done

fWIW, this will cause all kinds of other issues too.  It will however 
slow down exponentially over time as a result of these issues though. 
The two biggest are:
1. Performance for large directories is horrendous, and roughly 
exponentially (with a small exponent near 1) proportionate to the 
inverse of the number of directory entries.  Past a few thousand 
entries, directory operations (especially stat() and readdir()) start to 
take long enough for a normal person to notice the latency.
2. Overall filesystem performance with lots of snapshots is horrendous 
too, and this also scales exponentially proportionate to the inverse of 
the number of snapshots and the total amount of data in each.  This will 
start being an issue much sooner than 1, somewhere around 300-400 
snapshots most of the time.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases


On 2017-02-08 07:14, Martin Raiber wrote:

Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:

Out of curiosity, I see one problem here:
If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.


little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.
Just my perspective, but BTRFS (and XFS, and OCFS2) already provide 
reflinking to userspace, and therefore it's fully possible to implement 
this in userspace.  Having a version of the fsfreeze (the generic form 
of xfs_freeze) stuff that worked on individual sub-trees would be nice 
from a practical perspective, but implementing it would not be easy by 
any means, and would be essentially necessary for a VSS-like API.  In 
the meantime though, it is fully possible for the application software 
to implement this itself without needing anything more from the kernel.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: better document btrfs receive security

On Wed, Feb 08, 2017 at 07:29:22AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-02-07 13:27, David Sterba wrote:
> > On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote:
> >> This adds some extra documentation to the btrfs-receive manpage that
> >> explains some of the security related aspects of btrfs-receive.  The
> >> first part covers the fact that the subvolume being received is writable
> >> until the receive finishes, and the second covers the current lack of
> >> sanity checking of the send stream.
> >>
> >> Signed-off-by: Austin S. Hemmelgarn 
> >
> > Applied, thanks.
> >
> Didn't get a chance to mention this yesterday, but it looks like you 
> hadn't seen the updated version I sent on the third.  Message ID is:
> <20170203193805.96977-1-ahferro...@gmail.com>

Ah sorry I missed that.

> The only significant difference is that I updated the description for 
> the writablility issue using a much better description from Graham Cobb 
> (with his permission of course).
> 
> If you want, I can send an incremental patch on top of the original to 
> update just that description.

No need to, I'll replace the patch with the latest version. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dup vs raid1 in single disk


On 2017-02-07 17:28, Kai Krakow wrote:

Am Thu, 19 Jan 2017 15:02:14 -0500
schrieb "Austin S. Hemmelgarn" :


On 2017-01-19 13:23, Roman Mamedov wrote:

On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo"  wrote:


I was wondering, from a point of view of data safety, if there is
any difference between using dup or making a raid1 from two
partitions in the same disk. This is thinking on having some
protection against the typical aging HDD that starts to have bad
sectors.


RAID1 will write slower compared to DUP, as any optimization to
make RAID1 devices work in parallel will cause a total performance
disaster for you as you will start trying to write to both
partitions at the same time, turning all linear writes into random
ones, which are about two orders of magnitude slower than linear on
spinning hard drives. DUP shouldn't have this issue, but still it
will be twice slower than single, since you are writing everything
twice.

As of right now, there will actually be near zero impact on write
performance (or at least, it's way less than the theoretical 50%)
because there really isn't any optimization to speak of in the
multi-device code.  That will hopefully change over time, but it's
not likely to do so any time in the future since nobody appears to be
working on multi-device write performance.


I think that's only true if you don't account the seek overhead. In
single device RAID1 mode you will always seek half of the device while
writing data, and even when reading between odd and even PIDs. In
contrast, DUP mode doesn't guarantee your seeks to be shorter but from
a statistical point of view, on the average it should be shorter. So it
should yield better performance (tho I wouldn't expect it to be
observable, depending on your workload).

So, on devices having no seek overhead (aka SSD), it is probably true
(minus bus bandwidth considerations). For HDD I'd prefer DUP.

From data safety point of view: It's more likely that adjacent
and nearby sectors are bad. So DUP imposes a higher risk of written
data being written to only bad sectors - which means data loss or even
file system loss (if metadata hits this problem).

To be realistic: I wouldn't trade space usage for duplicate data on an
already failing disk, no matter if it's DUP or RAID1. HDD disk space is
cheap, and using such a scenario is just waste of performance AND
space - no matter what. I don't understand the purpose of this. It just
results in fake safety.

Better get two separate devices half the size. There's a better chance
of getting a better cost/space ratio anyways, plus better performance
and safety.


There's also the fact that you're writing more metadata than data
most of the time unless you're dealing with really big files, and
metadata is already DUP mode (unless you are using an SSD), so the
performance hit isn't 50%, it's actually a bit more than half the
ratio of data writes to metadata writes.



On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a
single copy thus deduplicating them. This negates the purpose of
increased redunancy (sic) and just wastes space"


That ability is vastly overestimated in the man page. There is no
miracle content-addressable storage system working at 500 MB/sec
speeds all within a little cheap controller on SSDs. Likely most of
what it can do, is just compress simple stuff, such as runs of
zeroes or other repeating byte sequences.

Most of those that do in-line compression don't implement it in
firmware, they implement it in hardware, and even DEFLATE can get 500
MB/second speeds if properly implemented in hardware.  The firmware
may control how the hardware works, but it's usually hardware doing
heavy lifting in that case, and getting a good ASIC made that can hit
the required performance point for a reasonable compression algorithm
like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
work.


I still thinks it's a myth... The overhead of managing inline
deduplication is just way too high to implement it without jumping
through expensive hoops. Most workloads have almost zero deduplication
potential. And even when, their temporal occurrence is spaced so far
that an inline deduplicator won't catch it.
Just like the proposed implementation in BTRFS, it's not complete 
deduplication.  In fact, the only devices I've ever seen that do this 
appear to implement it just like what was proposed for BTRFS, just with 
a much smaller cache.  They were also insanely expensive.


If it would be all so easy, btrfs would already have it working in
mainline. I don't even remember that those patches is still being
worked on.

With this in mind, I think dup metadata is still a good think to have
even on SSD and I would always force to enable it.

Agreed.


Potential for deduplication is only when using snapshots (which already
are deduplicated when taken) or when handling

Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.


On 2017-02-07 22:21, Hans Deragon wrote:

Greetings,

On 2017-02-02 10:06, Austin S. Hemmelgarn wrote:

On 2017-02-02 09:25, Adam Borowski wrote:

On Thu, Feb 02, 2017 at 07:49:50AM -0500, Austin S. Hemmelgarn wrote:

This is a severe bug that makes a not all that uncommon (albeit bad) use
case fail completely.  The fix had no dependencies itself and


I don't see what's bad in mounting a RAID degraded.  Yeah, it provides no
redundancy but that's no worse than using a single disk from the start.
And most people not doing storage/server farm don't have a stack of spare
disks at hand, so getting a replacement might take a while.

Running degraded is bad. Period.  If you don't have a disk on hand to
replace the failed one (and if you care about redundancy, you should
have at least one spare on hand), you should be converting to a single
disk, not continuing to run in degraded mode until you get a new disk.
The moment you start talking about running degraded long enough that you
will be _booting_ the system with the array degraded, you need to be
converting to a single disk.  This is of course impractical for
something like a hardware array or an LVM volume, but it's _trivial_
with BTRFS, and protects you from all kinds of bad situations that can't
happen with a single disk but can completely destroy the filesystem if
it's a degraded array.  Running a single disk is not exactly the same as
running a degraded array, it's actually marginally safer (even if you
aren't using dup profile for metadata) because there are fewer moving
parts to go wrong.  It's also exponentially more efficient.


Being able to continue to run when a disk fails is the whole point of
RAID
-- despite what some folks think, RAIDs are not for backups but for
uptime.
And if your uptime goes to hell because the moment a disk fails you
need to
drop everything and replace the disk immediately, why would you use RAID?

Because just replacing a disk and rebuilding the array is almost always
much cheaper in terms of time than rebuilding the system from a backup.
IOW, even if you have to drop everything and replace the disk
immediately, it's still less time consuming than restoring from a
backup.  It also has the advantage that you don't lose any data.


We disagree on letting people run degraded, which I support, you not.  I
respect your opinion.  However, I have to ask who decides these rules?
Obviously, not me since I am a simple btrfs home user.
This is a pretty typical stance among seasoned system administrators. 
It's worth pointing out that I'm not saying you shouldn't run with a 
single disk for an extended period of time, I'm saying you should 
_convert_ to single disk profiles until you can get a replacement, and 
then convert back to raid profiles once you have the replacement.  It is 
exponentially safer in BTRFS to run single data single metadata than 
half raid1 data half raid1 metadata.  This is one of the big reasons 
that I've avoided MD over the years, it's functionally impossible to do 
this with MD arrays.


Since Oracle is funding btrfs development, is that Oracle's official
stand on how to handle a failed disk?  Who decides of btrfs's roadmap?
I have no clue who is who on this mailing list and who influences the
features of btrfs.

Oracle is obviously using raid systems internally.  How do the operators
of these raid systems feel about this "not let the system run in
degraded mode"?
They replace the disks immediately, so it's irrelevant to them.  Oracle 
isn't the sole source of funding (I'm actually not even sure they are 
anymore CLM works for Facebook now last I knew), but you have to 
understand that it has been developed primarily as an _enterprise_ 
filesystem.  This means that certain perfectly reasonable assumptions 
are made about the conditions under which it will be used.


As a home user, I do not want to have a disk always available.  This is
paying a disk very expensively when the raid system can run easily for
two years without disk failure.  I want to buy the new disk (asap, of
course) once one died.  At that moment, the cost of a drive would have
fallen drastically.  Yes, I can live with running my home system (which
has backups) for a day or two, in degraded rw mode until I purchase and
can install a new disk.  Chances are low that both disks will quit at
around the same time.
You're missing my point.  I have zero issue with running with one disk 
when the other fails.  I have issue with not telling the FS that it 
won't have another disk for a while.  IOW, in that situation, I would run:

btrfs balance start -dconvert=single -mconvert=dup /whatever
To convert to profiles _designed_ for a single device and then convert 
back to raid1 when I got another disk.  The issue you've stumbled across 
is only partial motivation for this, the bigger motivation is that 
running half a 2 disk array is more risky than running a single disk by 
itself.


Simply because I cannot run in degraded mode and cannot add a disk

Re: [PATCH] btrfs-progs: better document btrfs receive security


On 2017-02-07 13:27, David Sterba wrote:

On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote:

This adds some extra documentation to the btrfs-receive manpage that
explains some of the security related aspects of btrfs-receive.  The
first part covers the fact that the subvolume being received is writable
until the receive finishes, and the second covers the current lack of
sanity checking of the send stream.

Signed-off-by: Austin S. Hemmelgarn 


Applied, thanks.

Didn't get a chance to mention this yesterday, but it looks like you 
hadn't seen the updated version I sent on the third.  Message ID is:

<20170203193805.96977-1-ahferro...@gmail.com>

The only significant difference is that I updated the description for 
the writablility issue using a much better description from Graham Cobb 
(with his permission of course).


If you want, I can send an incremental patch on top of the original to 
update just that description.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber

Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:
> Out of curiosity, I see one problem here:
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure all
> data is flushed before continuing.
>
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
>
> How is this going to be addressed? Is there some snapshot aware API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
> still, also this needs to be integrated with MySQL to properly work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We moved to
> creating SQL dumps instead, although there're binlogs which can be used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump of the
> complete database from time to time, only then it's safe to remove all
> binlogs up to that point in time.

little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.

Regards,
Martin Raiber



smime.p7s
Description: S/MIME Cryptographic Signature

Re: BTRFS for OLTP Databases