[PATCH] btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check sum and expected check sum. However no root objectid is outputted, which sometimes makes debugging quite painful under multi-subvolume case (including relocation). Also the checksum output is decimal, which seldom makes sense for users/developers and is hard to read in most time. This patch will add root objectid, which will be %lld for rootid larger than LAST_FREE_OBJECTID, and hex csum output for better readability. Signed-off-by: Qu Wenruo--- v2: Output mirror number in both inode.c and compression.c --- fs/btrfs/btrfs_inode.h | 18 ++ fs/btrfs/compression.c | 6 ++ fs/btrfs/inode.c | 5 ++--- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 1a8fa46ff87e..3cb8e6347b24 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -326,6 +326,24 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode) _I(inode)->runtime_flags); } +static inline void btrfs_print_data_csum_error(struct inode *inode, + u64 logical_start, u32 csum, u32 csum_expected, int mirror_num) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + + /* Output minus objectid, which is more meaningful */ + if (root->objectid >= BTRFS_LAST_FREE_OBJECTID) + btrfs_warn_rl(root->fs_info, + "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum 0x%08x mirror %d", + root->objectid, btrfs_ino(inode), logical_start, csum, + csum_expected, mirror_num); + else + btrfs_warn_rl(root->fs_info, + "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum 0x%08x mirror %d", + root->objectid, btrfs_ino(inode), logical_start, csum, + csum_expected, mirror_num); +} + bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end); #endif diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 7f390849343b..a7a770ad93ad 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -124,10 +124,8 @@ static int check_compressed_csum(struct inode *inode, kunmap_atomic(kaddr); if (csum != *cb_sum) { - btrfs_info(BTRFS_I(inode)->root->fs_info, - "csum failed ino %llu extent %llu csum %u wanted %u mirror %d", - btrfs_ino(inode), disk_start, csum, *cb_sum, - cb->mirror_num); + btrfs_print_data_csum_error(inode, disk_start, csum, + *cb_sum, cb->mirror_num); ret = -EIO; goto fail; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1e861a063721..5cfd904cc6e6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3123,9 +3123,8 @@ static int __readpage_endio_check(struct inode *inode, kunmap_atomic(kaddr); return 0; zeroit: - btrfs_warn_rl(BTRFS_I(inode)->root->fs_info, - "csum failed ino %llu off %llu csum %u expected csum %u", - btrfs_ino(inode), start, csum, csum_expected); + btrfs_print_data_csum_error(inode, start, csum, csum_expected, + io_bio->mirror_num); memset(kaddr + pgoff, 1, len); flush_dcache_page(page); kunmap_atomic(kaddr); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check sum and expected check sum. However no root objectid is outputted, which sometimes makes debugging quite painful under multi-subvolume case (including relocation). Also the checksum output is decimal, which seldom makes sense for users/developers and is hard to read in most time. This patch will add root objectid, which will be %lld for rootid larger than LAST_FREE_OBJECTID, and hex csum output for better readability. Signed-off-by: Qu Wenruo--- v2: Output mirror number in both inode.c and compression.c --- fs/btrfs/btrfs_inode.h | 18 ++ fs/btrfs/compression.c | 6 ++ fs/btrfs/inode.c | 5 ++--- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 1a8fa46ff87e..3cb8e6347b24 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -326,6 +326,24 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode) _I(inode)->runtime_flags); } +static inline void btrfs_print_data_csum_error(struct inode *inode, + u64 logical_start, u32 csum, u32 csum_expected, int mirror_num) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + + /* Output minus objectid, which is more meaningful */ + if (root->objectid >= BTRFS_LAST_FREE_OBJECTID) + btrfs_warn_rl(root->fs_info, + "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum 0x%08x mirror %d", + root->objectid, btrfs_ino(inode), logical_start, csum, + csum_expected, mirror_num); + else + btrfs_warn_rl(root->fs_info, + "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum 0x%08x mirror %d", + root->objectid, btrfs_ino(inode), logical_start, csum, + csum_expected, mirror_num); +} + bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end); #endif diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 7f390849343b..a7a770ad93ad 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -124,10 +124,8 @@ static int check_compressed_csum(struct inode *inode, kunmap_atomic(kaddr); if (csum != *cb_sum) { - btrfs_info(BTRFS_I(inode)->root->fs_info, - "csum failed ino %llu extent %llu csum %u wanted %u mirror %d", - btrfs_ino(inode), disk_start, csum, *cb_sum, - cb->mirror_num); + btrfs_print_data_csum_error(inode, disk_start, csum, + *cb_sum, cb->mirror_num); ret = -EIO; goto fail; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1e861a063721..5cfd904cc6e6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3123,9 +3123,8 @@ static int __readpage_endio_check(struct inode *inode, kunmap_atomic(kaddr); return 0; zeroit: - btrfs_warn_rl(BTRFS_I(inode)->root->fs_info, - "csum failed ino %llu off %llu csum %u expected csum %u", - btrfs_ino(inode), start, csum, csum_expected); + btrfs_print_data_csum_error(inode, start, csum, csum_expected, + io_bio->mirror_num); memset(kaddr + pgoff, 1, len); flush_dcache_page(page); kunmap_atomic(kaddr); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
At 02/08/2017 05:55 PM, Vasco Visser wrote: Thank you for the explanation. What I would still like to know is how to relate the chunk level abstraction to the file level abstraction. According to the btrfs output there is 2G of data space is available and 24G of data space is being used. Does this mean 24G of data used in files? Yes, 24G data is used to store data. (And space cache, while space cache is relatively small, less than 1M for each chunk) How do I know which files take up most space? du seems pretty useless as it reports only 9G of files on the volume. Are you using snapshots? If you are only using 1 subvolume(including snapshots), then it seems that btrfs data CoW waste quite a lot of space. In case of btrfs data CoW, for example you have a 128M file(one extent), then you rewrite 64M of it, your data space usage will be 128M + 64M, as the first 128M will only be freed after *all* its user get freed. For single subvolume and little to none reflink usage case, "btrfs fi defrag" should help to free some space. If you have multiple snapshots or a lot of reflinked files, then I'm afraid you have to delete some file (including reflink copy or snapshot) to free some data. Thanks, Qu -- Vasco On Wed, Feb 8, 2017 at 4:48 AM, Qu Wenruowrote: At 02/08/2017 12:44 AM, Vasco Visser wrote: Hello, My system is or seems to be running out of disk space but I can't find out how or why. Might be a BTRFS peculiarity, hence posting on this list. Most indicators seem to suggest I'm filling up, but I can't trace the disk usage to files on the FS. The issue is on my root filesystem on a 28GiB ssd partition (commands below issued when booted into single user mode): $ df -h FilesystemSize Used Avail Use% Mounted on /dev/sda3 28G 26G 2.1G 93% / $ btrfs --version btrfs-progs v4.4 $ btrfs fi usage / Overall: Device size: 27.94GiB Device allocated: 27.94GiB Device unallocated: 1.00MiB So from chunk level, your fs is already full. And balance won't success since there is no unallocated space at all. The first 1M of btrfs is always reserved and won't be allocated, and 1M is too small for btrfs to allocate a chunk. Device missing: 0.00B Used: 25.03GiB Free (estimated): 2.37GiB (min: 2.37GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 256.00MiB (used: 0.00B) Data,single: Size:26.69GiB, Used:24.32GiB You still have 2G data space, so you can still write things. /dev/sda3 26.69GiB Metadata,single: Size:1.22GiB, Used:731.45MiB Metadata has has less space when considering "Global reserve". In fact the used space would be 987M. But it's still OK for normal write. /dev/sda3 1.22GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda3 32.00MiB System chunk can hardly be used up. Unallocated: /dev/sda3 1.00MiB $ btrfs fi df / Data, single: total=26.69GiB, used=24.32GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=1.22GiB, used=731.48MiB GlobalReserve, single: total=256.00MiB, used=0.00B However: $ mount -o bind / /mnt $ sudo du -hs /mnt 9.3G /mnt Try to balance: $ btrfs balance start / ERROR: error during balancing '/': No space left on device Am I really filling up? What can explain the huge discrepancy with the output of du (no open file descriptors on deleted files can explain this in single user mode) and the FS stats? Just don't believe the vanilla df output for btrfs. For btrfs, unlike other fs like ext4/xfs, which allocates chunk dynamically and has different metadata/data profile, we can only get a clear view of the fs from both chunk level(allocated/unallocated) and extent level(total/used). In your case, your fs doesn't have any unallocated space, this make balance unable to work at all. And your data/metadata usage is quite high, although both has small available space left, the fs should be writable for some time, but not long. To proceed, add a larger device to current fs, and do a balance or just delete the 28G partition then btrfs will handle the rest well. Thanks, Qu Any advice on possible causes and how to proceed? -- Vasco -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
csum failed, checksum error, questions
I had a file read fail repeatably, in syslog, lines like this kernel: BTRFS warning (device dm-5): csum failed ino 2241616 off 51580928 csum 4redacted expected csum 2redacted I rmed the file. Another error more recently, 5 instances which look like this: kernel: BTRFS warning (device dm-5): checksum error at logical 16147043602432 on dev /dev/mapper/dev-name-redacted, sector 1177577896, root 4679, inode 2241616, offset 51597312, length 4096, links 1 (path: file/path/redacted) kernel: BTRFS error (device dm-5): bdev /dev/mapper/dev-name-redacted errs: wr 0, rd 0, flush 0, corrupt 5, gen 0 kernel: BTRFS error (device dm-5): unable to fixup (regular) error at logical 16147043602432 on dev /dev/mapper/dev-name-redacted In this case, I think the file got rmed as well. I'm assuming this is a problem with the drive, not btrfs. Any opinions on how likely catastrophic failure of the drive is? Is rming the problematic file sufficient? How about if the subvolume containing this bad file was previously snapshotted? Is there anything else besides "kernel: BTRFS (error|warning)" that I should grep for my syslog to watch for filesystem/drive problems? For example, is there anything in addition to error/warning like "fatal" or "critical"? For at least the second error, I was running Linux 4.9.0-1-amd64 #1 SMP Debian 4.9.2-2 (2017-01-12) x86_64 GNU/Linux btrfs-progs 4.7.3-1 Thanks, Ian Kelling -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
At 02/08/2017 10:09 PM, Filipe Manana wrote: On Wed, Feb 8, 2017 at 1:56 AM, Qu Wenruowrote: Just as Filipe pointed out, the most time consuming part of qgroup is btrfs_qgroup_account_extents() and btrfs_qgroup_prepare_account_extents(). there's an "and" so the "is" should be "are" and "part" should be "parts". Which both call btrfs_find_all_roots() to get old_roots and new_roots ulist. However for old_roots, we don't really need to calculate it at transaction commit time. This patch moves the old_roots accounting part out of commit_transaction(), so at least we won't block transaction too long. Doing stuff inside btrfs_commit_transaction() is only bad if it's within the critical section, that is, after setting the transaction's state to TRANS_STATE_COMMIT_DOING and before setting the state to TRANS_STATE_UNBLOCKED. This should be explained somehow in the changelog. In this context, only critical section is under concern But please note that, this won't speedup qgroup overall, it just moves half of the cost out of commit_transaction(). Cc: Filipe Manana Signed-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.c | 20 fs/btrfs/qgroup.c | 33 ++--- fs/btrfs/qgroup.h | 14 ++ 3 files changed, 60 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index ef724a5..0ee927e 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *ref, struct btrfs_qgroup_extent_record *qrecord, u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, -int action, int is_data) +int action, int is_data, int *qrecord_inserted_ret) { struct btrfs_delayed_ref_head *existing; struct btrfs_delayed_ref_head *head_ref = NULL; struct btrfs_delayed_ref_root *delayed_refs; int count_mod = 1; int must_insert_reserved = 0; + int qrecord_inserted = 0; /* If reserved is provided, it must be a data extent. */ BUG_ON(!is_data && reserved); @@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, if(btrfs_qgroup_trace_extent_nolock(fs_info, delayed_refs, qrecord)) kfree(qrecord); + else + qrecord_inserted = 1; } spin_lock_init(_ref->lock); @@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, atomic_inc(_refs->num_entries); trans->delayed_ref_updates++; } + if (qrecord_inserted_ret) + *qrecord_inserted_ret = qrecord_inserted; return head_ref; } @@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; + int qrecord_inserted; BUG_ON(extent_op && extent_op->is_data); ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS); @@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, * the spin lock */ head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record, - bytenr, num_bytes, 0, 0, action, 0); + bytenr, num_bytes, 0, 0, action, 0, + _inserted); add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr, num_bytes, parent, ref_root, level, action); spin_unlock(_refs->lock); + if (qrecord_inserted) + return btrfs_qgroup_trace_extent_post(fs_info, record); return 0; free_head_ref: @@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; + int qrecord_inserted; BUG_ON(extent_op && !extent_op->is_data); ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); @@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, */ head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record, bytenr, num_bytes, ref_root, reserved, - action, 1); + action, 1, _inserted); add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr, num_bytes, parent, ref_root, owner, offset,
Re: Very slow balance / btrfs-transaction
At 02/08/2017 09:56 PM, Filipe Manana wrote: On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruowrote: At 02/07/2017 11:55 PM, Filipe Manana wrote: On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo wrote: At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote: Hi Qu, On 02/05/2017 07:45 PM, Qu Wenruo wrote: At 02/04/2017 09:47 AM, Jorg Bornschein wrote: February 4, 2017 1:07 AM, "Goldwyn Rodrigues" wrote: Quata support was indeed active -- and it warned me that the qroup data was inconsistent. Disabling quotas had an immediate impact on balance throughput -- it's *much* faster now! From a quick glance at iostat I would guess it's at least a factor 100 faster. Should quota support generally be disabled during balances? Or did I somehow push my fs into a weired state where it triggered a slow-path? Thanks! j Would you please provide the kernel version? v4.9 introduced a bad fix for qgroup balance, which doesn't completely fix qgroup bytes leaking, but also hugely slow down the balance process: commit 62b99540a1d91e46422f0e04de50fc723812c421 Author: Qu Wenruo Date: Mon Aug 15 10:36:51 2016 +0800 btrfs: relocation: Fix leaking qgroups numbers on data extents Sorry for that. And in v4.10, a better method is applied to fix the byte leaking problem, and should be a little faster than previous one. commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca Author: Qu Wenruo Date: Tue Oct 18 09:31:29 2016 +0800 btrfs: qgroup: Fix qgroup data leaking by using subtree tracing However, using balance with qgroup is still slower than balance without qgroup, the root fix needs us to rework current backref iteration. This patch has made the btrfs balance performance worse. The balance task has become more CPU intensive compared to earlier and takes longer to complete, besides hogging resources. While correctness is important, we need to figure out how this can be made more efficient. The cause is already known. It's find_parent_node() which takes most of the time to find all referencer of an extent. And it's also the cause for FIEMAP softlockup (fixed in recent release by early quit). The biggest problem is, current find_parent_node() uses list to iterate, which is quite slow especially it's done in a loop. In real world find_parent_node() is about O(n^3). We can either improve find_parent_node() by using rb_tree, or introduce some cache for find_parent_node(). Even if anyone is able to reduce that function's complexity from O(n^3) down to lets say O(n^2) or O(n log n) for example, the current implementation of qgroups will always be a problem. The real problem is that this more recent rework of qgroups does all this accounting inside the critical section of a transaction - blocking any other tasks that want to start a new transaction or attempt to join the current transaction. Not to mention that on systems with small amounts of memory (2Gb or 4Gb from what I've seen from user reports) we also OOM due this allocation of struct btrfs_qgroup_extent_record per delayed data reference head, that are used for that accounting phase in the critical section of a transaction commit. Let's face it and be realistic, even if someone manages to make find_parent_node() much much better, like O(n) for example, it will always be a problem due to the reasons mentioned before. Many extents touched per transaction and many subvolumes/snapshots, will always expose that root problem - doing the accounting in the transaction commit critical section. You must accept the fact that we must call find_parent_node() at least twice to get correct owner modification for each touched extent. Or qgroup number will never be correct. One for old_roots by searching commit root, and one for new_roots by searching current root. You can call find_parent_node() as many time as you like, but that's just wasting your CPU time. Only the final find_parent_node() will determine new_roots for that extent, and there is no better timing than commit_transaction(). You're missing my point. My point is not about needing to call find_parent_nodes() nor how many times to call it, or whether it's needed or not. My point is about doing expensive things inside the critical section of a transaction commit, which leads not only to low performance but getting a system becoming unresponsive and with too high latency - and this is not theory or speculation, there are upstream reports about this as well as several in suse's bugzilla, all caused when qgroups are enabled on 4.2+ kernels (when the last qgroups major changes landed). Judging from that code and from your reply to this and other threads it seems you didn't understand the consequences of doing all that accounting stuff inside the critical section of a transaction commit. NO, I know what you're talking about. Or I won't send the patch to
Re: [PATCH] Btrfs: fix use-after-free due to wrong order of destroying work queues
On Tue, Feb 07, 2017 at 05:02:53PM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Before we destroy all work queues (and wait for their tasks to complete) > we were destroying the work queues used for metadata I/O operations, which > can result in a use-after-free problem because most tasks from all work > queues do metadata I/O operations. For example, the tasks from the caching > workers work queue (fs_info->caching_workers), which is destroyed only > after the work queue used for metadata reads (fs_info->endio_meta_workers) > is destroyed, do metadata reads, which result in attempts to queue tasks > into the later work queue, triggering a use-after-free with a trace like > the following: > > [23114.613543] general protection fault: [#1] PREEMPT SMP > [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data > dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod > crc32c_generic > acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 > processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 > crc16 > jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci > libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug] > [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted > 4.9.0-rc7-btrfs-next-36+ #1 > [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs] > [23114.616932] task: 880221d45780 task.stack: c9000bc5 > [23114.616932] RIP: 0010:[] [] > btrfs_queue_work+0x2c/0x190 [btrfs] > [23114.616932] RSP: 0018:88023f443d60 EFLAGS: 00010246 > [23114.616932] RAX: RBX: 6b6b6b6b6b6b6b6b RCX: > 0102 > [23114.616932] RDX: a0419000 RSI: 88011df534f0 RDI: > 880101f01c00 > [23114.616932] RBP: 88023f443d80 R08: 000f7000 R09: > > [23114.616932] R10: 88023f443d48 R11: 1000 R12: > 88011df534f0 > [23114.616932] R13: 880135963868 R14: 1000 R15: > 1000 > [23114.616932] FS: () GS:88023f44() > knlGS: > [23114.616932] CS: 0010 DS: ES: CR0: 80050033 > [23114.616932] CR2: 7f0fb9f8e520 CR3: 01a0b000 CR4: > 06e0 > [23114.616932] Stack: > [23114.616932] 880101f01c00 88011df534f0 880135963868 > 1000 > [23114.616932] 88023f443da0 a03470af 880149b37200 > 880135963868 > [23114.616932] 88023f443db8 8125293c 880149b37200 > 88023f443de0 > [23114.616932] Call Trace: > [23114.616932] [23114.616932] [] > end_workqueue_bio+0xd5/0xda [btrfs] > [23114.616932] [] bio_endio+0x54/0x57 > [23114.616932] [] btrfs_end_bio+0xf7/0x106 [btrfs] > [23114.616932] [] bio_endio+0x54/0x57 > [23114.616932] [] blk_update_request+0x21a/0x30f > [23114.616932] [] scsi_end_request+0x31/0x182 [scsi_mod] > [23114.616932] [] scsi_io_completion+0x1ce/0x4c8 [scsi_mod] > [23114.616932] [] scsi_finish_command+0x104/0x10d > [scsi_mod] > [23114.616932] [] scsi_softirq_done+0x101/0x10a [scsi_mod] > [23114.616932] [] blk_done_softirq+0x82/0x8d > [23114.616932] [] __do_softirq+0x1ab/0x412 > [23114.616932] [] irq_exit+0x49/0x99 > [23114.616932] [] > smp_call_function_single_interrupt+0x24/0x26 > [23114.616932] [] call_function_single_interrupt+0x89/0x90 > [23114.616932] [23114.616932] [] ? > scsi_request_fn+0x13a/0x2a1 [scsi_mod] > [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a > [23114.616932] [] ? _raw_spin_unlock_irq+0x32/0x4a > [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a > [23114.616932] [] scsi_request_fn+0x13a/0x2a1 [scsi_mod] > [23114.616932] [] __blk_run_queue_uncond+0x22/0x2b > [23114.616932] [] __blk_run_queue+0x19/0x1b > [23114.616932] [] blk_queue_bio+0x268/0x282 > [23114.616932] [] generic_make_request+0xbd/0x160 > [23114.616932] [] submit_bio+0x100/0x11d > [23114.616932] [] ? __this_cpu_preempt_check+0x13/0x15 > [23114.616932] [] ? __percpu_counter_add+0x8e/0xa7 > [23114.616932] [] btrfsic_submit_bio+0x1a/0x1d [btrfs] > [23114.616932] [] btrfs_map_bio+0x1f4/0x26d [btrfs] > [23114.616932] [] btree_submit_bio_hook+0x74/0xbf [btrfs] > [23114.616932] [] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs] > [23114.616932] [] submit_one_bio+0x6b/0x89 [btrfs] > [23114.616932] [] read_extent_buffer_pages+0x170/0x1ec > [btrfs] > [23114.616932] [] ? free_root_pointers+0x64/0x64 [btrfs] > [23114.616932] [] readahead_tree_block+0x3f/0x4c [btrfs] > [23114.616932] [] > read_block_for_search.isra.20+0x1ce/0x23d [btrfs] > [23114.616932] [] btrfs_search_slot+0x65f/0x774 [btrfs] > [23114.616932] [] ? free_extent_buffer+0x73/0x7e [btrfs] > [23114.616932] [] btrfs_next_old_leaf+0xa1/0x33c [btrfs] > [23114.616932] []
Re: [PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10
On Wed, Feb 08, 2017 at 05:51:28PM +0100, David Sterba wrote: Hi, could you please merge this single-patch pull request, for 4.10 still? There are quite a few patches on top of v4.10-rc7 so this IMHO does not look like look too bad even late in the release cycle. Though it's a fix for an uncommon usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the ioctls. Thanks. Hi Dave, I'll pull this in and it, thanks. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and cyrus mail server
On 08/02/17 18:38, Libor Klepáč wrote: > I'm interested in using: ... > - send/receive for offisite backup I don't particularly recommend that. I do use send/receive for onsite backups (I actually use btrbk). But for offsite I use a traditional backup tool (I use dar). For three main reasons: 1) Paranoia: I want a backup that does not use btrfs just in case there turned out to be some problem with btrfs which could corrupt the backup. I can't think of anything but I did say it was paranoia! 2) send/receive in incremental mode (the obvious way to use it for offsite backups) relies on the target being up to date and properly synchronised with the source. If, for any reason, it gets out of sync, you have to start again with sending a full backup - a lot of data. Traditional backup formats are more forgiving and having a corrupted incremental does not normally prevent you getting access to data stored in the other incrementals. This would particularly be a risk if you thought about storing the actual send streams instead of doing the receive: a single bit error in one could make all the subsequent streams useless. 3) send/receive doesn't work particularly well with encryption. I store my offsite backups in a cloud service and I want them encrypted both in transit and when stored. To get the same with send/receive requires putting together your own encrypted communication channel (e.g. using ssh) and requires that you have a remote server, with an encrypted filesystem receiving the data (and it has to be accessible in the clear on that server). Traditional backups can just be stored offsite as encrypted files without ever having to be in the clear anywhere except onsite. Just my reasons. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
[ ... ] > The issue isn't total size, it's the difference between total > size and the amount of data you want to store on it. and how > well you manage chunk usage. If you're balancing regularly to > compact chunks that are less than 50% full, [ ... ] BTRFS on > 16GB disk images before with absolutely zero issues, and have > a handful of fairly active 8GB BTRFS volumes [ ... ] Unfortunately balance operations are quite expensive, especially from inside VMs. On the other hand if the system is not much disk constrained relatively frequent balances is a good idea indeed. It is a bit like the advice in the other thread on OLTP to run frequent data defrags, which are also quite expensive. Both combined are like running the compactor/cleaner on log structured (another variants of "COW") filesystems like NILFS2: running that frequently means tighter space use and better locality, but is quite expensive too. >> [ ... ] My impression is that the Btrfs design trades space >> for performance and reliability. > In general, yes, but a more accurate statement would be that > it offers a trade-off between space and convenience. [ ... ] It is not quite "convenience", it is overhead: whole-volume operations like compacting, defragmenting (or fscking) tend to cost significantly in IOPS and also in transfer rate, and on flash SSDs they also consume lifetime. Therefore personally I prefer to have quite a bit of unused space in Btrfs or NILFS2, at a minimum around double at 10-20% than the 5-10% that I think is the minimum advisable with conventional designs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Копия: Sеx & Dатing: Tеchnоlоgy аnd тhе feаr оf rejеcтіоn
Это копия сообщения, которое вы отправили Балашева Майя Валерьевна через Культурный фонд "Наследие" Это письмо отправлено с сайта http://www.xn8sbkcebuvoch5b6a.xn--p1ai/ от: LolacakEncumОfтеnтiмеs, тhe sаyіng, “тhеrе arе plеnтy of fish in тhe sеа,” іs gіvеn аs a соnsolатіon аfter a brеакuр оr as rеаssurаnсe during а perіоd оf lоnеlinеss. If тhis seа rеally exіsтs, surеly a pоol suсh as UGА, flооdеd wiтh тhоusands оf undеrgraduaте sтudеnтs, wоuld sеrve as а good plаcе тo cаst bаiт. Hоwеvеr, sомеtiмes iт sеемs fеw fish аre biтіng оn самрus—а noтion тhат cоuld bе аттribuтеd tо thе laск of fishermеn castіng тheir nетs in-pеrsоn. Insтеad, мany studеnтs find dаtеs оnlіnе тhrоugh аррlісatiоns. http://wooga.info/6Kts?SafTwighTauhinhask>Sex & Dаting -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
On 2017-02-08 09:46, Peter Grandi wrote: My system is or seems to be running out of disk space but I can't find out how or why. [ ... ] FilesystemSize Used Avail Use% Mounted on /dev/sda3 28G 26G 2.1G 93% / [ ... ] So from chunk level, your fs is already full. And balance won't success since there is no unallocated space at all. To add to this, 28GiB is a bit too small for Btrfs, because at that point chunk size is 1GiB. I have the habit of sizing partitions to an exact number of GiB, and that means that most of 1GiB will never be used by Btrfs because there is a small amount of space allocated that is smaller than 1GiB and thus there will be eventually just less than 1GiB unallocated. Unfortunately the chunk size is not manually settable. 28GB is a perfectly reasonable (if a bit odd) size for a non-mixed-mode volume. The issue isn't total size, it's the difference between total size and the amount of data you want to store on it. and how well you manage chunk usage. If you're balancing regularly to compact chunks that are less than 50% full, you can get away with as little as 4GB of extra space beyond your regular data-set with absolutely zero issues. I've run full Linux installations in VM's with BTRFS on 16GB disk images before with absolutely zero issues, and have a handful of fairly active 8GB BTRFS volumes on both of my primary systems that never have any issues with free space despite averaging 5GB of space usage. Example here from 'btrfs fi usage': Overall: Device size: 88.00GiB Device allocated: 86.06GiB Device unallocated:1.94GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 5.30GiB) That means that I should 'btrfs balance' now, because of the 1.94GiB "unallocated", 0.94GiB will never be allocated, and that leaves just 1GiB "unallocated" which is the minimum for running 'btrfs balance'. I have just done so and this is the result: Actually, that 0.94GB would be used. BTRFS will create smaller chunks if it has to, so if you allocated two data chunks with that 1.94GB of space, you would get one 1GB chunk and one 0.94GB chunk. Overall: Device size: 88.00GiB Device allocated: 82.03GiB Device unallocated:5.97GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 3.28GiB) At some point I had decided to use 'mixedbg' allocation to reduce this problem and hopefully improve locality, but that means that metadata and data need to have the same profile, and I really want metadata to be 'dup' because of checksumming, and I don't want data to be 'dup' too. You could also use larger partitions and keep a better handle on free space. [ ... ] To proceed, add a larger device to current fs, and do a balance or just delete the 28G partition then btrfs will handle the rest well. Usually for this I use a USB stick, with a 1-3GiB partition plus a bit extra because of that extra bit of space. If you have a lot of RAM and can guarantee that things won't crash (or don't care about the filesystem too much and are just trying to avoid having to restore a backup), a ramdisk works well for this too. https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21 marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html Unfortunately if it is a single device volume and metadata is 'dup' to remove the extra temporary device one has first to convert the metadata to 'single' and then back to 'dup' after removal. This shouldn't be needed, if it is then it's a bug that should be reported and ideally fixed (there was such a bug when converting from multi-device raid profiles to single device, but that got fixed quite a few kernel versions ago (I distinctly remember because I wrote the fix)). There are also some additional reasons why space used (rather than allocated) may be larger than expected, in special but not wholly infrequent cases. My impression is that the Btrfs design trades space for performance and reliability. In general, yes, but a more accurate statement would be that it offers a trade-off between space and convenience. If you're not going to take the time to maintain the filesystem properly, then you will need more excess space for it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and cyrus mail server
Am Wed, 08 Feb 2017 19:38:06 +0100 schrieb Libor Klepáč: > Hello, > inspired by recent discussion on BTRFS vs. databases i wanted to ask > on suitability of BTRFS for hosting a Cyrus imap server spool. I > haven't found any recent article on this topic. > > I'm preparing migration of our mailserver to Debian Stretch, ie. > kernel 4.9 for now. We are using XFS for storage now. I will migrate > using imapsync to new server. Both are virtual machines running on > vmware on Dell hardware. Disks are on battery backed hw raid > controllers over vmfs. > > I'm considering using BTRFS, but I'm little concerned because of > reading this mailing list ;) > > I'm interested in using: > - compression (emails should compress well - right?) Not really... The small part that's compressible (headers and a few lines of text) are already small, so a sector (maybe 4k) is still a sector. Compression gains you no benefit here. That big parts of mails is already compressed (images, attachments). Mail spools only compress well if you're compressing mails to a solid archive (like 7zip or tgz). If you're compressing each mail individually, there's almost no gain because of file system slack. > - maybe deduplication (cyrus does it by hardlinking of same content > messages now) later It won't work that way. I'd stick to hardlinking. Only offline/nearline deduplication will help you. And it will have a hard time finding the duplicates. This would only properly work if Cyrus separates mail headers and bodies (I don't know if it does, dovecot doesn't which is what I use) because delivering to the spool usually adds some headers like "Delivered-To". This changes the byte offsets between similar mails so that deduplication will no longer work. > - snapshots for history Don't do snapshots too deep. I had similar plans but instead decided it would be better to use the following setup as a continuous backup strategy: Deliver mails to two spools, one being the user accessible spool, and one being the backup spool. Once per day you rename the backup spool and let it be recreated. Then store away the old backup store in whatever way you want (snapshots, traditional backup with retention, ...). > - send/receive for offisite backup It's not that stable that I'd use it in production... > - what about data inlining, should it be turned off? How much data can be inlined? I'm not sure, I never thought about that. > Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 > mailboxes. Similar numbers here, just more mailboxes and less space because we take care that customers remove their mails from our servers and store it in their own systems and backups. With a few exceptions, and those have really big mailboxes. > We have message size limit of ~25MB, so emails are not bigger than > that. 50 MB raw size here... (after 3-in-4 decoding this makes around 37 MB worth of attachments) > There are however bigger files, these are per mailbox > caches/index files of cyrus (some of them are around 300MB) - and > these are also files which are most modified. > Rest of files (messages) are usualy just writen once. I'm still struggling if I should try btrfs or stay with xfs. Xfs has a huge benefit of scaling very very well to parallel workloads and accross multiple devices. Btrfs does exactly that not very well yet (because of write-serialization etc). > > --- > I started using btrfs on backup server as a storage for 4 backuppc > run in containers (backups are then send away with btrbk), year ago. > After switching off data inlining i'm satisfied, everything works > (send/ receive is sometime slow, but i guess it's because of sata > disks on receive side). I've started to love borgbackup. It's very fast, efficient, and reliable. Not sure how good it works for VM images, but for delta backups in general it's very efficient and fast. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and cyrus mail server
On 2017-02-08 13:38, Libor Klepáč wrote: Hello, inspired by recent discussion on BTRFS vs. databases i wanted to ask on suitability of BTRFS for hosting a Cyrus imap server spool. I haven't found any recent article on this topic. I'm preparing migration of our mailserver to Debian Stretch, ie. kernel 4.9 for now. We are using XFS for storage now. I will migrate using imapsync to new server. Both are virtual machines running on vmware on Dell hardware. Disks are on battery backed hw raid controllers over vmfs. I'm considering using BTRFS, but I'm little concerned because of reading this mailing list ;) FWIW, as long as you're using a recent kernel and take the time to do proper maintenance on the filesystem, BTRFS is generally very stable. WRT mail servers specifically, before we went to a cloud service for e-mail where I work, we used Postfix + Dovecot on our internal server, and actually saw a measurable performance improvement when switching from XFS to BTRFS. That was about 3.12-3.18 vintage on the kernel though, so YMMV. I'm interested in using: - compression (emails should compress well - right?) Yes, very well assuming you're storing the actual text form of them (I don't recall if Cyrus does so, but I know Postfix, Sendmail, and most other FOSS mail server software do). The in-line compression will also help reduce fragmentation, and unless you have a really fast storage device, should probably improve performance in general. - maybe deduplication (cyrus does it by hardlinking of same content messages now) later Deduplication beyond what Cyrus does is probably not worth it. In most cases about 10% of an e-mail in text form is going to be duplicated if it's not a copy of an existing message, and that 10% is generally spread throughout the file (stuff like MIME headers and such), so you would probably see near zero space savings for doing anything beyond what Cyrus does while using an insanely larger amount of resources. - snapshots for history Make sure you use a sane exponential thinning system. Once you get past about 300 snapshots, you'll start seeing some serious performance issues, and even double digits might hurt performance at the scale you're talking about. - send/receive for offisite backup This is up to you, but I would probably not use send-receive for off-site backups. Unless you're using reflinking, you can copy all the same attributes that send-receive does using almost any other backup tool, and other tools often have much better security built-in. Send streams also don't compress very well in my experience, so using send-receive has a tendency to require more network resources. - what about data inlining, should it be turned off? Generally no, and especially if you handle lots of small e-mails. Metadata blocks need to be looked up to open and read files anyway, in-lining the data means that you don't need to read in any more blocks for files small enough to fit in the spare space in the metadata block or when you only need to read the first few kilobytes of the file (and if Cyrus' IMAP/POP server works anything like most others I've seen, it will be parsing those first few KB because that's where the headers it indexes are). Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 mailboxes. We have message size limit of ~25MB, so emails are not bigger than that. There are however bigger files, these are per mailbox caches/index files of cyrus (some of them are around 300MB) - and these are also files which are most modified. I would mark these files NOCOW for performance reasons (and because if they're just caches and indexes, they should be pretty simple to regenerate). Rest of files (messages) are usualy just writen once. --- I started using btrfs on backup server as a storage for 4 backuppc run in containers (backups are then send away with btrbk), year ago. After switching off data inlining i'm satisfied, everything works (send/ receive is sometime slow, but i guess it's because of sata disks on receive side). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 22:35, Kai Krakow wrote: [...] >> >> Atomicity can be a relative term. If the snapshot atomicity is >> relative to barriers but not relative to individual writes between >> barriers then AFAICT it's fine because the filesystem doesn't make >> any promise it won't keep even in the context of its snapshots. >> Consider a power loss : the filesystems atomicity guarantees can't go >> beyond what the hardware guarantees which means not all current in fly >> write will reach the disk and partial writes can happen. Modern >> filesystems will remain consistent though and if an application using >> them makes uses of f*sync it can provide its own guarantees too. The >> same should apply to snapshots : all the writes in fly can complete or >> not on disk before the snapshot what matters is that both the snapshot >> and these writes will be completed after the next barrier (and any >> robust application will ignore all the in fly writes it finds in the >> snapshot if they were part of a batch that should be atomically >> commited). >> >> This is why AFAIK PostgreSQL or MySQL with their default ACID >> compliant configuration will recover from a BTRFS snapshot in the >> same way they recover from a power loss. > > This is what I meant in my other reply. But this is also why it should > be documented. Wrongly implying that snapshots are single point in time > snapshots is a wrong assumption with possibly horrible side effects one > wouldn't expect. I don't understand what are you saying. Until now, my understanding was that "all the writings which were passed to btrfs before the snapshot time are in the snapshot. The ones after not". Am I wrong ? Which are the others possible interpretations ? [..] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.
On 2017-02-08 08:46, Tomasz Torcz wrote: On Wed, Feb 08, 2017 at 07:50:22AM -0500, Austin S. Hemmelgarn wrote: It is exponentially safer in BTRFS to run single data single metadata than half raid1 data half raid1 metadata. Why? To convert to profiles _designed_ for a single device and then convert back to raid1 when I got another disk. The issue you've stumbled across is only partial motivation for this, the bigger motivation is that running half a 2 disk array is more risky than running a single disk by itself. Again, why? What's the difference? What causes increased risk? Aside from bugs like the one that sparked this thread that is? Just off the top of my head: * You're running with half a System chunk. This is _very_ risky because almost any errors in the system chunk run the risk of nuking entire files and possibly the whole filesystem. This is part of the reason that I explicitly listed -mconvert=dup instead of -mconvert=single. * It performs significantly better. As odd as this sounds, this actually has an impact on safety. Better overall performance reduces the size of the windows of time during which part of the filesystem is committed. This has less impact than running a traditional filesystem on top of a traditional RAID array, but it still has some impact. * Single device is exponentially more well tested than running a degraded multi-device array. IOW, you're less likely to hit obscure bugs by running a single profile instead of half a raid1 profile. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
On Wed, Feb 08, 2017 at 02:46:32PM +, Peter Grandi wrote: > >> My system is or seems to be running out of disk space but I > >> can't find out how or why. [ ... ] > >> FilesystemSize Used Avail Use% Mounted on > >> /dev/sda3 28G 26G 2.1G 93% / > [ ... ] > > So from chunk level, your fs is already full. And balance > > won't success since there is no unallocated space at all. > > To add to this, 28GiB is a bit too small for Btrfs, because at > that point chunk size is 1GiB. I have the habit of sizing > partitions to an exact number of GiB, and that means that most > of 1GiB will never be used by Btrfs because there is a small > amount of space allocated that is smaller than 1GiB and thus > there will be eventually just less than 1GiB unallocated. Not true -- the last chunk can be smaller than 1 GiB, to use the available space completely. Hugo. > Unfortunately the chunk size is not manually settable. > > Example here from 'btrfs fi usage': > > Overall: > Device size: 88.00GiB > Device allocated: 86.06GiB > Device unallocated:1.94GiB > Device missing: 0.00B > Used: 80.11GiB > Free (estimated): 6.26GiB (min: 5.30GiB) > > That means that I should 'btrfs balance' now, because of the > 1.94GiB "unallocated", 0.94GiB will never be allocated, and that > leaves just 1GiB "unallocated" which is the minimum for running > 'btrfs balance'. I have just done so and this is the result: > > Overall: > Device size: 88.00GiB > Device allocated: 82.03GiB > Device unallocated:5.97GiB > Device missing: 0.00B > Used: 80.11GiB > Free (estimated): 6.26GiB (min: 3.28GiB) > > At some point I had decided to use 'mixedbg' allocation to > reduce this problem and hopefully improve locality, but that > means that metadata and data need to have the same profile, and > I really want metadata to be 'dup' because of checksumming, > and I don't want data to be 'dup' too. > > > [ ... ] To proceed, add a larger device to current fs, and do > > a balance or just delete the 28G partition then btrfs will > > handle the rest well. > > Usually for this I use a USB stick, with a 1-3GiB partition plus > a bit extra because of that extra bit of space. > > https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F > https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21 > marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html > > Unfortunately if it is a single device volume and metadata is > 'dup' to remove the extra temporary device one has first to > convert the metadata to 'single' and then back to 'dup' after > removal. > > There are also some additional reasons why space used (rather > than allocated) may be larger than expected, in special but not > wholly infrequent cases. My impression is that the Btrfs design > trades space for performance and reliability. -- Hugo Mills | Alert status chocolate viridian: Authorised hugo@... carfax.org.uk | personnel only. Dogs must be carried on escalator. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: [PATCH v2] Btrfs: create a helper to create em for IO
On Tue, Jan 31, 2017 at 07:50:22AM -0800, Liu Bo wrote: > We have similar codes to create and insert extent mapping around IO path, > this merges them into a single helper. Looks good, comments below. > +static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 > len, > +u64 orig_start, u64 block_start, > +u64 block_len, u64 orig_block_len, > +u64 ram_bytes, int compress_type, > +int type); > > static int btrfs_dirty_inode(struct inode *inode); > > @@ -690,7 +690,6 @@ static noinline void submit_compressed_extents(struct > inode *inode, > struct btrfs_key ins; > struct extent_map *em; > struct btrfs_root *root = BTRFS_I(inode)->root; > - struct extent_map_tree *em_tree = _I(inode)->extent_tree; > struct extent_io_tree *io_tree; > int ret = 0; > > @@ -778,46 +777,19 @@ static noinline void submit_compressed_extents(struct > inode *inode, >* here we're doing allocation and writeback of the >* compressed pages >*/ > - btrfs_drop_extent_cache(inode, async_extent->start, > - async_extent->start + > - async_extent->ram_size - 1, 0); > - > - em = alloc_extent_map(); > - if (!em) { > - ret = -ENOMEM; > - goto out_free_reserve; > - } > - em->start = async_extent->start; > - em->len = async_extent->ram_size; > - em->orig_start = em->start; > - em->mod_start = em->start; > - em->mod_len = em->len; > - > - em->block_start = ins.objectid; > - em->block_len = ins.offset; > - em->orig_block_len = ins.offset; > - em->ram_bytes = async_extent->ram_size; > - em->bdev = fs_info->fs_devices->latest_bdev; > - em->compress_type = async_extent->compress_type; > - set_bit(EXTENT_FLAG_PINNED, >flags); > - set_bit(EXTENT_FLAG_COMPRESSED, >flags); > - em->generation = -1; > - > - while (1) { > - write_lock(_tree->lock); > - ret = add_extent_mapping(em_tree, em, 1); > - write_unlock(_tree->lock); > - if (ret != -EEXIST) { > - free_extent_map(em); > - break; > - } > - btrfs_drop_extent_cache(inode, async_extent->start, > - async_extent->start + > - async_extent->ram_size - 1, 0); > - } > - > - if (ret) > + em = create_io_em(inode, async_extent->start, > + async_extent->ram_size, /* len */ > + async_extent->start, /* orig_start */ > + ins.objectid, /* block_start */ > + ins.offset, /* block_len */ > + ins.offset, /* orig_block_len */ > + async_extent->ram_size, /* ram_bytes */ > + async_extent->compress_type, > + BTRFS_ORDERED_COMPRESSED); > + if (IS_ERR(em)) > + /* ret value is not necessary due to void function */ > goto out_free_reserve; > + free_extent_map(em); > > ret = btrfs_add_ordered_extent_compress(inode, > async_extent->start, > @@ -952,7 +924,6 @@ static noinline int cow_file_range(struct inode *inode, > u64 blocksize = fs_info->sectorsize; > struct btrfs_key ins; > struct extent_map *em; > - struct extent_map_tree *em_tree = _I(inode)->extent_tree; > int ret = 0; > > if (btrfs_is_free_space_inode(inode)) { > @@ -1008,39 +979,18 @@ static noinline int cow_file_range(struct inode *inode, > if (ret < 0) > goto out_unlock; > > - em = alloc_extent_map(); > - if (!em) { > - ret = -ENOMEM; > - goto out_reserve; > - } > - em->start = start; > - em->orig_start = em->start; > ram_size = ins.offset; > - em->len = ins.offset; > - em->mod_start = em->start; > - em->mod_len = em->len; > - > - em->block_start = ins.objectid; > - em->block_len = ins.offset; > - em->orig_block_len = ins.offset; > - em->ram_bytes = ram_size; > - em->bdev = fs_info->fs_devices->latest_bdev; > - set_bit(EXTENT_FLAG_PINNED,
BTRFS and cyrus mail server
Hello, inspired by recent discussion on BTRFS vs. databases i wanted to ask on suitability of BTRFS for hosting a Cyrus imap server spool. I haven't found any recent article on this topic. I'm preparing migration of our mailserver to Debian Stretch, ie. kernel 4.9 for now. We are using XFS for storage now. I will migrate using imapsync to new server. Both are virtual machines running on vmware on Dell hardware. Disks are on battery backed hw raid controllers over vmfs. I'm considering using BTRFS, but I'm little concerned because of reading this mailing list ;) I'm interested in using: - compression (emails should compress well - right?) - maybe deduplication (cyrus does it by hardlinking of same content messages now) later - snapshots for history - send/receive for offisite backup - what about data inlining, should it be turned off? Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 mailboxes. We have message size limit of ~25MB, so emails are not bigger than that. There are however bigger files, these are per mailbox caches/index files of cyrus (some of them are around 300MB) - and these are also files which are most modified. Rest of files (messages) are usualy just writen once. --- I started using btrfs on backup server as a storage for 4 backuppc run in containers (backups are then send away with btrbk), year ago. After switching off data inlining i'm satisfied, everything works (send/ receive is sometime slow, but i guess it's because of sata disks on receive side). Thanks for you opinions, Libor -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.
On Wed, Feb 08, 2017 at 07:50:22AM -0500, Austin S. Hemmelgarn wrote: > It is exponentially safer in BTRFS > to run single data single metadata than half raid1 data half raid1 metadata. Why? > To convert to profiles _designed_ for a single device and then convert back > to raid1 when I got another disk. The issue you've stumbled across is only > partial motivation for this, the bigger motivation is that running half a 2 > disk array is more risky than running a single disk by itself. Again, why? What's the difference? What causes increased risk? -- Tomasz TorczOnly gods can safely risk perfection, xmpp: zdzich...@chrome.pl it's a dangerous thing for a man. -- Alia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure
On Tue, Feb 07, 2017 at 12:14:51PM -0800, Liu Bo wrote: > > + end_page_writeback(page); > > + } > > > > cur = cur + iosize; > > pg_offset += iosize; > > @@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct > > extent_buffer *eb, > > epd->bio_flags = bio_flags; > > if (ret) { > > set_btree_ioerr(p); > > - end_page_writeback(p); > > + if (PageWriteback(p)) > > + end_page_writeback(p); > > if (atomic_sub_and_test(num_pages - i, >io_pages)) > > end_extent_buffer_writeback(eb); > > ret = -EIO; > > > > --- > > > > Looks good, could you please make a comment for the if statement in your > commit log so that others could know why we put it? Thank you both. Please resend v2 so I can add it to 4.11 queue. > > Since you've got a reproducer, baking it into a fstests case is also > welcome. AFAICS the reproducer needs a kernel patch so the memory allocation fails reliably, this is not suitable for fstests. We don't have an easy way to inject allocation failures easily, but some reduced steps to reprroduce could be added to the changelog. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: Better csum error message for data csum mismatch
On Tue, Feb 07, 2017 at 02:57:17PM +0800, Qu Wenruo wrote: > The original csum error message only outputs inode number, offset, check > sum and expected check sum. > > However no root objectid is outputted, which sometimes makes debugging > quite painful under multi-subvolume case (including relocation). > > Also the checksum output is decimal, which seldom makes sense for > users/developers and is hard to read in most time. > > This patch will add root objectid, which will be %lld for rootid larger > than LAST_FREE_OBJECTID, and hex csum output for better readability. Ok for the change. > + "csum failed root %lld ino %lld off %llu csum 0x%08x expected csum > 0x%08x", > + "csum failed root %llu ino %llu off %llu csum 0x%08x expected csum > 0x%08x", > -"csum failed ino %llu extent %llu csum %u wanted %u > mirror %d", so the new code does not print mirror number, I think this still makes sense in cases where we know it. Please extend the helper and callchain that leads to the new print functions so we see the mirror as well. btrfs_readpage_end_io_hook __readpage_endio_check (print the csum failed message) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/24] btrfs: Convert to separately allocated bdi
On Thu, Feb 02, 2017 at 06:34:06PM +0100, Jan Kara wrote: > Allocate struct backing_dev_info separately instead of embedding it > inside superblock. This unifies handling of bdi among users. > > CC: Chris Mason> CC: Josef Bacik > CC: David Sterba > CC: linux-btrfs@vger.kernel.org > Signed-off-by: Jan Kara Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Remove unused function arg in delete_extent_records
On Fri, Feb 03, 2017 at 10:15:32AM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > new_len is not used in delete_extent_records(). > > Signed-off-by: Goldwyn Rodrigues Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls
On Mon, Feb 06, 2017 at 07:39:09PM -0500, Jeff Mahoney wrote: > Commit 4c63c2454ef incorrectly assumed that returning -ENOIOCTLCMD would > cause the native ioctl to be called. The ->compat_ioctl callback is > expected to handle all ioctls, not just compat variants. As a result, > when using 32-bit userspace on 64-bit kernels, everything except those > three ioctls would return -ENOTTY. > > Fixes: 4c63c2454ef ("btrfs: bugfix: handle > FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl") > Cc: sta...@vger.kernel.org > Signed-off-by: Jeff MahoneyReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10
Hi, could you please merge this single-patch pull request, for 4.10 still? There are quite a few patches on top of v4.10-rc7 so this IMHO does not look like look too bad even late in the release cycle. Though it's a fix for an uncommon usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the ioctls. Thanks. The following changes since commit 57b59ed2e5b91e958843609c7884794e29e6c4cb: Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (2017-01-26 15:48:56 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git fixes-4.10 for you to fetch changes up to 2a362249187a8d0f6d942d6e1d763d150a296f47: btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls (2017-02-08 17:47:30 +0100) Jeff Mahoney (1): btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls fs/btrfs/ioctl.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
>> My system is or seems to be running out of disk space but I >> can't find out how or why. [ ... ] >> FilesystemSize Used Avail Use% Mounted on >> /dev/sda3 28G 26G 2.1G 93% / [ ... ] > So from chunk level, your fs is already full. And balance > won't success since there is no unallocated space at all. To add to this, 28GiB is a bit too small for Btrfs, because at that point chunk size is 1GiB. I have the habit of sizing partitions to an exact number of GiB, and that means that most of 1GiB will never be used by Btrfs because there is a small amount of space allocated that is smaller than 1GiB and thus there will be eventually just less than 1GiB unallocated. Unfortunately the chunk size is not manually settable. Example here from 'btrfs fi usage': Overall: Device size: 88.00GiB Device allocated: 86.06GiB Device unallocated:1.94GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 5.30GiB) That means that I should 'btrfs balance' now, because of the 1.94GiB "unallocated", 0.94GiB will never be allocated, and that leaves just 1GiB "unallocated" which is the minimum for running 'btrfs balance'. I have just done so and this is the result: Overall: Device size: 88.00GiB Device allocated: 82.03GiB Device unallocated:5.97GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 3.28GiB) At some point I had decided to use 'mixedbg' allocation to reduce this problem and hopefully improve locality, but that means that metadata and data need to have the same profile, and I really want metadata to be 'dup' because of checksumming, and I don't want data to be 'dup' too. > [ ... ] To proceed, add a larger device to current fs, and do > a balance or just delete the 28G partition then btrfs will > handle the rest well. Usually for this I use a USB stick, with a 1-3GiB partition plus a bit extra because of that extra bit of space. https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21 marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html Unfortunately if it is a single device volume and metadata is 'dup' to remove the extra temporary device one has first to convert the metadata to 'single' and then back to 'dup' after removal. There are also some additional reasons why space used (rather than allocated) may be larger than expected, in special but not wholly infrequent cases. My impression is that the Btrfs design trades space for performance and reliability. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
partial quota rescan
I'm using trying to use qgroups to keep track of storage occupied by snapshots. I noticed that: a) no two rescans can run in parallel, and there's no way to schedule another rescan while one is running; b) seems like it's a whole-disk operation regardless of path specified in CLI. I only just started to fill my new 24Tb btrfs volume using qgroups, but rescans already take a long time, and due to (a) above I each time have to wait for previous rescan to finish in my scripts. Can anything be done about it, like trashing and recomputing only statistics for specific qgroup? Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 -- -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
W dniu 2017-02-08 o 13:14 PM, Martin Raiber pisze: > Hi, > > On 08.02.2017 03:11 Peter Zaitsev wrote: >> Out of curiosity, I see one problem here: >> If you're doing snapshots of the live database, each snapshot leaves >> the database files like killing the database in-flight. Like shutting >> the system down in the middle of writing data. >> >> This is because I think there's no API for user space to subscribe to >> events like a snapshot - unlike e.g. the VSS API (volume snapshot >> service) in Windows. You should put the database into frozen state to >> prepare it for a hotcopy before creating the snapshot, then ensure all >> data is flushed before continuing. >> >> I think I've read that btrfs snapshots do not guarantee single point in >> time snapshots - the snapshot may be smeared across a longer period of >> time while the kernel is still writing data. So parts of your writes >> may still end up in the snapshot after issuing the snapshot command, >> instead of in the working copy as expected. >> >> How is this going to be addressed? Is there some snapshot aware API to >> let user space subscribe to such events and do proper preparation? Is >> this planned? LVM could be a user of such an API, too. I think this >> could have nice enterprise-grade value for Linux. >> >> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >> still, also this needs to be integrated with MySQL to properly work. I >> once (years ago) researched on this but gave up on my plans when I >> planned database backups for our web server infrastructure. We moved to >> creating SQL dumps instead, although there're binlogs which can be used >> to recover to a clean and stable transactional state after taking >> snapshots. But I simply didn't want to fiddle around with properly >> cleaning up binlogs which accumulate horribly much space usage over >> time. The cleanup process requires to create a cold copy or dump of the >> complete database from time to time, only then it's safe to remove all >> binlogs up to that point in time. > little bit off topic, but I for one would be on board with such an > effort. It "just" needs coordination between the backup > software/snapshot tools, the backed up software and the various snapshot > providers. If you look at the Windows VSS API, this would be a > relatively large undertaking if all the corner cases are taken into > account, like e.g. a database having the database log on a separate > volume from the data, dependencies between different components etc. > > You'll know more about this, but databases usually fsync quite often in > their default configuration, so btrfs snapshots shouldn't be much behind > the properly snapshotted state, so I see the advantages more with > usability and taking care of corner cases automatically. > > Regards, > Martin Raiber xfs_freeze works also for BTRFS... -- Adrian Brzeziński -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi, When it comes to MySQL I'm not really sure what you're trying to achieve. Because MySQL manages its own cache flushing OS cache to the disk and "freezing" FS does not really do much - it will still need to do crash recovery when such snapshot is restored. The reason people would use xfs_freeze with MySQL is when we have the database spread across different filesystems - typically log files placed on the different partition than the data or databases placed on different partitions. In this case you need to have consistent single point in time snapshot across the filesystems for backup to be recoverable. More common approach though is to keep it KISS and have everything on single filesystem. On Wed, Feb 8, 2017 at 8:26 AM, Martin Raiberwrote: > On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >> On 2017-02-08 07:14, Martin Raiber wrote: >>> Hi, >>> >>> On 08.02.2017 03:11 Peter Zaitsev wrote: Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. >>> >>> little bit off topic, but I for one would be on board with such an >>> effort. It "just" needs coordination between the backup >>> software/snapshot tools, the backed up software and the various snapshot >>> providers. If you look at the Windows VSS API, this would be a >>> relatively large undertaking if all the corner cases are taken into >>> account, like e.g. a database having the database log on a separate >>> volume from the data, dependencies between different components etc. >>> >>> You'll know more about this, but databases usually fsync quite often in >>> their default configuration, so btrfs snapshots shouldn't be much behind >>> the properly snapshotted state, so I see the advantages more with >>> usability and taking care of corner cases automatically. >> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >> reflinking to userspace, and therefore it's fully possible to >> implement this in userspace. Having a version of the fsfreeze (the >> generic form of xfs_freeze) stuff that worked on individual sub-trees >> would be nice from a practical perspective, but implementing it would >> not be easy by any means, and would be essentially necessary for a >> VSS-like API. In the meantime though, it is fully possible for the >> application software to implement this itself without needing anything >> more from the kernel. > > VSS snapshots whole volumes, not individual files (so comparable to an > LVM snapshot). The sub-folder freeze would be something useful in some > situations, but duplicating the files+extends might also take too long > in a lot of situations. You are correct that the kernel features are > there and what is missing is a user-space daemon, plus a protocol that > facilitates/coordinates the backups/snapshots. > > Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not > really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and > manages its on buffer pool which won't get the FIFREEZE and flush, but > as said, the
Re: BTRFS for OLTP Databases
W dniu 2017-02-08 o 14:32 PM, Austin S. Hemmelgarn pisze: > On 2017-02-08 08:26, Martin Raiber wrote: >> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >>> On 2017-02-08 07:14, Martin Raiber wrote: Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: > Out of curiosity, I see one problem here: > If you're doing snapshots of the live database, each snapshot leaves > the database files like killing the database in-flight. Like shutting > the system down in the middle of writing data. > > This is because I think there's no API for user space to subscribe to > events like a snapshot - unlike e.g. the VSS API (volume snapshot > service) in Windows. You should put the database into frozen state to > prepare it for a hotcopy before creating the snapshot, then ensure > all > data is flushed before continuing. > > I think I've read that btrfs snapshots do not guarantee single > point in > time snapshots - the snapshot may be smeared across a longer > period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. > > How is this going to be addressed? Is there some snapshot aware > API to > let user space subscribe to such events and do proper preparation? Is > this planned? LVM could be a user of such an API, too. I think this > could have nice enterprise-grade value for Linux. > > XFS has xfs_freeze and xfs_thaw for this, to prepare LVM > snapshots. But > still, also this needs to be integrated with MySQL to properly > work. I > once (years ago) researched on this but gave up on my plans when I > planned database backups for our web server infrastructure. We > moved to > creating SQL dumps instead, although there're binlogs which can be > used > to recover to a clean and stable transactional state after taking > snapshots. But I simply didn't want to fiddle around with properly > cleaning up binlogs which accumulate horribly much space usage over > time. The cleanup process requires to create a cold copy or dump > of the > complete database from time to time, only then it's safe to remove > all > binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. >>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >>> reflinking to userspace, and therefore it's fully possible to >>> implement this in userspace. Having a version of the fsfreeze (the >>> generic form of xfs_freeze) stuff that worked on individual sub-trees >>> would be nice from a practical perspective, but implementing it would >>> not be easy by any means, and would be essentially necessary for a >>> VSS-like API. In the meantime though, it is fully possible for the >>> application software to implement this itself without needing anything >>> more from the kernel. >> >> VSS snapshots whole volumes, not individual files (so comparable to an >> LVM snapshot). The sub-folder freeze would be something useful in some >> situations, but duplicating the files+extends might also take too long >> in a lot of situations. You are correct that the kernel features are >> there and what is missing is a user-space daemon, plus a protocol that >> facilitates/coordinates the backups/snapshots. >> >> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not >> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and >> manages its on buffer pool which won't get the FIFREEZE and flush, but >> as said, the default configuration is to flush/fsync on every commit. > OK, there's part of the misunderstanding. You can't FIFREEZE a BTRFS > filesystem and then take a snapshot in it, because the snapshot > requires writing to the filesystem (which the FIFREEZE would prevent, > so a script that tried to do this would deadlock). A new version of > the FIFREEZE ioctl would be needed that operates on subvolumes. You can also you put your filesystem on LVM, and take LVM snapshots. -- Adrian Brzeziński -- To unsubscribe from this list: send the line
Re: [PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
On Wed, Feb 8, 2017 at 1:56 AM, Qu Wenruowrote: > Just as Filipe pointed out, the most time consuming part of qgroup is > btrfs_qgroup_account_extents() and > btrfs_qgroup_prepare_account_extents(). there's an "and" so the "is" should be "are" and "part" should be "parts". > Which both call btrfs_find_all_roots() to get old_roots and new_roots > ulist. > > However for old_roots, we don't really need to calculate it at transaction > commit time. > > This patch moves the old_roots accounting part out of > commit_transaction(), so at least we won't block transaction too long. Doing stuff inside btrfs_commit_transaction() is only bad if it's within the critical section, that is, after setting the transaction's state to TRANS_STATE_COMMIT_DOING and before setting the state to TRANS_STATE_UNBLOCKED. This should be explained somehow in the changelog. > > But please note that, this won't speedup qgroup overall, it just moves > half of the cost out of commit_transaction(). > > Cc: Filipe Manana > Signed-off-by: Qu Wenruo > --- > fs/btrfs/delayed-ref.c | 20 > fs/btrfs/qgroup.c | 33 ++--- > fs/btrfs/qgroup.h | 14 ++ > 3 files changed, 60 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index ef724a5..0ee927e 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, > struct btrfs_delayed_ref_node *ref, > struct btrfs_qgroup_extent_record *qrecord, > u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, > -int action, int is_data) > +int action, int is_data, int *qrecord_inserted_ret) > { > struct btrfs_delayed_ref_head *existing; > struct btrfs_delayed_ref_head *head_ref = NULL; > struct btrfs_delayed_ref_root *delayed_refs; > int count_mod = 1; > int must_insert_reserved = 0; > + int qrecord_inserted = 0; > > /* If reserved is provided, it must be a data extent. */ > BUG_ON(!is_data && reserved); > @@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, > if(btrfs_qgroup_trace_extent_nolock(fs_info, > delayed_refs, qrecord)) > kfree(qrecord); > + else > + qrecord_inserted = 1; > } > > spin_lock_init(_ref->lock); > @@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, > atomic_inc(_refs->num_entries); > trans->delayed_ref_updates++; > } > + if (qrecord_inserted_ret) > + *qrecord_inserted_ret = qrecord_inserted; > return head_ref; > } > > @@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info > *fs_info, > struct btrfs_delayed_ref_head *head_ref; > struct btrfs_delayed_ref_root *delayed_refs; > struct btrfs_qgroup_extent_record *record = NULL; > + int qrecord_inserted; > > BUG_ON(extent_op && extent_op->is_data); > ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS); > @@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info > *fs_info, > * the spin lock > */ > head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, > record, > - bytenr, num_bytes, 0, 0, action, 0); > + bytenr, num_bytes, 0, 0, action, 0, > + _inserted); > > add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr, > num_bytes, parent, ref_root, level, action); > spin_unlock(_refs->lock); > > + if (qrecord_inserted) > + return btrfs_qgroup_trace_extent_post(fs_info, record); > return 0; > > free_head_ref: > @@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info > *fs_info, > struct btrfs_delayed_ref_head *head_ref; > struct btrfs_delayed_ref_root *delayed_refs; > struct btrfs_qgroup_extent_record *record = NULL; > + int qrecord_inserted; > > BUG_ON(extent_op && !extent_op->is_data); > ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); > @@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info > *fs_info, > */ > head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, > record, > bytenr, num_bytes, ref_root, reserved, > - action, 1); > + action, 1, _inserted); > > add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr, >
Re: Very slow balance / btrfs-transaction
On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruowrote: > > > At 02/07/2017 11:55 PM, Filipe Manana wrote: >> >> On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo >> wrote: >>> >>> >>> >>> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote: Hi Qu, On 02/05/2017 07:45 PM, Qu Wenruo wrote: > > > > > At 02/04/2017 09:47 AM, Jorg Bornschein wrote: >> >> >> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" >> wrote: >> >> >> Quata support was indeed active -- and it warned me that the qroup >> data was inconsistent. >> >> Disabling quotas had an immediate impact on balance throughput -- it's >> *much* faster now! >> From a quick glance at iostat I would guess it's at least a factor 100 >> faster. >> >> >> Should quota support generally be disabled during balances? Or did I >> somehow push my fs into a weired state where it triggered a slow-path? >> >> >> >> Thanks! >> >>j > > > > Would you please provide the kernel version? > > v4.9 introduced a bad fix for qgroup balance, which doesn't completely > fix qgroup bytes leaking, but also hugely slow down the balance > process: > > commit 62b99540a1d91e46422f0e04de50fc723812c421 > Author: Qu Wenruo > Date: Mon Aug 15 10:36:51 2016 +0800 > > btrfs: relocation: Fix leaking qgroups numbers on data extents > > Sorry for that. > > And in v4.10, a better method is applied to fix the byte leaking > problem, and should be a little faster than previous one. > > commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca > Author: Qu Wenruo > Date: Tue Oct 18 09:31:29 2016 +0800 > > btrfs: qgroup: Fix qgroup data leaking by using subtree tracing > > > However, using balance with qgroup is still slower than balance without > qgroup, the root fix needs us to rework current backref iteration. > This patch has made the btrfs balance performance worse. The balance task has become more CPU intensive compared to earlier and takes longer to complete, besides hogging resources. While correctness is important, we need to figure out how this can be made more efficient. >>> The cause is already known. >>> >>> It's find_parent_node() which takes most of the time to find all >>> referencer >>> of an extent. >>> >>> And it's also the cause for FIEMAP softlockup (fixed in recent release by >>> early quit). >>> >>> The biggest problem is, current find_parent_node() uses list to iterate, >>> which is quite slow especially it's done in a loop. >>> In real world find_parent_node() is about O(n^3). >>> We can either improve find_parent_node() by using rb_tree, or introduce >>> some >>> cache for find_parent_node(). >> >> >> Even if anyone is able to reduce that function's complexity from >> O(n^3) down to lets say O(n^2) or O(n log n) for example, the current >> implementation of qgroups will always be a problem. The real problem >> is that this more recent rework of qgroups does all this accounting >> inside the critical section of a transaction - blocking any other >> tasks that want to start a new transaction or attempt to join the >> current transaction. Not to mention that on systems with small amounts >> of memory (2Gb or 4Gb from what I've seen from user reports) we also >> OOM due this allocation of struct btrfs_qgroup_extent_record per >> delayed data reference head, that are used for that accounting phase >> in the critical section of a transaction commit. >> >> Let's face it and be realistic, even if someone manages to make >> find_parent_node() much much better, like O(n) for example, it will >> always be a problem due to the reasons mentioned before. Many extents >> touched per transaction and many subvolumes/snapshots, will always >> expose that root problem - doing the accounting in the transaction >> commit critical section. > > > You must accept the fact that we must call find_parent_node() at least twice > to get correct owner modification for each touched extent. > Or qgroup number will never be correct. > > One for old_roots by searching commit root, and one for new_roots by > searching current root. > > You can call find_parent_node() as many time as you like, but that's just > wasting your CPU time. > > Only the final find_parent_node() will determine new_roots for that extent, > and there is no better timing than commit_transaction(). You're missing my point. My point is not about needing to call find_parent_nodes() nor how many times to call it, or whether it's needed or not. My point is about doing expensive things inside the critical section of a transaction commit, which leads not only to low performance but getting a system becoming unresponsive and
Re: BTRFS for OLTP Databases
On 2017-02-08 08:26, Martin Raiber wrote: On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: On 2017-02-08 07:14, Martin Raiber wrote: Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. Just my perspective, but BTRFS (and XFS, and OCFS2) already provide reflinking to userspace, and therefore it's fully possible to implement this in userspace. Having a version of the fsfreeze (the generic form of xfs_freeze) stuff that worked on individual sub-trees would be nice from a practical perspective, but implementing it would not be easy by any means, and would be essentially necessary for a VSS-like API. In the meantime though, it is fully possible for the application software to implement this itself without needing anything more from the kernel. VSS snapshots whole volumes, not individual files (so comparable to an LVM snapshot). The sub-folder freeze would be something useful in some situations, but duplicating the files+extends might also take too long in a lot of situations. You are correct that the kernel features are there and what is missing is a user-space daemon, plus a protocol that facilitates/coordinates the backups/snapshots. Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and manages its on buffer pool which won't get the FIFREEZE and flush, but as said, the default configuration is to flush/fsync on every commit. OK, there's part of the misunderstanding. You can't FIFREEZE a BTRFS filesystem and then take a snapshot in it, because the snapshot requires writing to the filesystem (which the FIFREEZE would prevent, so a script that tried to do this would deadlock). A new version of the FIFREEZE ioctl would be needed that operates on subvolumes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: > On 2017-02-08 07:14, Martin Raiber wrote: >> Hi, >> >> On 08.02.2017 03:11 Peter Zaitsev wrote: >>> Out of curiosity, I see one problem here: >>> If you're doing snapshots of the live database, each snapshot leaves >>> the database files like killing the database in-flight. Like shutting >>> the system down in the middle of writing data. >>> >>> This is because I think there's no API for user space to subscribe to >>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>> service) in Windows. You should put the database into frozen state to >>> prepare it for a hotcopy before creating the snapshot, then ensure all >>> data is flushed before continuing. >>> >>> I think I've read that btrfs snapshots do not guarantee single point in >>> time snapshots - the snapshot may be smeared across a longer period of >>> time while the kernel is still writing data. So parts of your writes >>> may still end up in the snapshot after issuing the snapshot command, >>> instead of in the working copy as expected. >>> >>> How is this going to be addressed? Is there some snapshot aware API to >>> let user space subscribe to such events and do proper preparation? Is >>> this planned? LVM could be a user of such an API, too. I think this >>> could have nice enterprise-grade value for Linux. >>> >>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>> still, also this needs to be integrated with MySQL to properly work. I >>> once (years ago) researched on this but gave up on my plans when I >>> planned database backups for our web server infrastructure. We moved to >>> creating SQL dumps instead, although there're binlogs which can be used >>> to recover to a clean and stable transactional state after taking >>> snapshots. But I simply didn't want to fiddle around with properly >>> cleaning up binlogs which accumulate horribly much space usage over >>> time. The cleanup process requires to create a cold copy or dump of the >>> complete database from time to time, only then it's safe to remove all >>> binlogs up to that point in time. >> >> little bit off topic, but I for one would be on board with such an >> effort. It "just" needs coordination between the backup >> software/snapshot tools, the backed up software and the various snapshot >> providers. If you look at the Windows VSS API, this would be a >> relatively large undertaking if all the corner cases are taken into >> account, like e.g. a database having the database log on a separate >> volume from the data, dependencies between different components etc. >> >> You'll know more about this, but databases usually fsync quite often in >> their default configuration, so btrfs snapshots shouldn't be much behind >> the properly snapshotted state, so I see the advantages more with >> usability and taking care of corner cases automatically. > Just my perspective, but BTRFS (and XFS, and OCFS2) already provide > reflinking to userspace, and therefore it's fully possible to > implement this in userspace. Having a version of the fsfreeze (the > generic form of xfs_freeze) stuff that worked on individual sub-trees > would be nice from a practical perspective, but implementing it would > not be easy by any means, and would be essentially necessary for a > VSS-like API. In the meantime though, it is fully possible for the > application software to implement this itself without needing anything > more from the kernel. VSS snapshots whole volumes, not individual files (so comparable to an LVM snapshot). The sub-folder freeze would be something useful in some situations, but duplicating the files+extends might also take too long in a lot of situations. You are correct that the kernel features are there and what is missing is a user-space daemon, plus a protocol that facilitates/coordinates the backups/snapshots. Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and manages its on buffer pool which won't get the FIFREEZE and flush, but as said, the default configuration is to flush/fsync on every commit. smime.p7s Description: S/MIME Cryptographic Signature
Re: user_subvol_rm_allowed? Is there a user_subvol_create_deny|allowed?
On 2017-02-07 20:49, Nicholas D Steeves wrote: Dear btrfs community, Please accept my apologies in advance if I missed something in recent btrfs development; my MUA tells me I'm ~1500 unread messages out-of-date. :/ I recently read about "mount -t btrfs -o user_subvol_rm_allowed" while doing reading up on LXC handling of snapshots with the btrfs backend. Is this mount option per-subvolume, or per volume? AFAIK, it's per-volume. Also, what mechanisms to restrict a user's ability to create an arbitrarily large number of snapshots? Is there a user_subvol_create_deny|allowed? What I've read about the inverse correlation between number of subvols to performance, a potentially hostile user could cause an IO denial of service or potentially even trigger an ENOSPC. Currently, there is nothing that restricts this ability. This is one of a handful of outstanding issues that I'd love to see fixed, but don't have the time, patience, or background to fix it myself. From what I gather, the following will reproduce the hypothetical issue related to my question: # as root btrfs sub create /some/dir/subvol chown some-user /some/dir/subvol # as some-user cd /home/dir/subvol cp -ar --reflink=always /some/big/files ./ COUNT=1 while [ 0 -lt 1 ]; do btrfs sub snap ./ ./snapshot-$COUNT COUNT=COUNT+1 sleep 2 # --maybe unnecessary done fWIW, this will cause all kinds of other issues too. It will however slow down exponentially over time as a result of these issues though. The two biggest are: 1. Performance for large directories is horrendous, and roughly exponentially (with a small exponent near 1) proportionate to the inverse of the number of directory entries. Past a few thousand entries, directory operations (especially stat() and readdir()) start to take long enough for a normal person to notice the latency. 2. Overall filesystem performance with lots of snapshots is horrendous too, and this also scales exponentially proportionate to the inverse of the number of snapshots and the total amount of data in each. This will start being an issue much sooner than 1, somewhere around 300-400 snapshots most of the time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-08 07:14, Martin Raiber wrote: Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. Just my perspective, but BTRFS (and XFS, and OCFS2) already provide reflinking to userspace, and therefore it's fully possible to implement this in userspace. Having a version of the fsfreeze (the generic form of xfs_freeze) stuff that worked on individual sub-trees would be nice from a practical perspective, but implementing it would not be easy by any means, and would be essentially necessary for a VSS-like API. In the meantime though, it is fully possible for the application software to implement this itself without needing anything more from the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: better document btrfs receive security
On Wed, Feb 08, 2017 at 07:29:22AM -0500, Austin S. Hemmelgarn wrote: > On 2017-02-07 13:27, David Sterba wrote: > > On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote: > >> This adds some extra documentation to the btrfs-receive manpage that > >> explains some of the security related aspects of btrfs-receive. The > >> first part covers the fact that the subvolume being received is writable > >> until the receive finishes, and the second covers the current lack of > >> sanity checking of the send stream. > >> > >> Signed-off-by: Austin S. Hemmelgarn> > > > Applied, thanks. > > > Didn't get a chance to mention this yesterday, but it looks like you > hadn't seen the updated version I sent on the third. Message ID is: > <20170203193805.96977-1-ahferro...@gmail.com> Ah sorry I missed that. > The only significant difference is that I updated the description for > the writablility issue using a much better description from Graham Cobb > (with his permission of course). > > If you want, I can send an incremental patch on top of the original to > update just that description. No need to, I'll replace the patch with the latest version. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On 2017-02-07 17:28, Kai Krakow wrote: Am Thu, 19 Jan 2017 15:02:14 -0500 schrieb "Austin S. Hemmelgarn": On 2017-01-19 13:23, Roman Mamedov wrote: On Thu, 19 Jan 2017 17:39:37 +0100 "Alejandro R. Mosteo" wrote: I was wondering, from a point of view of data safety, if there is any difference between using dup or making a raid1 from two partitions in the same disk. This is thinking on having some protection against the typical aging HDD that starts to have bad sectors. RAID1 will write slower compared to DUP, as any optimization to make RAID1 devices work in parallel will cause a total performance disaster for you as you will start trying to write to both partitions at the same time, turning all linear writes into random ones, which are about two orders of magnitude slower than linear on spinning hard drives. DUP shouldn't have this issue, but still it will be twice slower than single, since you are writing everything twice. As of right now, there will actually be near zero impact on write performance (or at least, it's way less than the theoretical 50%) because there really isn't any optimization to speak of in the multi-device code. That will hopefully change over time, but it's not likely to do so any time in the future since nobody appears to be working on multi-device write performance. I think that's only true if you don't account the seek overhead. In single device RAID1 mode you will always seek half of the device while writing data, and even when reading between odd and even PIDs. In contrast, DUP mode doesn't guarantee your seeks to be shorter but from a statistical point of view, on the average it should be shorter. So it should yield better performance (tho I wouldn't expect it to be observable, depending on your workload). So, on devices having no seek overhead (aka SSD), it is probably true (minus bus bandwidth considerations). For HDD I'd prefer DUP. From data safety point of view: It's more likely that adjacent and nearby sectors are bad. So DUP imposes a higher risk of written data being written to only bad sectors - which means data loss or even file system loss (if metadata hits this problem). To be realistic: I wouldn't trade space usage for duplicate data on an already failing disk, no matter if it's DUP or RAID1. HDD disk space is cheap, and using such a scenario is just waste of performance AND space - no matter what. I don't understand the purpose of this. It just results in fake safety. Better get two separate devices half the size. There's a better chance of getting a better cost/space ratio anyways, plus better performance and safety. There's also the fact that you're writing more metadata than data most of the time unless you're dealing with really big files, and metadata is already DUP mode (unless you are using an SSD), so the performance hit isn't 50%, it's actually a bit more than half the ratio of data writes to metadata writes. On a related note, I see this caveat about dup in the manpage: "For example, a SSD drive can remap the blocks internally to a single copy thus deduplicating them. This negates the purpose of increased redunancy (sic) and just wastes space" That ability is vastly overestimated in the man page. There is no miracle content-addressable storage system working at 500 MB/sec speeds all within a little cheap controller on SSDs. Likely most of what it can do, is just compress simple stuff, such as runs of zeroes or other repeating byte sequences. Most of those that do in-line compression don't implement it in firmware, they implement it in hardware, and even DEFLATE can get 500 MB/second speeds if properly implemented in hardware. The firmware may control how the hardware works, but it's usually hardware doing heavy lifting in that case, and getting a good ASIC made that can hit the required performance point for a reasonable compression algorithm like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work. I still thinks it's a myth... The overhead of managing inline deduplication is just way too high to implement it without jumping through expensive hoops. Most workloads have almost zero deduplication potential. And even when, their temporal occurrence is spaced so far that an inline deduplicator won't catch it. Just like the proposed implementation in BTRFS, it's not complete deduplication. In fact, the only devices I've ever seen that do this appear to implement it just like what was proposed for BTRFS, just with a much smaller cache. They were also insanely expensive. If it would be all so easy, btrfs would already have it working in mainline. I don't even remember that those patches is still being worked on. With this in mind, I think dup metadata is still a good think to have even on SSD and I would always force to enable it. Agreed. Potential for deduplication is only when using snapshots (which already are deduplicated when taken) or when handling
Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.
On 2017-02-07 22:21, Hans Deragon wrote: Greetings, On 2017-02-02 10:06, Austin S. Hemmelgarn wrote: On 2017-02-02 09:25, Adam Borowski wrote: On Thu, Feb 02, 2017 at 07:49:50AM -0500, Austin S. Hemmelgarn wrote: This is a severe bug that makes a not all that uncommon (albeit bad) use case fail completely. The fix had no dependencies itself and I don't see what's bad in mounting a RAID degraded. Yeah, it provides no redundancy but that's no worse than using a single disk from the start. And most people not doing storage/server farm don't have a stack of spare disks at hand, so getting a replacement might take a while. Running degraded is bad. Period. If you don't have a disk on hand to replace the failed one (and if you care about redundancy, you should have at least one spare on hand), you should be converting to a single disk, not continuing to run in degraded mode until you get a new disk. The moment you start talking about running degraded long enough that you will be _booting_ the system with the array degraded, you need to be converting to a single disk. This is of course impractical for something like a hardware array or an LVM volume, but it's _trivial_ with BTRFS, and protects you from all kinds of bad situations that can't happen with a single disk but can completely destroy the filesystem if it's a degraded array. Running a single disk is not exactly the same as running a degraded array, it's actually marginally safer (even if you aren't using dup profile for metadata) because there are fewer moving parts to go wrong. It's also exponentially more efficient. Being able to continue to run when a disk fails is the whole point of RAID -- despite what some folks think, RAIDs are not for backups but for uptime. And if your uptime goes to hell because the moment a disk fails you need to drop everything and replace the disk immediately, why would you use RAID? Because just replacing a disk and rebuilding the array is almost always much cheaper in terms of time than rebuilding the system from a backup. IOW, even if you have to drop everything and replace the disk immediately, it's still less time consuming than restoring from a backup. It also has the advantage that you don't lose any data. We disagree on letting people run degraded, which I support, you not. I respect your opinion. However, I have to ask who decides these rules? Obviously, not me since I am a simple btrfs home user. This is a pretty typical stance among seasoned system administrators. It's worth pointing out that I'm not saying you shouldn't run with a single disk for an extended period of time, I'm saying you should _convert_ to single disk profiles until you can get a replacement, and then convert back to raid profiles once you have the replacement. It is exponentially safer in BTRFS to run single data single metadata than half raid1 data half raid1 metadata. This is one of the big reasons that I've avoided MD over the years, it's functionally impossible to do this with MD arrays. Since Oracle is funding btrfs development, is that Oracle's official stand on how to handle a failed disk? Who decides of btrfs's roadmap? I have no clue who is who on this mailing list and who influences the features of btrfs. Oracle is obviously using raid systems internally. How do the operators of these raid systems feel about this "not let the system run in degraded mode"? They replace the disks immediately, so it's irrelevant to them. Oracle isn't the sole source of funding (I'm actually not even sure they are anymore CLM works for Facebook now last I knew), but you have to understand that it has been developed primarily as an _enterprise_ filesystem. This means that certain perfectly reasonable assumptions are made about the conditions under which it will be used. As a home user, I do not want to have a disk always available. This is paying a disk very expensively when the raid system can run easily for two years without disk failure. I want to buy the new disk (asap, of course) once one died. At that moment, the cost of a drive would have fallen drastically. Yes, I can live with running my home system (which has backups) for a day or two, in degraded rw mode until I purchase and can install a new disk. Chances are low that both disks will quit at around the same time. You're missing my point. I have zero issue with running with one disk when the other fails. I have issue with not telling the FS that it won't have another disk for a while. IOW, in that situation, I would run: btrfs balance start -dconvert=single -mconvert=dup /whatever To convert to profiles _designed_ for a single device and then convert back to raid1 when I got another disk. The issue you've stumbled across is only partial motivation for this, the bigger motivation is that running half a 2 disk array is more risky than running a single disk by itself. Simply because I cannot run in degraded mode and cannot add a disk
Re: [PATCH] btrfs-progs: better document btrfs receive security
On 2017-02-07 13:27, David Sterba wrote: On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote: This adds some extra documentation to the btrfs-receive manpage that explains some of the security related aspects of btrfs-receive. The first part covers the fact that the subvolume being received is writable until the receive finishes, and the second covers the current lack of sanity checking of the send stream. Signed-off-by: Austin S. HemmelgarnApplied, thanks. Didn't get a chance to mention this yesterday, but it looks like you hadn't seen the updated version I sent on the third. Message ID is: <20170203193805.96977-1-ahferro...@gmail.com> The only significant difference is that I updated the description for the writablility issue using a much better description from Graham Cobb (with his permission of course). If you want, I can send an incremental patch on top of the original to update just that description. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: > Out of curiosity, I see one problem here: > If you're doing snapshots of the live database, each snapshot leaves > the database files like killing the database in-flight. Like shutting > the system down in the middle of writing data. > > This is because I think there's no API for user space to subscribe to > events like a snapshot - unlike e.g. the VSS API (volume snapshot > service) in Windows. You should put the database into frozen state to > prepare it for a hotcopy before creating the snapshot, then ensure all > data is flushed before continuing. > > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. > > How is this going to be addressed? Is there some snapshot aware API to > let user space subscribe to such events and do proper preparation? Is > this planned? LVM could be a user of such an API, too. I think this > could have nice enterprise-grade value for Linux. > > XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But > still, also this needs to be integrated with MySQL to properly work. I > once (years ago) researched on this but gave up on my plans when I > planned database backups for our web server infrastructure. We moved to > creating SQL dumps instead, although there're binlogs which can be used > to recover to a clean and stable transactional state after taking > snapshots. But I simply didn't want to fiddle around with properly > cleaning up binlogs which accumulate horribly much space usage over > time. The cleanup process requires to create a cold copy or dump of the > complete database from time to time, only then it's safe to remove all > binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. Regards, Martin Raiber smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS for OLTP Databases
On 2017-02-07 15:54, Kai Krakow wrote: Am Tue, 7 Feb 2017 15:27:34 -0500 schrieb "Austin S. Hemmelgarn": I'm not sure about this one. I would assume based on the fact that many other things don't work with nodatacow and that regular defrag doesn't work on files which are currently mapped as executable code that it does not, but I could be completely wrong about this too. Technically, there's nothing that prevents autodefrag to work for nodatacow files. The question is: is it really necessary? Standard file systems also have no autodefrag, it's not an issue there because they are essentially nodatacow. Simply defrag the database file once and you're done. Transactional MySQL uses huge data files, probably preallocated. It should simply work with nodatacow. The thing is, I don't have enough knowledge of how defrag is implemented in BTRFS to say for certain that ti doesn't use COW semantics somewhere (and I would actually expect it to do so, since that in theory makes many things _much_ easier to handle), and if it uses COW somewhere, then it by definition doesn't work on NOCOW files. A dev would be needed on this. But from a non-dev point of view, the defrag operation itself is CoW: Blocks are rewritten to another location in contiguous order. Only metadata CoW should be needed for this operation. It should be nothing else than writing to a nodatacow snapshot... Just that the snapshot is more or less implicit and temporary. Hmm? *curious* The gimmicky part though is that the file has to remain accessible throughout the entire operation, and the defrad can't lose changes that occur while the file is being defragmented. In many filesystems (NTFS on Windows for example), a defrag functions similarly to a pvmove operation in LVM, as each extent gets moved, writes to that region get indirected to the new location and treat the areas that were written to as having been moved already. The thing is, on BTRFS that would result in extents getting split, which means COW is probably involved at some level in the data path too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
Thank you for the explanation. What I would still like to know is how to relate the chunk level abstraction to the file level abstraction. According to the btrfs output there is 2G of data space is available and 24G of data space is being used. Does this mean 24G of data used in files? How do I know which files take up most space? du seems pretty useless as it reports only 9G of files on the volume. -- Vasco On Wed, Feb 8, 2017 at 4:48 AM, Qu Wenruowrote: > > > At 02/08/2017 12:44 AM, Vasco Visser wrote: >> >> Hello, >> >> My system is or seems to be running out of disk space but I can't find >> out how or why. Might be a BTRFS peculiarity, hence posting on this >> list. Most indicators seem to suggest I'm filling up, but I can't >> trace the disk usage to files on the FS. >> >> The issue is on my root filesystem on a 28GiB ssd partition (commands >> below issued when booted into single user mode): >> >> >> $ df -h >> FilesystemSize Used Avail Use% Mounted on >> /dev/sda3 28G 26G 2.1G 93% / >> >> >> $ btrfs --version >> btrfs-progs v4.4 >> >> >> $ btrfs fi usage / >> Overall: >> Device size: 27.94GiB >> Device allocated: 27.94GiB >> Device unallocated: 1.00MiB > > > So from chunk level, your fs is already full. > > And balance won't success since there is no unallocated space at all. > The first 1M of btrfs is always reserved and won't be allocated, and 1M is > too small for btrfs to allocate a chunk. > >> Device missing: 0.00B >> Used: 25.03GiB >> Free (estimated): 2.37GiB (min: 2.37GiB) >> Data ratio: 1.00 >> Metadata ratio: 1.00 >> Global reserve: 256.00MiB (used: 0.00B) >> Data,single: Size:26.69GiB, Used:24.32GiB > > > You still have 2G data space, so you can still write things. > >>/dev/sda3 26.69GiB >> Metadata,single: Size:1.22GiB, Used:731.45MiB > > > Metadata has has less space when considering "Global reserve". > In fact the used space would be 987M. > > But it's still OK for normal write. > >>/dev/sda3 1.22GiB >> System,single: Size:32.00MiB, Used:16.00KiB >>/dev/sda3 32.00MiB > > > System chunk can hardly be used up. > >> Unallocated: >>/dev/sda3 1.00MiB >> >> >> $ btrfs fi df / >> Data, single: total=26.69GiB, used=24.32GiB >> System, single: total=32.00MiB, used=16.00KiB >> Metadata, single: total=1.22GiB, used=731.48MiB >> GlobalReserve, single: total=256.00MiB, used=0.00B >> >> >> However: >> $ mount -o bind / /mnt >> $ sudo du -hs /mnt >> 9.3G /mnt >> >> >> Try to balance: >> $ btrfs balance start / >> ERROR: error during balancing '/': No space left on device >> >> >> Am I really filling up? What can explain the huge discrepancy with the >> output of du (no open file descriptors on deleted files can explain >> this in single user mode) and the FS stats? > > > Just don't believe the vanilla df output for btrfs. > > For btrfs, unlike other fs like ext4/xfs, which allocates chunk dynamically > and has different metadata/data profile, we can only get a clear view of the > fs from both chunk level(allocated/unallocated) and extent > level(total/used). > > In your case, your fs doesn't have any unallocated space, this make balance > unable to work at all. > > And your data/metadata usage is quite high, although both has small > available space left, the fs should be writable for some time, but not long. > > To proceed, add a larger device to current fs, and do a balance or just > delete the 28G partition then btrfs will handle the rest well. > > Thanks, > Qu > >> >> Any advice on possible causes and how to proceed? >> >> >> -- >> Vasco >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On 07/02/17 23:28, Kai Krakow wrote: To be realistic: I wouldn't trade space usage for duplicate data on an already failing disk, no matter if it's DUP or RAID1. HDD disk space is cheap, and using such a scenario is just waste of performance AND space - no matter what. I don't understand the purpose of this. It just results in fake safety. The disk is already replaced and no longer my workstation main drive. I work with large datasets in my research, and I don't care much about sustained I/O efficiency, since they're only read when needed. Hence, is a matter of juicing out the last life of that disk, instead of discarding it right away. This way I can have one extra local storage that may spare me the copy from a remote, so I prefer to play with it until it dies. Besides, it affords me a chance to play with btrfs/zfs in ways that I wouldn't normally risk, and I can also assess their behavior with a truly failing disk. In the end, after a destructive write pass with badblocks, the disk increasing uncorrectable sectors have disappeared... go figure. So right now I have a btrfs filesystem built with single profile on top of four differently sized partitions. When/if bad blocks reappear I'll test some raid configuration; probably raidz unless btrfs raid5 is somewhat usable by then (why go with half a disk worth when you can have 2/3? ;-)) Thanks for your justified concern though. Alex. Better get two separate devices half the size. There's a better chance of getting a better cost/space ratio anyways, plus better performance and safety. There's also the fact that you're writing more metadata than data most of the time unless you're dealing with really big files, and metadata is already DUP mode (unless you are using an SSD), so the performance hit isn't 50%, it's actually a bit more than half the ratio of data writes to metadata writes. On a related note, I see this caveat about dup in the manpage: "For example, a SSD drive can remap the blocks internally to a single copy thus deduplicating them. This negates the purpose of increased redunancy (sic) and just wastes space" That ability is vastly overestimated in the man page. There is no miracle content-addressable storage system working at 500 MB/sec speeds all within a little cheap controller on SSDs. Likely most of what it can do, is just compress simple stuff, such as runs of zeroes or other repeating byte sequences. Most of those that do in-line compression don't implement it in firmware, they implement it in hardware, and even DEFLATE can get 500 MB/second speeds if properly implemented in hardware. The firmware may control how the hardware works, but it's usually hardware doing heavy lifting in that case, and getting a good ASIC made that can hit the required performance point for a reasonable compression algorithm like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work. I still thinks it's a myth... The overhead of managing inline deduplication is just way too high to implement it without jumping through expensive hoops. Most workloads have almost zero deduplication potential. And even when, their temporal occurrence is spaced so far that an inline deduplicator won't catch it. If it would be all so easy, btrfs would already have it working in mainline. I don't even remember that those patches is still being worked on. With this in mind, I think dup metadata is still a good think to have even on SSD and I would always force to enable it. Potential for deduplication is only when using snapshots (which already are deduplicated when taken) or when handling user data on a file server in a multi-user environment. Users tend to copy their files all over the place - multiple directories of multiple gigabytes. Potential is also where you're working with client machine backups or vm images. I regularly see deduplication efficiency of 30-60% in such scenarios - file servers mostly which I'm handling. But due to temporally far spaced occurrence of duplicate blocks, only offline or nearline deduplication works here. And the DUP mode is still useful on SSDs, for cases when one copy of the DUP gets corrupted in-flight due to a bad controller or RAM or cable, you could then restore that block from its good-CRC DUP copy. The only window of time during which bad RAM could result in only one copy of a block being bad is after the first copy is written but before the second is, which is usually an insanely small amount of time. As far as the cabling, the window for errors resulting in a single bad copy of a block is pretty much the same as for RAM, and if they're persistently bad, you're more likely to lose data for other reasons. It depends on the design of the software. You're true if this memory block is simply a single block throughout its lifetime in RAM before written to storage. But if it is already handled as duplicate block in memory, odds are different. I hope btrfs is doing this right... ;-) That