Re: understanding disk space usage
At 02/08/2017 12:44 AM, Vasco Visser wrote: Hello, My system is or seems to be running out of disk space but I can't find out how or why. Might be a BTRFS peculiarity, hence posting on this list. Most indicators seem to suggest I'm filling up, but I can't trace the disk usage to files on the FS. The issue is on my root filesystem on a 28GiB ssd partition (commands below issued when booted into single user mode): $ df -h FilesystemSize Used Avail Use% Mounted on /dev/sda3 28G 26G 2.1G 93% / $ btrfs --version btrfs-progs v4.4 $ btrfs fi usage / Overall: Device size: 27.94GiB Device allocated: 27.94GiB Device unallocated: 1.00MiB So from chunk level, your fs is already full. And balance won't success since there is no unallocated space at all. The first 1M of btrfs is always reserved and won't be allocated, and 1M is too small for btrfs to allocate a chunk. Device missing: 0.00B Used: 25.03GiB Free (estimated): 2.37GiB (min: 2.37GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 256.00MiB (used: 0.00B) Data,single: Size:26.69GiB, Used:24.32GiB You still have 2G data space, so you can still write things. /dev/sda3 26.69GiB Metadata,single: Size:1.22GiB, Used:731.45MiB Metadata has has less space when considering "Global reserve". In fact the used space would be 987M. But it's still OK for normal write. /dev/sda3 1.22GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda3 32.00MiB System chunk can hardly be used up. Unallocated: /dev/sda3 1.00MiB $ btrfs fi df / Data, single: total=26.69GiB, used=24.32GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=1.22GiB, used=731.48MiB GlobalReserve, single: total=256.00MiB, used=0.00B However: $ mount -o bind / /mnt $ sudo du -hs /mnt 9.3G /mnt Try to balance: $ btrfs balance start / ERROR: error during balancing '/': No space left on device Am I really filling up? What can explain the huge discrepancy with the output of du (no open file descriptors on deleted files can explain this in single user mode) and the FS stats? Just don't believe the vanilla df output for btrfs. For btrfs, unlike other fs like ext4/xfs, which allocates chunk dynamically and has different metadata/data profile, we can only get a clear view of the fs from both chunk level(allocated/unallocated) and extent level(total/used). In your case, your fs doesn't have any unallocated space, this make balance unable to work at all. And your data/metadata usage is quite high, although both has small available space left, the fs should be writable for some time, but not long. To proceed, add a larger device to current fs, and do a balance or just delete the 28G partition then btrfs will handle the rest well. Thanks, Qu Any advice on possible causes and how to proceed? -- Vasco -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.
Greetings, On 2017-02-02 10:06, Austin S. Hemmelgarn wrote: > On 2017-02-02 09:25, Adam Borowski wrote: >> On Thu, Feb 02, 2017 at 07:49:50AM -0500, Austin S. Hemmelgarn wrote: >>> This is a severe bug that makes a not all that uncommon (albeit bad) use >>> case fail completely. The fix had no dependencies itself and >> >> I don't see what's bad in mounting a RAID degraded. Yeah, it provides no >> redundancy but that's no worse than using a single disk from the start. >> And most people not doing storage/server farm don't have a stack of spare >> disks at hand, so getting a replacement might take a while. > Running degraded is bad. Period. If you don't have a disk on hand to > replace the failed one (and if you care about redundancy, you should > have at least one spare on hand), you should be converting to a single > disk, not continuing to run in degraded mode until you get a new disk. > The moment you start talking about running degraded long enough that you > will be _booting_ the system with the array degraded, you need to be > converting to a single disk. This is of course impractical for > something like a hardware array or an LVM volume, but it's _trivial_ > with BTRFS, and protects you from all kinds of bad situations that can't > happen with a single disk but can completely destroy the filesystem if > it's a degraded array. Running a single disk is not exactly the same as > running a degraded array, it's actually marginally safer (even if you > aren't using dup profile for metadata) because there are fewer moving > parts to go wrong. It's also exponentially more efficient. >> >> Being able to continue to run when a disk fails is the whole point of >> RAID >> -- despite what some folks think, RAIDs are not for backups but for >> uptime. >> And if your uptime goes to hell because the moment a disk fails you >> need to >> drop everything and replace the disk immediately, why would you use RAID? > Because just replacing a disk and rebuilding the array is almost always > much cheaper in terms of time than rebuilding the system from a backup. > IOW, even if you have to drop everything and replace the disk > immediately, it's still less time consuming than restoring from a > backup. It also has the advantage that you don't lose any data. We disagree on letting people run degraded, which I support, you not. I respect your opinion. However, I have to ask who decides these rules? Obviously, not me since I am a simple btrfs home user. Since Oracle is funding btrfs development, is that Oracle's official stand on how to handle a failed disk? Who decides of btrfs's roadmap? I have no clue who is who on this mailing list and who influences the features of btrfs. Oracle is obviously using raid systems internally. How do the operators of these raid systems feel about this "not let the system run in degraded mode"? As a home user, I do not want to have a disk always available. This is paying a disk very expensively when the raid system can run easily for two years without disk failure. I want to buy the new disk (asap, of course) once one died. At that moment, the cost of a drive would have fallen drastically. Yes, I can live with running my home system (which has backups) for a day or two, in degraded rw mode until I purchase and can install a new disk. Chances are low that both disks will quit at around the same time. Simply because I cannot run in degraded mode and cannot add a disk to my current degraded raid1, despite having my replacement disk in my hands, I must resort to switch to mdadm or zfs. Having a policy that limits user's options for the sake that they are too stupid to understand the implications is wrong. Its ok for applications, but not at the operating system; there should be a way to force this. A --yes-i-know-what-i-am-doing-now-please-mount-rw-degraded-so-i-can-install-the-new-disk parameter must be implemented. Currently, it is like disallowing root to run mkfs over an existing filesystem because people could erase data by mistake. Let people do what they want and let them live with the consequences. hdparm has a --yes-i-know-what-i-am-doing flag. btrfs needs one. Whoever decides about btrfs features to add, please consider this one. Best regards, Hans Deragon -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi Kai, I guess your message did not make it to me as I'm not subscribed to the list. I totally understand what the the snapshot is "crash consistent" - consistent to the state of the disk you would find if you shut down the power with no notice, for many applications it is a problem however it is fine for many databases which already need to be able to recover correctly from power loss for MySQL this works well for Innodb storage engine it does not work for MyISAM The great of such "uncoordinated" snapshot is what it is instant and have very little production impact - if you want to "freeze" multiple filesystems or even worse flush MyISAM table it can take a lot of time and can be unacceptable for many 24/7 workloads. Or are you saying BTRFS snapshots do not provide this kind of consistency ? > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. -- Regards, Kai On Tue, Feb 7, 2017 at 9:00 AM, Hugo Millswrote: > On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote: >> Hi, >> >> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL >> Workload. >> >> It did not go very well ranging from multi-seconds stalls where no >> transactions are completed to the finally kernel OOPS with "no space left >> on device" error message and filesystem going read only. >> >> I'm complete newbie in BTRFS so I assume I'm doing something wrong. >> >> Do you have any advice on how BTRFS should be tuned for OLTP workload >> (large files having a lot of random writes) ?Or is this the case where >> one should simply stay away from BTRFS and use something else ? >> >> One item recommended in some places is "nodatacow" this however defeats >> the main purpose I'm looking at BTRFS - I am interested in "free" >> snapshots which look very attractive to use for database recovery scenarios >> allow instant rollback to the previous state. > >Well, nodatacow will still allow snapshots to work, but it also > allows the data to fragment. Each snapshot made will cause subsequent > writes to shared areas to be CoWed once (and then it reverts to > unshared and nodatacow again). > >There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > >I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. > >Hugo. > >
[PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
Just as Filipe pointed out, the most time consuming part of qgroup is btrfs_qgroup_account_extents() and btrfs_qgroup_prepare_account_extents(). Which both call btrfs_find_all_roots() to get old_roots and new_roots ulist. However for old_roots, we don't really need to calculate it at transaction commit time. This patch moves the old_roots accounting part out of commit_transaction(), so at least we won't block transaction too long. But please note that, this won't speedup qgroup overall, it just moves half of the cost out of commit_transaction(). Cc: Filipe MananaSigned-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.c | 20 fs/btrfs/qgroup.c | 33 ++--- fs/btrfs/qgroup.h | 14 ++ 3 files changed, 60 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index ef724a5..0ee927e 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *ref, struct btrfs_qgroup_extent_record *qrecord, u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, -int action, int is_data) +int action, int is_data, int *qrecord_inserted_ret) { struct btrfs_delayed_ref_head *existing; struct btrfs_delayed_ref_head *head_ref = NULL; struct btrfs_delayed_ref_root *delayed_refs; int count_mod = 1; int must_insert_reserved = 0; + int qrecord_inserted = 0; /* If reserved is provided, it must be a data extent. */ BUG_ON(!is_data && reserved); @@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, if(btrfs_qgroup_trace_extent_nolock(fs_info, delayed_refs, qrecord)) kfree(qrecord); + else + qrecord_inserted = 1; } spin_lock_init(_ref->lock); @@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, atomic_inc(_refs->num_entries); trans->delayed_ref_updates++; } + if (qrecord_inserted_ret) + *qrecord_inserted_ret = qrecord_inserted; return head_ref; } @@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; + int qrecord_inserted; BUG_ON(extent_op && extent_op->is_data); ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS); @@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, * the spin lock */ head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record, - bytenr, num_bytes, 0, 0, action, 0); + bytenr, num_bytes, 0, 0, action, 0, + _inserted); add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr, num_bytes, parent, ref_root, level, action); spin_unlock(_refs->lock); + if (qrecord_inserted) + return btrfs_qgroup_trace_extent_post(fs_info, record); return 0; free_head_ref: @@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; + int qrecord_inserted; BUG_ON(extent_op && !extent_op->is_data); ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); @@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, */ head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record, bytenr, num_bytes, ref_root, reserved, - action, 1); + action, 1, _inserted); add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr, num_bytes, parent, ref_root, owner, offset, action); spin_unlock(_refs->lock); + if (qrecord_inserted) + return btrfs_qgroup_trace_extent_post(fs_info, record); return 0; } @@ -899,7 +911,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, add_delayed_ref_head(fs_info, trans, _ref->node, NULL, bytenr, num_bytes, 0, 0, BTRFS_UPDATE_DELAYED_HEAD, -extent_op->is_data); +extent_op->is_data, NULL);
user_subvol_rm_allowed? Is there a user_subvol_create_deny|allowed?
Dear btrfs community, Please accept my apologies in advance if I missed something in recent btrfs development; my MUA tells me I'm ~1500 unread messages out-of-date. :/ I recently read about "mount -t btrfs -o user_subvol_rm_allowed" while doing reading up on LXC handling of snapshots with the btrfs backend. Is this mount option per-subvolume, or per volume? Also, what mechanisms to restrict a user's ability to create an arbitrarily large number of snapshots? Is there a user_subvol_create_deny|allowed? What I've read about the inverse correlation between number of subvols to performance, a potentially hostile user could cause an IO denial of service or potentially even trigger an ENOSPC. From what I gather, the following will reproduce the hypothetical issue related to my question: # as root btrfs sub create /some/dir/subvol chown some-user /some/dir/subvol # as some-user cd /home/dir/subvol cp -ar --reflink=always /some/big/files ./ COUNT=1 while [ 0 -lt 1 ]; do btrfs sub snap ./ ./snapshot-$COUNT COUNT=COUNT+1 sleep 2 # --maybe unnecessary done -- I hope there's something I've misunderstood or failed to read! Please CC me so your reply will hit my main inbox :-) Nicholas signature.asc Description: Digital signature
Re: [lustre-devel] [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems
On Feb 2, 2017, at 10:34, Jan Karawrote: > > Provide helper functions for setting up dynamically allocated > backing_dev_info structures for filesystems and cleaning them up on > superblock destruction. > > CC: linux-...@lists.infradead.org > CC: linux-...@vger.kernel.org > CC: Petr Vandrovec > CC: linux-ni...@vger.kernel.org > CC: cluster-de...@redhat.com > CC: osd-...@open-osd.org > CC: codal...@coda.cs.cmu.edu > CC: linux-...@lists.infradead.org > CC: ecryp...@vger.kernel.org > CC: linux-c...@vger.kernel.org > CC: ceph-de...@vger.kernel.org > CC: linux-btrfs@vger.kernel.org > CC: v9fs-develo...@lists.sourceforge.net > CC: lustre-de...@lists.lustre.org > Signed-off-by: Jan Kara > --- > fs/super.c | 49 > include/linux/backing-dev-defs.h | 2 +- > include/linux/fs.h | 6 + > 3 files changed, 56 insertions(+), 1 deletion(-) > > diff --git a/fs/super.c b/fs/super.c > index ea662b0e5e78..31dc4c6450ef 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -446,6 +446,11 @@ void generic_shutdown_super(struct super_block *sb) > hlist_del_init(>s_instances); > spin_unlock(_lock); > up_write(>s_umount); > + if (sb->s_iflags & SB_I_DYNBDI) { > + bdi_put(sb->s_bdi); > + sb->s_bdi = _backing_dev_info; > + sb->s_iflags &= ~SB_I_DYNBDI; > + } > } > > EXPORT_SYMBOL(generic_shutdown_super); > @@ -1249,6 +1254,50 @@ mount_fs(struct file_system_type *type, int flags, > const char *name, void *data) > } > > /* > + * Setup private BDI for given superblock. I gets automatically cleaned up (typo) s/I/It/ Looks fine otherwise. > + * in generic_shutdown_super(). > + */ > +int super_setup_bdi_name(struct super_block *sb, char *fmt, ...) > +{ > + struct backing_dev_info *bdi; > + int err; > + va_list args; > + > + bdi = bdi_alloc(GFP_KERNEL); > + if (!bdi) > + return -ENOMEM; > + > + bdi->name = sb->s_type->name; > + > + va_start(args, fmt); > + err = bdi_register_va(bdi, NULL, fmt, args); > + va_end(args); > + if (err) { > + bdi_put(bdi); > + return err; > + } > + WARN_ON(sb->s_bdi != _backing_dev_info); > + sb->s_bdi = bdi; > + sb->s_iflags |= SB_I_DYNBDI; > + > + return 0; > +} > +EXPORT_SYMBOL(super_setup_bdi_name); > + > +/* > + * Setup private BDI for given superblock. I gets automatically cleaned up > + * in generic_shutdown_super(). > + */ > +int super_setup_bdi(struct super_block *sb) > +{ > + static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); > + > + return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name, > + atomic_long_inc_return(_seq)); > +} > +EXPORT_SYMBOL(super_setup_bdi); > + > +/* > * This is an internal function, please use sb_end_{write,pagefault,intwrite} > * instead. > */ > diff --git a/include/linux/backing-dev-defs.h > b/include/linux/backing-dev-defs.h > index 2ecafc8a2d06..70080b4217f4 100644 > --- a/include/linux/backing-dev-defs.h > +++ b/include/linux/backing-dev-defs.h > @@ -143,7 +143,7 @@ struct backing_dev_info { > congested_fn *congested_fn; /* Function pointer if device is md/dm */ > void *congested_data; /* Pointer to aux data for congested func */ > > - char *name; > + const char *name; > > struct kref refcnt; /* Reference counter for the structure */ > unsigned int registered:1; /* Is bdi registered? */ > diff --git a/include/linux/fs.h b/include/linux/fs.h > index c930cbc19342..8ed8b6d1bc54 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1267,6 +1267,9 @@ struct mm_struct; > /* sb->s_iflags to limit user namespace mounts */ > #define SB_I_USERNS_VISIBLE 0x0010 /* fstype already mounted */ > > +/* Temporary flag until all filesystems are converted to dynamic bdis */ > +#define SB_I_DYNBDI 0x0100 > + > /* Possible states of 'frozen' field */ > enum { > SB_UNFROZEN = 0,/* FS is unfrozen */ > @@ -2103,6 +2106,9 @@ extern int vfs_ustat(dev_t, struct kstatfs *); > extern int freeze_super(struct super_block *super); > extern int thaw_super(struct super_block *super); > extern bool our_mnt(struct vfsmount *mnt); > +extern __printf(2, 3) > +int super_setup_bdi_name(struct super_block *sb, char *fmt, ...); > +extern int super_setup_bdi(struct super_block *sb); > > extern int current_umask(void); > > -- > 2.10.2 > > ___ > lustre-devel mailing list > lustre-de...@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Troubleshooting crash due to running out of space issues
I'm running BTRFS on Ubuntu 16.04 - I was testing intensive database IO which ends up with pretty fragmented data file: root@blinky:/var/lib/mysql/sbtest# filefrag sbtest1.ibd sbtest1.ibd: 13415923 extents found This is 500G device which is some 60% full: /dev/nvme0n1500107608 308009444 189718556 62% /mnt/data/mysql root@blinky:/# btrfs fi show Label: none uuid: 2a396366-e3c9-4d14-b4cc-3d8992bd1c6b Total devices 1 FS bytes used 293.24GiB devid 1 size 476.94GiB used 476.94GiB path /dev/nvme0n1 this file (sbtest1.ibd) takes some 250GB - majority of space As I try to defrag this file with: btrfs fi defrag sbtest1.ibd I get either error of "no space available" and file system goes to read only or filesystem completely breaks down and IO errors are reported. Note during my experiments I have mounted this filesystem repeatedly with and without nodatacow and autodefrag options, I assume these should not cause any file system dammage, do they ? Here is the portion of the latest log: Feb 7 19:10:42 blinky kernel: [40722.055010] [ cut here ] Feb 7 19:10:42 blinky kernel: [40722.055060] WARNING: CPU: 12 PID: 17002 at /build/linux-W6HB68/linux-4.4.0/fs/btrfs/extent-tree.c:6552 __btrfs_free_extent.isr a.70+0x2e6/0xd30 [btrfs]() Feb 7 19:10:42 blinky kernel: [40722.055063] BTRFS: error (device nvme0n1) in __btrfs_free_extent:6552: errno=-28 No space left Feb 7 19:10:42 blinky kernel: [40722.055066] BTRFS info (device nvme0n1): forced readonly Feb 7 19:10:42 blinky kernel: [40722.055068] BTRFS: error (device nvme0n1) in btrfs_run_delayed_refs:2927: errno=-28 No space left Feb 7 19:10:42 blinky kernel: [40722.055086] BTRFS: Transaction aborted (error -28) Feb 7 19:10:42 blinky kernel: [40722.055087] Modules linked in: snd_hda_codec_hdmi nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic serio_raw sb_edac snd_hda_intel edac_core snd_hda_codec snd_hda_core snd_hwdep snd_pcm input_leds snd_time r mei_me lpc_ich snd mei soundcore shpchp tpm_infineon mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_tra nsport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_gen eric usbhid hid nouveau crct10dif_pclmul mxm_wmi crc32_pclmul video ghash_clmulni_intel i2c_algo_bit aesni_intel ttm aes_x86_64 lrw drm_kms_helper gf128mul glue _helper ablk_helper syscopyarea psmouse cryptd sysfillrect e1000e sysimgblt fb_sys_fops ahci libahci alx ptp drm mdio pps_core nvme fjes wmi Feb 7 19:10:42 blinky kernel: [40722.055180] CPU: 12 PID: 17002 Comm: btrfs-transacti Tainted: GW 4.4.0-62-generic #83-Ubuntu Feb 7 19:10:42 blinky kernel: [40722.055181] Hardware name: Gigabyte Technology Co., Ltd. Default string/X99-Ultra Gaming-CF, BIOS F5 08/29/2016 Feb 7 19:10:42 blinky kernel: [40722.055183] 0286 7c7f47a0 880f1b4c7b00 813f7c63 Feb 7 19:10:42 blinky kernel: [40722.055185] 880f1b4c7b48 c0392498 880f1b4c7b38 810812d2 Feb 7 19:10:42 blinky kernel: [40722.055188] 007021b82000 ffe4 880fe62a4000 Feb 7 19:10:42 blinky kernel: [40722.055190] Call Trace: Feb 7 19:10:42 blinky kernel: [40722.055197] [] dump_stack+0x63/0x90 Feb 7 19:10:42 blinky kernel: [40722.055202] [] warn_slowpath_common+0x82/0xc0 Feb 7 19:10:42 blinky kernel: [40722.055205] [] warn_slowpath_fmt+0x5c/0x80 Feb 7 19:10:42 blinky kernel: [40722.055219] [] __btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055232] [] __btrfs_run_delayed_refs+0x444/0x11f0 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055236] [] ? lock_timer_base.isra.22+0x54/0x70 Feb 7 19:10:42 blinky kernel: [40722.055248] [] btrfs_run_delayed_refs+0x7d/0x2a0 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055251] [] ? del_timer_sync+0x48/0x50 Feb 7 19:10:42 blinky kernel: [40722.055266] [] btrfs_commit_transaction+0xac/0xa90 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055279] [] transaction_kthread+0x229/0x240 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055291] [] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] Feb 7 19:10:42 blinky kernel: [40722.055294] [] kthread+0xd8/0xf0 Feb 7 19:10:42 blinky kernel: [40722.055296] [] ? kthread_create_on_node+0x1e0/0x1e0 Feb 7 19:10:42 blinky kernel: [40722.055300] [] ret_from_fork+0x3f/0x70 Feb 7 19:10:42 blinky kernel: [40722.055301] [] ? kthread_create_on_node+0x1e0/0x1e0 Feb 7 19:10:42 blinky kernel: [40722.055303] ---[ end trace 92a6418dcae8a352 ]--- Feb 7 19:10:42 blinky kernel: [40722.055306] BTRFS: error (device nvme0n1) in __btrfs_free_extent:6552: errno=-28 No space left Feb 7 19:10:42 blinky kernel: [40722.055314] BTRFS: error (device nvme0n1) in btrfs_run_delayed_refs:2927: errno=-28 No space left Feb 7
Re: Very slow balance / btrfs-transaction
At 02/07/2017 11:55 PM, Filipe Manana wrote: On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruowrote: At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote: Hi Qu, On 02/05/2017 07:45 PM, Qu Wenruo wrote: At 02/04/2017 09:47 AM, Jorg Bornschein wrote: February 4, 2017 1:07 AM, "Goldwyn Rodrigues" wrote: Quata support was indeed active -- and it warned me that the qroup data was inconsistent. Disabling quotas had an immediate impact on balance throughput -- it's *much* faster now! From a quick glance at iostat I would guess it's at least a factor 100 faster. Should quota support generally be disabled during balances? Or did I somehow push my fs into a weired state where it triggered a slow-path? Thanks! j Would you please provide the kernel version? v4.9 introduced a bad fix for qgroup balance, which doesn't completely fix qgroup bytes leaking, but also hugely slow down the balance process: commit 62b99540a1d91e46422f0e04de50fc723812c421 Author: Qu Wenruo Date: Mon Aug 15 10:36:51 2016 +0800 btrfs: relocation: Fix leaking qgroups numbers on data extents Sorry for that. And in v4.10, a better method is applied to fix the byte leaking problem, and should be a little faster than previous one. commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca Author: Qu Wenruo Date: Tue Oct 18 09:31:29 2016 +0800 btrfs: qgroup: Fix qgroup data leaking by using subtree tracing However, using balance with qgroup is still slower than balance without qgroup, the root fix needs us to rework current backref iteration. This patch has made the btrfs balance performance worse. The balance task has become more CPU intensive compared to earlier and takes longer to complete, besides hogging resources. While correctness is important, we need to figure out how this can be made more efficient. The cause is already known. It's find_parent_node() which takes most of the time to find all referencer of an extent. And it's also the cause for FIEMAP softlockup (fixed in recent release by early quit). The biggest problem is, current find_parent_node() uses list to iterate, which is quite slow especially it's done in a loop. In real world find_parent_node() is about O(n^3). We can either improve find_parent_node() by using rb_tree, or introduce some cache for find_parent_node(). Even if anyone is able to reduce that function's complexity from O(n^3) down to lets say O(n^2) or O(n log n) for example, the current implementation of qgroups will always be a problem. The real problem is that this more recent rework of qgroups does all this accounting inside the critical section of a transaction - blocking any other tasks that want to start a new transaction or attempt to join the current transaction. Not to mention that on systems with small amounts of memory (2Gb or 4Gb from what I've seen from user reports) we also OOM due this allocation of struct btrfs_qgroup_extent_record per delayed data reference head, that are used for that accounting phase in the critical section of a transaction commit. Let's face it and be realistic, even if someone manages to make find_parent_node() much much better, like O(n) for example, it will always be a problem due to the reasons mentioned before. Many extents touched per transaction and many subvolumes/snapshots, will always expose that root problem - doing the accounting in the transaction commit critical section. You must accept the fact that we must call find_parent_node() at least twice to get correct owner modification for each touched extent. Or qgroup number will never be correct. One for old_roots by searching commit root, and one for new_roots by searching current root. You can call find_parent_node() as many time as you like, but that's just wasting your CPU time. Only the final find_parent_node() will determine new_roots for that extent, and there is no better timing than commit_transaction(). Or you can wasting more time calling find_parent_node() every time you touched a extent, saving one find_parent_node() in commit_transaction() with the cost of more find_parent_node() in other place. Is that what you want? I can move the find_parent_node() for old_roots out of commit_transaction(). But that will only reduce 50% of the time spent on commit_transaction(). Compared to O(n^3) find_parent_node(), that's not the determining fact even. Thanks, Qu IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I didn't hear anything about it recently. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: dup vs raid1 in single disk
On 8 February 2017 at 08:28, Kai Krakowwrote: > I still thinks it's a myth... The overhead of managing inline > deduplication is just way too high to implement it without jumping > through expensive hoops. Most workloads have almost zero deduplication > potential. And even when, their temporal occurrence is spaced so far > that an inline deduplicator won't catch it. > > If it would be all so easy, btrfs would already have it working in > mainline. I don't even remember that those patches is still being > worked on. > > With this in mind, I think dup metadata is still a good think to have > even on SSD and I would always force to enable it. > > Potential for deduplication is only when using snapshots (which already > are deduplicated when taken) or when handling user data on a file > server in a multi-user environment. Users tend to copy their files all > over the place - multiple directories of multiple gigabytes. Potential > is also where you're working with client machine backups or vm images. > I regularly see deduplication efficiency of 30-60% in such scenarios - > file servers mostly which I'm handling. But due to temporally far > spaced occurrence of duplicate blocks, only offline or nearline > deduplication works here. I'm a sysadmin by trade, managing many PB of storage for a media company. Our primary storage are Oracle ZFS appliances, and all of our secondary/nearline storage is Linux+BtrFS. ZFS's inline deduplication is awful. It consumes enormous amounts of RAM that is orders of magnitude more valuable as ARC/Cache, and becomes immediately useless whenever a storage node is rebooted (necessary to apply mandatory security patches) and the in-memory tables are lost (meaning cold data is rarely re-examined, and the inline dedup becomes less efficient). Conversely, I use "dupremove" as a one-shot/offline deduplication tool on all of our BtrFS storage. I can be set as a cron job to be done outside of business hours, and use an SQLite database to store the necessary dedup hash information on disk, rather than in RAM. >From the point of view of someone who manages large amounts of long term centralised storage, this is a far superior way to deal with deduplication, as it offers more flexibility and far better space-saving ratios at a lower memory cost. We trialled ZFS dedup for a few months, and decided to turn it off, as there was far less benefit to ZFS using all that RAM for dedup than there was for it to be cache. I've been requesting Oracle offer a similar offline dedup tool for their ZFS appliance for a very long time, and if BtrFS ever did offer inline dedup, I wouldn't bother using it for all of the reasons above. -Dan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On 02/07/2017 11:28 PM, Kai Krakow wrote: > Am Thu, 19 Jan 2017 15:02:14 -0500 > schrieb "Austin S. Hemmelgarn": > >> On 2017-01-19 13:23, Roman Mamedov wrote: >>> On Thu, 19 Jan 2017 17:39:37 +0100 >>> [...] >>> And the DUP mode is still useful on SSDs, for cases when one copy >>> of the DUP gets corrupted in-flight due to a bad controller or RAM >>> or cable, you could then restore that block from its good-CRC DUP >>> copy. >> The only window of time during which bad RAM could result in only one >> copy of a block being bad is after the first copy is written but >> before the second is, which is usually an insanely small amount of >> time. As far as the cabling, the window for errors resulting in a >> single bad copy of a block is pretty much the same as for RAM, and if >> they're persistently bad, you're more likely to lose data for other >> reasons. > > It depends on the design of the software. You're true if this memory > block is simply a single block throughout its lifetime in RAM before > written to storage. But if it is already handled as duplicate block in > memory, odds are different. I hope btrfs is doing this right... ;-) In memory, it's just one copy, happily sitting around, getting corrupted by cosmic rays and other stuff done to it by aliens, after which a valid checksum is calculated for the corrupt data, after which it goes on its way to disk, twice. Yay. >> That said, I do still feel that DUP mode has value on SSD's. The >> primary arguments against it are: >> 1. It wears out the SSD faster. > > I don't think this is a huge factor, even more when looking at TBW > capabilities of modern SSDs. And prices are low enough to better swap > early than waiting for the disaster hitting you. Instead, you can still > use the old SSD for archival storage (but this has drawbacks, don't > leave them without power for months or years!) or as a shock resistent > USB mobile drive on the go. > >> 2. The blocks are likely to end up in the same erase block, and >> therefore there will be no benefit. > > Oh, this is probably a point to really think about... Would ssd_spread > help here? I think there was another one, SSD firmware deduplicating writes, converting the DUP into single again, giving a false idea of it being DUP. This is one that can be solved by e.g. using disk encryption, which causes same writes to show up as different data on disk. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: better document btrfs receive security
Am Fri, 3 Feb 2017 08:48:58 -0500 schrieb "Austin S. Hemmelgarn": > +user who is running receive, and then move then into the final > destination Typo? s/then/them/? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 02/07/2017 10:35 PM, Kai Krakow wrote: > Am Tue, 7 Feb 2017 22:25:29 +0100 > schrieb Lionel Bouton: > >> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : >>> On 2017-02-07 15:36, Kai Krakow wrote: Am Tue, 7 Feb 2017 09:13:25 -0500 schrieb Peter Zaitsev : >> [...] Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. >>> Correct. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. >>> Also correct AFAICT, and this needs to be better documented (for >>> most people, the term snapshot implies atomicity of the >>> operation). >> >> Atomicity can be a relative term. If the snapshot atomicity is >> relative to barriers but not relative to individual writes between >> barriers then AFAICT it's fine because the filesystem doesn't make >> any promise it won't keep even in the context of its snapshots. >> Consider a power loss : the filesystems atomicity guarantees can't go >> beyond what the hardware guarantees which means not all current in fly >> write will reach the disk and partial writes can happen. Modern >> filesystems will remain consistent though and if an application using >> them makes uses of f*sync it can provide its own guarantees too. The >> same should apply to snapshots : all the writes in fly can complete or >> not on disk before the snapshot what matters is that both the snapshot >> and these writes will be completed after the next barrier (and any >> robust application will ignore all the in fly writes it finds in the >> snapshot if they were part of a batch that should be atomically >> commited). >> >> This is why AFAIK PostgreSQL or MySQL with their default ACID >> compliant configuration will recover from a BTRFS snapshot in the >> same way they recover from a power loss. > > This is what I meant in my other reply. But this is also why it should > be documented. Wrongly implying that snapshots are single point in time > snapshots is a wrong assumption with possibly horrible side effects one > wouldn't expect. It depends on what the definition of time is. (whoa!!) A snapshot is taken of a single point in the lifetime of a filesystem tree (a generation, the point where a transaction commits)...? > Taking a snapshot is like a power loss - even tho there is no power > loss. So the database has to be properly configured. It is simply short > sighted if you don't think about this fact. The documentation should > really point that fact out. I'd almost say that it would be short sighted to assume a btrfs snapshot would *not* behave like a power loss. At least, to me (thinking as a sysadmin) it feels really weird to think of it in any other way than that. Oh wait, that's what you mean, or not? What is the thing that the documentation should point out? I'm not trying to be trolling, the piled up double negations make this discussion a bit hard to read. Moo -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
Am Thu, 19 Jan 2017 15:02:14 -0500 schrieb "Austin S. Hemmelgarn": > On 2017-01-19 13:23, Roman Mamedov wrote: > > On Thu, 19 Jan 2017 17:39:37 +0100 > > "Alejandro R. Mosteo" wrote: > > > >> I was wondering, from a point of view of data safety, if there is > >> any difference between using dup or making a raid1 from two > >> partitions in the same disk. This is thinking on having some > >> protection against the typical aging HDD that starts to have bad > >> sectors. > > > > RAID1 will write slower compared to DUP, as any optimization to > > make RAID1 devices work in parallel will cause a total performance > > disaster for you as you will start trying to write to both > > partitions at the same time, turning all linear writes into random > > ones, which are about two orders of magnitude slower than linear on > > spinning hard drives. DUP shouldn't have this issue, but still it > > will be twice slower than single, since you are writing everything > > twice. > As of right now, there will actually be near zero impact on write > performance (or at least, it's way less than the theoretical 50%) > because there really isn't any optimization to speak of in the > multi-device code. That will hopefully change over time, but it's > not likely to do so any time in the future since nobody appears to be > working on multi-device write performance. I think that's only true if you don't account the seek overhead. In single device RAID1 mode you will always seek half of the device while writing data, and even when reading between odd and even PIDs. In contrast, DUP mode doesn't guarantee your seeks to be shorter but from a statistical point of view, on the average it should be shorter. So it should yield better performance (tho I wouldn't expect it to be observable, depending on your workload). So, on devices having no seek overhead (aka SSD), it is probably true (minus bus bandwidth considerations). For HDD I'd prefer DUP. >From data safety point of view: It's more likely that adjacent and nearby sectors are bad. So DUP imposes a higher risk of written data being written to only bad sectors - which means data loss or even file system loss (if metadata hits this problem). To be realistic: I wouldn't trade space usage for duplicate data on an already failing disk, no matter if it's DUP or RAID1. HDD disk space is cheap, and using such a scenario is just waste of performance AND space - no matter what. I don't understand the purpose of this. It just results in fake safety. Better get two separate devices half the size. There's a better chance of getting a better cost/space ratio anyways, plus better performance and safety. > There's also the fact that you're writing more metadata than data > most of the time unless you're dealing with really big files, and > metadata is already DUP mode (unless you are using an SSD), so the > performance hit isn't 50%, it's actually a bit more than half the > ratio of data writes to metadata writes. > > > >> On a related note, I see this caveat about dup in the manpage: > >> > >> "For example, a SSD drive can remap the blocks internally to a > >> single copy thus deduplicating them. This negates the purpose of > >> increased redunancy (sic) and just wastes space" > > > > That ability is vastly overestimated in the man page. There is no > > miracle content-addressable storage system working at 500 MB/sec > > speeds all within a little cheap controller on SSDs. Likely most of > > what it can do, is just compress simple stuff, such as runs of > > zeroes or other repeating byte sequences. > Most of those that do in-line compression don't implement it in > firmware, they implement it in hardware, and even DEFLATE can get 500 > MB/second speeds if properly implemented in hardware. The firmware > may control how the hardware works, but it's usually hardware doing > heavy lifting in that case, and getting a good ASIC made that can hit > the required performance point for a reasonable compression algorithm > like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI > work. I still thinks it's a myth... The overhead of managing inline deduplication is just way too high to implement it without jumping through expensive hoops. Most workloads have almost zero deduplication potential. And even when, their temporal occurrence is spaced so far that an inline deduplicator won't catch it. If it would be all so easy, btrfs would already have it working in mainline. I don't even remember that those patches is still being worked on. With this in mind, I think dup metadata is still a good think to have even on SSD and I would always force to enable it. Potential for deduplication is only when using snapshots (which already are deduplicated when taken) or when handling user data on a file server in a multi-user environment. Users tend to copy their files all over the place - multiple directories of multiple gigabytes.
Re: BTRFS for OLTP Databases
On 02/07/2017 07:59 PM, Peter Zaitsev wrote: > > So far the most frustating for me was periodic stalls for many seconds > (running sysbench workload). What was the most puzzling I get this > even if I run workload at the 50% or less of the full load - Ie > database can handle 1000 transactions/sec and I only inject 500/sec > and I still have those stalls. > > This is where it looks to me like some work is being delayed and when > it requires stall for a few seconds to catch up.I wonder if there > are some configuration options available to play with. What happens during these stalls? Do you mean a 'stall' like it seems nothing is happening at all, or a 'stall' during which something is so busy that something else cannot continue? Is there some kernel thread doing a lot of cpu? What does the /proc//stack show? Is it huge write spikes with not many writes in between, or do you generate enough action to be writing to disk all the time? If the stalls show the behaviour of huge disk-write spikes, during which applications seem to be blocked from continuing to write more, and if during that time you see btrfs-transaction active in the kernel, nd, if your test is doing a lot of writes all over the place (not only simply appending table files sequentially, but changing a lot and touching a lot of metadata) and you're pushing it, it might be space cache related. I think the /proc//stack of the btrfs-transaction will show you something related to free space cache in this case. In this case, it might be interesting to test the free space tree (instead of the default free space cache): http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf Using free space tree helped me a lot on write-heavy filesystems (like a backup server with concurrent rsync data streaming in, also doing snapshotting) from having incoming traffic drop to the ground every time there was a transaction commit. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 22:25:29 +0100 schrieb Lionel Bouton: > Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : > > On 2017-02-07 15:36, Kai Krakow wrote: > >> Am Tue, 7 Feb 2017 09:13:25 -0500 > >> schrieb Peter Zaitsev : > >> > [...] > >> > >> Out of curiosity, I see one problem here: > >> > >> If you're doing snapshots of the live database, each snapshot > >> leaves the database files like killing the database in-flight. > >> Like shutting the system down in the middle of writing data. > >> > >> This is because I think there's no API for user space to subscribe > >> to events like a snapshot - unlike e.g. the VSS API (volume > >> snapshot service) in Windows. You should put the database into > >> frozen state to prepare it for a hotcopy before creating the > >> snapshot, then ensure all data is flushed before continuing. > > Correct. > >> > >> I think I've read that btrfs snapshots do not guarantee single > >> point in time snapshots - the snapshot may be smeared across a > >> longer period of time while the kernel is still writing data. So > >> parts of your writes may still end up in the snapshot after > >> issuing the snapshot command, instead of in the working copy as > >> expected. > > Also correct AFAICT, and this needs to be better documented (for > > most people, the term snapshot implies atomicity of the > > operation). > > Atomicity can be a relative term. If the snapshot atomicity is > relative to barriers but not relative to individual writes between > barriers then AFAICT it's fine because the filesystem doesn't make > any promise it won't keep even in the context of its snapshots. > Consider a power loss : the filesystems atomicity guarantees can't go > beyond what the hardware guarantees which means not all current in fly > write will reach the disk and partial writes can happen. Modern > filesystems will remain consistent though and if an application using > them makes uses of f*sync it can provide its own guarantees too. The > same should apply to snapshots : all the writes in fly can complete or > not on disk before the snapshot what matters is that both the snapshot > and these writes will be completed after the next barrier (and any > robust application will ignore all the in fly writes it finds in the > snapshot if they were part of a batch that should be atomically > commited). > > This is why AFAIK PostgreSQL or MySQL with their default ACID > compliant configuration will recover from a BTRFS snapshot in the > same way they recover from a power loss. This is what I meant in my other reply. But this is also why it should be documented. Wrongly implying that snapshots are single point in time snapshots is a wrong assumption with possibly horrible side effects one wouldn't expect. Taking a snapshot is like a power loss - even tho there is no power loss. So the database has to be properly configured. It is simply short sighted if you don't think about this fact. The documentation should really point that fact out. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : > On 2017-02-07 15:36, Kai Krakow wrote: >> Am Tue, 7 Feb 2017 09:13:25 -0500 >> schrieb Peter Zaitsev: >> >>> Hi Hugo, >>> >>> For the use case I'm looking for I'm interested in having snapshot(s) >>> open at all time. Imagine for example snapshot being created every >>> hour and several of these snapshots kept at all time providing quick >>> recovery points to the state of 1,2,3 hours ago. In such case (as I >>> think you also describe) nodatacow does not provide any advantage. >> >> Out of curiosity, I see one problem here: >> >> If you're doing snapshots of the live database, each snapshot leaves >> the database files like killing the database in-flight. Like shutting >> the system down in the middle of writing data. >> >> This is because I think there's no API for user space to subscribe to >> events like a snapshot - unlike e.g. the VSS API (volume snapshot >> service) in Windows. You should put the database into frozen state to >> prepare it for a hotcopy before creating the snapshot, then ensure all >> data is flushed before continuing. > Correct. >> >> I think I've read that btrfs snapshots do not guarantee single point in >> time snapshots - the snapshot may be smeared across a longer period of >> time while the kernel is still writing data. So parts of your writes >> may still end up in the snapshot after issuing the snapshot command, >> instead of in the working copy as expected. > Also correct AFAICT, and this needs to be better documented (for most > people, the term snapshot implies atomicity of the operation). Atomicity can be a relative term. If the snapshot atomicity is relative to barriers but not relative to individual writes between barriers then AFAICT it's fine because the filesystem doesn't make any promise it won't keep even in the context of its snapshots. Consider a power loss : the filesystems atomicity guarantees can't go beyond what the hardware guarantees which means not all current in fly write will reach the disk and partial writes can happen. Modern filesystems will remain consistent though and if an application using them makes uses of f*sync it can provide its own guarantees too. The same should apply to snapshots : all the writes in fly can complete or not on disk before the snapshot what matters is that both the snapshot and these writes will be completed after the next barrier (and any robust application will ignore all the in fly writes it finds in the snapshot if they were part of a batch that should be atomically commited). This is why AFAIK PostgreSQL or MySQL with their default ACID compliant configuration will recover from a BTRFS snapshot in the same way they recover from a power loss. Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 10:43:11 -0500 schrieb "Austin S. Hemmelgarn": > > I mean that: > > You have a 128MB extent, you rewrite random 4k sectors, btrfs will > > not split 128MB extent, and not free up data, (i don't know > > internal algo, so i can't predict when this will hapen), and after > > some time, btrfs will rebuild extents, and split 128 MB exten to > > several more smaller. But when you use compression, allocator > > rebuilding extents much early (i think, it's because btrfs also > > operates with that like 128kb extent, even if it's a continuos > > 128MB chunk of data). > The allocator has absolutely nothing to do with this, it's a function > of the COW operation. Unless you're using nodatacow, that 128MB > extent will get split the moment the data hits the storage device > (either on the next commit cycle (at most 30 seconds with the default > commit cycle), or when fdatasync is called, whichever is sooner). In > the case of compression, it's still one extent (although on disk it > will be less than 128MB) and will be split at _exactly_ the same time > under _exactly_ the same circumstances as an uncompressed extent. > IOW, it has absolutely nothing to do with the extent handling either. I don't think that btrfs splits extents which are part of the snapshot. The extent in a snapshot will stay intact when writing to this extent in another snapshot. Of course, in the just written snapshot, the extent will be represented as a split extent mapping to the original extents data blocks plus the new data in the middle (thus resulting in three extents). This is also why small random writes without autodefrag result in a vast amount of small extents bringing the fs performance to a crawl. Do that multiple times on multiple snapshots, delete some of the original snapshots, and you're left with slack space, data blocks being inaccessible and won't be reclaimed into free space (because they are still part of the original extent), and which can only be reclaimed by a defrag operation - which would of course unshares data. Thus, if any of the above mentioned small extents is still shared with an extent originally much bigger, then it will still occupy its original space on the filesystem - even when its associated snapshot/subvolume no longer exists. Only when the last remaining tiny block of such an extent gets rewritten and the reference counter decreases to zero, the extent is given up and freed. To work around this, you can currently only unshare and recombine by doing defrag and dedupe on all snapshots. This will reclaim space sitting in parts of the original extents no longer referenced by a snapshot visible from the VFS layer. This is for performance reasons because btrfs is extent based. As far as I know, ZFS on the other side, works different. It uses block based storage for the snapshot feature and can easily throw away unused blocks. Only a second layer on top maps this back into extents. The underlying infrastructure, however, is block based storage, which also enables the volume pool to create block devices on the fly out of ZFS storage space. PS: All above given the fact I understood it right. ;-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure
On Tue, Feb 07, 2017 at 08:09:53PM +0900, takafumi-sslab wrote: > > On 2017/02/07 1:34, Liu Bo wrote: > > > > > One thing to add, we still need to check whether page has writeback bit > > before > > end_page_writeback. > > Ok, I add PageWriteback check before end_page_writeback. > > > > > > > > > > > > > Looks like commit 55e3bd2e0c2e1 also has the same > > > > > > > > > > > > problem although I > > > > > > > > > > > > gave it my reviewed-by. > > I also add PageWriteback check in write_one_eb. > > Finally, the diff becomes like below. > Is it Ok ? > > --- > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4ac383a..aa1908a 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3445,8 +3445,11 @@ static noinline_for_stack int > __extent_writepage_io(struct inode *inode, >bdev, >bio, max_nr, >end_bio_extent_writepage, >0, 0, 0, false); > - if (ret) > + if (ret) { > SetPageError(page); > + if (PageWriteback(page)) > + end_page_writeback(page); > + } > > cur = cur + iosize; > pg_offset += iosize; > @@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct > extent_buffer *eb, > epd->bio_flags = bio_flags; > if (ret) { > set_btree_ioerr(p); > - end_page_writeback(p); > + if (PageWriteback(p)) > + end_page_writeback(p); > if (atomic_sub_and_test(num_pages - i, >io_pages)) > end_extent_buffer_writeback(eb); > ret = -EIO; > > --- > Looks good, could you please make a comment for the if statement in your commit log so that others could know why we put it? Since you've got a reproducer, baking it into a fstests case is also welcome. Thanks, -liubo > > Sincerely, > > -takafumi > > > > > > Thanks, > > > > -liubo > > > > > > > > Reviewed-by: Liu Bo> > > > > > Thanks, > > > > > > -liubo > > > > > > > > Sincerely, > > > > > > > > -takafumi > > > > > > > > > > So I don't think the patch is necessary for now. > > > > > > > > > > But as I said, the fact (nr == 0 or 1) would be changed if the > > > > > subpagesize blocksize is supported. > > > > > > > > > > Thanks, > > > > > > > > > > -liubo > > > > > > > > > > > Sincerely, > > > > > > > > > > > > -takafumi > > > > > > > Thanks, > > > > > > > > > > > > > > -liubo > > > > > > > > Sincerely, > > > > > > > > > > > > > > > > On 2017/01/31 5:09, Liu Bo wrote: > > > > > > > > > On Fri, Jan 13, 2017 at 03:12:31PM +0900, takafumi-sslab > > > > > > > > > wrote: > > > > > > > > > > Thanks for your replying. > > > > > > > > > > > > > > > > > > > > I understand this bug is more complicated than I expected. > > > > > > > > > > I classify error cases under submit_extent_page() below > > > > > > > > > > > > > > > > > > > > A: ENOMEM error at btrfs_bio_alloc() in submit_extent_page() > > > > > > > > > > I first assumed this case and sent the mail. > > > > > > > > > > When bio_ret is NULL, submit_extent_page() calls > > > > > > > > > > btrfs_bio_alloc(). > > > > > > > > > > Then, btrfs_bio_alloc() may fail and submit_extent_page() > > > > > > > > > > returns -ENOMEM. > > > > > > > > > > In this case, bio_endio() is not called and the page's > > > > > > > > > > writeback bit > > > > > > > > > > remains. > > > > > > > > > > So, there is a need to call end_page_writeback() in the > > > > > > > > > > error handling. > > > > > > > > > > > > > > > > > > > > B: errors under submit_one_bio() of submit_extent_page() > > > > > > > > > > Errors that occur under submit_one_bio() handles at > > > > > > > > > > bio_endio(), and > > > > > > > > > > bio_endio() would call end_page_writeback(). > > > > > > > > > > > > > > > > > > > > Therefore, as you mentioned in the last mail, simply adding > > > > > > > > > > end_page_writeback() like my last email and commit > > > > > > > > > > 55e3bd2e0c2e1 can > > > > > > > > > > conflict in the case of B. > > > > > > > > > > To avoid such conflict, one easy solution is adding > > > > > > > > > > PageWriteback() check > > > > > > > > > > too. > > > > > > > > > > > > > > > > > > > > How do you think of this solution? > > > > > > > > > (sorry for the late reply.) > > > > > > > > > > > > > > > > > > I think its caller, "__extent_writepage", has covered the > > > > > > > > > above case > > > > > > > > > by setting page writeback again. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > -liubo > > > > > > > > > > Sincerely, > > > > > > > > > > > > > > > > > > > > On 2016/12/22 15:20, Liu Bo wrote: > > > > > > > > > > > On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota > >
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 15:27:34 -0500 schrieb "Austin S. Hemmelgarn": > >> I'm not sure about this one. I would assume based on the fact that > >> many other things don't work with nodatacow and that regular defrag > >> doesn't work on files which are currently mapped as executable code > >> that it does not, but I could be completely wrong about this too. > > > > Technically, there's nothing that prevents autodefrag to work for > > nodatacow files. The question is: is it really necessary? Standard > > file systems also have no autodefrag, it's not an issue there > > because they are essentially nodatacow. Simply defrag the database > > file once and you're done. Transactional MySQL uses huge data > > files, probably preallocated. It should simply work with > > nodatacow. > The thing is, I don't have enough knowledge of how defrag is > implemented in BTRFS to say for certain that ti doesn't use COW > semantics somewhere (and I would actually expect it to do so, since > that in theory makes many things _much_ easier to handle), and if it > uses COW somewhere, then it by definition doesn't work on NOCOW files. A dev would be needed on this. But from a non-dev point of view, the defrag operation itself is CoW: Blocks are rewritten to another location in contiguous order. Only metadata CoW should be needed for this operation. It should be nothing else than writing to a nodatacow snapshot... Just that the snapshot is more or less implicit and temporary. Hmm? *curious* -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 15:36, Kai Krakow wrote: Am Tue, 7 Feb 2017 09:13:25 -0500 schrieb Peter Zaitsev: Hi Hugo, For the use case I'm looking for I'm interested in having snapshot(s) open at all time. Imagine for example snapshot being created every hour and several of these snapshots kept at all time providing quick recovery points to the state of 1,2,3 hours ago. In such case (as I think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. Correct. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. Also correct AFAICT, and this needs to be better documented (for most people, the term snapshot implies atomicity of the operation). How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. Ideally, such an API should be in the VFS layer, not just BTRFS. Reflinking exists in other filesystems already, it's only a matter of time before they decide to do snapshotting too. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. Sadly, freezefs (the generic interface based off of xfs_freeze) only works for block device snapshots. Filesystem level snapshots need the application software to sync all it's data and then stop writing until the snapshot is complete. As of right now, the sanest way I can come up with for a database server is to find a way to do a point-in-time SQL dump of the database (this also has the advantage that it works as a backup, and decouples you from the backing storage format). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Le 07/02/2017 à 21:36, Kai Krakow a écrit : > [...] > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. I don't think so for three reasons : - it's so far away from admin's expectations that someone would have documented this in "man btrfs-subvolume", - the CoW nature of Btrfs makes this trivial : it only has to keep old versions of data and the corresponding tree for it to work instead of unlinking them, - the backup server I referred to restarted a PostgreSQL system from snapshots about one thousand time now without a single problem while being almost continuously being updated by streaming replication. Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Austin, I recognize there are other components too. In this case I'm actually comparing BTRFS to XFS and EXT4 so I'm 100% sure it is file system related. Also I'm using O_DIRECT asynchronous IO with MySQL which means there are no significant dirty block size on the file system level. I'll see if it helps though Also I assumed this is something well known as it is documented in Gotchas here: https://btrfs.wiki.kernel.org/index.php/Gotchas (Fragmentation section) > > It's worth keeping in mind that there is more to the storage stack than just > the filesystem, and BTRFS tends to be more sensitive to the behavior of > other components in the stack than most other filesystems are. The stalls > you're describing sound more like a symptom of the brain-dead writeback > buffering defaults used by the VFS layer than they do an issue with BTRFS > (although BTRFS tends to be a bit more heavily impacted by this than most > other filesystems). Try fiddling with the /proc/sys/vm/dirty_* sysctls > (there is some pretty good documentation in Documentation/sysctl/vm.txt in > the kernel source) and see if that helps. The default values it uses are at > most 20% of RAM, which is an insane amount of data to buffer before starting > writeback when you're talking about systems with 16GB of RAM. > -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 09:13:25 -0500 schrieb Peter Zaitsev: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 15:19, Kai Krakow wrote: Am Tue, 7 Feb 2017 14:50:04 -0500 schrieb "Austin S. Hemmelgarn": Also does autodefrag works with nodatacow (ie with snapshot) or are these exclusive ? I'm not sure about this one. I would assume based on the fact that many other things don't work with nodatacow and that regular defrag doesn't work on files which are currently mapped as executable code that it does not, but I could be completely wrong about this too. Technically, there's nothing that prevents autodefrag to work for nodatacow files. The question is: is it really necessary? Standard file systems also have no autodefrag, it's not an issue there because they are essentially nodatacow. Simply defrag the database file once and you're done. Transactional MySQL uses huge data files, probably preallocated. It should simply work with nodatacow. The thing is, I don't have enough knowledge of how defrag is implemented in BTRFS to say for certain that ti doesn't use COW semantics somewhere (and I would actually expect it to do so, since that in theory makes many things _much_ easier to handle), and if it uses COW somewhere, then it by definition doesn't work on NOCOW files. On the other hand: Using snapshots clearly introduces fragmentation over time. If autodefrag kicks in (given, it is supported for nodatacow), it will slowly unshare all data over time. This somehow defeats the purpose of having snapshots in the first place for this scenario. In conclusion, I'd recommend to run some maintenance scripts from time to time, one to re-share identical blocks, and one to defragment the current workspace. The bees daemon comes into mind here... I haven't tried it but it sounds like it could fill a gap here: https://github.com/Zygo/bees Another option comes into mind: XFS now supports shared-extents copies. You could simply do a cold copy of the database with this feature resulting in the same effect as a snapshot, without seeing the other performance problems of btrfs. Tho, the fragmentation issue would remain, and I think there's no dedupe application for XFS yet. There isn't, but cp --reflink=auto with a reasonably recent version of coreutils should be able to reflink the file properly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 14:50:04 -0500 schrieb "Austin S. Hemmelgarn": > > Also does autodefrag works with nodatacow (ie with snapshot) or > > are these exclusive ? > I'm not sure about this one. I would assume based on the fact that > many other things don't work with nodatacow and that regular defrag > doesn't work on files which are currently mapped as executable code > that it does not, but I could be completely wrong about this too. Technically, there's nothing that prevents autodefrag to work for nodatacow files. The question is: is it really necessary? Standard file systems also have no autodefrag, it's not an issue there because they are essentially nodatacow. Simply defrag the database file once and you're done. Transactional MySQL uses huge data files, probably preallocated. It should simply work with nodatacow. On the other hand: Using snapshots clearly introduces fragmentation over time. If autodefrag kicks in (given, it is supported for nodatacow), it will slowly unshare all data over time. This somehow defeats the purpose of having snapshots in the first place for this scenario. In conclusion, I'd recommend to run some maintenance scripts from time to time, one to re-share identical blocks, and one to defragment the current workspace. The bees daemon comes into mind here... I haven't tried it but it sounds like it could fill a gap here: https://github.com/Zygo/bees Another option comes into mind: XFS now supports shared-extents copies. You could simply do a cold copy of the database with this feature resulting in the same effect as a snapshot, without seeing the other performance problems of btrfs. Tho, the fragmentation issue would remain, and I think there's no dedupe application for XFS yet. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
On 2017-02-07 14:47, Kai Krakow wrote: Am Mon, 6 Feb 2017 08:19:37 -0500 schrieb "Austin S. Hemmelgarn": MDRAID uses stripe selection based on latency and other measurements (like head position). It would be nice if btrfs implemented similar functionality. This would also be helpful for selecting a disk if there're more disks than stripesets (for example, I have 3 disks in my btrfs array). This could write new blocks to the most idle disk always. I think this wasn't covered by the above mentioned patch. Currently, selection is based only on the disk with most free space. You're confusing read selection and write selection. MDADM and DM-RAID both use a load-balancing read selection algorithm that takes latency and other factors into account. However, they use a round-robin write selection algorithm that only cares about the position of the block in the virtual device modulo the number of physical devices. Thanks for clearing that point. As an example, say you have a 3 disk RAID10 array set up using MDADM (this is functionally the same as a 3-disk raid1 mode BTRFS filesystem). Every third block starting from block 0 will be on disks 1 and 2, every third block starting from block 1 will be on disks 3 and 1, and every third block starting from block 2 will be on disks 2 and 3. No latency measurements are taken, literally nothing is factored in except the block's position in the virtual device. I didn't know MDADM can use RAID10 on odd amounts of disks... Nice. I'll keep that in mind. :-) It's one of those neat features that I stumbled across by accident a while back that not many people know about. It's kind of ironic when you think about it too, since the MD RAID10 profile with only 2 replicas is actually a more accurate comparison for the BTRFS raid1 profile than the MD RAID1 profile. FWIW, it can (somewhat paradoxically) sometimes get better read and write performance than MD RAID0 across the same number of disks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On Tue, 7 Feb 2017 09:13:25 -0500 Peter Zaitsevwrote: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. It still does provide some advantage, as in each write into new area since last hour snapshot is going to be CoW'ed only once, as opposed to every new write getting CoW'ed every time no matter what. I'm not sold on autodefrag, what I'd suggest instead is to schedule regular defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase space usage temporarily as it will partially unmerge extents previously shared across snapshots, but you won't get away runaway fragmentation anymore, as you would without nodatacow or with periodical snapshotting. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 14:39, Kai Krakow wrote: Am Tue, 7 Feb 2017 10:06:34 -0500 schrieb "Austin S. Hemmelgarn": 4. Try using in-line compression. This can actually significantly improve performance, especially if you have slow storage devices and a really nice CPU. Just a side note: With nodatacow there'll be no compression, I think. At least for files with "chattr +C" there'll be no compression. I thus think "nodatacow" has the same effect. You're absolutely right, thanks for mentioning this, I completely forgot to point it out myself. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 13:59, Peter Zaitsev wrote: Jeff, Thank you very much for explanations. Indeed it was not clear in the documentation - I read it simply as "if you have snapshots enabled nodatacow makes no difference" I will rebuild the database in this mode from scratch and see how performance changes. So far the most frustating for me was periodic stalls for many seconds (running sysbench workload). What was the most puzzling I get this even if I run workload at the 50% or less of the full load - Ie database can handle 1000 transactions/sec and I only inject 500/sec and I still have those stalls. This is where it looks to me like some work is being delayed and when it requires stall for a few seconds to catch up.I wonder if there are some configuration options available to play with. So far I found BTRFS rather "zero configuration" which is great if it works but it is also great to have more levers to pull if you're having some troubles. It's worth keeping in mind that there is more to the storage stack than just the filesystem, and BTRFS tends to be more sensitive to the behavior of other components in the stack than most other filesystems are. The stalls you're describing sound more like a symptom of the brain-dead writeback buffering defaults used by the VFS layer than they do an issue with BTRFS (although BTRFS tends to be a bit more heavily impacted by this than most other filesystems). Try fiddling with the /proc/sys/vm/dirty_* sysctls (there is some pretty good documentation in Documentation/sysctl/vm.txt in the kernel source) and see if that helps. The default values it uses are at most 20% of RAM, which is an insane amount of data to buffer before starting writeback when you're talking about systems with 16GB of RAM. On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoneywrote: On 2/7/17 8:53 AM, Peter Zaitsev wrote: Hi, I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL Workload. It did not go very well ranging from multi-seconds stalls where no transactions are completed to the finally kernel OOPS with "no space left on device" error message and filesystem going read only. I'm complete newbie in BTRFS so I assume I'm doing something wrong. Do you have any advice on how BTRFS should be tuned for OLTP workload (large files having a lot of random writes) ?Or is this the case where one should simply stay away from BTRFS and use something else ? One item recommended in some places is "nodatacow" this however defeats the main purpose I'm looking at BTRFS - I am interested in "free" snapshots which look very attractive to use for database recovery scenarios allow instant rollback to the previous state. Hi Peter - There seems to be some misunderstanding around how nodatacow works. Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed and, of course, will cause CoW to happen when a write occurs, but only on the first write. Subsequent writes will not CoW again. This does mean you don't get CRC protection for data, though. Since most databases do this internally, that is probably no great loss. You will get fragmentation, but that's true of any random-write workload on btrfs. Timothy's comment about how extents are accounted is more-or-less correct. The file extents in the file system trees reference data extents in the extent tree. When portions of the data extent are unreferenced, they're not necessarily released. A balance operation will usually split the data extents so that the unused space is released. As for the Oopses with ENOSPC, that's something we'd want to look into if it can be reproduced with a more recent kernel. We shouldn't be getting ENOSPC anywhere sensitive anymore. -Jeff -- Jeff Mahoney SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 10:06:34 -0500 schrieb "Austin S. Hemmelgarn": > 4. Try using in-line compression. This can actually significantly > improve performance, especially if you have slow storage devices and > a really nice CPU. Just a side note: With nodatacow there'll be no compression, I think. At least for files with "chattr +C" there'll be no compression. I thus think "nodatacow" has the same effect. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 14:31, Peter Zaitsev wrote: Hi Hugo, As I re-read it closely (and also other comments in the thread) I know understand there is a difference how nodatacow works even if snapshot are in place. On autodefrag I wonder is there some more detailed documentation about how autodefrag works. The manual https://btrfs.wiki.kernel.org/index.php/Mount_optionshas very general statement. What does "detect random IO" really means ? It also talks about defragmenting the file - is i really about the whole file which is triggered for defrag or is defrag locally ? Ie I would understand what as writes happen the 1MB block is checked and if it is more than X fragments it is defragmented or something like that. I don't know the exact algorithm, but I'm pretty sure it's similar to what bcache uses to bypass the cache device for sequential I/O. In essence, it's going to trigger for database usage. Also does autodefrag works with nodatacow (ie with snapshot) or are these exclusive ? I'm not sure about this one. I would assume based on the fact that many other things don't work with nodatacow and that regular defrag doesn't work on files which are currently mapped as executable code that it does not, but I could be completely wrong about this too. There's another approach which might be worth testing, which is to use autodefrag. This will increase data write I/O, because where you have one or more small writes in a region, it will also read and write the data in a small neghbourhood around those writes, so the fragmentation is reduced. This will improve subsequent read performance. I could also suggest getting the latest kernel you can -- 16.04 is already getting on for a year old, and there may be performance improvements in upstream kernels which affect your workload. There's an Ubuntu kernel PPA you can use to get the new kernels without too much pain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
Am Mon, 6 Feb 2017 08:19:37 -0500 schrieb "Austin S. Hemmelgarn": > > MDRAID uses stripe selection based on latency and other measurements > > (like head position). It would be nice if btrfs implemented similar > > functionality. This would also be helpful for selecting a disk if > > there're more disks than stripesets (for example, I have 3 disks in > > my btrfs array). This could write new blocks to the most idle disk > > always. I think this wasn't covered by the above mentioned patch. > > Currently, selection is based only on the disk with most free > > space. > You're confusing read selection and write selection. MDADM and > DM-RAID both use a load-balancing read selection algorithm that takes > latency and other factors into account. However, they use a > round-robin write selection algorithm that only cares about the > position of the block in the virtual device modulo the number of > physical devices. Thanks for clearing that point. > As an example, say you have a 3 disk RAID10 array set up using MDADM > (this is functionally the same as a 3-disk raid1 mode BTRFS > filesystem). Every third block starting from block 0 will be on disks > 1 and 2, every third block starting from block 1 will be on disks 3 > and 1, and every third block starting from block 2 will be on disks 2 > and 3. No latency measurements are taken, literally nothing is > factored in except the block's position in the virtual device. I didn't know MDADM can use RAID10 on odd amounts of disks... Nice. I'll keep that in mind. :-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi Hugo, As I re-read it closely (and also other comments in the thread) I know understand there is a difference how nodatacow works even if snapshot are in place. On autodefrag I wonder is there some more detailed documentation about how autodefrag works. The manual https://btrfs.wiki.kernel.org/index.php/Mount_optionshas very general statement. What does "detect random IO" really means ? It also talks about defragmenting the file - is i really about the whole file which is triggered for defrag or is defrag locally ? Ie I would understand what as writes happen the 1MB block is checked and if it is more than X fragments it is defragmented or something like that. Also does autodefrag works with nodatacow (ie with snapshot) or are these exclusive ? > >There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > >I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Jeff, Thank you very much for explanations. Indeed it was not clear in the documentation - I read it simply as "if you have snapshots enabled nodatacow makes no difference" I will rebuild the database in this mode from scratch and see how performance changes. So far the most frustating for me was periodic stalls for many seconds (running sysbench workload). What was the most puzzling I get this even if I run workload at the 50% or less of the full load - Ie database can handle 1000 transactions/sec and I only inject 500/sec and I still have those stalls. This is where it looks to me like some work is being delayed and when it requires stall for a few seconds to catch up.I wonder if there are some configuration options available to play with. So far I found BTRFS rather "zero configuration" which is great if it works but it is also great to have more levers to pull if you're having some troubles. On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoneywrote: > On 2/7/17 8:53 AM, Peter Zaitsev wrote: >> Hi, >> >> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL >> Workload. >> >> It did not go very well ranging from multi-seconds stalls where no >> transactions are completed to the finally kernel OOPS with "no space left >> on device" error message and filesystem going read only. >> >> I'm complete newbie in BTRFS so I assume I'm doing something wrong. >> >> Do you have any advice on how BTRFS should be tuned for OLTP workload >> (large files having a lot of random writes) ?Or is this the case where >> one should simply stay away from BTRFS and use something else ? >> >> One item recommended in some places is "nodatacow" this however defeats >> the main purpose I'm looking at BTRFS - I am interested in "free" >> snapshots which look very attractive to use for database recovery scenarios >> allow instant rollback to the previous state. >> > > Hi Peter - > > There seems to be some misunderstanding around how nodatacow works. > Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed > and, of course, will cause CoW to happen when a write occurs, but only > on the first write. Subsequent writes will not CoW again. This does > mean you don't get CRC protection for data, though. Since most > databases do this internally, that is probably no great loss. You will > get fragmentation, but that's true of any random-write workload on btrfs. > > Timothy's comment about how extents are accounted is more-or-less > correct. The file extents in the file system trees reference data > extents in the extent tree. When portions of the data extent are > unreferenced, they're not necessarily released. A balance operation > will usually split the data extents so that the unused space is released. > > As for the Oopses with ENOSPC, that's something we'd want to look into > if it can be reproduced with a more recent kernel. We shouldn't be > getting ENOSPC anywhere sensitive anymore. > > -Jeff > > -- > Jeff Mahoney > SUSE Labs > -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2/7/17 8:53 AM, Peter Zaitsev wrote: > Hi, > > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL > Workload. > > It did not go very well ranging from multi-seconds stalls where no > transactions are completed to the finally kernel OOPS with "no space left > on device" error message and filesystem going read only. > > I'm complete newbie in BTRFS so I assume I'm doing something wrong. > > Do you have any advice on how BTRFS should be tuned for OLTP workload > (large files having a lot of random writes) ?Or is this the case where > one should simply stay away from BTRFS and use something else ? > > One item recommended in some places is "nodatacow" this however defeats > the main purpose I'm looking at BTRFS - I am interested in "free" > snapshots which look very attractive to use for database recovery scenarios > allow instant rollback to the previous state. > Hi Peter - There seems to be some misunderstanding around how nodatacow works. Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed and, of course, will cause CoW to happen when a write occurs, but only on the first write. Subsequent writes will not CoW again. This does mean you don't get CRC protection for data, though. Since most databases do this internally, that is probably no great loss. You will get fragmentation, but that's true of any random-write workload on btrfs. Timothy's comment about how extents are accounted is more-or-less correct. The file extents in the file system trees reference data extents in the extent tree. When portions of the data extent are unreferenced, they're not necessarily released. A balance operation will usually split the data extents so that the unused space is released. As for the Oopses with ENOSPC, that's something we'd want to look into if it can be reproduced with a more recent kernel. We shouldn't be getting ENOSPC anywhere sensitive anymore. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs-progs: better document btrfs receive security
On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote: > This adds some extra documentation to the btrfs-receive manpage that > explains some of the security related aspects of btrfs-receive. The > first part covers the fact that the subvolume being received is writable > until the receive finishes, and the second covers the current lack of > sanity checking of the send stream. > > Signed-off-by: Austin S. HemmelgarnApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
READ YOUR MESSAGE NOW!!
Greeting's to you and your family! I will be glad if you will be capable to assist me to secure a sum of USD 15.5M) into your bank account in your country. This is a genuine transaction, It just that I cannot operate it alone without the help of a foreign partner that is my reason of contacting you in this manner so that you can assist me to actualize this better opportunity. If you are interested please reply back immediately to my private ID (kabiruwahid...@gmail.com) and prove your integrity together with your full Information. Your Name.. Your Home Address. Your Age.. Sex... Your Home Telephone.. Your Personal Number. Fax Number.. Receiving Country Occupation... I am waiting for your urgent respond to enable us proceed further for the transfer of this fund into your account. Yours faithfully, Mr.Kabiru Wahid -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix use-after-free due to wrong order of destroying work queues
From: Filipe MananaBefore we destroy all work queues (and wait for their tasks to complete) we were destroying the work queues used for metadata I/O operations, which can result in a use-after-free problem because most tasks from all work queues do metadata I/O operations. For example, the tasks from the caching workers work queue (fs_info->caching_workers), which is destroyed only after the work queue used for metadata reads (fs_info->endio_meta_workers) is destroyed, do metadata reads, which result in attempts to queue tasks into the later work queue, triggering a use-after-free with a trace like the following: [23114.613543] general protection fault: [#1] PREEMPT SMP [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug] [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs] [23114.616932] task: 880221d45780 task.stack: c9000bc5 [23114.616932] RIP: 0010:[] [] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP: 0018:88023f443d60 EFLAGS: 00010246 [23114.616932] RAX: RBX: 6b6b6b6b6b6b6b6b RCX: 0102 [23114.616932] RDX: a0419000 RSI: 88011df534f0 RDI: 880101f01c00 [23114.616932] RBP: 88023f443d80 R08: 000f7000 R09: [23114.616932] R10: 88023f443d48 R11: 1000 R12: 88011df534f0 [23114.616932] R13: 880135963868 R14: 1000 R15: 1000 [23114.616932] FS: () GS:88023f44() knlGS: [23114.616932] CS: 0010 DS: ES: CR0: 80050033 [23114.616932] CR2: 7f0fb9f8e520 CR3: 01a0b000 CR4: 06e0 [23114.616932] Stack: [23114.616932] 880101f01c00 88011df534f0 880135963868 1000 [23114.616932] 88023f443da0 a03470af 880149b37200 880135963868 [23114.616932] 88023f443db8 8125293c 880149b37200 88023f443de0 [23114.616932] Call Trace: [23114.616932] [23114.616932] [] end_workqueue_bio+0xd5/0xda [btrfs] [23114.616932] [] bio_endio+0x54/0x57 [23114.616932] [] btrfs_end_bio+0xf7/0x106 [btrfs] [23114.616932] [] bio_endio+0x54/0x57 [23114.616932] [] blk_update_request+0x21a/0x30f [23114.616932] [] scsi_end_request+0x31/0x182 [scsi_mod] [23114.616932] [] scsi_io_completion+0x1ce/0x4c8 [scsi_mod] [23114.616932] [] scsi_finish_command+0x104/0x10d [scsi_mod] [23114.616932] [] scsi_softirq_done+0x101/0x10a [scsi_mod] [23114.616932] [] blk_done_softirq+0x82/0x8d [23114.616932] [] __do_softirq+0x1ab/0x412 [23114.616932] [] irq_exit+0x49/0x99 [23114.616932] [] smp_call_function_single_interrupt+0x24/0x26 [23114.616932] [] call_function_single_interrupt+0x89/0x90 [23114.616932] [23114.616932] [] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [] ? _raw_spin_unlock_irq+0x32/0x4a [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [] scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [] __blk_run_queue_uncond+0x22/0x2b [23114.616932] [] __blk_run_queue+0x19/0x1b [23114.616932] [] blk_queue_bio+0x268/0x282 [23114.616932] [] generic_make_request+0xbd/0x160 [23114.616932] [] submit_bio+0x100/0x11d [23114.616932] [] ? __this_cpu_preempt_check+0x13/0x15 [23114.616932] [] ? __percpu_counter_add+0x8e/0xa7 [23114.616932] [] btrfsic_submit_bio+0x1a/0x1d [btrfs] [23114.616932] [] btrfs_map_bio+0x1f4/0x26d [btrfs] [23114.616932] [] btree_submit_bio_hook+0x74/0xbf [btrfs] [23114.616932] [] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs] [23114.616932] [] submit_one_bio+0x6b/0x89 [btrfs] [23114.616932] [] read_extent_buffer_pages+0x170/0x1ec [btrfs] [23114.616932] [] ? free_root_pointers+0x64/0x64 [btrfs] [23114.616932] [] readahead_tree_block+0x3f/0x4c [btrfs] [23114.616932] [] read_block_for_search.isra.20+0x1ce/0x23d [btrfs] [23114.616932] [] btrfs_search_slot+0x65f/0x774 [btrfs] [23114.616932] [] ? free_extent_buffer+0x73/0x7e [btrfs] [23114.616932] [] btrfs_next_old_leaf+0xa1/0x33c [btrfs] [23114.616932] [] btrfs_next_leaf+0x10/0x12 [btrfs] [23114.616932] [] caching_thread+0x22d/0x416 [btrfs] [23114.616932] [] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs] [23114.616932] [] btrfs_cache_helper+0xe/0x10 [btrfs] [23114.616932] [] process_one_work+0x273/0x4e4 [23114.616932] []
understanding disk space usage
Hello, My system is or seems to be running out of disk space but I can't find out how or why. Might be a BTRFS peculiarity, hence posting on this list. Most indicators seem to suggest I'm filling up, but I can't trace the disk usage to files on the FS. The issue is on my root filesystem on a 28GiB ssd partition (commands below issued when booted into single user mode): $ df -h FilesystemSize Used Avail Use% Mounted on /dev/sda3 28G 26G 2.1G 93% / $ btrfs --version btrfs-progs v4.4 $ btrfs fi usage / Overall: Device size: 27.94GiB Device allocated: 27.94GiB Device unallocated: 1.00MiB Device missing: 0.00B Used: 25.03GiB Free (estimated): 2.37GiB (min: 2.37GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 256.00MiB (used: 0.00B) Data,single: Size:26.69GiB, Used:24.32GiB /dev/sda3 26.69GiB Metadata,single: Size:1.22GiB, Used:731.45MiB /dev/sda3 1.22GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda3 32.00MiB Unallocated: /dev/sda3 1.00MiB $ btrfs fi df / Data, single: total=26.69GiB, used=24.32GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=1.22GiB, used=731.48MiB GlobalReserve, single: total=256.00MiB, used=0.00B However: $ mount -o bind / /mnt $ sudo du -hs /mnt 9.3G /mnt Try to balance: $ btrfs balance start / ERROR: error during balancing '/': No space left on device Am I really filling up? What can explain the huge discrepancy with the output of du (no open file descriptors on deleted files can explain this in single user mode) and the FS stats? Any advice on possible causes and how to proceed? -- Vasco -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi Peter, Le 07/02/2017 à 15:13, Peter Zaitsev a écrit : > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. > > I have not seen autodefrag helping much but I will try again. Is > there any autodefrag documentation available about how is it expected > to work and if it can be tuned in any way There's not much that can be done if the same file is modified in 2 different subvolumes (typically the original and a R/W snapshot). You either break the reflink around the modification to limit the amount of fragmentation (which will use disk space and write I/O) or get fragmentation on at least one subvolume (which will add seeks). So the only options are either to flatten the files (which can be done incrementally by defragmenting them on both sides when they change) or only defragment the most used volume (especially if the other is a relatively short-lived snapshot where performance won't degrade much until it is removed and won't matter much). I just modified our defragmenter scheduler to be aware of multiple subvolumes and support ignoring some of them. The previous version (not tagged, sorry) was battle tested on a Ceph cluster and was designed for it. Autodefrag didn't work with Ceph with our workload (latency went through the roof, OSDs were timing out requests, ...) and our scheduler with some simple Ceph BTRFS related tunings gave us even better performance than XFS (which is usually the recommended choice with current Ceph versions). The current version is probably still rough around the edges as it is brand new (most of the work was done last Sunday) and only running on a backup server with a situation not much different from yours : a large PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a daily snapshot used to start a PostgreSQL instance for "tests on real data" purposes + a copy of a <10TB NFS server with similar snapshots in place. All of this is on a single RAID10 13-14TB BTRFS. In our case using autodefrag on this slowly degraded performance to the point where off-site backups became slow enough to warrant preventive measures. The current scheduler looks for the mountpoints of top BTRFS volumes (so you have to mount the top volume somewhere), and defragments them avoiding : - read-only snapshots, - all data below configurable subdirs (including read-write subvolumes even if they are mounted elsewhere), see README.md for instructions. It slowly walks all files eligible for defragmentation and in parallel detects writes to the same filesystem, including writes to read-write subvolumes mounted elsewhere to trigger defragmentation. The scheduler uses an estimated "cost" for each file to prioritize defragmentation tasks and with default settings tries to keep I/O activity low enough that it doesn't slow down other tasks too much. However it defragments files whole, which might put some strain for huge ibdata* files if you didn't switch to file per table. In our case defragmenting 1GB files is OK and doesn't have a major impact. We are already seeing better performance (our total daily backup time is below worrying levels again) and the scheduler didn't even finish walking the whole filesystem (there are approximately 8 millions files and it is configured to evaluate them over a week). This is probably because it follows the most write-active files (which are in the PostgreSQL slave directory) and defragmented most of them early. Note that it is tuned for filesystems using ~2TB 7200rpm drives (there are some options that will adapt it to subsystems with more I/O capacity). Using drives with different capacities shouldn't need tuning, but it probably will not work well on SSD (it should be configured to speed up significantly). See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb Some parameters are available (start it with --help). You should probably start it with --verbose at least until you are comfortable with it to get a list of which files are defragmented and many debug messages you probably want to ignore (or you'll probably have to read the Ruby code to fully understand what they mean). I don't provide any warranty for it but the worst I believe can happen is no performance improvements or performance degradation until you stop it. If you don't blacklist read-write snapshots with the .no-defrag file (see README.md) defragmentation will probably eat more disk space than usual. Space usage will go up rapidly during defragmentation if you have snapshots, it is supposed to go down after all snapshots referring to fragmented files are removed and replaced by new snapshots (where fragmentation should be more stable). Best regards,
Re: Very slow balance / btrfs-transaction
On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruowrote: > > > At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote: >> >> >> Hi Qu, >> >> On 02/05/2017 07:45 PM, Qu Wenruo wrote: >>> >>> >>> >>> At 02/04/2017 09:47 AM, Jorg Bornschein wrote: February 4, 2017 1:07 AM, "Goldwyn Rodrigues" wrote: >> >> >> >> Quata support was indeed active -- and it warned me that the qroup data was inconsistent. Disabling quotas had an immediate impact on balance throughput -- it's *much* faster now! From a quick glance at iostat I would guess it's at least a factor 100 faster. Should quota support generally be disabled during balances? Or did I somehow push my fs into a weired state where it triggered a slow-path? Thanks! j >>> >>> >>> Would you please provide the kernel version? >>> >>> v4.9 introduced a bad fix for qgroup balance, which doesn't completely >>> fix qgroup bytes leaking, but also hugely slow down the balance process: >>> >>> commit 62b99540a1d91e46422f0e04de50fc723812c421 >>> Author: Qu Wenruo >>> Date: Mon Aug 15 10:36:51 2016 +0800 >>> >>> btrfs: relocation: Fix leaking qgroups numbers on data extents >>> >>> Sorry for that. >>> >>> And in v4.10, a better method is applied to fix the byte leaking >>> problem, and should be a little faster than previous one. >>> >>> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca >>> Author: Qu Wenruo >>> Date: Tue Oct 18 09:31:29 2016 +0800 >>> >>> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing >>> >>> >>> However, using balance with qgroup is still slower than balance without >>> qgroup, the root fix needs us to rework current backref iteration. >>> >> >> This patch has made the btrfs balance performance worse. The balance >> task has become more CPU intensive compared to earlier and takes longer >> to complete, besides hogging resources. While correctness is important, >> we need to figure out how this can be made more efficient. >> > The cause is already known. > > It's find_parent_node() which takes most of the time to find all referencer > of an extent. > > And it's also the cause for FIEMAP softlockup (fixed in recent release by > early quit). > > The biggest problem is, current find_parent_node() uses list to iterate, > which is quite slow especially it's done in a loop. > In real world find_parent_node() is about O(n^3). > We can either improve find_parent_node() by using rb_tree, or introduce some > cache for find_parent_node(). Even if anyone is able to reduce that function's complexity from O(n^3) down to lets say O(n^2) or O(n log n) for example, the current implementation of qgroups will always be a problem. The real problem is that this more recent rework of qgroups does all this accounting inside the critical section of a transaction - blocking any other tasks that want to start a new transaction or attempt to join the current transaction. Not to mention that on systems with small amounts of memory (2Gb or 4Gb from what I've seen from user reports) we also OOM due this allocation of struct btrfs_qgroup_extent_record per delayed data reference head, that are used for that accounting phase in the critical section of a transaction commit. Let's face it and be realistic, even if someone manages to make find_parent_node() much much better, like O(n) for example, it will always be a problem due to the reasons mentioned before. Many extents touched per transaction and many subvolumes/snapshots, will always expose that root problem - doing the accounting in the transaction commit critical section. > > > IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I > didn't hear anything about it recently. > > Thanks, > Qu > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 10:20, Timofey Titovets wrote: I think that you have a problem with extent bookkeeping (if i understand how btrfs manage extents). So for deal with it, try enable compression, as compression will force all extents to be fragmented with size ~128kb. No, it will compress everything in chunks of 128kB, but it will not fragment things any more than they already would have been (it may actually _reduce_ fragmentation because there is less data being stored on disk). This representation is a bug in the FIEMAP ioctl, it doesn't understand the way BTRFS represents things properly. IIRC, there was a patch to fix this, but I don't remember what happened with it. That said, in-line compression can help significantly, especially if you have slow storage devices. I mean that: You have a 128MB extent, you rewrite random 4k sectors, btrfs will not split 128MB extent, and not free up data, (i don't know internal algo, so i can't predict when this will hapen), and after some time, btrfs will rebuild extents, and split 128 MB exten to several more smaller. But when you use compression, allocator rebuilding extents much early (i think, it's because btrfs also operates with that like 128kb extent, even if it's a continuos 128MB chunk of data). The allocator has absolutely nothing to do with this, it's a function of the COW operation. Unless you're using nodatacow, that 128MB extent will get split the moment the data hits the storage device (either on the next commit cycle (at most 30 seconds with the default commit cycle), or when fdatasync is called, whichever is sooner). In the case of compression, it's still one extent (although on disk it will be less than 128MB) and will be split at _exactly_ the same time under _exactly_ the same circumstances as an uncompressed extent. IOW, it has absolutely nothing to do with the extent handling either. The difference arises in that compressed data effectively has a on-media block size of 128k, not 16k (the current default block size) or 4k (the old default). This means that the smallest fragment possible for a file with in-line compression enabled is 128k, while for a file without it it's equal to the filesystem block size. A larger minimum fragment size means that the maximum number of fragments a given file can have is smaller (8 times smaller in fact than without compression when using the current default block size), which means that there will be less fragmentation. Some rather complex and tedious math indicates that this is not the _only_ thing improving performance when using in-line compression, but it's probably the biggest thing doing so for the workload being discussed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
>> I think that you have a problem with extent bookkeeping (if i >> understand how btrfs manage extents). >> So for deal with it, try enable compression, as compression will force >> all extents to be fragmented with size ~128kb. > > No, it will compress everything in chunks of 128kB, but it will not fragment > things any more than they already would have been (it may actually _reduce_ > fragmentation because there is less data being stored on disk). This > representation is a bug in the FIEMAP ioctl, it doesn't understand the way > BTRFS represents things properly. IIRC, there was a patch to fix this, but > I don't remember what happened with it. > > That said, in-line compression can help significantly, especially if you > have slow storage devices. I mean that: You have a 128MB extent, you rewrite random 4k sectors, btrfs will not split 128MB extent, and not free up data, (i don't know internal algo, so i can't predict when this will hapen), and after some time, btrfs will rebuild extents, and split 128 MB exten to several more smaller. But when you use compression, allocator rebuilding extents much early (i think, it's because btrfs also operates with that like 128kb extent, even if it's a continuos 128MB chunk of data). -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 10:00, Timofey Titovets wrote: 2017-02-07 17:13 GMT+03:00 Peter Zaitsev: Hi Hugo, For the use case I'm looking for I'm interested in having snapshot(s) open at all time. Imagine for example snapshot being created every hour and several of these snapshots kept at all time providing quick recovery points to the state of 1,2,3 hours ago. In such case (as I think you also describe) nodatacow does not provide any advantage. I have not seen autodefrag helping much but I will try again. Is there any autodefrag documentation available about how is it expected to work and if it can be tuned in any way I noticed remounting already fragmented filesystem with autodefrag and putting workload which does more fragmentation does not seem to improve over time Well, nodatacow will still allow snapshots to work, but it also allows the data to fragment. Each snapshot made will cause subsequent writes to shared areas to be CoWed once (and then it reverts to unshared and nodatacow again). There's another approach which might be worth testing, which is to use autodefrag. This will increase data write I/O, because where you have one or more small writes in a region, it will also read and write the data in a small neghbourhood around those writes, so the fragmentation is reduced. This will improve subsequent read performance. I could also suggest getting the latest kernel you can -- 16.04 is already getting on for a year old, and there may be performance improvements in upstream kernels which affect your workload. There's an Ubuntu kernel PPA you can use to get the new kernels without too much pain. -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I think that you have a problem with extent bookkeeping (if i understand how btrfs manage extents). So for deal with it, try enable compression, as compression will force all extents to be fragmented with size ~128kb. No, it will compress everything in chunks of 128kB, but it will not fragment things any more than they already would have been (it may actually _reduce_ fragmentation because there is less data being stored on disk). This representation is a bug in the FIEMAP ioctl, it doesn't understand the way BTRFS represents things properly. IIRC, there was a patch to fix this, but I don't remember what happened with it. That said, in-line compression can help significantly, especially if you have slow storage devices. I did have a similar problem with MySQL (Zabbix as a workload, i.e. most time load are random write), and i fix it, by enabling compression. (I use debian with latest kernel from backports) At now it just works with stable speed under stable load. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
2017-02-07 17:13 GMT+03:00 Peter Zaitsev: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. > > I have not seen autodefrag helping much but I will try again. Is > there any autodefrag documentation available about how is it expected > to work and if it can be tuned in any way > > I noticed remounting already fragmented filesystem with autodefrag > and putting workload which does more fragmentation does not seem to > improve over time > > > >>Well, nodatacow will still allow snapshots to work, but it also >> allows the data to fragment. Each snapshot made will cause subsequent >> writes to shared areas to be CoWed once (and then it reverts to >> unshared and nodatacow again). >> >>There's another approach which might be worth testing, which is to >> use autodefrag. This will increase data write I/O, because where you >> have one or more small writes in a region, it will also read and write >> the data in a small neghbourhood around those writes, so the >> fragmentation is reduced. This will improve subsequent read >> performance. >> >>I could also suggest getting the latest kernel you can -- 16.04 is >> already getting on for a year old, and there may be performance >> improvements in upstream kernels which affect your workload. There's >> an Ubuntu kernel PPA you can use to get the new kernels without too >> much pain. > > > > > > > > -- > Peter Zaitsev, CEO, Percona > Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html I think that you have a problem with extent bookkeeping (if i understand how btrfs manage extents). So for deal with it, try enable compression, as compression will force all extents to be fragmented with size ~128kb. I did have a similar problem with MySQL (Zabbix as a workload, i.e. most time load are random write), and i fix it, by enabling compression. (I use debian with latest kernel from backports) At now it just works with stable speed under stable load. P.S. (And i also use your percona MySQL some time, it's cool). -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive > OLTP MySQL Workload. This has a lot of interesting and mostly agreeable information: https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp The main target of Btrfs is where one wants checksums and occasional snapshot for backup (rather than rollback) and applications do whole-file rewrites or appends. > It did not go very well ranging from multi-seconds stalls > where no transactions are completed That usually is more because of the "clever" design and defaults of the Linux page cache and block IO subsystem, which are astutely pessimized for every workload, but especially for read-modify-write ones, never mind for RMW workloads on copy-on-write filesystems. That most OS designs are pessimized for anything like a "write intensive OLTP" workload is not new, M Stonebraker complained about that 35 years ago, and nothing much has changed: http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d > to the finally kernel OOPS with "no space left on device" > error message and filesystem going read only. That's because Btrfs has a a two-level allocator, where space is allocated in 1GiB chunks (distinct as to data and metadata) and then in 16KiB nodes, and this makes it far more likely for free space fragmentation to occur. Therefore Btrfs has a free space compactor ('btrfs balance') that must be used the more often the more updates happen. > interested in "free" snapshots which look very attractive The general problem is that it is pretty much impossible to have read-modify-write rollbacks for cheap, because the writes in general are scattered (that is their time coherence is very different from their spatial coherence). That means either heavy spatial fragmentation or huge write amplification. The 'snapshot' type of DM/LVM2 device delivers heavy spatial fragmentation, Btrfs does a balance of both. Another commenter has mentioned the use of 'nodatacow' to prevent RMW resulting in huge write-amplification. > to use for database recovery scenarios allow instant rollback > to the previous state. You may be more interested in NILFS2 for that, but there are significant tradeoffs there too, and NILFS2 requires a free space compactor too, plus since NILFS2 gives up on short-term spatial coherence, the compactor also needs to compact data space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 2017-02-07 08:53, Peter Zaitsev wrote: Hi, I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL Workload. It did not go very well ranging from multi-seconds stalls where no transactions are completed to the finally kernel OOPS with "no space left on device" error message and filesystem going read only. How much spare space did you have allocated in the filesystem? At a minimum, you want at least a few GB beyond what you expect to be the maximum size of your data-set times the number of snapshots you plan to keep around at any given time. I'm complete newbie in BTRFS so I assume I'm doing something wrong. Not exactly wrong, but getting this to work efficiently is more art than engineering. Do you have any advice on how BTRFS should be tuned for OLTP workload (large files having a lot of random writes) ?Or is this the case where one should simply stay away from BTRFS and use something else ? The general recommendation is usually to avoid BTRFS for such things. There are however a number of things you can do to improve performance: 1. Use a backing storage format that has the minimal amount of complexity. The more data structures that get updated when a record changes, the worse the performance will be. I don't have enough experience with MySQL to give a specific recommendation on what backing storage format to use, but someone else might. 2. Avoid large numbers of small transactions. The smaller the transaction, the worse it will fragment things. 3. Use autodefrag. This will increase write load on the storage device, but it should improve performance for reads. 4. Try using in-line compression. This can actually significantly improve performance, especially if you have slow storage devices and a really nice CPU. 5. If you're running raid10 mode for BTRFS, run raid1 on top of two LVM or MD RAID0 devices instead. This sounds stupid, but it actually will hugely improve both read and write performance without sacrificing any data safety. 6. Look at I/O scheduler tuning. This can have a huge impact, especially considering that most of the defaults for the various schedulers are somewhat poor for most modern systems. I won't go into the details here, since there are a huge number of online resources about this. One item recommended in some places is "nodatacow" this however defeats the main purpose I'm looking at BTRFS - I am interested in "free" snapshots which look very attractive to use for database recovery scenarios allow instant rollback to the previous state. Snapshots aren't free. They are quick, but they aren't free by any means. If you're going to be using snapshots, keep them to a minimum, performance scales inversely proportionate to the number of snapshots, and this has a much bigger impact the more you're trying to do on the filesystem. Also, consider whether or not you _actually_ need filesystem level snapshots. I don't know about your full software stack, but most good OLTP software supports rollback segments (or an equivalent with a different name), and those are probably what you want to use, not filesystem snapshots. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Hi Hugo, For the use case I'm looking for I'm interested in having snapshot(s) open at all time. Imagine for example snapshot being created every hour and several of these snapshots kept at all time providing quick recovery points to the state of 1,2,3 hours ago. In such case (as I think you also describe) nodatacow does not provide any advantage. I have not seen autodefrag helping much but I will try again. Is there any autodefrag documentation available about how is it expected to work and if it can be tuned in any way I noticed remounting already fragmented filesystem with autodefrag and putting workload which does more fragmentation does not seem to improve over time >Well, nodatacow will still allow snapshots to work, but it also > allows the data to fragment. Each snapshot made will cause subsequent > writes to shared areas to be CoWed once (and then it reverts to > unshared and nodatacow again). > >There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > >I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote: > Hi, > > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL > Workload. > > It did not go very well ranging from multi-seconds stalls where no > transactions are completed to the finally kernel OOPS with "no space left > on device" error message and filesystem going read only. > > I'm complete newbie in BTRFS so I assume I'm doing something wrong. > > Do you have any advice on how BTRFS should be tuned for OLTP workload > (large files having a lot of random writes) ?Or is this the case where > one should simply stay away from BTRFS and use something else ? > > One item recommended in some places is "nodatacow" this however defeats > the main purpose I'm looking at BTRFS - I am interested in "free" > snapshots which look very attractive to use for database recovery scenarios > allow instant rollback to the previous state. Well, nodatacow will still allow snapshots to work, but it also allows the data to fragment. Each snapshot made will cause subsequent writes to shared areas to be CoWed once (and then it reverts to unshared and nodatacow again). There's another approach which might be worth testing, which is to use autodefrag. This will increase data write I/O, because where you have one or more small writes in a region, it will also read and write the data in a small neghbourhood around those writes, so the fragmentation is reduced. This will improve subsequent read performance. I could also suggest getting the latest kernel you can -- 16.04 is already getting on for a year old, and there may be performance improvements in upstream kernels which affect your workload. There's an Ubuntu kernel PPA you can use to get the new kernels without too much pain. Hugo. -- Hugo Mills | I don't care about "it works on my machine". We are hugo@... carfax.org.uk | not shipping your machine. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
BTRFS for OLTP Databases
Hi, I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL Workload. It did not go very well ranging from multi-seconds stalls where no transactions are completed to the finally kernel OOPS with "no space left on device" error message and filesystem going read only. I'm complete newbie in BTRFS so I assume I'm doing something wrong. Do you have any advice on how BTRFS should be tuned for OLTP workload (large files having a lot of random writes) ?Or is this the case where one should simply stay away from BTRFS and use something else ? One item recommended in some places is "nodatacow" this however defeats the main purpose I'm looking at BTRFS - I am interested in "free" snapshots which look very attractive to use for database recovery scenarios allow instant rollback to the previous state. -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure
On 2017/02/07 1:34, Liu Bo wrote: One thing to add, we still need to check whether page has writeback bit before end_page_writeback. Ok, I add PageWriteback check before end_page_writeback. Looks like commit 55e3bd2e0c2e1 also has the same problem although I gave it my reviewed-by. I also add PageWriteback check in write_one_eb. Finally, the diff becomes like below. Is it Ok ? --- diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ac383a..aa1908a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3445,8 +3445,11 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, bdev, >bio, max_nr, end_bio_extent_writepage, 0, 0, 0, false); - if (ret) + if (ret) { SetPageError(page); + if (PageWriteback(page)) + end_page_writeback(page); + } cur = cur + iosize; pg_offset += iosize; @@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, epd->bio_flags = bio_flags; if (ret) { set_btree_ioerr(p); - end_page_writeback(p); + if (PageWriteback(p)) + end_page_writeback(p); if (atomic_sub_and_test(num_pages - i, >io_pages)) end_extent_buffer_writeback(eb); ret = -EIO; --- Sincerely, -takafumi Thanks, -liubo Reviewed-by: Liu BoThanks, -liubo Sincerely, -takafumi So I don't think the patch is necessary for now. But as I said, the fact (nr == 0 or 1) would be changed if the subpagesize blocksize is supported. Thanks, -liubo Sincerely, -takafumi Thanks, -liubo Sincerely, On 2017/01/31 5:09, Liu Bo wrote: On Fri, Jan 13, 2017 at 03:12:31PM +0900, takafumi-sslab wrote: Thanks for your replying. I understand this bug is more complicated than I expected. I classify error cases under submit_extent_page() below A: ENOMEM error at btrfs_bio_alloc() in submit_extent_page() I first assumed this case and sent the mail. When bio_ret is NULL, submit_extent_page() calls btrfs_bio_alloc(). Then, btrfs_bio_alloc() may fail and submit_extent_page() returns -ENOMEM. In this case, bio_endio() is not called and the page's writeback bit remains. So, there is a need to call end_page_writeback() in the error handling. B: errors under submit_one_bio() of submit_extent_page() Errors that occur under submit_one_bio() handles at bio_endio(), and bio_endio() would call end_page_writeback(). Therefore, as you mentioned in the last mail, simply adding end_page_writeback() like my last email and commit 55e3bd2e0c2e1 can conflict in the case of B. To avoid such conflict, one easy solution is adding PageWriteback() check too. How do you think of this solution? (sorry for the late reply.) I think its caller, "__extent_writepage", has covered the above case by setting page writeback again. Thanks, -liubo Sincerely, On 2016/12/22 15:20, Liu Bo wrote: On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota wrote: This is actually inspired by Filipe's patch(55e3bd2e0c2e1). When submit_extent_page() in __extent_writepage_io() fails, Btrfs misses clearing a writeback bit of the failed page. This causes the false under-writeback page. Then, another sync task hangs in filemap_fdatawait_range(), because it waits the false under-writeback page. CPU0CPU1 __extent_writepage_io() ret = submit_extent_page() // fail if (ret) SetPageError(page) // miss clearing the writeback bit sync() ... filemap_fdatawait_range() wait_on_page_writeback(page); // wait the false under-writeback page Signed-off-by: Takafumi Kubota --- fs/btrfs/extent_io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1e67723..ef9793b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3443,8 +3443,10 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, bdev->bio, max_nr, end_bio_extent_writepage, 0, 0, 0, false); - if (ret) + if (ret) { SetPageError(page); + end_page_writeback(page); + } OK...this could be complex as we don't know which part in
Re: btrfs/125 deadlock using nospace_cache or space_cache=v2
At 02/07/2017 04:02 PM, Anand Jain wrote: Hi Qu, I don't think I have seen this before, I don't know the reason why I wrote this, may be to test encryption, however it was all with default options. Forgot to mention, thanks for the test case. Or we will never find it. Thanks, Qu But now I could reproduce and, looks like balance fails to start with IO error though the mount is successful. -- # tail -f ./results/btrfs/125.full intense and takes potentially very long. It is recommended to use the balance filters to narrow down the balanced data. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1ERROR: error during balancing '/scratch': Input/output error There may be more info in syslog - try dmesg | tail Starting balance without any filters. failed: '/root/bin/btrfs balance start /scratch' This must be fixed. For debugging if I add a sync before previous unmount, the problem isn't reproduced. just fyi. Strange. --- diff --git a/tests/btrfs/125 b/tests/btrfs/125 index 91aa8d8c3f4d..4d4316ca9f6e 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -133,6 +133,7 @@ echo "-Mount normal-" >> $seqres.full echo echo "Mount normal and balance" +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT _scratch_unmount _run_btrfs_util_prog device scan _scratch_mount >> $seqres.full 2>&1 -- HTH. Thanks, Anand On 02/07/17 14:09, Qu Wenruo wrote: Hi Anand, I found that btrfs/125 test case can only pass if we enabled space cache. If using nospace_cache or space_cache=v2 mount option, it will get blocked forever with the following callstack(the only blocked process): [11382.046978] btrfs D11128 6705 6057 0x [11382.047356] Call Trace: [11382.047668] __schedule+0x2d4/0xae0 [11382.047956] schedule+0x3d/0x90 [11382.048283] btrfs_start_ordered_extent+0x160/0x200 [btrfs] [11382.048630] ? wake_atomic_t_function+0x60/0x60 [11382.048958] btrfs_wait_ordered_range+0x113/0x210 [btrfs] [11382.049360] btrfs_relocate_block_group+0x260/0x2b0 [btrfs] [11382.049703] btrfs_relocate_chunk+0x51/0xf0 [btrfs] [11382.050073] btrfs_balance+0xaa9/0x1610 [btrfs] [11382.050404] ? btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs] [11382.050739] btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs] [11382.051109] btrfs_ioctl+0xbe7/0x27f0 [btrfs] [11382.051430] ? trace_hardirqs_on+0xd/0x10 [11382.051747] ? free_object+0x74/0xa0 [11382.052084] ? debug_object_free+0xf2/0x130 [11382.052413] do_vfs_ioctl+0x94/0x710 [11382.052750] ? enqueue_hrtimer+0x160/0x160 [11382.053090] ? do_nanosleep+0x71/0x130 [11382.053431] SyS_ioctl+0x79/0x90 [11382.053735] entry_SYSCALL_64_fastpath+0x18/0xad [11382.054570] RIP: 0033:0x7f397d7a6787 I also found in the test case, we only have 3 continuous data extents, whose sizes are 1M, 68.5M and 31.5M respectively. Original data block group: 0 1M 64M69.5M 101M 128M | Ext A | Extent B(68.5M) |Extent C(31.5M) | While relocation write them in 4 extents: 0~1M:same as Extent A. (1st) 1M~68.3438M :smaller than Extent B (2nd) 68.3438M~69.5M :tail part of Extent B (3rd) 69.5M~ 101M :same as Extent C. (4th) However only ordered extent of (3rd) and (4th) get finished. While ordered extent of (1st) and (2nd) never reached finish_ordered_io(). So relocation will wait for no one to finish the these two ordered extent, and get blocked. Did you experienced the same bug submitting the test case? Is there any known fix for it? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs/125 deadlock using nospace_cache or space_cache=v2
Hi Anand, At 02/07/2017 04:02 PM, Anand Jain wrote: Hi Qu, I don't think I have seen this before, I don't know the reason why I wrote this, may be to test encryption, however it was all with default options. But now I could reproduce and, looks like balance fails to start with IO error though the mount is successful. -- # tail -f ./results/btrfs/125.full intense and takes potentially very long. It is recommended to use the balance filters to narrow down the balanced data. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1ERROR: error during balancing '/scratch': Input/output error There may be more info in syslog - try dmesg | tail Starting balance without any filters. failed: '/root/bin/btrfs balance start /scratch' This must be fixed. For debugging if I add a sync before previous unmount, the problem isn't reproduced. just fyi. Strange. Thanks for the extra info, this seems to be a clue to dig further. Thanks, Qu --- diff --git a/tests/btrfs/125 b/tests/btrfs/125 index 91aa8d8c3f4d..4d4316ca9f6e 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -133,6 +133,7 @@ echo "-Mount normal-" >> $seqres.full echo echo "Mount normal and balance" +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT _scratch_unmount _run_btrfs_util_prog device scan _scratch_mount >> $seqres.full 2>&1 -- HTH. Thanks, Anand On 02/07/17 14:09, Qu Wenruo wrote: Hi Anand, I found that btrfs/125 test case can only pass if we enabled space cache. If using nospace_cache or space_cache=v2 mount option, it will get blocked forever with the following callstack(the only blocked process): [11382.046978] btrfs D11128 6705 6057 0x [11382.047356] Call Trace: [11382.047668] __schedule+0x2d4/0xae0 [11382.047956] schedule+0x3d/0x90 [11382.048283] btrfs_start_ordered_extent+0x160/0x200 [btrfs] [11382.048630] ? wake_atomic_t_function+0x60/0x60 [11382.048958] btrfs_wait_ordered_range+0x113/0x210 [btrfs] [11382.049360] btrfs_relocate_block_group+0x260/0x2b0 [btrfs] [11382.049703] btrfs_relocate_chunk+0x51/0xf0 [btrfs] [11382.050073] btrfs_balance+0xaa9/0x1610 [btrfs] [11382.050404] ? btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs] [11382.050739] btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs] [11382.051109] btrfs_ioctl+0xbe7/0x27f0 [btrfs] [11382.051430] ? trace_hardirqs_on+0xd/0x10 [11382.051747] ? free_object+0x74/0xa0 [11382.052084] ? debug_object_free+0xf2/0x130 [11382.052413] do_vfs_ioctl+0x94/0x710 [11382.052750] ? enqueue_hrtimer+0x160/0x160 [11382.053090] ? do_nanosleep+0x71/0x130 [11382.053431] SyS_ioctl+0x79/0x90 [11382.053735] entry_SYSCALL_64_fastpath+0x18/0xad [11382.054570] RIP: 0033:0x7f397d7a6787 I also found in the test case, we only have 3 continuous data extents, whose sizes are 1M, 68.5M and 31.5M respectively. Original data block group: 0 1M 64M69.5M 101M 128M | Ext A | Extent B(68.5M) |Extent C(31.5M) | While relocation write them in 4 extents: 0~1M:same as Extent A. (1st) 1M~68.3438M :smaller than Extent B (2nd) 68.3438M~69.5M :tail part of Extent B (3rd) 69.5M~ 101M :same as Extent C. (4th) However only ordered extent of (3rd) and (4th) get finished. While ordered extent of (1st) and (2nd) never reached finish_ordered_io(). So relocation will wait for no one to finish the these two ordered extent, and get blocked. Did you experienced the same bug submitting the test case? Is there any known fix for it? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html