Re: understanding disk space usage

2017-02-07 Thread Qu Wenruo



At 02/08/2017 12:44 AM, Vasco Visser wrote:

Hello,

My system is or seems to be running out of disk space but I can't find
out how or why. Might be a BTRFS peculiarity, hence posting on this
list. Most indicators seem to suggest I'm filling up, but I can't
trace the disk usage to files on the FS.

The issue is on my root filesystem on a 28GiB ssd partition (commands
below issued when booted into single user mode):


$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda3  28G   26G  2.1G  93% /


$ btrfs --version
btrfs-progs v4.4


$ btrfs fi usage /
Overall:
Device size:  27.94GiB
Device allocated:  27.94GiB
Device unallocated:   1.00MiB


So from chunk level, your fs is already full.

And balance won't success since there is no unallocated space at all.
The first 1M of btrfs is always reserved and won't be allocated, and 1M 
is too small for btrfs to allocate a chunk.



Device missing: 0.00B
Used:  25.03GiB
Free (estimated):   2.37GiB (min: 2.37GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 256.00MiB (used: 0.00B)
Data,single: Size:26.69GiB, Used:24.32GiB


You still have 2G data space, so you can still write things.


   /dev/sda3  26.69GiB
Metadata,single: Size:1.22GiB, Used:731.45MiB


Metadata has has less space when considering "Global reserve".
In fact the used space would be 987M.

But it's still OK for normal write.


   /dev/sda3   1.22GiB
System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda3  32.00MiB


System chunk can hardly be used up.


Unallocated:
   /dev/sda3   1.00MiB


$ btrfs fi df /
Data, single: total=26.69GiB, used=24.32GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.22GiB, used=731.48MiB
GlobalReserve, single: total=256.00MiB, used=0.00B


However:
$ mount -o bind / /mnt
$ sudo du -hs /mnt
9.3G /mnt


Try to balance:
$ btrfs balance start /
ERROR: error during balancing '/': No space left on device


Am I really filling up? What can explain the huge discrepancy with the
output of du (no open file descriptors on deleted files can explain
this in single user mode) and the FS stats?


Just don't believe the vanilla df output for btrfs.

For btrfs, unlike other fs like ext4/xfs, which allocates chunk 
dynamically and has different metadata/data profile, we can only get a 
clear view of the fs from both chunk level(allocated/unallocated) and 
extent level(total/used).


In your case, your fs doesn't have any unallocated space, this make 
balance unable to work at all.


And your data/metadata usage is quite high, although both has small 
available space left, the fs should be writable for some time, but not long.


To proceed, add a larger device to current fs, and do a balance or just 
delete the 28G partition then btrfs will handle the rest well.


Thanks,
Qu



Any advice on possible causes and how to proceed?


--
Vasco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-02-07 Thread Hans Deragon
Greetings,

On 2017-02-02 10:06, Austin S. Hemmelgarn wrote:
> On 2017-02-02 09:25, Adam Borowski wrote:
>> On Thu, Feb 02, 2017 at 07:49:50AM -0500, Austin S. Hemmelgarn wrote:
>>> This is a severe bug that makes a not all that uncommon (albeit bad) use
>>> case fail completely.  The fix had no dependencies itself and
>>
>> I don't see what's bad in mounting a RAID degraded.  Yeah, it provides no
>> redundancy but that's no worse than using a single disk from the start.
>> And most people not doing storage/server farm don't have a stack of spare
>> disks at hand, so getting a replacement might take a while.
> Running degraded is bad. Period.  If you don't have a disk on hand to
> replace the failed one (and if you care about redundancy, you should
> have at least one spare on hand), you should be converting to a single
> disk, not continuing to run in degraded mode until you get a new disk.
> The moment you start talking about running degraded long enough that you
> will be _booting_ the system with the array degraded, you need to be
> converting to a single disk.  This is of course impractical for
> something like a hardware array or an LVM volume, but it's _trivial_
> with BTRFS, and protects you from all kinds of bad situations that can't
> happen with a single disk but can completely destroy the filesystem if
> it's a degraded array.  Running a single disk is not exactly the same as
> running a degraded array, it's actually marginally safer (even if you
> aren't using dup profile for metadata) because there are fewer moving
> parts to go wrong.  It's also exponentially more efficient.
>>
>> Being able to continue to run when a disk fails is the whole point of
>> RAID
>> -- despite what some folks think, RAIDs are not for backups but for
>> uptime.
>> And if your uptime goes to hell because the moment a disk fails you
>> need to
>> drop everything and replace the disk immediately, why would you use RAID?
> Because just replacing a disk and rebuilding the array is almost always
> much cheaper in terms of time than rebuilding the system from a backup.
> IOW, even if you have to drop everything and replace the disk
> immediately, it's still less time consuming than restoring from a
> backup.  It also has the advantage that you don't lose any data.

We disagree on letting people run degraded, which I support, you not.  I
respect your opinion.  However, I have to ask who decides these rules?
Obviously, not me since I am a simple btrfs home user.

Since Oracle is funding btrfs development, is that Oracle's official
stand on how to handle a failed disk?  Who decides of btrfs's roadmap?
I have no clue who is who on this mailing list and who influences the
features of btrfs.

Oracle is obviously using raid systems internally.  How do the operators
of these raid systems feel about this "not let the system run in
degraded mode"?

As a home user, I do not want to have a disk always available.  This is
paying a disk very expensively when the raid system can run easily for
two years without disk failure.  I want to buy the new disk (asap, of
course) once one died.  At that moment, the cost of a drive would have
fallen drastically.  Yes, I can live with running my home system (which
has backups) for a day or two, in degraded rw mode until I purchase and
can install a new disk.  Chances are low that both disks will quit at
around the same time.

Simply because I cannot run in degraded mode and cannot add a disk to my
current degraded raid1, despite having my replacement disk in my hands,
I must resort to switch to mdadm or zfs.

Having a policy that limits user's options for the sake that they are
too stupid to understand the implications is wrong.  Its ok for
applications, but not at the operating system; there should be a way to
force this.  A
--yes-i-know-what-i-am-doing-now-please-mount-rw-degraded-so-i-can-install-the-new-disk
parameter must be implemented.  Currently, it is like disallowing root
to run mkfs over an existing filesystem because people could erase data
by mistake.  Let people do what they want and let them live with the
consequences.

hdparm has a --yes-i-know-what-i-am-doing flag.  btrfs needs one.

Whoever decides about btrfs features to add, please consider this one.

Best regards,
Hans Deragon
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Hi Kai,

I guess your message did not make it to me as I'm not subscribed to the list.

I totally understand what the the snapshot is "crash consistent"  -
consistent to the state of the disk you would find if you shut down
the power with no notice,
for many applications it is a problem however it is fine for many
databases which already need to be able to recover correctly from
power loss

for MySQL this works well for Innodb storage engine it does not work for MyISAM

The great of such "uncoordinated" snapshot is what it is instant and
have very little production impact -  if you want to "freeze" multiple
filesystems or
even worse flush MyISAM table it can take a lot of time and can be
unacceptable for many 24/7 workloads.

Or are you saying BTRFS snapshots do not provide this kind of consistency ?

> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.

-- 
Regards,
Kai

On Tue, Feb 7, 2017 at 9:00 AM, Hugo Mills  wrote:
> On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote:
>> Hi,
>>
>> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
>> Workload.
>>
>> It did not go very well ranging from multi-seconds stalls where no
>> transactions are completed to the finally kernel OOPS with "no space left
>> on device" error message and filesystem going read only.
>>
>> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
>>
>> Do you have any advice on how BTRFS should be tuned for OLTP workload
>> (large files having a lot of random writes)  ?Or is this the case where
>> one should simply stay away from BTRFS and use something else ?
>>
>> One item recommended in some places is "nodatacow"  this however defeats
>> the main purpose I'm looking at BTRFS -  I am interested in "free"
>> snapshots which look very attractive to use for database recovery scenarios
>> allow instant rollback to the previous state.
>
>Well, nodatacow will still allow snapshots to work, but it also
> allows the data to fragment. Each snapshot made will cause subsequent
> writes to shared areas to be CoWed once (and then it reverts to
> unshared and nodatacow again).
>
>There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.
>
>Hugo.
>
> 

[PATCH] btrfs: qgroup: Move half of the qgroup accounting time out of commit trans

2017-02-07 Thread Qu Wenruo
Just as Filipe pointed out, the most time consuming part of qgroup is
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.

However for old_roots, we don't really need to calculate it at transaction
commit time.

This patch moves the old_roots accounting part out of
commit_transaction(), so at least we won't block transaction too long.

But please note that, this won't speedup qgroup overall, it just moves
half of the cost out of commit_transaction().

Cc: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 20 
 fs/btrfs/qgroup.c  | 33 ++---
 fs/btrfs/qgroup.h  | 14 ++
 3 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ef724a5..0ee927e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -550,13 +550,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 struct btrfs_delayed_ref_node *ref,
 struct btrfs_qgroup_extent_record *qrecord,
 u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
-int action, int is_data)
+int action, int is_data, int *qrecord_inserted_ret)
 {
struct btrfs_delayed_ref_head *existing;
struct btrfs_delayed_ref_head *head_ref = NULL;
struct btrfs_delayed_ref_root *delayed_refs;
int count_mod = 1;
int must_insert_reserved = 0;
+   int qrecord_inserted = 0;
 
/* If reserved is provided, it must be a data extent. */
BUG_ON(!is_data && reserved);
@@ -623,6 +624,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
if(btrfs_qgroup_trace_extent_nolock(fs_info,
delayed_refs, qrecord))
kfree(qrecord);
+   else
+   qrecord_inserted = 1;
}
 
spin_lock_init(_ref->lock);
@@ -650,6 +653,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
atomic_inc(_refs->num_entries);
trans->delayed_ref_updates++;
}
+   if (qrecord_inserted_ret)
+   *qrecord_inserted_ret = qrecord_inserted;
return head_ref;
 }
 
@@ -779,6 +784,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int qrecord_inserted;
 
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
@@ -806,12 +812,15 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
 * the spin lock
 */
head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
-   bytenr, num_bytes, 0, 0, action, 0);
+   bytenr, num_bytes, 0, 0, action, 0,
+   _inserted);
 
add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr,
 num_bytes, parent, ref_root, level, action);
spin_unlock(_refs->lock);
 
+   if (qrecord_inserted)
+   return btrfs_qgroup_trace_extent_post(fs_info, record);
return 0;
 
 free_head_ref:
@@ -836,6 +845,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int qrecord_inserted;
 
BUG_ON(extent_op && !extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
@@ -870,13 +880,15 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 */
head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
bytenr, num_bytes, ref_root, reserved,
-   action, 1);
+   action, 1, _inserted);
 
add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
   num_bytes, parent, ref_root, owner, offset,
   action);
spin_unlock(_refs->lock);
 
+   if (qrecord_inserted)
+   return btrfs_qgroup_trace_extent_post(fs_info, record);
return 0;
 }
 
@@ -899,7 +911,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
 
add_delayed_ref_head(fs_info, trans, _ref->node, NULL, bytenr,
 num_bytes, 0, 0, BTRFS_UPDATE_DELAYED_HEAD,
-extent_op->is_data);
+extent_op->is_data, NULL);
 
   

user_subvol_rm_allowed? Is there a user_subvol_create_deny|allowed?

2017-02-07 Thread Nicholas D Steeves
Dear btrfs community,

Please accept my apologies in advance if I missed something in recent
btrfs development; my MUA tells me I'm ~1500 unread messages
out-of-date. :/

I recently read about "mount -t btrfs -o user_subvol_rm_allowed" while
doing reading up on LXC handling of snapshots with the btrfs backend.
Is this mount option per-subvolume, or per volume?

Also, what mechanisms to restrict a user's ability to create an
arbitrarily large number of snapshots?  Is there a
user_subvol_create_deny|allowed?  What I've read about the inverse
correlation between number of subvols to performance, a potentially
hostile user could cause an IO denial of service or potentially even
trigger an ENOSPC.

From what I gather, the following will reproduce the hypothetical
issue related to my question:

# as root
btrfs sub create /some/dir/subvol
chown some-user /some/dir/subvol

# as some-user
cd /home/dir/subvol
cp -ar --reflink=always /some/big/files ./
COUNT=1
while [ 0 -lt 1 ]; do
  btrfs sub snap ./ ./snapshot-$COUNT
  COUNT=COUNT+1
  sleep 2   # --maybe unnecessary
done

--

I hope there's something I've misunderstood or failed to read!

Please CC me so your reply will hit my main inbox :-)
Nicholas


signature.asc
Description: Digital signature


Re: [lustre-devel] [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-02-07 Thread Dilger, Andreas
On Feb 2, 2017, at 10:34, Jan Kara  wrote:
> 
> Provide helper functions for setting up dynamically allocated
> backing_dev_info structures for filesystems and cleaning them up on
> superblock destruction.
> 
> CC: linux-...@lists.infradead.org
> CC: linux-...@vger.kernel.org
> CC: Petr Vandrovec 
> CC: linux-ni...@vger.kernel.org
> CC: cluster-de...@redhat.com
> CC: osd-...@open-osd.org
> CC: codal...@coda.cs.cmu.edu
> CC: linux-...@lists.infradead.org
> CC: ecryp...@vger.kernel.org
> CC: linux-c...@vger.kernel.org
> CC: ceph-de...@vger.kernel.org
> CC: linux-btrfs@vger.kernel.org
> CC: v9fs-develo...@lists.sourceforge.net
> CC: lustre-de...@lists.lustre.org
> Signed-off-by: Jan Kara 
> ---
> fs/super.c   | 49 
> include/linux/backing-dev-defs.h |  2 +-
> include/linux/fs.h   |  6 +
> 3 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index ea662b0e5e78..31dc4c6450ef 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -446,6 +446,11 @@ void generic_shutdown_super(struct super_block *sb)
>   hlist_del_init(>s_instances);
>   spin_unlock(_lock);
>   up_write(>s_umount);
> + if (sb->s_iflags & SB_I_DYNBDI) {
> + bdi_put(sb->s_bdi);
> + sb->s_bdi = _backing_dev_info;
> + sb->s_iflags &= ~SB_I_DYNBDI;
> + }
> }
> 
> EXPORT_SYMBOL(generic_shutdown_super);
> @@ -1249,6 +1254,50 @@ mount_fs(struct file_system_type *type, int flags, 
> const char *name, void *data)
> }
> 
> /*
> + * Setup private BDI for given superblock. I gets automatically cleaned up

(typo) s/I/It/

Looks fine otherwise.

> + * in generic_shutdown_super().
> + */
> +int super_setup_bdi_name(struct super_block *sb, char *fmt, ...)
> +{
> + struct backing_dev_info *bdi;
> + int err;
> + va_list args;
> +
> + bdi = bdi_alloc(GFP_KERNEL);
> + if (!bdi)
> + return -ENOMEM;
> +
> + bdi->name = sb->s_type->name;
> +
> + va_start(args, fmt);
> + err = bdi_register_va(bdi, NULL, fmt, args);
> + va_end(args);
> + if (err) {
> + bdi_put(bdi);
> + return err;
> + }
> + WARN_ON(sb->s_bdi != _backing_dev_info);
> + sb->s_bdi = bdi;
> + sb->s_iflags |= SB_I_DYNBDI;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(super_setup_bdi_name);
> +
> +/*
> + * Setup private BDI for given superblock. I gets automatically cleaned up
> + * in generic_shutdown_super().
> + */
> +int super_setup_bdi(struct super_block *sb)
> +{
> + static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
> +
> + return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name,
> + atomic_long_inc_return(_seq));
> +}
> +EXPORT_SYMBOL(super_setup_bdi);
> +
> +/*
>  * This is an internal function, please use sb_end_{write,pagefault,intwrite}
>  * instead.
>  */
> diff --git a/include/linux/backing-dev-defs.h 
> b/include/linux/backing-dev-defs.h
> index 2ecafc8a2d06..70080b4217f4 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -143,7 +143,7 @@ struct backing_dev_info {
>   congested_fn *congested_fn; /* Function pointer if device is md/dm */
>   void *congested_data;   /* Pointer to aux data for congested func */
> 
> - char *name;
> + const char *name;
> 
>   struct kref refcnt; /* Reference counter for the structure */
>   unsigned int registered:1;  /* Is bdi registered? */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c930cbc19342..8ed8b6d1bc54 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1267,6 +1267,9 @@ struct mm_struct;
> /* sb->s_iflags to limit user namespace mounts */
> #define SB_I_USERNS_VISIBLE   0x0010 /* fstype already mounted */
> 
> +/* Temporary flag until all filesystems are converted to dynamic bdis */
> +#define SB_I_DYNBDI  0x0100
> +
> /* Possible states of 'frozen' field */
> enum {
>   SB_UNFROZEN = 0,/* FS is unfrozen */
> @@ -2103,6 +2106,9 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
> extern int freeze_super(struct super_block *super);
> extern int thaw_super(struct super_block *super);
> extern bool our_mnt(struct vfsmount *mnt);
> +extern __printf(2, 3)
> +int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
> +extern int super_setup_bdi(struct super_block *sb);
> 
> extern int current_umask(void);
> 
> -- 
> 2.10.2
> 
> ___
> lustre-devel mailing list
> lustre-de...@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Troubleshooting crash due to running out of space issues

2017-02-07 Thread Peter Zaitsev
I'm running BTRFS on Ubuntu 16.04  - I was testing intensive database
IO which ends up with pretty fragmented data file:

root@blinky:/var/lib/mysql/sbtest# filefrag sbtest1.ibd

sbtest1.ibd: 13415923 extents found

This is 500G device which is some 60% full:

/dev/nvme0n1500107608 308009444 189718556  62% /mnt/data/mysql

root@blinky:/# btrfs fi show Label: none uuid:
2a396366-e3c9-4d14-b4cc-3d8992bd1c6b Total devices 1 FS bytes used
293.24GiB devid 1 size 476.94GiB used 476.94GiB path /dev/nvme0n1


this file (sbtest1.ibd) takes some 250GB - majority of space

As I try to defrag this file with:

btrfs fi defrag sbtest1.ibd

I get either error of "no space available"   and file system goes to
read only or filesystem completely breaks down and IO errors are
reported.

Note during my experiments I have mounted this filesystem repeatedly
with and without nodatacow and autodefrag   options, I assume these
should not cause any file system dammage, do they ?

Here is the portion of the latest log:

Feb  7 19:10:42 blinky kernel: [40722.055010] [ cut here
]
Feb  7 19:10:42 blinky kernel: [40722.055060] WARNING: CPU: 12 PID:
17002 at /build/linux-W6HB68/linux-4.4.0/fs/btrfs/extent-tree.c:6552
__btrfs_free_extent.isr
a.70+0x2e6/0xd30 [btrfs]()
Feb  7 19:10:42 blinky kernel: [40722.055063] BTRFS: error (device
nvme0n1) in __btrfs_free_extent:6552: errno=-28 No space left
Feb  7 19:10:42 blinky kernel: [40722.055066] BTRFS info (device
nvme0n1): forced readonly
Feb  7 19:10:42 blinky kernel: [40722.055068] BTRFS: error (device
nvme0n1) in btrfs_run_delayed_refs:2927: errno=-28 No space left
Feb  7 19:10:42 blinky kernel: [40722.055086] BTRFS: Transaction
aborted (error -28)
Feb  7 19:10:42 blinky kernel: [40722.055087] Modules linked in:
snd_hda_codec_hdmi nls_iso8859_1 intel_rapl x86_pkg_temp_thermal
intel_powerclamp coretemp kvm
snd_hda_codec_realtek irqbypass snd_hda_codec_generic serio_raw
sb_edac snd_hda_intel edac_core snd_hda_codec snd_hda_core snd_hwdep
snd_pcm input_leds snd_time
r mei_me lpc_ich snd mei soundcore shpchp tpm_infineon mac_hid ib_iser
rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_tra
nsport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
raid0 multipath linear hid_gen
eric usbhid hid nouveau crct10dif_pclmul mxm_wmi crc32_pclmul video
ghash_clmulni_intel i2c_algo_bit aesni_intel ttm aes_x86_64 lrw
drm_kms_helper gf128mul glue
_helper ablk_helper syscopyarea psmouse cryptd sysfillrect e1000e
sysimgblt fb_sys_fops ahci libahci alx ptp drm mdio pps_core nvme fjes
wmi
Feb  7 19:10:42 blinky kernel: [40722.055180] CPU: 12 PID: 17002 Comm:
btrfs-transacti Tainted: GW   4.4.0-62-generic #83-Ubuntu
Feb  7 19:10:42 blinky kernel: [40722.055181] Hardware name: Gigabyte
Technology Co., Ltd. Default string/X99-Ultra Gaming-CF, BIOS F5
08/29/2016
Feb  7 19:10:42 blinky kernel: [40722.055183]  0286
7c7f47a0 880f1b4c7b00 813f7c63
Feb  7 19:10:42 blinky kernel: [40722.055185]  880f1b4c7b48
c0392498 880f1b4c7b38 810812d2
Feb  7 19:10:42 blinky kernel: [40722.055188]  007021b82000
ffe4  880fe62a4000
Feb  7 19:10:42 blinky kernel: [40722.055190] Call Trace:
Feb  7 19:10:42 blinky kernel: [40722.055197]  []
dump_stack+0x63/0x90
Feb  7 19:10:42 blinky kernel: [40722.055202]  []
warn_slowpath_common+0x82/0xc0
Feb  7 19:10:42 blinky kernel: [40722.055205]  []
warn_slowpath_fmt+0x5c/0x80
Feb  7 19:10:42 blinky kernel: [40722.055219]  []
__btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055232]  []
__btrfs_run_delayed_refs+0x444/0x11f0 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055236]  [] ?
lock_timer_base.isra.22+0x54/0x70
Feb  7 19:10:42 blinky kernel: [40722.055248]  []
btrfs_run_delayed_refs+0x7d/0x2a0 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055251]  [] ?
del_timer_sync+0x48/0x50
Feb  7 19:10:42 blinky kernel: [40722.055266]  []
btrfs_commit_transaction+0xac/0xa90 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055279]  []
transaction_kthread+0x229/0x240 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055291]  [] ?
btrfs_cleanup_transaction+0x570/0x570 [btrfs]
Feb  7 19:10:42 blinky kernel: [40722.055294]  []
kthread+0xd8/0xf0
Feb  7 19:10:42 blinky kernel: [40722.055296]  [] ?
kthread_create_on_node+0x1e0/0x1e0
Feb  7 19:10:42 blinky kernel: [40722.055300]  []
ret_from_fork+0x3f/0x70
Feb  7 19:10:42 blinky kernel: [40722.055301]  [] ?
kthread_create_on_node+0x1e0/0x1e0
Feb  7 19:10:42 blinky kernel: [40722.055303] ---[ end trace
92a6418dcae8a352 ]---
Feb  7 19:10:42 blinky kernel: [40722.055306] BTRFS: error (device
nvme0n1) in __btrfs_free_extent:6552: errno=-28 No space left
Feb  7 19:10:42 blinky kernel: [40722.055314] BTRFS: error (device
nvme0n1) in btrfs_run_delayed_refs:2927: errno=-28 No space left
Feb  7 

Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Qu Wenruo



At 02/07/2017 11:55 PM, Filipe Manana wrote:

On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo  wrote:



At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:



Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:




At 02/04/2017 09:47 AM, Jorg Bornschein wrote:


February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:








Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.


Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?



Thanks!

   j



Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance process:

commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.

commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.



This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.


The cause is already known.

It's find_parent_node() which takes most of the time to find all referencer
of an extent.

And it's also the cause for FIEMAP softlockup (fixed in recent release by
early quit).

The biggest problem is, current find_parent_node() uses list to iterate,
which is quite slow especially it's done in a loop.
In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce some
cache for find_parent_node().


Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.


You must accept the fact that we must call find_parent_node() at least 
twice to get correct owner modification for each touched extent.

Or qgroup number will never be correct.

One for old_roots by searching commit root, and one for new_roots by 
searching current root.


You can call find_parent_node() as many time as you like, but that's 
just wasting your CPU time.


Only the final find_parent_node() will determine new_roots for that 
extent, and there is no better timing than commit_transaction().


Or you can wasting more time calling find_parent_node() every time you 
touched a extent, saving one find_parent_node() in commit_transaction() 
with the cost of more find_parent_node() in other place.

Is that what you want?

I can move the find_parent_node() for old_roots out of commit_transaction().
But that will only reduce 50% of the time spent on commit_transaction().

Compared to O(n^3) find_parent_node(), that's not the determining fact even.

Thanks,
Qu






IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I
didn't hear anything about it recently.

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: dup vs raid1 in single disk

2017-02-07 Thread Dan Mons
On 8 February 2017 at 08:28, Kai Krakow  wrote:
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.

I'm a sysadmin by trade, managing many PB of storage for a media
company.  Our primary storage are Oracle ZFS appliances, and all of
our secondary/nearline storage is Linux+BtrFS.

ZFS's inline deduplication is awful.  It consumes enormous amounts of
RAM that is orders of magnitude more valuable as ARC/Cache, and
becomes immediately useless whenever a storage node is rebooted
(necessary to apply mandatory security patches) and the in-memory
tables are lost (meaning cold data is rarely re-examined, and the
inline dedup becomes less efficient).

Conversely, I use  "dupremove" as a one-shot/offline deduplication
tool on all of our BtrFS storage.  I can be set as a cron job to be
done outside of business hours, and use an SQLite database to store
the necessary dedup hash information on disk, rather than in RAM.
>From the point of view of someone who manages large amounts of long
term centralised storage, this is a far superior way to deal with
deduplication, as it offers more flexibility and far better
space-saving ratios at a lower memory cost.

We trialled ZFS dedup for a few months, and decided to turn it off, as
there was far less benefit to ZFS using all that RAM for dedup than
there was for it to be cache.  I've been requesting Oracle offer a
similar offline dedup tool for their ZFS appliance for a very long
time, and if BtrFS ever did offer inline dedup, I wouldn't bother
using it for all of the reasons above.

-Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dup vs raid1 in single disk

2017-02-07 Thread Hans van Kranenburg
On 02/07/2017 11:28 PM, Kai Krakow wrote:
> Am Thu, 19 Jan 2017 15:02:14 -0500
> schrieb "Austin S. Hemmelgarn" :
> 
>> On 2017-01-19 13:23, Roman Mamedov wrote:
>>> On Thu, 19 Jan 2017 17:39:37 +0100
>>> [...]
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.  
>> The only window of time during which bad RAM could result in only one 
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time.  As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
> 
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)

In memory, it's just one copy, happily sitting around, getting corrupted
by cosmic rays and other stuff done to it by aliens, after which a valid
checksum is calculated for the corrupt data, after which it goes on its
way to disk, twice. Yay.

>> That said, I do still feel that DUP mode has value on SSD's.  The 
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
> 
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
> 
>> 2. The blocks are likely to end up in the same erase block, and 
>> therefore there will be no benefit.
> 
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?

I think there was another one, SSD firmware deduplicating writes,
converting the DUP into single again, giving a false idea of it being DUP.

This is one that can be solved by e.g. using disk encryption, which
causes same writes to show up as different data on disk.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: better document btrfs receive security

2017-02-07 Thread Kai Krakow
Am Fri,  3 Feb 2017 08:48:58 -0500
schrieb "Austin S. Hemmelgarn" :

> +user who is running receive, and then move then into the final
> destination 

Typo? s/then/them/?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Hans van Kranenburg
On 02/07/2017 10:35 PM, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 22:25:29 +0100
> schrieb Lionel Bouton :
> 
>> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
>>> On 2017-02-07 15:36, Kai Krakow wrote:  
 Am Tue, 7 Feb 2017 09:13:25 -0500
 schrieb Peter Zaitsev :
  
>>  [...]  

 Out of curiosity, I see one problem here:

 If you're doing snapshots of the live database, each snapshot
 leaves the database files like killing the database in-flight.
 Like shutting the system down in the middle of writing data.

 This is because I think there's no API for user space to subscribe
 to events like a snapshot - unlike e.g. the VSS API (volume
 snapshot service) in Windows. You should put the database into
 frozen state to prepare it for a hotcopy before creating the
 snapshot, then ensure all data is flushed before continuing.  
>>> Correct.  

 I think I've read that btrfs snapshots do not guarantee single
 point in time snapshots - the snapshot may be smeared across a
 longer period of time while the kernel is still writing data. So
 parts of your writes may still end up in the snapshot after
 issuing the snapshot command, instead of in the working copy as
 expected.  
>>> Also correct AFAICT, and this needs to be better documented (for
>>> most people, the term snapshot implies atomicity of the
>>> operation).  
>>
>> Atomicity can be a relative term. If the snapshot atomicity is
>> relative to barriers but not relative to individual writes between
>> barriers then AFAICT it's fine because the filesystem doesn't make
>> any promise it won't keep even in the context of its snapshots.
>> Consider a power loss : the filesystems atomicity guarantees can't go
>> beyond what the hardware guarantees which means not all current in fly
>> write will reach the disk and partial writes can happen. Modern
>> filesystems will remain consistent though and if an application using
>> them makes uses of f*sync it can provide its own guarantees too. The
>> same should apply to snapshots : all the writes in fly can complete or
>> not on disk before the snapshot what matters is that both the snapshot
>> and these writes will be completed after the next barrier (and any
>> robust application will ignore all the in fly writes it finds in the
>> snapshot if they were part of a batch that should be atomically
>> commited).
>>
>> This is why AFAIK PostgreSQL or MySQL with their default ACID
>> compliant configuration will recover from a BTRFS snapshot in the
>> same way they recover from a power loss.
> 
> This is what I meant in my other reply. But this is also why it should
> be documented. Wrongly implying that snapshots are single point in time
> snapshots is a wrong assumption with possibly horrible side effects one
> wouldn't expect.

It depends on what the definition of time is. (whoa!!) A snapshot is
taken of a single point in the lifetime of a filesystem tree (a
generation, the point where a transaction commits)...?

> Taking a snapshot is like a power loss - even tho there is no power
> loss. So the database has to be properly configured. It is simply short
> sighted if you don't think about this fact. The documentation should
> really point that fact out.

I'd almost say that it would be short sighted to assume a btrfs snapshot
would *not* behave like a power loss. At least, to me (thinking as a
sysadmin) it feels really weird to think of it in any other way than that.

Oh wait, that's what you mean, or not? What is the thing that the
documentation should point out? I'm not trying to be trolling, the piled
up double negations make this discussion a bit hard to read.

Moo

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dup vs raid1 in single disk

2017-02-07 Thread Kai Krakow
Am Thu, 19 Jan 2017 15:02:14 -0500
schrieb "Austin S. Hemmelgarn" :

> On 2017-01-19 13:23, Roman Mamedov wrote:
> > On Thu, 19 Jan 2017 17:39:37 +0100
> > "Alejandro R. Mosteo"  wrote:
> >  
> >> I was wondering, from a point of view of data safety, if there is
> >> any difference between using dup or making a raid1 from two
> >> partitions in the same disk. This is thinking on having some
> >> protection against the typical aging HDD that starts to have bad
> >> sectors.  
> >
> > RAID1 will write slower compared to DUP, as any optimization to
> > make RAID1 devices work in parallel will cause a total performance
> > disaster for you as you will start trying to write to both
> > partitions at the same time, turning all linear writes into random
> > ones, which are about two orders of magnitude slower than linear on
> > spinning hard drives. DUP shouldn't have this issue, but still it
> > will be twice slower than single, since you are writing everything
> > twice.  
> As of right now, there will actually be near zero impact on write 
> performance (or at least, it's way less than the theoretical 50%) 
> because there really isn't any optimization to speak of in the 
> multi-device code.  That will hopefully change over time, but it's
> not likely to do so any time in the future since nobody appears to be 
> working on multi-device write performance.

I think that's only true if you don't account the seek overhead. In
single device RAID1 mode you will always seek half of the device while
writing data, and even when reading between odd and even PIDs. In
contrast, DUP mode doesn't guarantee your seeks to be shorter but from
a statistical point of view, on the average it should be shorter. So it
should yield better performance (tho I wouldn't expect it to be
observable, depending on your workload).

So, on devices having no seek overhead (aka SSD), it is probably true
(minus bus bandwidth considerations). For HDD I'd prefer DUP.

>From data safety point of view: It's more likely that adjacent
and nearby sectors are bad. So DUP imposes a higher risk of written
data being written to only bad sectors - which means data loss or even
file system loss (if metadata hits this problem).

To be realistic: I wouldn't trade space usage for duplicate data on an
already failing disk, no matter if it's DUP or RAID1. HDD disk space is
cheap, and using such a scenario is just waste of performance AND
space - no matter what. I don't understand the purpose of this. It just
results in fake safety.

Better get two separate devices half the size. There's a better chance
of getting a better cost/space ratio anyways, plus better performance
and safety.

> There's also the fact that you're writing more metadata than data
> most of the time unless you're dealing with really big files, and
> metadata is already DUP mode (unless you are using an SSD), so the
> performance hit isn't 50%, it's actually a bit more than half the
> ratio of data writes to metadata writes.
> >  
> >> On a related note, I see this caveat about dup in the manpage:
> >>
> >> "For example, a SSD drive can remap the blocks internally to a
> >> single copy thus deduplicating them. This negates the purpose of
> >> increased redunancy (sic) and just wastes space"  
> >
> > That ability is vastly overestimated in the man page. There is no
> > miracle content-addressable storage system working at 500 MB/sec
> > speeds all within a little cheap controller on SSDs. Likely most of
> > what it can do, is just compress simple stuff, such as runs of
> > zeroes or other repeating byte sequences.  
> Most of those that do in-line compression don't implement it in 
> firmware, they implement it in hardware, and even DEFLATE can get 500 
> MB/second speeds if properly implemented in hardware.  The firmware
> may control how the hardware works, but it's usually hardware doing
> heavy lifting in that case, and getting a good ASIC made that can hit
> the required performance point for a reasonable compression algorithm
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
> work.

I still thinks it's a myth... The overhead of managing inline
deduplication is just way too high to implement it without jumping
through expensive hoops. Most workloads have almost zero deduplication
potential. And even when, their temporal occurrence is spaced so far
that an inline deduplicator won't catch it.

If it would be all so easy, btrfs would already have it working in
mainline. I don't even remember that those patches is still being
worked on.

With this in mind, I think dup metadata is still a good think to have
even on SSD and I would always force to enable it.

Potential for deduplication is only when using snapshots (which already
are deduplicated when taken) or when handling user data on a file
server in a multi-user environment. Users tend to copy their files all
over the place - multiple directories of multiple gigabytes. 

Re: BTRFS for OLTP Databases

2017-02-07 Thread Hans van Kranenburg
On 02/07/2017 07:59 PM, Peter Zaitsev wrote:
> 
> So far the most frustating for me was periodic stalls for many seconds
>  (running sysbench workload).  What was the most puzzling  I get this
> even if I run workload at the  50% or less of the full load  -  Ie
> database can handle 1000 transactions/sec and I only inject 500/sec
> and I still have those stalls.
> 
> This is where it looks to me like some work is being delayed and when
> it requires stall for a few seconds to catch up.I wonder  if there
> are some configuration options available to play with.

What happens during these stalls? Do you mean a 'stall' like it seems
nothing is happening at all, or a 'stall' during which something is so
busy that something else cannot continue?

Is there some kernel thread doing a lot of cpu? What does the
/proc//stack show?

Is it huge write spikes with not many writes in between, or do you
generate enough action to be writing to disk all the time?

If the stalls show the behaviour of huge disk-write spikes, during which
applications seem to be blocked from continuing to write more, and if
during that time you see btrfs-transaction active in the kernel, nd,
if your test is doing a lot of writes all over the place (not only
simply appending table files sequentially, but changing a lot and
touching a lot of metadata) and you're pushing it, it might be space
cache related.

I think the /proc//stack of the btrfs-transaction will show you
something related to free space cache in this case.

In this case, it might be interesting to test the free space tree
(instead of the default free space cache):

http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf

Using free space tree helped me a lot on write-heavy filesystems (like a
backup server with concurrent rsync data streaming in, also doing
snapshotting) from having incoming traffic drop to the ground every time
there was a transaction commit.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 22:25:29 +0100
schrieb Lionel Bouton :

> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> > On 2017-02-07 15:36, Kai Krakow wrote:  
> >> Am Tue, 7 Feb 2017 09:13:25 -0500
> >> schrieb Peter Zaitsev :
> >>  
>  [...]  
> >>
> >> Out of curiosity, I see one problem here:
> >>
> >> If you're doing snapshots of the live database, each snapshot
> >> leaves the database files like killing the database in-flight.
> >> Like shutting the system down in the middle of writing data.
> >>
> >> This is because I think there's no API for user space to subscribe
> >> to events like a snapshot - unlike e.g. the VSS API (volume
> >> snapshot service) in Windows. You should put the database into
> >> frozen state to prepare it for a hotcopy before creating the
> >> snapshot, then ensure all data is flushed before continuing.  
> > Correct.  
> >>
> >> I think I've read that btrfs snapshots do not guarantee single
> >> point in time snapshots - the snapshot may be smeared across a
> >> longer period of time while the kernel is still writing data. So
> >> parts of your writes may still end up in the snapshot after
> >> issuing the snapshot command, instead of in the working copy as
> >> expected.  
> > Also correct AFAICT, and this needs to be better documented (for
> > most people, the term snapshot implies atomicity of the
> > operation).  
> 
> Atomicity can be a relative term. If the snapshot atomicity is
> relative to barriers but not relative to individual writes between
> barriers then AFAICT it's fine because the filesystem doesn't make
> any promise it won't keep even in the context of its snapshots.
> Consider a power loss : the filesystems atomicity guarantees can't go
> beyond what the hardware guarantees which means not all current in fly
> write will reach the disk and partial writes can happen. Modern
> filesystems will remain consistent though and if an application using
> them makes uses of f*sync it can provide its own guarantees too. The
> same should apply to snapshots : all the writes in fly can complete or
> not on disk before the snapshot what matters is that both the snapshot
> and these writes will be completed after the next barrier (and any
> robust application will ignore all the in fly writes it finds in the
> snapshot if they were part of a batch that should be atomically
> commited).
> 
> This is why AFAIK PostgreSQL or MySQL with their default ACID
> compliant configuration will recover from a BTRFS snapshot in the
> same way they recover from a power loss.

This is what I meant in my other reply. But this is also why it should
be documented. Wrongly implying that snapshots are single point in time
snapshots is a wrong assumption with possibly horrible side effects one
wouldn't expect.

Taking a snapshot is like a power loss - even tho there is no power
loss. So the database has to be properly configured. It is simply short
sighted if you don't think about this fact. The documentation should
really point that fact out.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> On 2017-02-07 15:36, Kai Krakow wrote:
>> Am Tue, 7 Feb 2017 09:13:25 -0500
>> schrieb Peter Zaitsev :
>>
>>> Hi Hugo,
>>>
>>> For the use case I'm looking for I'm interested in having snapshot(s)
>>> open at all time.  Imagine  for example snapshot being created every
>>> hour and several of these snapshots  kept at all time providing quick
>>> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
>>> think you also describe)  nodatacow  does not provide any advantage.
>>
>> Out of curiosity, I see one problem here:
>>
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
> Correct.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
> Also correct AFAICT, and this needs to be better documented (for most
> people, the term snapshot implies atomicity of the operation).

Atomicity can be a relative term. If the snapshot atomicity is relative
to barriers but not relative to individual writes between barriers then
AFAICT it's fine because the filesystem doesn't make any promise it
won't keep even in the context of its snapshots.
Consider a power loss : the filesystems atomicity guarantees can't go
beyond what the hardware guarantees which means not all current in fly
write will reach the disk and partial writes can happen. Modern
filesystems will remain consistent though and if an application using
them makes uses of f*sync it can provide its own guarantees too. The
same should apply to snapshots : all the writes in fly can complete or
not on disk before the snapshot what matters is that both the snapshot
and these writes will be completed after the next barrier (and any
robust application will ignore all the in fly writes it finds in the
snapshot if they were part of a batch that should be atomically commited).

This is why AFAIK PostgreSQL or MySQL with their default ACID compliant
configuration will recover from a BTRFS snapshot in the same way they
recover from a power loss.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 10:43:11 -0500
schrieb "Austin S. Hemmelgarn" :

> > I mean that:
> > You have a 128MB extent, you rewrite random 4k sectors, btrfs will
> > not split 128MB extent, and not free up data, (i don't know
> > internal algo, so i can't predict when this will hapen), and after
> > some time, btrfs will rebuild extents, and split 128 MB exten to
> > several more smaller. But when you use compression, allocator
> > rebuilding extents much early (i think, it's because btrfs also
> > operates with that like 128kb extent, even if it's a continuos
> > 128MB chunk of data). 
> The allocator has absolutely nothing to do with this, it's a function
> of the COW operation.  Unless you're using nodatacow, that 128MB
> extent will get split the moment the data hits the storage device
> (either on the next commit cycle (at most 30 seconds with the default
> commit cycle), or when fdatasync is called, whichever is sooner).  In
> the case of compression, it's still one extent (although on disk it
> will be less than 128MB) and will be split at _exactly_ the same time
> under _exactly_ the same circumstances as an uncompressed extent.
> IOW, it has absolutely nothing to do with the extent handling either.

I don't think that btrfs splits extents which are part of the snapshot.
The extent in a snapshot will stay intact when writing to this extent
in another snapshot. Of course, in the just written snapshot, the
extent will be represented as a split extent mapping to the original
extents data blocks plus the new data in the middle (thus resulting in
three extents). This is also why small random writes without autodefrag
result in a vast amount of small extents bringing the fs performance to
a crawl.

Do that multiple times on multiple snapshots, delete some of the
original snapshots, and you're left with slack space, data blocks being
inaccessible and won't be reclaimed into free space (because they
are still part of the original extent), and which can only be
reclaimed by a defrag operation - which would of course unshares data.

Thus, if any of the above mentioned small extents is still shared with
an extent originally much bigger, then it will still occupy its
original space on the filesystem - even when its associated
snapshot/subvolume no longer exists. Only when the last remaining
tiny block of such an extent gets rewritten and the reference counter
decreases to zero, the extent is given up and freed.

To work around this, you can currently only unshare and recombine by
doing defrag and dedupe on all snapshots. This will reclaim space
sitting in parts of the original extents no longer referenced by a
snapshot visible from the VFS layer.

This is for performance reasons because btrfs is extent based.

As far as I know, ZFS on the other side, works different. It uses block
based storage for the snapshot feature and can easily throw away unused
blocks. Only a second layer on top maps this back into extents. The
underlying infrastructure, however, is block based storage, which also
enables the volume pool to create block devices on the fly out of ZFS
storage space.

PS: All above given the fact I understood it right. ;-)

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure

2017-02-07 Thread Liu Bo
On Tue, Feb 07, 2017 at 08:09:53PM +0900, takafumi-sslab wrote:
> 
> On 2017/02/07 1:34, Liu Bo wrote:
> 
> > 
> > One thing to add, we still need to check whether page has writeback bit 
> > before
> > end_page_writeback.
> 
> Ok, I add PageWriteback check before end_page_writeback.
> 
> > > > > > > > > > > > Looks like commit 55e3bd2e0c2e1 also has the same 
> > > > > > > > > > > > problem although I
> > > > > > > > > > > > gave it my reviewed-by.
> 
> I also add PageWriteback check in write_one_eb.
> 
> Finally, the diff becomes like below.
> Is it Ok ?
> 
> ---
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4ac383a..aa1908a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3445,8 +3445,11 @@ static noinline_for_stack int
> __extent_writepage_io(struct inode *inode,
>bdev, >bio, max_nr,
>end_bio_extent_writepage,
>0, 0, 0, false);
> - if (ret)
> + if (ret) {
>   SetPageError(page);
> + if (PageWriteback(page))
> + end_page_writeback(page);
> + }
> 
>   cur = cur + iosize;
>   pg_offset += iosize;
> @@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct
> extent_buffer *eb,
>   epd->bio_flags = bio_flags;
>   if (ret) {
>   set_btree_ioerr(p);
> - end_page_writeback(p);
> + if (PageWriteback(p))
> + end_page_writeback(p);
>   if (atomic_sub_and_test(num_pages - i, >io_pages))
>   end_extent_buffer_writeback(eb);
>   ret = -EIO;
> 
> ---
> 

Looks good, could you please make a comment for the if statement in your
commit log so that others could know why we put it?

Since you've got a reproducer, baking it into a fstests case is also
welcome.

Thanks,

-liubo

> 
> Sincerely,
> 
> -takafumi
> 
> 
> > 
> > Thanks,
> > 
> > -liubo
> > 
> > > 
> > > Reviewed-by: Liu Bo 
> > > 
> > > Thanks,
> > > 
> > > -liubo
> > > > 
> > > > Sincerely,
> > > > 
> > > > -takafumi
> > > > > 
> > > > > So I don't think the patch is necessary for now.
> > > > > 
> > > > > But as I said, the fact (nr == 0 or 1) would be changed if the
> > > > > subpagesize blocksize is supported.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > -liubo
> > > > > 
> > > > > > Sincerely,
> > > > > > 
> > > > > > -takafumi
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > -liubo
> > > > > > > > Sincerely,
> > > > > > > > 
> > > > > > > > On 2017/01/31 5:09, Liu Bo wrote:
> > > > > > > > > On Fri, Jan 13, 2017 at 03:12:31PM +0900, takafumi-sslab 
> > > > > > > > > wrote:
> > > > > > > > > > Thanks for your replying.
> > > > > > > > > > 
> > > > > > > > > > I understand this bug is more complicated than I expected.
> > > > > > > > > > I classify error cases under submit_extent_page() below
> > > > > > > > > > 
> > > > > > > > > > A: ENOMEM error at btrfs_bio_alloc() in submit_extent_page()
> > > > > > > > > > I first assumed this case and sent the mail.
> > > > > > > > > > When bio_ret is NULL, submit_extent_page() calls 
> > > > > > > > > > btrfs_bio_alloc().
> > > > > > > > > > Then, btrfs_bio_alloc() may fail and submit_extent_page() 
> > > > > > > > > > returns -ENOMEM.
> > > > > > > > > > In this case, bio_endio() is not called and the page's 
> > > > > > > > > > writeback bit
> > > > > > > > > > remains.
> > > > > > > > > > So, there is a need to call end_page_writeback() in the 
> > > > > > > > > > error handling.
> > > > > > > > > > 
> > > > > > > > > > B: errors under submit_one_bio() of submit_extent_page()
> > > > > > > > > > Errors that occur under submit_one_bio() handles at 
> > > > > > > > > > bio_endio(), and
> > > > > > > > > > bio_endio() would call end_page_writeback().
> > > > > > > > > > 
> > > > > > > > > > Therefore, as you mentioned in the last mail, simply adding
> > > > > > > > > > end_page_writeback() like my last email and commit 
> > > > > > > > > > 55e3bd2e0c2e1 can
> > > > > > > > > > conflict in the case of B.
> > > > > > > > > > To avoid such conflict, one easy solution is adding 
> > > > > > > > > > PageWriteback() check
> > > > > > > > > > too.
> > > > > > > > > > 
> > > > > > > > > > How do you think of this solution?
> > > > > > > > > (sorry for the late reply.)
> > > > > > > > > 
> > > > > > > > > I think its caller, "__extent_writepage", has covered the 
> > > > > > > > > above case
> > > > > > > > > by setting page writeback again.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > -liubo
> > > > > > > > > > Sincerely,
> > > > > > > > > > 
> > > > > > > > > > On 2016/12/22 15:20, Liu Bo wrote:
> > > > > > > > > > > On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota 
> > 

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 15:27:34 -0500
schrieb "Austin S. Hemmelgarn" :

> >> I'm not sure about this one.  I would assume based on the fact that
> >> many other things don't work with nodatacow and that regular defrag
> >> doesn't work on files which are currently mapped as executable code
> >> that it does not, but I could be completely wrong about this too.  
> >
> > Technically, there's nothing that prevents autodefrag to work for
> > nodatacow files. The question is: is it really necessary? Standard
> > file systems also have no autodefrag, it's not an issue there
> > because they are essentially nodatacow. Simply defrag the database
> > file once and you're done. Transactional MySQL uses huge data
> > files, probably preallocated. It should simply work with
> > nodatacow.  
> The thing is, I don't have enough knowledge of how defrag is
> implemented in BTRFS to say for certain that ti doesn't use COW
> semantics somewhere (and I would actually expect it to do so, since
> that in theory makes many things _much_ easier to handle), and if it
> uses COW somewhere, then it by definition doesn't work on NOCOW files.

A dev would be needed on this. But from a non-dev point of view, the
defrag operation itself is CoW: Blocks are rewritten to another
location in contiguous order. Only metadata CoW should be needed for
this operation.

It should be nothing else than writing to a nodatacow snapshot... Just
that the snapshot is more or less implicit and temporary.

Hmm? *curious*

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 15:36, Kai Krakow wrote:

Am Tue, 7 Feb 2017 09:13:25 -0500
schrieb Peter Zaitsev :


Hi Hugo,

For the use case I'm looking for I'm interested in having snapshot(s)
open at all time.  Imagine  for example snapshot being created every
hour and several of these snapshots  kept at all time providing quick
recovery points to the state of 1,2,3 hours ago.  In  such case (as I
think you also describe)  nodatacow  does not provide any advantage.


Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

Correct.


I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.
Also correct AFAICT, and this needs to be better documented (for most 
people, the term snapshot implies atomicity of the operation).


How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.
Ideally, such an API should be in the VFS layer, not just BTRFS. 
Reflinking exists in other filesystems already, it's only a matter of 
time before they decide to do snapshotting too.


XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.
Sadly, freezefs (the generic interface based off of xfs_freeze) only 
works for block device snapshots.  Filesystem level snapshots need the 
application software to sync all it's data and then stop writing until 
the snapshot is complete.


As of right now, the sanest way I can come up with for a database server 
is to find a way to do a point-in-time SQL dump of the database (this 
also has the advantage that it works as a backup, and decouples you from 
the backing storage format).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Le 07/02/2017 à 21:36, Kai Krakow a écrit :
> [...]
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.


I don't think so for three reasons :
- it's so far away from admin's expectations that someone would have
documented this in "man btrfs-subvolume",
- the CoW nature of Btrfs makes this trivial : it only has to keep old
versions of data and the corresponding tree for it to work instead of
unlinking them,
- the backup server I referred to restarted a PostgreSQL system from
snapshots about one thousand time now without a single problem while
being almost continuously being updated by streaming replication.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Austin,

I recognize there are other components too.  In this case I'm actually
comparing BTRFS to XFS and EXT4 so I'm 100% sure it is file system
related.   Also I'm using O_DIRECT   asynchronous IO with MySQL which
means there are no significant dirty block size on the file system
level.

I'll see if it helps though

Also I assumed this is something well known as it is documented in Gotchas here:

https://btrfs.wiki.kernel.org/index.php/Gotchas

(Fragmentation section)




>
> It's worth keeping in mind that there is more to the storage stack than just
> the filesystem, and BTRFS tends to be more sensitive to the behavior of
> other components in the stack than most other filesystems are.  The stalls
> you're describing sound more like a symptom of the brain-dead writeback
> buffering defaults used by the VFS layer than they do an issue with BTRFS
> (although BTRFS tends to be a  bit more heavily impacted by this than most
> other filesystems).  Try fiddling with the /proc/sys/vm/dirty_* sysctls
> (there is some pretty good documentation in Documentation/sysctl/vm.txt in
> the kernel source) and see if that helps.  The default values it uses are at
> most 20% of RAM, which is an insane amount of data to buffer before starting
> writeback when you're talking about systems with 16GB of RAM.
>


-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 09:13:25 -0500
schrieb Peter Zaitsev :

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 15:19, Kai Krakow wrote:

Am Tue, 7 Feb 2017 14:50:04 -0500
schrieb "Austin S. Hemmelgarn" :


Also does autodefrag works with nodatacow (ie with snapshot)  or
are these exclusive ?

I'm not sure about this one.  I would assume based on the fact that
many other things don't work with nodatacow and that regular defrag
doesn't work on files which are currently mapped as executable code
that it does not, but I could be completely wrong about this too.


Technically, there's nothing that prevents autodefrag to work for
nodatacow files. The question is: is it really necessary? Standard file
systems also have no autodefrag, it's not an issue there because they
are essentially nodatacow. Simply defrag the database file once and
you're done. Transactional MySQL uses huge data files, probably
preallocated. It should simply work with nodatacow.
The thing is, I don't have enough knowledge of how defrag is implemented 
in BTRFS to say for certain that ti doesn't use COW semantics somewhere 
(and I would actually expect it to do so, since that in theory makes 
many things _much_ easier to handle), and if it uses COW somewhere, then 
it by definition doesn't work on NOCOW files.


On the other hand: Using snapshots clearly introduces fragmentation over
time. If autodefrag kicks in (given, it is supported for nodatacow), it
will slowly unshare all data over time. This somehow defeats the
purpose of having snapshots in the first place for this scenario.

In conclusion, I'd recommend to run some maintenance scripts from time
to time, one to re-share identical blocks, and one to defragment the
current workspace.

The bees daemon comes into mind here... I haven't tried it but it
sounds like it could fill a gap here:

https://github.com/Zygo/bees

Another option comes into mind: XFS now supports shared-extents
copies. You could simply do a cold copy of the database with this
feature resulting in the same effect as a snapshot, without seeing the
other performance problems of btrfs. Tho, the fragmentation issue would
remain, and I think there's no dedupe application for XFS yet.
There isn't, but cp --reflink=auto with a reasonably recent version of 
coreutils should be able to reflink the file properly.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 14:50:04 -0500
schrieb "Austin S. Hemmelgarn" :

> > Also does autodefrag works with nodatacow (ie with snapshot)  or
> > are these exclusive ?  
> I'm not sure about this one.  I would assume based on the fact that
> many other things don't work with nodatacow and that regular defrag
> doesn't work on files which are currently mapped as executable code
> that it does not, but I could be completely wrong about this too.

Technically, there's nothing that prevents autodefrag to work for
nodatacow files. The question is: is it really necessary? Standard file
systems also have no autodefrag, it's not an issue there because they
are essentially nodatacow. Simply defrag the database file once and
you're done. Transactional MySQL uses huge data files, probably
preallocated. It should simply work with nodatacow.

On the other hand: Using snapshots clearly introduces fragmentation over
time. If autodefrag kicks in (given, it is supported for nodatacow), it
will slowly unshare all data over time. This somehow defeats the
purpose of having snapshots in the first place for this scenario.

In conclusion, I'd recommend to run some maintenance scripts from time
to time, one to re-share identical blocks, and one to defragment the
current workspace.

The bees daemon comes into mind here... I haven't tried it but it
sounds like it could fill a gap here:

https://github.com/Zygo/bees

Another option comes into mind: XFS now supports shared-extents
copies. You could simply do a cold copy of the database with this
feature resulting in the same effect as a snapshot, without seeing the
other performance problems of btrfs. Tho, the fragmentation issue would
remain, and I think there's no dedupe application for XFS yet.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 14:47, Kai Krakow wrote:

Am Mon, 6 Feb 2017 08:19:37 -0500
schrieb "Austin S. Hemmelgarn" :


MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in
my btrfs array). This could write new blocks to the most idle disk
always. I think this wasn't covered by the above mentioned patch.
Currently, selection is based only on the disk with most free
space.

You're confusing read selection and write selection.  MDADM and
DM-RAID both use a load-balancing read selection algorithm that takes
latency and other factors into account.  However, they use a
round-robin write selection algorithm that only cares about the
position of the block in the virtual device modulo the number of
physical devices.


Thanks for clearing that point.


As an example, say you have a 3 disk RAID10 array set up using MDADM
(this is functionally the same as a 3-disk raid1 mode BTRFS
filesystem). Every third block starting from block 0 will be on disks
1 and 2, every third block starting from block 1 will be on disks 3
and 1, and every third block starting from block 2 will be on disks 2
and 3.  No latency measurements are taken, literally nothing is
factored in except the block's position in the virtual device.


I didn't know MDADM can use RAID10 on odd amounts of disks...
Nice. I'll keep that in mind. :-)
It's one of those neat features that I stumbled across by accident a 
while back that not many people know about.  It's kind of ironic when 
you think about it too, since the MD RAID10 profile with only 2 replicas 
is actually a more accurate comparison for the BTRFS raid1 profile than 
the MD RAID1 profile.  FWIW, it can (somewhat paradoxically) sometimes 
get better read and write performance than MD RAID0 across the same 
number of disks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Roman Mamedov
On Tue, 7 Feb 2017 09:13:25 -0500
Peter Zaitsev  wrote:

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

It still does provide some advantage, as in each write into new area since
last hour snapshot is going to be CoW'ed only once, as opposed to every new
write getting CoW'ed every time no matter what.

I'm not sold on autodefrag, what I'd suggest instead is to schedule regular
defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase
space usage temporarily as it will partially unmerge extents previously shared
across snapshots, but you won't get away runaway fragmentation anymore, as you
would without nodatacow or with periodical snapshotting.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 14:39, Kai Krakow wrote:

Am Tue, 7 Feb 2017 10:06:34 -0500
schrieb "Austin S. Hemmelgarn" :


4. Try using in-line compression.  This can actually significantly
improve performance, especially if you have slow storage devices and
a really nice CPU.


Just a side note: With nodatacow there'll be no compression, I think.
At least for files with "chattr +C" there'll be no compression. I thus
think "nodatacow" has the same effect.
You're absolutely right, thanks for mentioning this, I completely forgot 
to point it out myself.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 13:59, Peter Zaitsev wrote:

Jeff,

Thank you very much for explanations. Indeed it was not clear in the
documentation - I read it simply as "if you have snapshots enabled
nodatacow makes no difference"

I will rebuild the database in this mode from scratch and see how
performance changes.

So far the most frustating for me was periodic stalls for many seconds
 (running sysbench workload).  What was the most puzzling  I get this
even if I run workload at the  50% or less of the full load  -  Ie
database can handle 1000 transactions/sec and I only inject 500/sec
and I still have those stalls.

This is where it looks to me like some work is being delayed and when
it requires stall for a few seconds to catch up.I wonder  if there
are some configuration options available to play with.

So far I found BTRFS rather  "zero configuration" which is great if it
works but it is also great to have more levers to pull if you're
having some troubles.
It's worth keeping in mind that there is more to the storage stack than 
just the filesystem, and BTRFS tends to be more sensitive to the 
behavior of other components in the stack than most other filesystems 
are.  The stalls you're describing sound more like a symptom of the 
brain-dead writeback buffering defaults used by the VFS layer than they 
do an issue with BTRFS (although BTRFS tends to be a  bit more heavily 
impacted by this than most other filesystems).  Try fiddling with the 
/proc/sys/vm/dirty_* sysctls (there is some pretty good documentation in 
Documentation/sysctl/vm.txt in the kernel source) and see if that helps. 
 The default values it uses are at most 20% of RAM, which is an insane 
amount of data to buffer before starting writeback when you're talking 
about systems with 16GB of RAM.



On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney  wrote:

On 2/7/17 8:53 AM, Peter Zaitsev wrote:

Hi,

I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
Workload.

It did not go very well ranging from multi-seconds stalls where no
transactions are completed to the finally kernel OOPS with "no space left
on device" error message and filesystem going read only.

I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.

Do you have any advice on how BTRFS should be tuned for OLTP workload
(large files having a lot of random writes)  ?Or is this the case where
one should simply stay away from BTRFS and use something else ?

One item recommended in some places is "nodatacow"  this however defeats
the main purpose I'm looking at BTRFS -  I am interested in "free"
snapshots which look very attractive to use for database recovery scenarios
allow instant rollback to the previous state.



Hi Peter -

There seems to be some misunderstanding around how nodatacow works.
Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
and, of course, will cause CoW to happen when a write occurs, but only
on the first write.  Subsequent writes will not CoW again.  This does
mean you don't get CRC protection for data, though.  Since most
databases do this internally, that is probably no great loss.  You will
get fragmentation, but that's true of any random-write workload on btrfs.

Timothy's comment about how extents are accounted is more-or-less
correct.  The file extents in the file system trees reference data
extents in the extent tree.  When portions of the data extent are
unreferenced, they're not necessarily released.  A balance operation
will usually split the data extents so that the unused space is released.

As for the Oopses with ENOSPC, that's something we'd want to look into
if it can be reproduced with a more recent kernel.  We shouldn't be
getting ENOSPC anywhere sensitive anymore.

-Jeff

--
Jeff Mahoney
SUSE Labs







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow
Am Tue, 7 Feb 2017 10:06:34 -0500
schrieb "Austin S. Hemmelgarn" :

> 4. Try using in-line compression.  This can actually significantly 
> improve performance, especially if you have slow storage devices and
> a really nice CPU.

Just a side note: With nodatacow there'll be no compression, I think.
At least for files with "chattr +C" there'll be no compression. I thus
think "nodatacow" has the same effect.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 14:31, Peter Zaitsev wrote:

Hi Hugo,

As I re-read it closely (and also other comments in the thread) I know
understand there is a difference how nodatacow works even if snapshot are
in place.

On autodefrag I wonder is there some more detailed documentation about how
autodefrag works.

The manual  https://btrfs.wiki.kernel.org/index.php/Mount_optionshas
very general statement.

What does "detect random IO" really means  ? It also talks about
 defragmenting the file - is i really about the whole file which is
triggered for defrag or is defrag locally ?  Ie I would understand what
as writes happen the  1MB block is checked and if it is more than X
fragments it is defragmented or something like that.
I don't know the exact algorithm, but I'm pretty sure it's similar to 
what bcache uses to bypass the cache device for sequential I/O.  In 
essence, it's going to trigger for database usage.


Also does autodefrag works with nodatacow (ie with snapshot)  or are these
exclusive ?
I'm not sure about this one.  I would assume based on the fact that many 
other things don't work with nodatacow and that regular defrag doesn't 
work on files which are currently mapped as executable code that it does 
not, but I could be completely wrong about this too.





   There's another approach which might be worth testing, which is to
use autodefrag. This will increase data write I/O, because where you
have one or more small writes in a region, it will also read and write
the data in a small neghbourhood around those writes, so the
fragmentation is reduced. This will improve subsequent read
performance.

   I could also suggest getting the latest kernel you can -- 16.04 is
already getting on for a year old, and there may be performance
improvements in upstream kernels which affect your workload. There's
an Ubuntu kernel PPA you can use to get the new kernels without too
much pain.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Kai Krakow
Am Mon, 6 Feb 2017 08:19:37 -0500
schrieb "Austin S. Hemmelgarn" :

> > MDRAID uses stripe selection based on latency and other measurements
> > (like head position). It would be nice if btrfs implemented similar
> > functionality. This would also be helpful for selecting a disk if
> > there're more disks than stripesets (for example, I have 3 disks in
> > my btrfs array). This could write new blocks to the most idle disk
> > always. I think this wasn't covered by the above mentioned patch.
> > Currently, selection is based only on the disk with most free
> > space.  
> You're confusing read selection and write selection.  MDADM and
> DM-RAID both use a load-balancing read selection algorithm that takes
> latency and other factors into account.  However, they use a
> round-robin write selection algorithm that only cares about the
> position of the block in the virtual device modulo the number of
> physical devices.

Thanks for clearing that point.

> As an example, say you have a 3 disk RAID10 array set up using MDADM 
> (this is functionally the same as a 3-disk raid1 mode BTRFS
> filesystem). Every third block starting from block 0 will be on disks
> 1 and 2, every third block starting from block 1 will be on disks 3
> and 1, and every third block starting from block 2 will be on disks 2
> and 3.  No latency measurements are taken, literally nothing is
> factored in except the block's position in the virtual device.

I didn't know MDADM can use RAID10 on odd amounts of disks...
Nice. I'll keep that in mind. :-)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Hi Hugo,

As I re-read it closely (and also other comments in the thread) I know
understand there is a difference how nodatacow works even if snapshot are
in place.

On autodefrag I wonder is there some more detailed documentation about how
autodefrag works.

The manual  https://btrfs.wiki.kernel.org/index.php/Mount_optionshas
very general statement.

What does "detect random IO" really means  ? It also talks about
 defragmenting the file - is i really about the whole file which is
triggered for defrag or is defrag locally ?  Ie I would understand what
as writes happen the  1MB block is checked and if it is more than X
fragments it is defragmented or something like that.

Also does autodefrag works with nodatacow (ie with snapshot)  or are these
exclusive ?


>
>There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Jeff,

Thank you very much for explanations. Indeed it was not clear in the
documentation - I read it simply as "if you have snapshots enabled
nodatacow makes no difference"

I will rebuild the database in this mode from scratch and see how
performance changes.

So far the most frustating for me was periodic stalls for many seconds
 (running sysbench workload).  What was the most puzzling  I get this
even if I run workload at the  50% or less of the full load  -  Ie
database can handle 1000 transactions/sec and I only inject 500/sec
and I still have those stalls.

This is where it looks to me like some work is being delayed and when
it requires stall for a few seconds to catch up.I wonder  if there
are some configuration options available to play with.

So far I found BTRFS rather  "zero configuration" which is great if it
works but it is also great to have more levers to pull if you're
having some troubles.


On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney  wrote:
> On 2/7/17 8:53 AM, Peter Zaitsev wrote:
>> Hi,
>>
>> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
>> Workload.
>>
>> It did not go very well ranging from multi-seconds stalls where no
>> transactions are completed to the finally kernel OOPS with "no space left
>> on device" error message and filesystem going read only.
>>
>> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
>>
>> Do you have any advice on how BTRFS should be tuned for OLTP workload
>> (large files having a lot of random writes)  ?Or is this the case where
>> one should simply stay away from BTRFS and use something else ?
>>
>> One item recommended in some places is "nodatacow"  this however defeats
>> the main purpose I'm looking at BTRFS -  I am interested in "free"
>> snapshots which look very attractive to use for database recovery scenarios
>> allow instant rollback to the previous state.
>>
>
> Hi Peter -
>
> There seems to be some misunderstanding around how nodatacow works.
> Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
> and, of course, will cause CoW to happen when a write occurs, but only
> on the first write.  Subsequent writes will not CoW again.  This does
> mean you don't get CRC protection for data, though.  Since most
> databases do this internally, that is probably no great loss.  You will
> get fragmentation, but that's true of any random-write workload on btrfs.
>
> Timothy's comment about how extents are accounted is more-or-less
> correct.  The file extents in the file system trees reference data
> extents in the extent tree.  When portions of the data extent are
> unreferenced, they're not necessarily released.  A balance operation
> will usually split the data extents so that the unused space is released.
>
> As for the Oopses with ENOSPC, that's something we'd want to look into
> if it can be reproduced with a more recent kernel.  We shouldn't be
> getting ENOSPC anywhere sensitive anymore.
>
> -Jeff
>
> --
> Jeff Mahoney
> SUSE Labs
>



-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Jeff Mahoney
On 2/7/17 8:53 AM, Peter Zaitsev wrote:
> Hi,
> 
> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
> Workload.
> 
> It did not go very well ranging from multi-seconds stalls where no
> transactions are completed to the finally kernel OOPS with "no space left
> on device" error message and filesystem going read only.
> 
> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
> 
> Do you have any advice on how BTRFS should be tuned for OLTP workload
> (large files having a lot of random writes)  ?Or is this the case where
> one should simply stay away from BTRFS and use something else ?
> 
> One item recommended in some places is "nodatacow"  this however defeats
> the main purpose I'm looking at BTRFS -  I am interested in "free"
> snapshots which look very attractive to use for database recovery scenarios
> allow instant rollback to the previous state.
> 

Hi Peter -

There seems to be some misunderstanding around how nodatacow works.
Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
and, of course, will cause CoW to happen when a write occurs, but only
on the first write.  Subsequent writes will not CoW again.  This does
mean you don't get CRC protection for data, though.  Since most
databases do this internally, that is probably no great loss.  You will
get fragmentation, but that's true of any random-write workload on btrfs.

Timothy's comment about how extents are accounted is more-or-less
correct.  The file extents in the file system trees reference data
extents in the extent tree.  When portions of the data extent are
unreferenced, they're not necessarily released.  A balance operation
will usually split the data extents so that the unused space is released.

As for the Oopses with ENOSPC, that's something we'd want to look into
if it can be reproduced with a more recent kernel.  We shouldn't be
getting ENOSPC anywhere sensitive anymore.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs-progs: better document btrfs receive security

2017-02-07 Thread David Sterba
On Fri, Feb 03, 2017 at 08:48:58AM -0500, Austin S. Hemmelgarn wrote:
> This adds some extra documentation to the btrfs-receive manpage that
> explains some of the security related aspects of btrfs-receive.  The
> first part covers the fact that the subvolume being received is writable
> until the receive finishes, and the second covers the current lack of
> sanity checking of the send stream.
> 
> Signed-off-by: Austin S. Hemmelgarn 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


READ YOUR MESSAGE NOW!!

2017-02-07 Thread Kabiru Wahid


Greeting's to you and your family! I will be glad if you will be capable to 
assist me to secure a sum of USD 15.5M) into your bank account in your country. 
This is a genuine transaction, It just that I cannot operate it alone without 
the help of a foreign partner that is my reason of contacting you in this 
manner so that you can assist me to actualize this better opportunity. If you 
are interested please reply back immediately to my private ID 
(kabiruwahid...@gmail.com) and prove your integrity together with your full 
Information.
Your  Name..
Your Home Address.
Your Age..
Sex...
Your Home Telephone..
Your Personal Number.
Fax Number..
Receiving Country
Occupation...
I am waiting for your urgent respond to enable us proceed further for the 
transfer of this fund into your account.
Yours faithfully,
Mr.Kabiru Wahid
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix use-after-free due to wrong order of destroying work queues

2017-02-07 Thread fdmanana
From: Filipe Manana 

Before we destroy all work queues (and wait for their tasks to complete)
we were destroying the work queues used for metadata I/O operations, which
can result in a use-after-free problem because most tasks from all work
queues do metadata I/O operations. For example, the tasks from the caching
workers work queue (fs_info->caching_workers), which is destroyed only
after the work queue used for metadata reads (fs_info->endio_meta_workers)
is destroyed, do metadata reads, which result in attempts to queue tasks
into the later work queue, triggering a use-after-free with a trace like
the following:

[23114.613543] general protection fault:  [#1] PREEMPT SMP
[23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic
acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 
processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 
crc16
jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci 
libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug]
[23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 
4.9.0-rc7-btrfs-next-36+ #1
[23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs]
[23114.616932] task: 880221d45780 task.stack: c9000bc5
[23114.616932] RIP: 0010:[]  [] 
btrfs_queue_work+0x2c/0x190 [btrfs]
[23114.616932] RSP: 0018:88023f443d60  EFLAGS: 00010246
[23114.616932] RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 0102
[23114.616932] RDX: a0419000 RSI: 88011df534f0 RDI: 880101f01c00
[23114.616932] RBP: 88023f443d80 R08: 000f7000 R09: 
[23114.616932] R10: 88023f443d48 R11: 1000 R12: 88011df534f0
[23114.616932] R13: 880135963868 R14: 1000 R15: 1000
[23114.616932] FS:  () GS:88023f44() 
knlGS:
[23114.616932] CS:  0010 DS:  ES:  CR0: 80050033
[23114.616932] CR2: 7f0fb9f8e520 CR3: 01a0b000 CR4: 06e0
[23114.616932] Stack:
[23114.616932]  880101f01c00 88011df534f0 880135963868 
1000
[23114.616932]  88023f443da0 a03470af 880149b37200 
880135963868
[23114.616932]  88023f443db8 8125293c 880149b37200 
88023f443de0
[23114.616932] Call Trace:
[23114.616932]   [23114.616932]  [] 
end_workqueue_bio+0xd5/0xda [btrfs]
[23114.616932]  [] bio_endio+0x54/0x57
[23114.616932]  [] btrfs_end_bio+0xf7/0x106 [btrfs]
[23114.616932]  [] bio_endio+0x54/0x57
[23114.616932]  [] blk_update_request+0x21a/0x30f
[23114.616932]  [] scsi_end_request+0x31/0x182 [scsi_mod]
[23114.616932]  [] scsi_io_completion+0x1ce/0x4c8 [scsi_mod]
[23114.616932]  [] scsi_finish_command+0x104/0x10d [scsi_mod]
[23114.616932]  [] scsi_softirq_done+0x101/0x10a [scsi_mod]
[23114.616932]  [] blk_done_softirq+0x82/0x8d
[23114.616932]  [] __do_softirq+0x1ab/0x412
[23114.616932]  [] irq_exit+0x49/0x99
[23114.616932]  [] 
smp_call_function_single_interrupt+0x24/0x26
[23114.616932]  [] call_function_single_interrupt+0x89/0x90
[23114.616932]   [23114.616932]  [] ? 
scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932]  [] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932]  [] ? _raw_spin_unlock_irq+0x32/0x4a
[23114.616932]  [] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932]  [] scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932]  [] __blk_run_queue_uncond+0x22/0x2b
[23114.616932]  [] __blk_run_queue+0x19/0x1b
[23114.616932]  [] blk_queue_bio+0x268/0x282
[23114.616932]  [] generic_make_request+0xbd/0x160
[23114.616932]  [] submit_bio+0x100/0x11d
[23114.616932]  [] ? __this_cpu_preempt_check+0x13/0x15
[23114.616932]  [] ? __percpu_counter_add+0x8e/0xa7
[23114.616932]  [] btrfsic_submit_bio+0x1a/0x1d [btrfs]
[23114.616932]  [] btrfs_map_bio+0x1f4/0x26d [btrfs]
[23114.616932]  [] btree_submit_bio_hook+0x74/0xbf [btrfs]
[23114.616932]  [] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs]
[23114.616932]  [] submit_one_bio+0x6b/0x89 [btrfs]
[23114.616932]  [] read_extent_buffer_pages+0x170/0x1ec 
[btrfs]
[23114.616932]  [] ? free_root_pointers+0x64/0x64 [btrfs]
[23114.616932]  [] readahead_tree_block+0x3f/0x4c [btrfs]
[23114.616932]  [] read_block_for_search.isra.20+0x1ce/0x23d 
[btrfs]
[23114.616932]  [] btrfs_search_slot+0x65f/0x774 [btrfs]
[23114.616932]  [] ? free_extent_buffer+0x73/0x7e [btrfs]
[23114.616932]  [] btrfs_next_old_leaf+0xa1/0x33c [btrfs]
[23114.616932]  [] btrfs_next_leaf+0x10/0x12 [btrfs]
[23114.616932]  [] caching_thread+0x22d/0x416 [btrfs]
[23114.616932]  [] btrfs_scrubparity_helper+0x187/0x3b6 
[btrfs]
[23114.616932]  [] btrfs_cache_helper+0xe/0x10 [btrfs]
[23114.616932]  [] process_one_work+0x273/0x4e4
[23114.616932]  [] 

understanding disk space usage

2017-02-07 Thread Vasco Visser
Hello,

My system is or seems to be running out of disk space but I can't find
out how or why. Might be a BTRFS peculiarity, hence posting on this
list. Most indicators seem to suggest I'm filling up, but I can't
trace the disk usage to files on the FS.

The issue is on my root filesystem on a 28GiB ssd partition (commands
below issued when booted into single user mode):


$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda3  28G   26G  2.1G  93% /


$ btrfs --version
btrfs-progs v4.4


$ btrfs fi usage /
Overall:
Device size:  27.94GiB
Device allocated:  27.94GiB
Device unallocated:   1.00MiB
Device missing: 0.00B
Used:  25.03GiB
Free (estimated):   2.37GiB (min: 2.37GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 256.00MiB (used: 0.00B)
Data,single: Size:26.69GiB, Used:24.32GiB
   /dev/sda3  26.69GiB
Metadata,single: Size:1.22GiB, Used:731.45MiB
   /dev/sda3   1.22GiB
System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda3  32.00MiB
Unallocated:
   /dev/sda3   1.00MiB


$ btrfs fi df /
Data, single: total=26.69GiB, used=24.32GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.22GiB, used=731.48MiB
GlobalReserve, single: total=256.00MiB, used=0.00B


However:
$ mount -o bind / /mnt
$ sudo du -hs /mnt
9.3G /mnt


Try to balance:
$ btrfs balance start /
ERROR: error during balancing '/': No space left on device


Am I really filling up? What can explain the huge discrepancy with the
output of du (no open file descriptors on deleted files can explain
this in single user mode) and the FS stats?

Any advice on possible causes and how to proceed?


--
Vasco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Lionel Bouton
Hi Peter,

Le 07/02/2017 à 15:13, Peter Zaitsev a écrit :
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again. Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way

There's not much that can be done if the same file is modified in 2
different subvolumes (typically the original and a R/W snapshot). You
either break the reflink around the modification to limit the amount of
fragmentation (which will use disk space and write I/O) or get
fragmentation on at least one subvolume (which will add seeks).
So the only options are either to flatten the files (which can be done
incrementally by defragmenting them on both sides when they change) or
only defragment the most used volume (especially if the other is a
relatively short-lived snapshot where performance won't degrade much
until it is removed and won't matter much).

I just modified our defragmenter scheduler to be aware of multiple
subvolumes and support ignoring some of them. The previous version (not
tagged, sorry) was battle tested on a Ceph cluster and was designed for
it. Autodefrag didn't work with Ceph with our workload (latency went
through the roof, OSDs were timing out requests, ...) and our scheduler
with some simple Ceph BTRFS related tunings gave us even better
performance than XFS (which is usually the recommended choice with
current Ceph versions).

The current version is probably still rough around the edges as it is
brand new (most of the work was done last Sunday) and only running on a
backup server with a situation not much different from yours : a large
PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a
daily snapshot used to start a PostgreSQL instance for "tests on real
data" purposes + a copy of a <10TB NFS server with similar snapshots in
place. All of this is on a single RAID10 13-14TB BTRFS.
In our case using autodefrag on this slowly degraded performance to the
point where off-site backups became slow enough to warrant preventive
measures.
The current scheduler looks for the mountpoints of top BTRFS volumes (so
you have to mount the top volume somewhere), and defragments them avoiding :
- read-only snapshots,
- all data below configurable subdirs (including read-write subvolumes
even if they are mounted elsewhere), see README.md for instructions.

It slowly walks all files eligible for defragmentation and in parallel
detects writes to the same filesystem, including writes to read-write
subvolumes mounted elsewhere to trigger defragmentation. The scheduler
uses an estimated "cost" for each file to prioritize defragmentation
tasks and with default settings tries to keep I/O activity low enough
that it doesn't slow down other tasks too much. However it defragments
files whole, which might put some strain for huge ibdata* files if you
didn't switch to file per table. In our case defragmenting 1GB files is
OK and doesn't have a major impact.

We are already seeing better performance (our total daily backup time is
below worrying levels again) and the scheduler didn't even finish
walking the whole filesystem (there are approximately 8 millions files
and it is configured to evaluate them over a week). This is probably
because it follows the most write-active files (which are in the
PostgreSQL slave directory) and defragmented most of them early.

Note that it is tuned for filesystems using ~2TB 7200rpm drives (there
are some options that will adapt it to subsystems with more I/O
capacity). Using drives with different capacities shouldn't need tuning,
but it probably will not work well on SSD (it should be configured to
speed up significantly).

See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb

Some parameters are available (start it with --help). You should
probably start it with --verbose at least until you are comfortable with
it to get a list of which files are defragmented and many debug messages
you probably want to ignore (or you'll probably have to read the Ruby
code to fully understand what they mean).

I don't provide any warranty for it but the worst I believe can happen
is no performance improvements or performance degradation until you stop
it. If you don't blacklist read-write snapshots with the .no-defrag file
(see README.md) defragmentation will probably eat more disk space than
usual. Space usage will go up rapidly during defragmentation if you have
snapshots, it is supposed to go down after all snapshots referring to
fragmented files are removed and replaced by new snapshots (where
fragmentation should be more stable).

Best regards,

Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Filipe Manana
On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo  wrote:
>
>
> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:
>>
>>
>> Hi Qu,
>>
>> On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>>>
>>>
>>>
>>> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:

 February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:
>>
>>
>> 
>>


 Quata support was indeed active -- and it warned me that the qroup
 data was inconsistent.

 Disabling quotas had an immediate impact on balance throughput -- it's
 *much* faster now!
 From a quick glance at iostat I would guess it's at least a factor 100
 faster.


 Should quota support generally be disabled during balances? Or did I
 somehow push my fs into a weired state where it triggered a slow-path?



 Thanks!

j
>>>
>>>
>>> Would you please provide the kernel version?
>>>
>>> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
>>> fix qgroup bytes leaking, but also hugely slow down the balance process:
>>>
>>> commit 62b99540a1d91e46422f0e04de50fc723812c421
>>> Author: Qu Wenruo 
>>> Date:   Mon Aug 15 10:36:51 2016 +0800
>>>
>>> btrfs: relocation: Fix leaking qgroups numbers on data extents
>>>
>>> Sorry for that.
>>>
>>> And in v4.10, a better method is applied to fix the byte leaking
>>> problem, and should be a little faster than previous one.
>>>
>>> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
>>> Author: Qu Wenruo 
>>> Date:   Tue Oct 18 09:31:29 2016 +0800
>>>
>>> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>>>
>>>
>>> However, using balance with qgroup is still slower than balance without
>>> qgroup, the root fix needs us to rework current backref iteration.
>>>
>>
>> This patch has made the btrfs balance performance worse. The balance
>> task has become more CPU intensive compared to earlier and takes longer
>> to complete, besides hogging resources. While correctness is important,
>> we need to figure out how this can be made more efficient.
>>
> The cause is already known.
>
> It's find_parent_node() which takes most of the time to find all referencer
> of an extent.
>
> And it's also the cause for FIEMAP softlockup (fixed in recent release by
> early quit).
>
> The biggest problem is, current find_parent_node() uses list to iterate,
> which is quite slow especially it's done in a loop.
> In real world find_parent_node() is about O(n^3).
> We can either improve find_parent_node() by using rb_tree, or introduce some
> cache for find_parent_node().

Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.

>
>
> IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I
> didn't hear anything about it recently.
>
> Thanks,
> Qu
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 10:20, Timofey Titovets wrote:

I think that you have a problem with extent bookkeeping (if i
understand how btrfs manage extents).
So for deal with it, try enable compression, as compression will force
all extents to be fragmented with size ~128kb.


No, it will compress everything in chunks of 128kB, but it will not fragment
things any more than they already would have been (it may actually _reduce_
fragmentation because there is less data being stored on disk).  This
representation is a bug in the FIEMAP ioctl, it doesn't understand the way
BTRFS represents things properly.  IIRC, there was a patch to fix this, but
I don't remember what happened with it.

That said, in-line compression can help significantly, especially if you
have slow storage devices.



I mean that:
You have a 128MB extent, you rewrite random 4k sectors, btrfs will not
split 128MB extent, and not free up data, (i don't know internal algo,
so i can't predict when this will hapen), and after some time, btrfs
will rebuild extents, and split 128 MB exten to several more smaller.
But when you use compression, allocator rebuilding extents much early
(i think, it's because btrfs also operates with that like 128kb
extent, even if it's a continuos 128MB chunk of data).

The allocator has absolutely nothing to do with this, it's a function of 
the COW operation.  Unless you're using nodatacow, that 128MB extent 
will get split the moment the data hits the storage device (either on 
the next commit cycle (at most 30 seconds with the default commit 
cycle), or when fdatasync is called, whichever is sooner).  In the case 
of compression, it's still one extent (although on disk it will be less 
than 128MB) and will be split at _exactly_ the same time under _exactly_ 
the same circumstances as an uncompressed extent.  IOW, it has 
absolutely nothing to do with the extent handling either.


The difference arises in that compressed data effectively has a on-media 
block size of 128k, not 16k (the current default block size) or 4k (the 
old default).  This means that the smallest fragment possible for a file 
with in-line compression enabled is 128k, while for a file without it 
it's equal to the filesystem block size.  A larger minimum fragment size 
means that the maximum number of fragments a given file can have is 
smaller (8 times smaller in fact than without compression when using the 
current default block size), which means that there will be less 
fragmentation.


Some rather complex and tedious math indicates that this is not the 
_only_ thing improving performance when using in-line compression, but 
it's probably the biggest thing doing so for the workload being discussed.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Timofey Titovets
>> I think that you have a problem with extent bookkeeping (if i
>> understand how btrfs manage extents).
>> So for deal with it, try enable compression, as compression will force
>> all extents to be fragmented with size ~128kb.
>
> No, it will compress everything in chunks of 128kB, but it will not fragment
> things any more than they already would have been (it may actually _reduce_
> fragmentation because there is less data being stored on disk).  This
> representation is a bug in the FIEMAP ioctl, it doesn't understand the way
> BTRFS represents things properly.  IIRC, there was a patch to fix this, but
> I don't remember what happened with it.
>
> That said, in-line compression can help significantly, especially if you
> have slow storage devices.


I mean that:
You have a 128MB extent, you rewrite random 4k sectors, btrfs will not
split 128MB extent, and not free up data, (i don't know internal algo,
so i can't predict when this will hapen), and after some time, btrfs
will rebuild extents, and split 128 MB exten to several more smaller.
But when you use compression, allocator rebuilding extents much early
(i think, it's because btrfs also operates with that like 128kb
extent, even if it's a continuos 128MB chunk of data).

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 10:00, Timofey Titovets wrote:

2017-02-07 17:13 GMT+03:00 Peter Zaitsev :

Hi Hugo,

For the use case I'm looking for I'm interested in having snapshot(s)
open at all time.  Imagine  for example snapshot being created every
hour and several of these snapshots  kept at all time providing quick
recovery points to the state of 1,2,3 hours ago.  In  such case (as I
think you also describe)  nodatacow  does not provide any advantage.

I have not seen autodefrag helping much but I will try again. Is
there any autodefrag documentation available about how is it expected
to work and if it can be tuned in any way

I noticed remounting already fragmented filesystem with autodefrag
and putting workload  which does more fragmentation does not seem to
improve over time




   Well, nodatacow will still allow snapshots to work, but it also
allows the data to fragment. Each snapshot made will cause subsequent
writes to shared areas to be CoWed once (and then it reverts to
unshared and nodatacow again).

   There's another approach which might be worth testing, which is to
use autodefrag. This will increase data write I/O, because where you
have one or more small writes in a region, it will also read and write
the data in a small neghbourhood around those writes, so the
fragmentation is reduced. This will improve subsequent read
performance.

   I could also suggest getting the latest kernel you can -- 16.04 is
already getting on for a year old, and there may be performance
improvements in upstream kernels which affect your workload. There's
an Ubuntu kernel PPA you can use to get the new kernels without too
much pain.








--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think that you have a problem with extent bookkeeping (if i
understand how btrfs manage extents).
So for deal with it, try enable compression, as compression will force
all extents to be fragmented with size ~128kb.
No, it will compress everything in chunks of 128kB, but it will not 
fragment things any more than they already would have been (it may 
actually _reduce_ fragmentation because there is less data being stored 
on disk).  This representation is a bug in the FIEMAP ioctl, it doesn't 
understand the way BTRFS represents things properly.  IIRC, there was a 
patch to fix this, but I don't remember what happened with it.


That said, in-line compression can help significantly, especially if you 
have slow storage devices.


I did have a similar problem with MySQL (Zabbix as a workload, i.e.
most time load are random write), and i fix it, by enabling
compression. (I use debian with latest kernel from backports)
At now it just works with stable speed under stable load.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Timofey Titovets
2017-02-07 17:13 GMT+03:00 Peter Zaitsev :
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again. Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way
>
> I noticed remounting already fragmented filesystem with autodefrag
> and putting workload  which does more fragmentation does not seem to
> improve over time
>
>
>
>>Well, nodatacow will still allow snapshots to work, but it also
>> allows the data to fragment. Each snapshot made will cause subsequent
>> writes to shared areas to be CoWed once (and then it reverts to
>> unshared and nodatacow again).
>>
>>There's another approach which might be worth testing, which is to
>> use autodefrag. This will increase data write I/O, because where you
>> have one or more small writes in a region, it will also read and write
>> the data in a small neghbourhood around those writes, so the
>> fragmentation is reduced. This will improve subsequent read
>> performance.
>>
>>I could also suggest getting the latest kernel you can -- 16.04 is
>> already getting on for a year old, and there may be performance
>> improvements in upstream kernels which affect your workload. There's
>> an Ubuntu kernel PPA you can use to get the new kernels without too
>> much pain.
>
>
>
>
>
>
>
> --
> Peter Zaitsev, CEO, Percona
> Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I think that you have a problem with extent bookkeeping (if i
understand how btrfs manage extents).
So for deal with it, try enable compression, as compression will force
all extents to be fragmented with size ~128kb.

I did have a similar problem with MySQL (Zabbix as a workload, i.e.
most time load are random write), and i fix it, by enabling
compression. (I use debian with latest kernel from backports)
At now it just works with stable speed under stable load.

P.S.
(And i also use your percona MySQL some time, it's cool).

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Grandi
> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive
> OLTP MySQL Workload.

This has a lot of interesting and mostly agreeable information:

https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

The main target of Btrfs is where one wants checksums and
occasional snapshot for backup (rather than rollback) and
applications do whole-file rewrites or appends.

> It did not go very well ranging from multi-seconds stalls
> where no transactions are completed

That usually is more because of the "clever" design and defaults
of the Linux page cache and block IO subsystem, which are
astutely pessimized for every workload, but especially for
read-modify-write ones, never mind for RMW workloads on
copy-on-write filesystems.

That most OS designs are pessimized for anything like a "write
intensive OLTP" workload is not new, M Stonebraker complained
about that 35 years ago, and nothing much has changed:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d

> to the finally kernel OOPS with "no space left on device"
> error message and filesystem going read only.

That's because Btrfs has a a two-level allocator, where space is
allocated in 1GiB chunks (distinct as to data and metadata) and
then in 16KiB nodes, and this makes it far more likely for free
space fragmentation to occur. Therefore Btrfs has a free space
compactor ('btrfs balance') that must be used the more often the
more updates happen.

> interested in "free" snapshots which look very attractive

The general problem is that it is pretty much impossible to have
read-modify-write rollbacks for cheap, because the writes in
general are scattered (that is their time coherence is very
different from their spatial coherence). That means either heavy
spatial fragmentation or huge write amplification.

The 'snapshot' type of DM/LVM2 device delivers heavy spatial
fragmentation, Btrfs does a balance of both. Another commenter
has mentioned the use of 'nodatacow' to prevent RMW resulting in
huge write-amplification.

> to use for database recovery scenarios allow instant rollback
> to the previous state.

You may be more interested in NILFS2 for that, but there are
significant tradeoffs there too, and NILFS2 requires a free
space compactor too, plus since NILFS2 gives up on short-term
spatial coherence, the compactor also needs to compact data
space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 08:53, Peter Zaitsev wrote:

Hi,

I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
Workload.

It did not go very well ranging from multi-seconds stalls where no
transactions are completed to the finally kernel OOPS with "no space left
on device" error message and filesystem going read only.
How much spare space did you have allocated in the filesystem?  At a 
minimum, you want at least a few GB beyond what you expect to be the 
maximum size of your data-set times the number of snapshots you plan to 
keep around at any given time.


I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
Not exactly wrong, but getting this to work efficiently is more art than 
engineering.


Do you have any advice on how BTRFS should be tuned for OLTP workload
(large files having a lot of random writes)  ?Or is this the case where
one should simply stay away from BTRFS and use something else ?
The general recommendation is usually to avoid BTRFS for such things. 
There are however a number of things you can do to improve performance:
1. Use a backing storage format that has the minimal amount of 
complexity.  The more data structures that get updated when a record 
changes, the worse the performance will be.  I don't have enough 
experience with MySQL to give a specific recommendation on what backing 
storage format to use, but someone else might.
2. Avoid large numbers of small transactions.  The smaller the 
transaction, the worse it will fragment things.
3. Use autodefrag.  This will increase write load on the storage device, 
but it should improve performance for reads.
4. Try using in-line compression.  This can actually significantly 
improve performance, especially if you have slow storage devices and a 
really nice CPU.
5. If you're running raid10 mode for BTRFS, run raid1 on top of two LVM 
or MD RAID0 devices instead.  This sounds stupid, but it actually will 
hugely improve both read and write performance without sacrificing any 
data safety.
6. Look at I/O scheduler tuning.  This can have a huge impact, 
especially considering that most of the defaults for the various 
schedulers are somewhat poor for most modern systems.  I won't go into 
the details here, since there are a huge number of online resources 
about this.


One item recommended in some places is "nodatacow"  this however defeats
the main purpose I'm looking at BTRFS -  I am interested in "free"
snapshots which look very attractive to use for database recovery scenarios
allow instant rollback to the previous state.
Snapshots aren't free.  They are quick, but they aren't free by any 
means.  If you're going to be using snapshots, keep them to a minimum, 
performance scales inversely proportionate to the number of snapshots, 
and this has a much bigger impact the more you're trying to do on the 
filesystem.  Also, consider whether or not you _actually_ need 
filesystem level snapshots.  I don't know about your full software 
stack, but most good OLTP software supports rollback segments (or an 
equivalent with a different name), and those are probably what you want 
to use, not filesystem snapshots.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Hi Hugo,

For the use case I'm looking for I'm interested in having snapshot(s)
open at all time.  Imagine  for example snapshot being created every
hour and several of these snapshots  kept at all time providing quick
recovery points to the state of 1,2,3 hours ago.  In  such case (as I
think you also describe)  nodatacow  does not provide any advantage.

I have not seen autodefrag helping much but I will try again. Is
there any autodefrag documentation available about how is it expected
to work and if it can be tuned in any way

I noticed remounting already fragmented filesystem with autodefrag
and putting workload  which does more fragmentation does not seem to
improve over time



>Well, nodatacow will still allow snapshots to work, but it also
> allows the data to fragment. Each snapshot made will cause subsequent
> writes to shared areas to be CoWed once (and then it reverts to
> unshared and nodatacow again).
>
>There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.







-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Hugo Mills
On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote:
> Hi,
> 
> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
> Workload.
> 
> It did not go very well ranging from multi-seconds stalls where no
> transactions are completed to the finally kernel OOPS with "no space left
> on device" error message and filesystem going read only.
> 
> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
> 
> Do you have any advice on how BTRFS should be tuned for OLTP workload
> (large files having a lot of random writes)  ?Or is this the case where
> one should simply stay away from BTRFS and use something else ?
> 
> One item recommended in some places is "nodatacow"  this however defeats
> the main purpose I'm looking at BTRFS -  I am interested in "free"
> snapshots which look very attractive to use for database recovery scenarios
> allow instant rollback to the previous state.

   Well, nodatacow will still allow snapshots to work, but it also
allows the data to fragment. Each snapshot made will cause subsequent
writes to shared areas to be CoWed once (and then it reverts to
unshared and nodatacow again).

   There's another approach which might be worth testing, which is to
use autodefrag. This will increase data write I/O, because where you
have one or more small writes in a region, it will also read and write
the data in a small neghbourhood around those writes, so the
fragmentation is reduced. This will improve subsequent read
performance.

   I could also suggest getting the latest kernel you can -- 16.04 is
already getting on for a year old, and there may be performance
improvements in upstream kernels which affect your workload. There's
an Ubuntu kernel PPA you can use to get the new kernels without too
much pain.

   Hugo.

-- 
Hugo Mills | I don't care about "it works on my machine". We are
hugo@... carfax.org.uk | not shipping your machine.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


BTRFS for OLTP Databases

2017-02-07 Thread Peter Zaitsev
Hi,

I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
Workload.

It did not go very well ranging from multi-seconds stalls where no
transactions are completed to the finally kernel OOPS with "no space left
on device" error message and filesystem going read only.

I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.

Do you have any advice on how BTRFS should be tuned for OLTP workload
(large files having a lot of random writes)  ?Or is this the case where
one should simply stay away from BTRFS and use something else ?

One item recommended in some places is "nodatacow"  this however defeats
the main purpose I'm looking at BTRFS -  I am interested in "free"
snapshots which look very attractive to use for database recovery scenarios
allow instant rollback to the previous state.

-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure

2017-02-07 Thread takafumi-sslab


On 2017/02/07 1:34, Liu Bo wrote:



One thing to add, we still need to check whether page has writeback bit before
end_page_writeback.


Ok, I add PageWriteback check before end_page_writeback.


Looks like commit 55e3bd2e0c2e1 also has the same problem although I
gave it my reviewed-by.


I also add PageWriteback check in write_one_eb.

Finally, the diff becomes like below.
Is it Ok ?

---

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ac383a..aa1908a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3445,8 +3445,11 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,

 bdev, >bio, max_nr,
 end_bio_extent_writepage,
 0, 0, 0, false);
-   if (ret)
+   if (ret) {
SetPageError(page);
+   if (PageWriteback(page))
+   end_page_writeback(page);
+   }

cur = cur + iosize;
pg_offset += iosize;
@@ -3767,7 +3770,8 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,

epd->bio_flags = bio_flags;
if (ret) {
set_btree_ioerr(p);
-   end_page_writeback(p);
+   if (PageWriteback(p))
+   end_page_writeback(p);
if (atomic_sub_and_test(num_pages - i, >io_pages))
end_extent_buffer_writeback(eb);
ret = -EIO;

---


Sincerely,

-takafumi




Thanks,

-liubo



Reviewed-by: Liu Bo 

Thanks,

-liubo


Sincerely,

-takafumi


So I don't think the patch is necessary for now.

But as I said, the fact (nr == 0 or 1) would be changed if the
subpagesize blocksize is supported.

Thanks,

-liubo


Sincerely,

-takafumi

Thanks,

-liubo

Sincerely,

On 2017/01/31 5:09, Liu Bo wrote:

On Fri, Jan 13, 2017 at 03:12:31PM +0900, takafumi-sslab wrote:

Thanks for your replying.

I understand this bug is more complicated than I expected.
I classify error cases under submit_extent_page() below

A: ENOMEM error at btrfs_bio_alloc() in submit_extent_page()
I first assumed this case and sent the mail.
When bio_ret is NULL, submit_extent_page() calls btrfs_bio_alloc().
Then, btrfs_bio_alloc() may fail and submit_extent_page() returns -ENOMEM.
In this case, bio_endio() is not called and the page's writeback bit
remains.
So, there is a need to call end_page_writeback() in the error handling.

B: errors under submit_one_bio() of submit_extent_page()
Errors that occur under submit_one_bio() handles at bio_endio(), and
bio_endio() would call end_page_writeback().

Therefore, as you mentioned in the last mail, simply adding
end_page_writeback() like my last email and commit 55e3bd2e0c2e1 can
conflict in the case of B.
To avoid such conflict, one easy solution is adding PageWriteback() check
too.

How do you think of this solution?

(sorry for the late reply.)

I think its caller, "__extent_writepage", has covered the above case
by setting page writeback again.

Thanks,

-liubo

Sincerely,

On 2016/12/22 15:20, Liu Bo wrote:

On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota wrote:

This is actually inspired by Filipe's patch(55e3bd2e0c2e1).

When submit_extent_page() in __extent_writepage_io() fails,
Btrfs misses clearing a writeback bit of the failed page.
This causes the false under-writeback page.
Then, another sync task hangs in filemap_fdatawait_range(),
because it waits the false under-writeback page.

CPU0CPU1

__extent_writepage_io()
  ret = submit_extent_page() // fail

  if (ret)
SetPageError(page)
// miss clearing the writeback bit

sync()
  ...
  filemap_fdatawait_range()
wait_on_page_writeback(page);
// wait the false under-writeback page

Signed-off-by: Takafumi Kubota 
---
 fs/btrfs/extent_io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1e67723..ef9793b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3443,8 +3443,10 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 bdev->bio, max_nr,
 end_bio_extent_writepage,
 0, 0, 0, false);
-   if (ret)
+   if (ret) {
SetPageError(page);
+   end_page_writeback(page);
+   }

OK...this could be complex as we don't know which part in

Re: btrfs/125 deadlock using nospace_cache or space_cache=v2

2017-02-07 Thread Qu Wenruo



At 02/07/2017 04:02 PM, Anand Jain wrote:


Hi Qu,

 I don't think I have seen this before, I don't know the reason
 why I wrote this, may be to test encryption, however it was all
 with default options.


Forgot to mention, thanks for the test case.
Or we will never find it.

Thanks,
Qu


 But now I could reproduce and, looks like balance fails to
 start with IO error though the mount is successful.
--
# tail -f ./results/btrfs/125.full
intense and takes potentially very long. It is recommended to
use the balance filters to narrow down the balanced data.
Use 'btrfs balance start --full-balance' option to skip this
warning. The operation will start in 10 seconds.
Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1ERROR: error during balancing '/scratch':
Input/output error
There may be more info in syslog - try dmesg | tail

Starting balance without any filters.
failed: '/root/bin/btrfs balance start /scratch'


 This must be fixed. For debugging if I add a sync before previous
 unmount, the problem isn't reproduced. just fyi. Strange.

---
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index 91aa8d8c3f4d..4d4316ca9f6e 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -133,6 +133,7 @@ echo "-Mount normal-" >> $seqres.full
 echo
 echo "Mount normal and balance"

+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
 _scratch_unmount
 _run_btrfs_util_prog device scan
 _scratch_mount >> $seqres.full 2>&1
--

 HTH.

Thanks, Anand


On 02/07/17 14:09, Qu Wenruo wrote:

Hi Anand,

I found that btrfs/125 test case can only pass if we enabled space cache.

If using nospace_cache or space_cache=v2 mount option, it will get
blocked forever with the following callstack(the only blocked process):

[11382.046978] btrfs   D11128  6705   6057 0x
[11382.047356] Call Trace:
[11382.047668]  __schedule+0x2d4/0xae0
[11382.047956]  schedule+0x3d/0x90
[11382.048283]  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
[11382.048630]  ? wake_atomic_t_function+0x60/0x60
[11382.048958]  btrfs_wait_ordered_range+0x113/0x210 [btrfs]
[11382.049360]  btrfs_relocate_block_group+0x260/0x2b0 [btrfs]
[11382.049703]  btrfs_relocate_chunk+0x51/0xf0 [btrfs]
[11382.050073]  btrfs_balance+0xaa9/0x1610 [btrfs]
[11382.050404]  ? btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs]
[11382.050739]  btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs]
[11382.051109]  btrfs_ioctl+0xbe7/0x27f0 [btrfs]
[11382.051430]  ? trace_hardirqs_on+0xd/0x10
[11382.051747]  ? free_object+0x74/0xa0
[11382.052084]  ? debug_object_free+0xf2/0x130
[11382.052413]  do_vfs_ioctl+0x94/0x710
[11382.052750]  ? enqueue_hrtimer+0x160/0x160
[11382.053090]  ? do_nanosleep+0x71/0x130
[11382.053431]  SyS_ioctl+0x79/0x90
[11382.053735]  entry_SYSCALL_64_fastpath+0x18/0xad
[11382.054570] RIP: 0033:0x7f397d7a6787

I also found in the test case, we only have 3 continuous data extents,
whose sizes are 1M, 68.5M and 31.5M respectively.

Original data block group:
0   1M 64M69.5M  101M   128M
| Ext A | Extent B(68.5M) |Extent C(31.5M)   |


While relocation write them in 4 extents:
0~1M:same as Extent A. (1st)
1M~68.3438M :smaller than Extent B (2nd)
68.3438M~69.5M  :tail part of Extent B (3rd)
69.5M~ 101M :same as Extent C. (4th)

However only ordered extent of (3rd) and (4th) get finished.
While ordered extent of (1st) and (2nd) never reached
finish_ordered_io().

So relocation will wait for no one to finish the these two ordered
extent, and get blocked.

Did you experienced the same bug submitting the test case?
Is there any known fix for it?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs/125 deadlock using nospace_cache or space_cache=v2

2017-02-07 Thread Qu Wenruo

Hi Anand,

At 02/07/2017 04:02 PM, Anand Jain wrote:


Hi Qu,

 I don't think I have seen this before, I don't know the reason
 why I wrote this, may be to test encryption, however it was all
 with default options.

 But now I could reproduce and, looks like balance fails to
 start with IO error though the mount is successful.
--
# tail -f ./results/btrfs/125.full
intense and takes potentially very long. It is recommended to
use the balance filters to narrow down the balanced data.
Use 'btrfs balance start --full-balance' option to skip this
warning. The operation will start in 10 seconds.
Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1ERROR: error during balancing '/scratch':
Input/output error
There may be more info in syslog - try dmesg | tail

Starting balance without any filters.
failed: '/root/bin/btrfs balance start /scratch'


 This must be fixed. For debugging if I add a sync before previous
 unmount, the problem isn't reproduced. just fyi. Strange.


Thanks for the extra info, this seems to be a clue to dig further.

Thanks,
Qu


---
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index 91aa8d8c3f4d..4d4316ca9f6e 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -133,6 +133,7 @@ echo "-Mount normal-" >> $seqres.full
 echo
 echo "Mount normal and balance"

+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
 _scratch_unmount
 _run_btrfs_util_prog device scan
 _scratch_mount >> $seqres.full 2>&1
--

 HTH.

Thanks, Anand


On 02/07/17 14:09, Qu Wenruo wrote:

Hi Anand,

I found that btrfs/125 test case can only pass if we enabled space cache.

If using nospace_cache or space_cache=v2 mount option, it will get
blocked forever with the following callstack(the only blocked process):

[11382.046978] btrfs   D11128  6705   6057 0x
[11382.047356] Call Trace:
[11382.047668]  __schedule+0x2d4/0xae0
[11382.047956]  schedule+0x3d/0x90
[11382.048283]  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
[11382.048630]  ? wake_atomic_t_function+0x60/0x60
[11382.048958]  btrfs_wait_ordered_range+0x113/0x210 [btrfs]
[11382.049360]  btrfs_relocate_block_group+0x260/0x2b0 [btrfs]
[11382.049703]  btrfs_relocate_chunk+0x51/0xf0 [btrfs]
[11382.050073]  btrfs_balance+0xaa9/0x1610 [btrfs]
[11382.050404]  ? btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs]
[11382.050739]  btrfs_ioctl_balance+0x3a0/0x3b0 [btrfs]
[11382.051109]  btrfs_ioctl+0xbe7/0x27f0 [btrfs]
[11382.051430]  ? trace_hardirqs_on+0xd/0x10
[11382.051747]  ? free_object+0x74/0xa0
[11382.052084]  ? debug_object_free+0xf2/0x130
[11382.052413]  do_vfs_ioctl+0x94/0x710
[11382.052750]  ? enqueue_hrtimer+0x160/0x160
[11382.053090]  ? do_nanosleep+0x71/0x130
[11382.053431]  SyS_ioctl+0x79/0x90
[11382.053735]  entry_SYSCALL_64_fastpath+0x18/0xad
[11382.054570] RIP: 0033:0x7f397d7a6787

I also found in the test case, we only have 3 continuous data extents,
whose sizes are 1M, 68.5M and 31.5M respectively.

Original data block group:
0   1M 64M69.5M  101M   128M
| Ext A | Extent B(68.5M) |Extent C(31.5M)   |


While relocation write them in 4 extents:
0~1M:same as Extent A. (1st)
1M~68.3438M :smaller than Extent B (2nd)
68.3438M~69.5M  :tail part of Extent B (3rd)
69.5M~ 101M :same as Extent C. (4th)

However only ordered extent of (3rd) and (4th) get finished.
While ordered extent of (1st) and (2nd) never reached
finish_ordered_io().

So relocation will wait for no one to finish the these two ordered
extent, and get blocked.

Did you experienced the same bug submitting the test case?
Is there any known fix for it?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html