Здравствуйте! Вас интересуют клиентские базы данных? Ответ на Email: prodawez...@gmail.com

2017-07-26 Thread fxuqzmyzswnaavalinux-bt...@vger.kernel.org
Здравствуйте! Вас интересуют клиентские базы  данных? Ответ на Email: 
prodawez...@gmail.com


Re: [PATCH 1/3] btrfs-progs: convert: properly handle reserved ranges while iterating files

2017-07-26 Thread Qu Wenruo



On 2017年07月26日 04:54, je...@suse.com wrote:

From: Jeff Mahoney 

Commit 522ef705e38 (btrfs-progs: convert: Introduce function to calculate
the available space) changed how we handle migrating file data so that
we never have btrfs space associated with the reserved ranges.  This
works pretty well and when we iterate over the file blocks, the
associations are redirected to the migrated locations.

This commit missed the case in block_iterate_proc where we just check
for intersection with a superblock location before looking up a block
group.  intersect_with_sb checks to see if the range intersects with
a stripe containing a superblock but, in fact, we've reserved the
full 0-1MB range at the start of the disk.  So a file block located
at e.g. 160kB will fall in the reserved region but won't be excepted
in block_iterate_block.  We ultimately hit a BUG_ON when we fail
to look up the block group for that location.


The description of the problem  is indeed correct.



This is reproducible using convert-tests/003-ext4-basic.


Thanks for pointing this out, I also reproduced it.

While it would be nicer if you could upload a special crafted image as 
indicated test case.
IIRC the test passed without problem several versions ago, so there may 
be some factors preventing the bug from being exposed.




The fix is to have intersect_with_sb and block_iterate_proc understand
the full size of the reserved ranges.  Since we use the range to
determine the boundary for the block iterator, let's just return the
boundary.  0 isn't a valid boundary and means that we proceed normally
with block group lookup.


I'm OK with current fix as it indeed fix the bug and has minimal impact 
on current code.


So feel free to add:
Reviewed-by: Qu Wenruo 

While I think there is a better way to solve it more completely.

As when we run into block_iterate_proc(), we have already created 
ext2_save/image.
So we can use the the image as ext2 <-> btrfs position mapping, just as 
we have already done in record_file_blocks().


That's to say, we don't need too much care about the intersection with 
reserved range, but just letting record_file_blocks() to handle it will 
be good enough.


What do you think about this idea?

Thanks,
Qu



Cc: Qu Wenruo 
Signed-off-by: Jeff Mahoney 
---
  convert/source-fs.c | 25 +++--
  1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/convert/source-fs.c b/convert/source-fs.c
index 80e4e41..09f6995 100644
--- a/convert/source-fs.c
+++ b/convert/source-fs.c
@@ -28,18 +28,16 @@ const struct simple_range btrfs_reserved_ranges[3] = {
{ BTRFS_SB_MIRROR_OFFSET(2), SZ_64K }
  };
  
-static int intersect_with_sb(u64 bytenr, u64 num_bytes)

+static u64 intersect_with_reserved(u64 bytenr, u64 num_bytes)
  {
int i;
-   u64 offset;
  
-	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {

-   offset = btrfs_sb_offset(i);
-   offset &= ~((u64)BTRFS_STRIPE_LEN - 1);
+   for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
+   const struct simple_range *range = _reserved_ranges[i];
  
-		if (bytenr < offset + BTRFS_STRIPE_LEN &&

-   bytenr + num_bytes > offset)
-   return 1;
+   if (bytenr < range_end(range) &&
+   bytenr + num_bytes >= range->start)
+   return range_end(range);
}
return 0;
  }
@@ -64,14 +62,14 @@ int block_iterate_proc(u64 disk_block, u64 file_block,
  struct blk_iterate_data *idata)
  {
int ret = 0;
-   int sb_region;
+   u64 reserved_boundary;
int do_barrier;
struct btrfs_root *root = idata->root;
struct btrfs_block_group_cache *cache;
u64 bytenr = disk_block * root->sectorsize;
  
-	sb_region = intersect_with_sb(bytenr, root->sectorsize);

-   do_barrier = sb_region || disk_block >= idata->boundary;
+   reserved_boundary = intersect_with_reserved(bytenr, root->sectorsize);
+   do_barrier = reserved_boundary || disk_block >= idata->boundary;
if ((idata->num_blocks > 0 && do_barrier) ||
(file_block > idata->first_block + idata->num_blocks) ||
(disk_block != idata->disk_block + idata->num_blocks)) {
@@ -91,9 +89,8 @@ int block_iterate_proc(u64 disk_block, u64 file_block,
goto fail;
}
  
-		if (sb_region) {

-   bytenr += BTRFS_STRIPE_LEN - 1;
-   bytenr &= ~((u64)BTRFS_STRIPE_LEN - 1);
+   if (reserved_boundary) {
+   bytenr = reserved_boundary;
} else {
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
BUG_ON(!cache);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:36:54AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-26 08:27, Hugo Mills wrote:
> >On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> >>On 2017-07-25 17:45, Hugo Mills wrote:
> >>>On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> 
> 
> Hugo Mills wrote:
> >
> >>>You can see about the disk usage in different scenarios with the
> >>>online tool at:
> >>>
> >>>http://carfax.org.uk/btrfs-usage/
> >>>
> >>>Hugo.
> >>>
> As a side note, have you ever considered making this online tool
> (that should never go away just for the record) part of btrfs-progs
> e.g. a proper tool? I use it quite often (at least several timers
> per. month) and I would love for this to be a visual tool
> 'btrfs-space-calculator' would be a great name for it I think.
> 
> Imagine how nice it would be to run
> 
> btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> /dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> something similar to my example below (no accuracy intended)
> >>>
> >>>It's certainly a thought. I've already got the algorithm written
> >>>up. I'd have to resurrect my C skills, though, and it's a long way
> >>>down my list of things to do. :/
> >>>
> >>>Also on the subject of this tool, I'd like to make it so that the
> >>>parameters get set in the URL, so that people can copy-paste the URL
> >>>of the settings they've got into IRC for discussion. However, that
> >>>would involve doing more JavaScript, which is possibly even lower down
> >>>my list of things to do than starting doing C again...
> >
> >>Is the core logic posted somewhere?  Because if I have some time, I
> >>might write up a quick Python script to do this locally (it may not
> >>be as tightly integrated with the regular tools, but I can count on
> >>half a hand how many distros don't include Python by default).
> >
> >If it's going to be done in python, I might as well do it myself --
> >I can do python with my eyes closed. It's just C and JS I'm rusty with.
> Same here ironically :)
> >
> >There is a write-up of the usable-space algorithm somewhere. I
> >wrote it up in detail (with pseudocode) in a mail on this list. I've
> >also got several pages of LaTeX somewhere where I tried and failed to
> >prove the correctness of the formula. I'll see if I can dig them out
> >this evening.
> It looks like the Message-ID for the one on the mailing list is
> <20160311221703.gj17...@carfax.org.uk>
> I had forgotten that I'd archived that with the intent of actually
> doing something with it eventually...

   Here's the write-up of my attempted proof of the optimality of the
current allocator algorithm:

http://carfax.org.uk/files/temp/btrfs-allocator-draft.pdf

   Section 1 is a general (allocator-agnostic) description of the
process. Section 2 finds a bound on how well _any_ allocator can
do. That's the formula (eq 9) used in the online btrfs-usage
tool. Section 3 described the current allocator. Section 4 is a failed
attempt at proving that the algorithm achieves the bound from section
2. I wasn't able to complete the proof.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Interview with the Umpire
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


*3qрaвствyйте! Baс uнтересyют клuентскuе бaзы qaнныx?

2017-07-26 Thread znbrmwhrwnmasmozlinux-bt...@vger.kernel.org
*Здpавcmвуйmе! Bаc uнmеpеcуюm kлuенmckuе 6азы данныx?
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

[PATCH] btrfs: Remove extra parentheses from condition in copy_items()

2017-07-26 Thread Matthias Kaehlcke
There is no need for the extra pair of parentheses, remove it. This
fixes the following warning when building with clang:

fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous
  parentheses [-Wparentheses-equality]
if ((i == (nr - 1)))
 ~~^~~

Signed-off-by: Matthias Kaehlcke 
---
 fs/btrfs/tree-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f20ef211a73d..b92408a3f834 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3691,7 +3691,7 @@ static noinline int copy_items(struct btrfs_trans_handle 
*trans,
 
src_offset = btrfs_item_ptr_offset(src, start_slot + i);
 
-   if ((i == (nr - 1)))
+   if (i == (nr - 1))
last_key = ins_keys[i];
 
if (ins_keys[i].type == BTRFS_INODE_ITEM_KEY) {
-- 
2.14.0.rc0.400.g1c36432dff-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-26 Thread Hans van Kranenburg
Ah, great, while doing the last git format-patch, my earlier written
changes since v1 were lost again:

Changes since v1:
* Keep ssd_spread behaviour unchanged
* Add summary at the beginning of the commit message

Thanks,

On 07/26/2017 09:59 PM, Hans van Kranenburg wrote:
> The purpose of this patch is providing a band aid to improve the
> 'out of the box' behaviour of btrfs for disks that are detected as being
> an ssd.  In a general purpose mixed workload scenario, the current ssd
> mode causes overallocation of available raw disk space for data, while
> leaving behind increasing amounts of unused fragmented free space. This
> situation leads to early ENOSPC problems which are harming user
> experience and adoption of btrfs as a general purpose filesystem.
> 
> This patch modifies the data extent allocation behaviour of the ssd mode
> to make it behave identical to nossd mode.  The metadata behaviour and
> additional ssd_spread option stay untouched so far.
> 
> Recommendations for future development are to reconsider the current
> oversimplified nossd / ssd distinction and the broken detection
> mechanism based on the rotational attribute in sysfs and provide
> experienced users with a more flexible way to choose allocator behaviour
> for data and metadata, optimized for certain use cases, while keeping
> sane 'out of the box' default settings.  The internals of the current
> btrfs code have more potential than what currently gets exposed to the
> user to choose from.
> 
> The SSD story...
> 
> In the first year of btrfs development, around early 2008, btrfs
> gained a mount option which enables specific functionality for
> filesystems on solid state devices. The first occurance of this
> functionality is in commit e18e4809, labeled "Add mount -o ssd, which
> includes optimizations for seek free storage".
> 
> The effect on allocating free space for doing (data) writes is to
> 'cluster' writes together, writing them out in contiguous space, as
> opposed to a 'tetris' way of putting all separate writes into any free
> space fragment that fits (which is what the -o nossd behaviour does).
> 
> A somewhat simplified explanation of what happens is that, when for
> example, the 'cluster' size is set to 2MiB, when we do some writes, the
> data allocator will search for a free space block that is 2MiB big, and
> put the writes in there. The ssd mode itself might allow a 2MiB cluster
> to be composed of multiple free space extents with some existing data in
> between, while the additional ssd_spread mount option kills off this
> option and requires fully free space.
> 
> The idea behind this is (commit 536ac8ae): "The [...] clusters make it
> more likely a given IO will completely overwrite the ssd block, so it
> doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
> block. So, effectively this means applying a "locality based algorithm"
> and trying to outsmart the actual ssd.
> 
> Since then, various changes have been made to the involved code, but the
> basic idea is still present, and gets activated whenever the ssd mount
> option is active. This also happens by default, when the rotational flag
> as seen at /sys/block//queue/rotational is set to 0.
> 
> However, there's a number of problems with this approach.
> 
> First, what the optimization is trying to do is outsmart the ssd by
> assuming there is a relation between the physical address space of the
> block device as seen by btrfs and the actual physical storage of the
> ssd, and then adjusting data placement. However, since the introduction
> of the Flash Translation Layer (FTL) which is a part of the internal
> controller of an ssd, these attempts are futile. The use of good quality
> FTL in consumer ssd products might have been limited in 2008, but this
> situation has changed drastically soon after that time. Today, even the
> flash memory in your automatic cat feeding machine or your grandma's
> wheelchair has a full featured one.
> 
> Second, the behaviour as described above results in the filesystem being
> filled up with badly fragmented free space extents because of relatively
> small pieces of space that are freed up by deletes, but not selected
> again as part of a 'cluster'. Since the algorithm prefers allocating a
> new chunk over going back to tetris mode, the end result is a filesystem
> in which all raw space is allocated, but which is composed of
> underutilized chunks with a 'shotgun blast' pattern of fragmented free
> space. Usually, the next problematic thing that happens is the
> filesystem wanting to allocate new space for metadata, which causes the
> filesystem to fail in spectacular ways.
> 
> Third, the default mount options you get for an ssd ('ssd' mode enabled,
> 'discard' not enabled), in combination with spreading out writes over
> the full address space and ignoring freed up space leads to worst case
> behaviour in providing information to the ssd itself, since it will
> never 

[PATCH v2] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-26 Thread Hans van Kranenburg
The purpose of this patch is providing a band aid to improve the
'out of the box' behaviour of btrfs for disks that are detected as being
an ssd.  In a general purpose mixed workload scenario, the current ssd
mode causes overallocation of available raw disk space for data, while
leaving behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

The SSD story...

In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block//queue/rotational is set to 0.

However, there's a number of problems with this approach.

First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space 

Re: [PATCH] btrfs: Make flush_space return void

2017-07-26 Thread David Sterba
On Tue, Jul 25, 2017 at 05:48:28PM +0300, Nikolay Borisov wrote:
> The return value of flush_space was used to have significance in the early 
> days
> when the code was first introduced and before the ticketed enospc rework. 
> Since
> the latter got introduced the return value lost any significance whatsoever to
> its callers. So let's remove it. While at it also remove the unused ticket
> variable in btrfs_async_reclaim_metadata_space. It was used in the initial
> version of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem
> with this and fixed it in ce129655c9d9 ("btrfs: introduce tickets_id to
> determine whether asynchronous metadata reclaim work makes progress").
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 

I've added a comment to the function, as it's not obvious why all the
error conditions inside get ignored in the end.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Deprecate userspace transaction ioctls

2017-07-26 Thread David Sterba
On Wed, Jul 26, 2017 at 11:26:28AM +0300, Nikolay Borisov wrote:
> Userspace transactions were introduced in commit
> 6bf13c0cc833 ("Btrfs: transaction ioctls") to provide semantics that Ceph's
> object store required. However, things have changed significantly since then,
> to the point where btrfs is no longer suitable as a backend for ceph and in 
> fact
> it's actively advised against such usages. Considering this, there doesn't
> seem to be a widespread, legit use case of userspace transaction. They also
> clutter the file->private pointer.
> 
> So to end the agony let's nuke the userspace transaction ioctls. As a first
> step let's give time for people to voice their objection by just WARN()ining
> when the userspace transaction is used.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index fa1b78cf25f6..10d78d71df96 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4000,6 +4000,12 @@ static long btrfs_ioctl_trans_start(struct file *file)
>   struct btrfs_trans_handle *trans;
>   int ret;
>  
> + btrfs_warn(fs_info, "Userspace transaction mechanism is considered
> +deprecated and slated to be removed in the near future "
> +"(1 or 2 releases). If you have a valid use case please
> +speak up on the mailing list");
> + WARN_ON_ONCE(1);
> +
>   ret = -EPERM;
>   if (!capable(CAP_SYS_ADMIN))
>   goto out;

The warning must go after the permission checks and warn really just only once,
so we need to keep the local state anyway.

--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3967,17 +3967,22 @@ static long btrfs_ioctl_trans_start(struct file *file)
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_trans_handle *trans;
int ret;
-
-   btrfs_warn(fs_info, "Userspace transaction mechanism is considered
-  deprecated and slated to be removed in the near future "
-  "(1 or 2 releases). If you have a valid use case please
-  speak up on the mailing list");
-   WARN_ON_ONCE(1);
+   static bool warned = false;

ret = -EPERM;
if (!capable(CAP_SYS_ADMIN))
goto out;

+   if (!warned) {
+   btrfs_warn(fs_info,
+   "Userspace transaction mechanism is considered "
+   "deprecated and slated to be removed in 4.17. "
+   "If you have a valid use case please "
+   "speak up on the mailing list");
+   WARN_ON(1);
+   warned = true;
+   }
+
ret = -EINPROGRESS;
if (file->private_data)
goto out;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-26 Thread Liu Bo
On Mon, Jul 24, 2017 at 10:22:53PM +0200, Janos Toth F. wrote:
> I accidentally ran into this problem (it's pretty silly because I
> almost never run RC kernels or do dio writes but somehow I just
> happened to do both at once, exactly before I read your patch notes).
> I didn't initially catch any issues (I see no related messages in the
> kernel log) but after seeing your patch, I started a scrub (*) and it
> hung.
> 
> Is there a way to fix a filesystem corrupted by this bug or does it
> need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
> 5x4Tb HDDs.) There is a partial backup (of everything really
> important, the rest is not important enough to be kept in multiple
> copies, hence the desire for raid5...) and everything seems to be
> readable anyway (so could be saved if needed) but nuking a big fs is
> never fun...

It should only affect the dio-written files, the mentioned bug makes
btrfs write garbage into those files, so checksum fails when reading
files, nothing else from this bug.

As you use m=s=raid10, filesystem metadata is OK, so I think hang of
scrub could be another problem.


> 
> Scrub just hangs and pretty much makes the whole system hanging (it
> needs a power cycling for a reboot). Although everything runs smooth
> besides this. Btrfs check (read-only normal-mem mode) finds no errors,
> the kernel log is clean, etc.
> 
> I think I deleted all the affected dio-written test-files even before
> I started scrubbing, so that doesn't seem to do the trick. Any other
> ideas?
>

A hang could normally be caught by sysrq-w, could you please try it
and see if there is a difference in kernel log?

Thanks,

-liubo
> 
> * By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
> disk with raw disk speeds of >100 Mb/s). I forgot about this issue
> since I last used raid5 a few years ago.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Deprecate userspace transaction ioctls

2017-07-26 Thread David Sterba
On Wed, Jul 26, 2017 at 11:26:28AM +0300, Nikolay Borisov wrote:
> Userspace transactions were introduced in commit
> 6bf13c0cc833 ("Btrfs: transaction ioctls") to provide semantics that Ceph's
> object store required. However, things have changed significantly since then,
> to the point where btrfs is no longer suitable as a backend for ceph and in 
> fact
> it's actively advised against such usages. Considering this, there doesn't
> seem to be a widespread, legit use case of userspace transaction. They also
> clutter the file->private pointer.
> 
> So to end the agony let's nuke the userspace transaction ioctls. As a first
> step let's give time for people to voice their objection by just WARN()ining
> when the userspace transaction is used.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index fa1b78cf25f6..10d78d71df96 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4000,6 +4000,12 @@ static long btrfs_ioctl_trans_start(struct file *file)
>   struct btrfs_trans_handle *trans;
>   int ret;
>  
> + btrfs_warn(fs_info, "Userspace transaction mechanism is considered
> +deprecated and slated to be removed in the near future "

Compiler complains about the string

fs/btrfs/ioctl.c:5635:0: error: unterminated argument list invoking macro 
"btrfs_warn"

as it is trivial I'll fix it.


> +"(1 or 2 releases). If you have a valid use case please

So this patch will get released in 4.14, the last release that'll have
the ioctl is 4.16. I'll make that explicit in the message.

> +speak up on the mailing list");
> + WARN_ON_ONCE(1);
> +
>   ret = -EPERM;
>   if (!capable(CAP_SYS_ADMIN))
>   goto out;
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-26 Thread David Sterba
On Mon, Jul 24, 2017 at 02:53:52PM -0400, Chris Mason wrote:
> On 07/24/2017 02:41 PM, David Sterba wrote:
> > On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
> >> On 07/24/2017 10:25 AM, David Sterba wrote:
> >>
> >>> Thanks for the extensive historical summary, this change really deserves
> >>> it.
> >>>
> >>> Decoupling the assumptions about the device's block management is really
> >>> a good thing, mount option 'ssd' should mean that the device just has
> >>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> >>> way to keep the behaviour for anybody who wants it.
> >>>
> >>> I'd like to push this change to 4.13-rc3, as I don't think we need more
> >>> time to let other users to test this. The effects of current ssd
> >>> implementation have been debated and debugged on IRC for a long time.
> >>
> >> The description is great, but I'd love to see many more benchmarks.  At
> >> Facebook we use the current ssd_spread mode in production on top of
> >> hardware raid5/6 (spinning storage) because it dramatically reduces the
> >> read/modify/write cycles done for metadata writes.
> > 
> > Well, I think this is an example that ssd got misused because of the
> > side effects of the allocation. If you observe good patterns for raid5,
> > then the allocator should be adapted for that case, otherwise
> > ssd/ssd_spread should be independent of the raid level.
> 
> Absolutely.  The optimizations that made ssd_spread useful for first 
> generation flash are the same things that raid5/6 need.  Big writes, or 
> said differently a minimum size for fast writes.

Actually, you can do the alignments if the block group is raid56
automatically, and don't rely on ssd_spread. This should be equivalent
in function, but a bit cleaner from the code and interface side.

> >> If we're going to play around with these, we need a good way to measure
> >> free space fragmentation as part of benchmarks, as well as the IO
> >> patterns coming out of the allocator.
> > 
> > Hans has a tool that visualizes the fragmentation. Most complaints I've
> > seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
> > many people use ssd_spread, 'ssd' gets turned on automatically so it has
> > much wider impact.
> > 
> >> At least for our uses, ssd_spread matters much more for metadata than
> >> data (the data writes are large and metadata is small).
> > 
> >  From the changes overview:
> > 
> >> 1. Throw out the current ssd_spread behaviour.
> > 
> > would it be ok for you to keep ssd_working as before?
> > 
> > I'd really like to get this patch merged soon because "do not use ssd
> > mode for ssd" has started to be the recommended workaround. Once this
> > sticks, we won't need to have any ssd mode anymore ...
> 
> Works for me.  I do want to make sure that commits in this area include 
> the workload they were targeting, how they were measured and what 
> impacts they had.  That way when we go back to try and change this again 
> we'll understand what profiles we want to preserve.

So there are at least two ways how to look at the change:

performance - ssd + alignments gives better results under some
conditions, especially when there's enough space, rough summary of my
recent measurements with dbench, mailserver workload, fio jobs with
random readwrite

early ENOSPC (user experience) - ie. when a fragmented filesystem, can't
statisfy allocation due to the constraints, although it would be
possible without the alignments; an aged filesystem, near-full

The target here is not a particular workload or profile, but the
behaviour in the near-full conditions, on a filesystem that's likely
fragmented and aged, mixed workload. The patch should fix it in a way
that will make it work at all. There will be some performance impact,
hard to measure or predict under the conditions.

The reports where nossd fixed the ENOSPC problem IMO validate the
change. We don't have performance characteristics attached to them, but
I think that's not relevant to compare if a 'balance finished in a few
hours' compared to 'balance failed too early, what next'.

Technically, this is an ENOSPC fix, and it took quite some time to
identify the cause.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs-progs: backref: add list_first_pref helper

2017-07-26 Thread Jeff Mahoney
On 7/26/17 9:22 AM, Jeff Mahoney wrote:
> On 7/26/17 3:08 AM, Nikolay Borisov wrote:
>>
>>
>> On 25.07.2017 23:51, je...@suse.com wrote:
>>> From: Jeff Mahoney 
>>>
>>> ---
>>>  backref.c | 11 +++
>>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/backref.c b/backref.c
>>> index ac1b506..be3376a 100644
>>> --- a/backref.c
>>> +++ b/backref.c
>>> @@ -130,6 +130,11 @@ struct __prelim_ref {
>>> u64 wanted_disk_byte;
>>>  };
>>>  
>>> +static struct __prelim_ref *list_first_pref(struct list_head *head)
>>> +{
>>> +   return list_first_entry(head, struct __prelim_ref, list);
>>> +}
>>> +
>>
>> I think this just adds one more level of abstraction with no real
>> benefit whatsoever. Why not drop the patch entirely.
> 
> Ack.  I thought it might be more readable but it ends up taking the same
> number of characters.

Actually, no, it doesn't.  That's only true if using 'head' as the list head
as in the helper.

It ends up being

ref = list_first_pref(>pending_missing_keys);
vs
ref = list_first_entry(>pending_missing_keys,
   struct __prelim_ref, list);

and I have to say I prefer reading the former.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/7] btrfs-progs: backref: add list_first_pref helper

2017-07-26 Thread Jeff Mahoney
On 7/26/17 3:08 AM, Nikolay Borisov wrote:
> 
> 
> On 25.07.2017 23:51, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> ---
>>  backref.c | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/backref.c b/backref.c
>> index ac1b506..be3376a 100644
>> --- a/backref.c
>> +++ b/backref.c
>> @@ -130,6 +130,11 @@ struct __prelim_ref {
>>  u64 wanted_disk_byte;
>>  };
>>  
>> +static struct __prelim_ref *list_first_pref(struct list_head *head)
>> +{
>> +return list_first_entry(head, struct __prelim_ref, list);
>> +}
>> +
> 
> I think this just adds one more level of abstraction with no real
> benefit whatsoever. Why not drop the patch entirely.

Ack.  I thought it might be more readable but it ends up taking the same
number of characters.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/7] btrfs-progs: extent-cache: actually cache extent buffers

2017-07-26 Thread Jeff Mahoney
On 7/26/17 3:00 AM, Nikolay Borisov wrote:
> 
> 
> On 25.07.2017 23:51, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> We have the infrastructure to cache extent buffers but we don't actually
>> do the caching.  As soon as the last reference is dropped, the buffer
>> is dropped.  This patch keeps the extent buffers around until the max
>> cache size is reached (defaults to 25% of memory) and then it drops
>> the last 10% of the LRU to free up cache space for reallocation.  The
>> cache size is configurable (for use by e.g. lowmem) when the cache is
>> initialized.
>>
>> Signed-off-by: Jeff Mahoney 

>> @@ -567,7 +580,21 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
>> extent_buffer *src)
>>  return new;
>>  }
>>  
>> -void free_extent_buffer(struct extent_buffer *eb)
>> +static void free_extent_buffer_final(struct extent_buffer *eb)
>> +{
>> +struct extent_io_tree *tree = eb->tree;
>> +
>> +BUG_ON(eb->refs);
>> +BUG_ON(tree->cache_size < eb->len);
>> +list_del_init(>lru);
>> +if (!(eb->flags & EXTENT_BUFFER_DUMMY)) {
>> +remove_cache_extent(>cache, >cache_node);
>> +tree->cache_size -= eb->len;
>> +}
>> +free(eb);
>> +}
>> +
>> +static void free_extent_buffer_internal(struct extent_buffer *eb, int 
>> free_now)
> 
> nit: free_ow -> boolean

Ack.  There should be a bunch of int -> bool conversions elsewhere too.

>> @@ -619,6 +650,21 @@ struct extent_buffer *find_first_extent_buffer(struct 
>> extent_io_tree *tree,
>>  return eb;
>>  }
>>  
>> +static void
>> +trim_extent_buffer_cache(struct extent_io_tree *tree)
>> +{
>> +struct extent_buffer *eb, *tmp;
>> +u64 count = 0;
> 
> count seems to be a leftover from something, so you could remove it

Yep, that was during debugging.  Removed.

>> @@ -2521,3 +2522,14 @@ u8 rand_u8(void)
>>  void btrfs_config_init(void)
>>  {
>>  }
>> +
>> +unsigned long total_memory(void)
> 
> perhaps rename to total_memory_bytes and return the memory size in
> bytes. Returning them in kilobytes seems rather arbitrary. That way
> you'd save the constant *1024 to turn the kbs in bytes in the callers
> (currently only in extent_io_tree_init())
> 

Ack.

Thanks,

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


[PATCH v2 2/3] Btrfs: heuristic add byte set calculation

2017-07-26 Thread Timofey Titovets
Calculate byte set size for data sample:
Calculate how many unique bytes has been in sample
By count all bytes in bucket with count > 0
If byte set low (~25%), data are easily compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 27 +++
 fs/btrfs/compression.h |  1 +
 2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ca7cfaad6e2f..1429b11f2c5f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1048,6 +1048,27 @@ int btrfs_decompress_buf2page(const char *buf, unsigned 
long buf_start,
return 1;
 }

+static inline int byte_set_size(const struct heuristic_bucket_item *bucket)
+{
+   int a = 0;
+   int byte_set_size = 0;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_SET_THRESHOLD; a++) {
+   if (bucket[a].count > 0)
+   byte_set_size++;
+   }
+
+   for (; a < BTRFS_HEURISTIC_BUCKET_SIZE; a++) {
+   if (bucket[a].count > 0) {
+   byte_set_size++;
+   if (byte_set_size > BTRFS_HEURISTIC_BYTE_SET_THRESHOLD)
+   return byte_set_size;
+   }
+   }
+
+   return byte_set_size;
+}
+
 /*
  * Compression heuristic.
  *
@@ -1096,6 +1117,12 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
index++;
}

+   a = byte_set_size(bucket);
+   if (a > BTRFS_HEURISTIC_BYTE_SET_THRESHOLD) {
+   ret = 1;
+   goto out;
+   }
+
 out:
kfree(bucket);
return ret;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index e30a9df1937e..03857967815a 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -138,6 +138,7 @@ struct heuristic_bucket_item {
 #define BTRFS_HEURISTIC_READ_SIZE 16
 #define BTRFS_HEURISTIC_ITER_OFFSET 256
 #define BTRFS_HEURISTIC_BUCKET_SIZE 256
+#define BTRFS_HEURISTIC_BYTE_SET_THRESHOLD 64

 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] Btrfs: heuristic add byte core set calculation

2017-07-26 Thread Timofey Titovets
Calculate byte core set for data sample:
Sort bucket's numbers in decreasing order
Count how many numbers use 90% of sample
If core set are low (<=25%), data are easily compressible
If core set high (>=80%), data are not compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 57 ++
 fs/btrfs/compression.h |  2 ++
 2 files changed, 59 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1429b11f2c5f..314cbdd8d175 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1069,6 +1069,42 @@ static inline int byte_set_size(const struct 
heuristic_bucket_item *bucket)
return byte_set_size;
 }

+/* For bucket sorting */
+static inline int heuristic_bucket_compare(const void *lv, const void *rv)
+{
+   struct heuristic_bucket_item *l = (struct heuristic_bucket_item *)(lv);
+   struct heuristic_bucket_item *r = (struct heuristic_bucket_item *)(rv);
+
+   return r->count - l->count;
+}
+
+/*
+ * Byte Core set size
+ * How many bytes use 90% of sample
+ */
+static inline int byte_core_set_size(struct heuristic_bucket_item *bucket,
+u32 core_set_threshold)
+{
+   int a = 0;
+   u32 coreset_sum = 0;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_CORE_SET_LOW; a++)
+   coreset_sum += bucket[a].count;
+
+   if (coreset_sum > core_set_threshold)
+   return a;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH; a++) {
+   if (bucket[a].count == 0)
+   break;
+   coreset_sum += bucket[a].count;
+   if (coreset_sum > core_set_threshold)
+   break;
+   }
+
+   return a;
+}
+
 /*
  * Compression heuristic.
  *
@@ -1092,6 +1128,8 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
struct heuristic_bucket_item *bucket;
int a, b, ret;
u8 symbol, *input_data;
+   u32 core_set_threshold;
+   u64 input_size = start - end;

ret = 1;

@@ -1123,6 +1161,25 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
goto out;
}

+   /* Sort in reverse order */
+   sort(bucket, BTRFS_HEURISTIC_BUCKET_SIZE,
+sizeof(struct heuristic_bucket_item), _bucket_compare,
+NULL);
+
+   core_set_threshold = input_size*90/BTRFS_HEURISTIC_ITER_OFFSET/100;
+   core_set_threshold *= BTRFS_HEURISTIC_READ_SIZE;
+
+   a = byte_core_set_size(bucket, core_set_threshold);
+   if (a <= BTRFS_HEURISTIC_BYTE_CORE_SET_LOW) {
+   ret = 2;
+   goto out;
+   }
+
+   if (a >= BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH) {
+   ret = 0;
+   goto out;
+   }
+
 out:
kfree(bucket);
return ret;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 03857967815a..0fcd1a485adb 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -139,6 +139,8 @@ struct heuristic_bucket_item {
 #define BTRFS_HEURISTIC_ITER_OFFSET 256
 #define BTRFS_HEURISTIC_BUCKET_SIZE 256
 #define BTRFS_HEURISTIC_BYTE_SET_THRESHOLD 64
+#define BTRFS_HEURISTIC_BYTE_CORE_SET_LOW  BTRFS_HEURISTIC_BYTE_SET_THRESHOLD
+#define BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH 200 // 80%

 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] Btrfs: heuristic add simple sampling logic

2017-07-26 Thread Timofey Titovets
Get small sample from input data and calculate
byte type count for that sample into bucket.
Bucket will store info about which bytes
and how many has been detected in sample

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 24 ++--
 fs/btrfs/compression.h | 10 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 63f54bd2d5bb..ca7cfaad6e2f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1068,15 +1068,35 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
u64 index = start >> PAGE_SHIFT;
u64 end_index = end >> PAGE_SHIFT;
struct page *page;
-   int ret = 1;
+   struct heuristic_bucket_item *bucket;
+   int a, b, ret;
+   u8 symbol, *input_data;
+
+   ret = 1;
+
+   bucket = kcalloc(BTRFS_HEURISTIC_BUCKET_SIZE,
+   sizeof(struct heuristic_bucket_item), GFP_NOFS);
+
+   if (!bucket)
+   goto out;

while (index <= end_index) {
page = find_get_page(inode->i_mapping, index);
-   kmap(page);
+   input_data = kmap(page);
+   a = 0;
+   while (a < PAGE_SIZE) {
+   for (b = 0; b < BTRFS_HEURISTIC_READ_SIZE; b++) {
+   symbol = input_data[a+b];
+   bucket[symbol].count++;
+   }
+   a += BTRFS_HEURISTIC_ITER_OFFSET;
+   }
kunmap(page);
put_page(page);
index++;
}

+out:
+   kfree(bucket);
return ret;
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index d1f4eee2d0af..e30a9df1937e 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -129,6 +129,16 @@ struct btrfs_compress_op {
 extern const struct btrfs_compress_op btrfs_zlib_compress;
 extern const struct btrfs_compress_op btrfs_lzo_compress;

+struct heuristic_bucket_item {
+   u8  padding;
+   u8  symbol;
+   u16 count;
+};
+
+#define BTRFS_HEURISTIC_READ_SIZE 16
+#define BTRFS_HEURISTIC_ITER_OFFSET 256
+#define BTRFS_HEURISTIC_BUCKET_SIZE 256
+
 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

 #endif
--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] Btrfs: populate heuristic with detection logic

2017-07-26 Thread Timofey Titovets
Based on kdave for-next
As heuristic skeleton already merged
Populate heuristic with basic code.

First patch: add simple sampling code
It's get 16 byte samples with 256 bytes shifts
over input data. Collect info about how many
different bytes (symbols) has been found in sample data

Second patch: add code for calculate
how many unique bytes has been
found in sample data
That can fast detect easy compressible data

Third patch: add code for calculate byte core set size
i.e. how many unique bytes use 90% of sample data
That code require that numbers in bucket must be sorted
That can detect easy compressible data with many repeated bytes
That can detect not compressible data with evenly distributed bytes

Changes v1 -> v2:
  - Change input data iterator shift 512 -> 256
  - Replace magic macro numbers with direct values
  - Drop useless symbol population in bucket
as no one care about where and what symbol stored
in bucket at now

Timofey Titovets (3):
  Btrfs: heuristic add simple sampling logic
  Btrfs: heuristic add byte set calculation
  Btrfs: heuristic add byte core set calculation

 fs/btrfs/compression.c | 108 -
 fs/btrfs/compression.h |  13 ++
 2 files changed, 119 insertions(+), 2 deletions(-)

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-26 08:27, Hugo Mills wrote:

On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:

On 2017-07-25 17:45, Hugo Mills wrote:

On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.


As a side note, have you ever considered making this online tool
(that should never go away just for the record) part of btrfs-progs
e.g. a proper tool? I use it quite often (at least several timers
per. month) and I would love for this to be a visual tool
'btrfs-space-calculator' would be a great name for it I think.

Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
something similar to my example below (no accuracy intended)


It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...



Is the core logic posted somewhere?  Because if I have some time, I
might write up a quick Python script to do this locally (it may not
be as tightly integrated with the regular tools, but I can count on
half a hand how many distros don't include Python by default).


If it's going to be done in python, I might as well do it myself --
I can do python with my eyes closed. It's just C and JS I'm rusty with.

Same here ironically :)


There is a write-up of the usable-space algorithm somewhere. I
wrote it up in detail (with pseudocode) in a mail on this list. I've
also got several pages of LaTeX somewhere where I tried and failed to
prove the correctness of the formula. I'll see if I can dig them out
this evening.
It looks like the Message-ID for the one on the mailing list is 
<20160311221703.gj17...@carfax.org.uk>
I had forgotten that I'd archived that with the intent of actually doing 
something with it eventually...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 12:27:20PM +, Hugo Mills wrote:
> On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> > On 2017-07-25 17:45, Hugo Mills wrote:
> > >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> > >>
> > >>
> > >>Hugo Mills wrote:
> > >>>
> > >You can see about the disk usage in different scenarios with the
> > >online tool at:
> > >
> > >http://carfax.org.uk/btrfs-usage/
> > >
> > >Hugo.
> > >
> > >>As a side note, have you ever considered making this online tool
> > >>(that should never go away just for the record) part of btrfs-progs
> > >>e.g. a proper tool? I use it quite often (at least several timers
> > >>per. month) and I would love for this to be a visual tool
> > >>'btrfs-space-calculator' would be a great name for it I think.
> > >>
> > >>Imagine how nice it would be to run
> > >>
> > >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> > >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> > >>something similar to my example below (no accuracy intended)
> > >
> > >It's certainly a thought. I've already got the algorithm written
> > >up. I'd have to resurrect my C skills, though, and it's a long way
> > >down my list of things to do. :/
> > >
> > >Also on the subject of this tool, I'd like to make it so that the
> > >parameters get set in the URL, so that people can copy-paste the URL
> > >of the settings they've got into IRC for discussion. However, that
> > >would involve doing more JavaScript, which is possibly even lower down
> > >my list of things to do than starting doing C again...
> 
> > Is the core logic posted somewhere?  Because if I have some time, I
> > might write up a quick Python script to do this locally (it may not
> > be as tightly integrated with the regular tools, but I can count on
> > half a hand how many distros don't include Python by default).
> 
>If it's going to be done in python, I might as well do it myself --
> I can do python with my eyes closed. It's just C and JS I'm rusty with.
> 
>There is a write-up of the usable-space algorithm somewhere. I
> wrote it up in detail (with pseudocode) in a mail on this list. I've
> also got several pages of LaTeX somewhere where I tried and failed to
> prove the correctness of the formula. I'll see if I can dig them out
> this evening.

   Oh, and of course there's the JS from the website... that's not
minified, and should be readable (if not particularly well-commented).

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-25 17:45, Hugo Mills wrote:
> >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> >>
> >>
> >>Hugo Mills wrote:
> >>>
> >You can see about the disk usage in different scenarios with the
> >online tool at:
> >
> >http://carfax.org.uk/btrfs-usage/
> >
> >Hugo.
> >
> >>As a side note, have you ever considered making this online tool
> >>(that should never go away just for the record) part of btrfs-progs
> >>e.g. a proper tool? I use it quite often (at least several timers
> >>per. month) and I would love for this to be a visual tool
> >>'btrfs-space-calculator' would be a great name for it I think.
> >>
> >>Imagine how nice it would be to run
> >>
> >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> >>something similar to my example below (no accuracy intended)
> >
> >It's certainly a thought. I've already got the algorithm written
> >up. I'd have to resurrect my C skills, though, and it's a long way
> >down my list of things to do. :/
> >
> >Also on the subject of this tool, I'd like to make it so that the
> >parameters get set in the URL, so that people can copy-paste the URL
> >of the settings they've got into IRC for discussion. However, that
> >would involve doing more JavaScript, which is possibly even lower down
> >my list of things to do than starting doing C again...

> Is the core logic posted somewhere?  Because if I have some time, I
> might write up a quick Python script to do this locally (it may not
> be as tightly integrated with the regular tools, but I can count on
> half a hand how many distros don't include Python by default).

   If it's going to be done in python, I might as well do it myself --
I can do python with my eyes closed. It's just C and JS I'm rusty with.

   There is a write-up of the usable-space algorithm somewhere. I
wrote it up in detail (with pseudocode) in a mail on this list. I've
also got several pages of LaTeX somewhere where I tried and failed to
prove the correctness of the formula. I'll see if I can dig them out
this evening.

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-25 17:45, Hugo Mills wrote:

On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:



Hugo Mills wrote:



You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

Hugo.


As a side note, have you ever considered making this online tool
(that should never go away just for the record) part of btrfs-progs
e.g. a proper tool? I use it quite often (at least several timers
per. month) and I would love for this to be a visual tool
'btrfs-space-calculator' would be a great name for it I think.

Imagine how nice it would be to run

btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
something similar to my example below (no accuracy intended)


It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...
Is the core logic posted somewhere?  Because if I have some time, I 
might write up a quick Python script to do this locally (it may not be 
as tightly integrated with the regular tools, but I can count on half a 
hand how many distros don't include Python by default).


Hugo.


d=data
m=metadata
.=unusable

{  500mb} [|d|] /dev/sda1
{ 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
{ 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
{ 5000mb}
[|d|m|m|m|m|m|m|m|m|m|]
/dev/sdb1

{11500mb} Total space

usable for data (raid10): 1000mb / 2000mb
usable for metadata (raid1): 4500mb / 9000mb
unusable: 500mb

Of course this would have to change one (if ever) subvolumes can
have different raid levels etc, but I would have loved using
something like this instead of jumping around carfax abbey (!) at
night.


The core algorithm for the tool actually works pretty well for
dealing with different RAID levels, as long as you know how much of
each kind of data you're going to be using. (Although it's actually
path-dependent -- write 100 GB of RAID-0 then 100 GB of RAID-1 can
have different results than if you write them in the opposite order --
but that's a kind of edge effect).

Hugo.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: Deprecate userspace transaction ioctls

2017-07-26 Thread Nikolay Borisov
Userspace transactions were introduced in commit
6bf13c0cc833 ("Btrfs: transaction ioctls") to provide semantics that Ceph's
object store required. However, things have changed significantly since then,
to the point where btrfs is no longer suitable as a backend for ceph and in fact
it's actively advised against such usages. Considering this, there doesn't
seem to be a widespread, legit use case of userspace transaction. They also
clutter the file->private pointer.

So to end the agony let's nuke the userspace transaction ioctls. As a first
step let's give time for people to voice their objection by just WARN()ining
when the userspace transaction is used.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ioctl.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fa1b78cf25f6..10d78d71df96 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4000,6 +4000,12 @@ static long btrfs_ioctl_trans_start(struct file *file)
struct btrfs_trans_handle *trans;
int ret;
 
+   btrfs_warn(fs_info, "Userspace transaction mechanism is considered
+  deprecated and slated to be removed in the near future "
+  "(1 or 2 releases). If you have a valid use case please
+  speak up on the mailing list");
+   WARN_ON_ONCE(1);
+
ret = -EPERM;
if (!capable(CAP_SYS_ADMIN))
goto out;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs-progs: backref: add list_first_pref helper

2017-07-26 Thread Nikolay Borisov


On 25.07.2017 23:51, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> ---
>  backref.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/backref.c b/backref.c
> index ac1b506..be3376a 100644
> --- a/backref.c
> +++ b/backref.c
> @@ -130,6 +130,11 @@ struct __prelim_ref {
>   u64 wanted_disk_byte;
>  };
>  
> +static struct __prelim_ref *list_first_pref(struct list_head *head)
> +{
> + return list_first_entry(head, struct __prelim_ref, list);
> +}
> +

I think this just adds one more level of abstraction with no real
benefit whatsoever. Why not drop the patch entirely.

>  struct pref_state {
>   struct list_head pending;
>  };
> @@ -804,8 +809,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
> *trans,
>   __merge_refs(, 2);
>  
>   while (!list_empty()) {
> - ref = list_first_entry(,
> -struct __prelim_ref, list);
> + ref = list_first_pref();
>   WARN_ON(ref->count < 0);
>   if (roots && ref->count && ref->root_id && ref->parent == 0) {
>   /* no parent == root of tree */
> @@ -857,8 +861,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
> *trans,
>  out:
>   btrfs_free_path(path);
>   while (!list_empty()) {
> - ref = list_first_entry(,
> -struct __prelim_ref, list);
> + ref = list_first_pref();
>   list_del(>list);
>   kfree(ref);
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] btrfs-progs: extent-cache: actually cache extent buffers

2017-07-26 Thread Nikolay Borisov


On 25.07.2017 23:51, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> We have the infrastructure to cache extent buffers but we don't actually
> do the caching.  As soon as the last reference is dropped, the buffer
> is dropped.  This patch keeps the extent buffers around until the max
> cache size is reached (defaults to 25% of memory) and then it drops
> the last 10% of the LRU to free up cache space for reallocation.  The
> cache size is configurable (for use by e.g. lowmem) when the cache is
> initialized.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  extent_io.c | 74 
> ++---
>  extent_io.h |  4 
>  utils.c | 12 ++
>  utils.h |  3 +++
>  4 files changed, 80 insertions(+), 13 deletions(-)
> 
> diff --git a/extent_io.c b/extent_io.c
> index 915c6ed..937ff90 100644
> --- a/extent_io.c
> +++ b/extent_io.c
> @@ -27,6 +27,7 @@
>  #include "list.h"
>  #include "ctree.h"
>  #include "volumes.h"
> +#include "utils.h"
>  #include "internal.h"
>  
>  void extent_io_tree_init(struct extent_io_tree *tree)
> @@ -35,6 +36,14 @@ void extent_io_tree_init(struct extent_io_tree *tree)
>   cache_tree_init(>cache);
>   INIT_LIST_HEAD(>lru);
>   tree->cache_size = 0;
> + tree->max_cache_size = (u64)(total_memory() * 1024) / 4;
> +}
> +
> +void extent_io_tree_init_cache_max(struct extent_io_tree *tree,
> +u64 max_cache_size)
> +{
> + extent_io_tree_init(tree);
> + tree->max_cache_size = max_cache_size;
>  }
>  
>  static struct extent_state *alloc_extent_state(void)
> @@ -67,16 +76,20 @@ static void free_extent_state_func(struct cache_extent 
> *cache)
>   btrfs_free_extent_state(es);
>  }
>  
> +static void free_extent_buffer_final(struct extent_buffer *eb);
>  void extent_io_tree_cleanup(struct extent_io_tree *tree)
>  {
>   struct extent_buffer *eb;
>  
>   while(!list_empty(>lru)) {
>   eb = list_entry(tree->lru.next, struct extent_buffer, lru);
> - fprintf(stderr, "extent buffer leak: "
> - "start %llu len %u\n",
> - (unsigned long long)eb->start, eb->len);
> - free_extent_buffer(eb);
> + if (eb->refs) {
> + fprintf(stderr, "extent buffer leak: "
> + "start %llu len %u\n",
> + (unsigned long long)eb->start, eb->len);
> + free_extent_buffer_nocache(eb);
> + } else
> + free_extent_buffer_final(eb);
>   }
>  
>   cache_tree_free_extents(>state, free_extent_state_func);
> @@ -567,7 +580,21 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
> extent_buffer *src)
>   return new;
>  }
>  
> -void free_extent_buffer(struct extent_buffer *eb)
> +static void free_extent_buffer_final(struct extent_buffer *eb)
> +{
> + struct extent_io_tree *tree = eb->tree;
> +
> + BUG_ON(eb->refs);
> + BUG_ON(tree->cache_size < eb->len);
> + list_del_init(>lru);
> + if (!(eb->flags & EXTENT_BUFFER_DUMMY)) {
> + remove_cache_extent(>cache, >cache_node);
> + tree->cache_size -= eb->len;
> + }
> + free(eb);
> +}
> +
> +static void free_extent_buffer_internal(struct extent_buffer *eb, int 
> free_now)

nit: free_ow -> boolean

>  {
>   if (!eb || IS_ERR(eb))
>   return;
> @@ -575,19 +602,23 @@ void free_extent_buffer(struct extent_buffer *eb)
>   eb->refs--;
>   BUG_ON(eb->refs < 0);
>   if (eb->refs == 0) {
> - struct extent_io_tree *tree = eb->tree;
>   BUG_ON(eb->flags & EXTENT_DIRTY);
> - list_del_init(>lru);
>   list_del_init(>recow);
> - if (!(eb->flags & EXTENT_BUFFER_DUMMY)) {
> - BUG_ON(tree->cache_size < eb->len);
> - remove_cache_extent(>cache, >cache_node);
> - tree->cache_size -= eb->len;
> - }
> - free(eb);
> + if (eb->flags & EXTENT_BUFFER_DUMMY || free_now)
> + free_extent_buffer_final(eb);
>   }
>  }
>  
> +void free_extent_buffer(struct extent_buffer *eb)
> +{
> + free_extent_buffer_internal(eb, 0);
> +}
> +
> +void free_extent_buffer_nocache(struct extent_buffer *eb)
> +{
> + free_extent_buffer_internal(eb, 1);
> +}
> +
>  struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
>u64 bytenr, u32 blocksize)
>  {
> @@ -619,6 +650,21 @@ struct extent_buffer *find_first_extent_buffer(struct 
> extent_io_tree *tree,
>   return eb;
>  }
>  
> +static void
> +trim_extent_buffer_cache(struct extent_io_tree *tree)
> +{
> + struct extent_buffer *eb, *tmp;
> + u64 count = 0;

count seems to be a leftover from something, so you could remove it

> + list_for_each_entry_safe(eb, tmp, >lru, lru) {
> +