Re: updatedb does not index /home when /home is Btrfs

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
> Ancient bug, still seems to be a bug.
> https://bugzilla.redhat.com/show_bug.cgi?id=906591
> 
> The issue is that updatedb by default will not index bind mounts, but
> by default on Fedora and probably other distros, put /home on a
> subvolume and then mount that subvolume which is in effect a bind
> mount.
> 
> There's a lot of early discussion in 2013 about it, but then it's
> dropped off the radar as nobody has any ideas how to fix this in
> mlocate.

I don't see how this would be a bug in btrfs.  The same happens if you
bind-mount /home (or individual homes), which is a valid and non-rare setup.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>  wrote:
> 
> > If you're running on an SSD (or thinly provisioned storage, or something
> > else which supports discards) and have the 'discard' mount option enabled,
> > then there is no backup metadata tree (this issue was mentioned on the list
> > a while ago, but nobody ever replied),
> 
> 
> This is a really good point. I've been running discard mount option
> for some time now without problems, in a laptop with Samsung
> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
> 
> However, just trying btrfs-debug-tree -b on a specific block address
> for any of the backup root trees listed in the super, only the current
> one returns a valid result.  All others fail with checksum errors. And
> even the good one fails with checksum errors within seconds as a new
> tree is created, the super updated, and Btrfs considers the old root
> tree disposable and subject to discard.
> 
> So absolutely if I were to have a problem, probably no rollback for
> me. This seems to totally obviate a fundamental part of Btrfs design.

How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.

Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.

For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.

>  because it's already been discarded.
> > This is ideally something which should be addressed (we need some sort of
> > discard queue for handling in-line discards), but it's not easy to address.
> 
> Discard data extents, don't discard metadata extents? Or put them on a
> substantial delay.

Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog

2017-11-03 Thread Chris Murphy
For what it's worth, cryptsetup 2 now offers a UI for setting up both
dm-verity and dm-integrity.
https://www.kernel.org/pub/linux/utils/cryptsetup/v2.0/v2.0.0-rc0-ReleaseNotes

While more complicated than Btrfs, it's possible to first make an
integrity device on each drive, and add the integrity block devices to
mdadm or lvm as physical devices to create the raid1/10/5/6 array. You
could do it the other way around, but what should happen if you do it
as described, a sector read that fails checksum matching will cause a
read error to be handed off to md driver which then does
reconstruction from parity. If you only make the integrity volume out
of an array, then your file system just gets a read error whenever
there's a checksum mismatch, reconstruction isn't possible but at
least you're warned.

---
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


updatedb does not index /home when /home is Btrfs

2017-11-03 Thread Chris Murphy
Ancient bug, still seems to be a bug.
https://bugzilla.redhat.com/show_bug.cgi?id=906591

The issue is that updatedb by default will not index bind mounts, but
by default on Fedora and probably other distros, put /home on a
subvolume and then mount that subvolume which is in effect a bind
mount.

There's a lot of early discussion in 2013 about it, but then it's
dropped off the radar as nobody has any ideas how to fix this in
mlocate.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Chris Murphy
On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
 wrote:

> If you're running on an SSD (or thinly provisioned storage, or something
> else which supports discards) and have the 'discard' mount option enabled,
> then there is no backup metadata tree (this issue was mentioned on the list
> a while ago, but nobody ever replied),


This is a really good point. I've been running discard mount option
for some time now without problems, in a laptop with Samsung
Electronics Co Ltd NVMe SSD Controller SM951/PM951.

However, just trying btrfs-debug-tree -b on a specific block address
for any of the backup root trees listed in the super, only the current
one returns a valid result.  All others fail with checksum errors. And
even the good one fails with checksum errors within seconds as a new
tree is created, the super updated, and Btrfs considers the old root
tree disposable and subject to discard.

So absolutely if I were to have a problem, probably no rollback for
me. This seems to totally obviate a fundamental part of Btrfs design.


 because it's already been discarded.
> This is ideally something which should be addressed (we need some sort of
> discard queue for handling in-line discards), but it's not easy to address.

Discard data extents, don't discard metadata extents? Or put them on a
substantial delay.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add support for fallocate's zero range operation

2017-11-03 Thread Edmund Nadolski


On 11/03/2017 11:20 AM, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> This implements support the zero range operation of fallocate. For now
> at least it's as simple as possible while reusing most of the existing
> fallocate and hole punching infrastructure.
> 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Removed double inode unlock on error path from failure to lock range.
> V3: Factored common code to update isize and inode item into a helper
> function, plus some minor cleanup.
> 
>  fs/btrfs/file.c | 351 
> +---
>  1 file changed, 285 insertions(+), 66 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index aafcc785f840..2cc1aed1c564 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2448,7 +2448,48 @@ static int find_first_non_hole(struct inode *inode, 
> u64 *start, u64 *len)
>   return ret;
>  }
>  
> -static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +static int btrfs_punch_hole_lock_range(struct inode *inode,
> +const u64 lockstart,
> +const u64 lockend,
> +struct extent_state **cached_state)
> +{
> + while (1) {
> + struct btrfs_ordered_extent *ordered;
> + int ret;
> +
> + truncate_pagecache_range(inode, lockstart, lockend);
> +
> + lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
> +  cached_state);
> + ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
> +
> + /*
> +  * We need to make sure we have no ordered extents in this range
> +  * and nobody raced in and read a page in this range, if we did
> +  * we need to try again.
> +  */
> + if ((!ordered ||
> + (ordered->file_offset + ordered->len <= lockstart ||
> +  ordered->file_offset > lockend)) &&
> +  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
> + if (ordered)
> + btrfs_put_ordered_extent(ordered);
> + break;
> + }
> + if (ordered)
> + btrfs_put_ordered_extent(ordered);
> + unlock_extent_cached(_I(inode)->io_tree, lockstart,
> +  lockend, cached_state, GFP_NOFS);
> + ret = btrfs_wait_ordered_range(inode, lockstart,
> +lockend - lockstart + 1);
> + if (ret)
> + return ret;
> + }
> + return 0;
> +}
> +
> +static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len,
> + bool lock_inode)

The inode_lock may no longer be needed, since it looks to be always true
in this version of the patch.

Ed

>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> @@ -2477,7 +2518,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
>   if (ret)
>   return ret;
>  
> - inode_lock(inode);
> + if (lock_inode)
> + inode_lock(inode);
>   ino_size = round_up(inode->i_size, fs_info->sectorsize);
>   ret = find_first_non_hole(inode, , );
>   if (ret < 0)
> @@ -2516,7 +2558,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
>   truncated_block = true;
>   ret = btrfs_truncate_block(inode, offset, 0, 0);
>   if (ret) {
> - inode_unlock(inode);
> + if (lock_inode)
> + inode_unlock(inode);
>   return ret;
>   }
>   }
> @@ -2564,38 +2607,12 @@ static int btrfs_punch_hole(struct inode *inode, 
> loff_t offset, loff_t len)
>   goto out_only_mutex;
>   }
>  
> - while (1) {
> - struct btrfs_ordered_extent *ordered;
> -
> - truncate_pagecache_range(inode, lockstart, lockend);
> -
> - lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
> -  _state);
> - ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
> -
> - /*
> -  * We need to make sure we have no ordered extents in this range
> -  * and nobody raced in and read a page in this range, if we did
> -  * we need to try again.
> -  */
> - if ((!ordered ||
> - (ordered->file_offset + ordered->len <= lockstart ||
> -  ordered->file_offset > lockend)) &&
> -  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
> - if (ordered)
> - 

[PATCH v3] Btrfs: add support for fallocate's zero range operation

2017-11-03 Thread fdmanana
From: Filipe Manana 

This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.

Signed-off-by: Filipe Manana 
---

V2: Removed double inode unlock on error path from failure to lock range.
V3: Factored common code to update isize and inode item into a helper
function, plus some minor cleanup.

 fs/btrfs/file.c | 351 +---
 1 file changed, 285 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index aafcc785f840..2cc1aed1c564 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2448,7 +2448,48 @@ static int find_first_non_hole(struct inode *inode, u64 
*start, u64 *len)
return ret;
 }
 
-static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+static int btrfs_punch_hole_lock_range(struct inode *inode,
+  const u64 lockstart,
+  const u64 lockend,
+  struct extent_state **cached_state)
+{
+   while (1) {
+   struct btrfs_ordered_extent *ordered;
+   int ret;
+
+   truncate_pagecache_range(inode, lockstart, lockend);
+
+   lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
+cached_state);
+   ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
+
+   /*
+* We need to make sure we have no ordered extents in this range
+* and nobody raced in and read a page in this range, if we did
+* we need to try again.
+*/
+   if ((!ordered ||
+   (ordered->file_offset + ordered->len <= lockstart ||
+ordered->file_offset > lockend)) &&
+!btrfs_page_exists_in_range(inode, lockstart, lockend)) {
+   if (ordered)
+   btrfs_put_ordered_extent(ordered);
+   break;
+   }
+   if (ordered)
+   btrfs_put_ordered_extent(ordered);
+   unlock_extent_cached(_I(inode)->io_tree, lockstart,
+lockend, cached_state, GFP_NOFS);
+   ret = btrfs_wait_ordered_range(inode, lockstart,
+  lockend - lockstart + 1);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
+static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len,
+   bool lock_inode)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -2477,7 +2518,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (ret)
return ret;
 
-   inode_lock(inode);
+   if (lock_inode)
+   inode_lock(inode);
ino_size = round_up(inode->i_size, fs_info->sectorsize);
ret = find_first_non_hole(inode, , );
if (ret < 0)
@@ -2516,7 +2558,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
truncated_block = true;
ret = btrfs_truncate_block(inode, offset, 0, 0);
if (ret) {
-   inode_unlock(inode);
+   if (lock_inode)
+   inode_unlock(inode);
return ret;
}
}
@@ -2564,38 +2607,12 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
goto out_only_mutex;
}
 
-   while (1) {
-   struct btrfs_ordered_extent *ordered;
-
-   truncate_pagecache_range(inode, lockstart, lockend);
-
-   lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
-_state);
-   ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
-
-   /*
-* We need to make sure we have no ordered extents in this range
-* and nobody raced in and read a page in this range, if we did
-* we need to try again.
-*/
-   if ((!ordered ||
-   (ordered->file_offset + ordered->len <= lockstart ||
-ordered->file_offset > lockend)) &&
-!btrfs_page_exists_in_range(inode, lockstart, lockend)) {
-   if (ordered)
-   btrfs_put_ordered_extent(ordered);
-   break;
-   }
-   if (ordered)
-   btrfs_put_ordered_extent(ordered);
-   unlock_extent_cached(_I(inode)->io_tree, lockstart,
-

Mein Liebster

2017-11-03 Thread Isabelle Seyyed


Fondsüberweisung
Von Isabelle Seyyed.

Mein Liebster,

Ich habe Ihnen diese E-Mail für offene Gespräche mit Ihnen geschickt. Ich 
möchte nicht, dass du dieses Angebot in irgendeiner Hinsicht missverstehst ... 
wenn es dir gut geht, ich bitte um deine volle Mitarbeit. Ich habe Sie 
kontaktiert Vertrauen, um eine Investition in Ihrem Land / Unternehmen in 
meinem Namen als potenzieller Partner zu behandeln. Mein Name ist Isabelle 
Seyyed, 22 Jahre altes Mädchen von Cote D'Ivoire. Mein Vater und ich entkamen 
aus unserem Land in der Hitze des Bürgerkriegs, nachdem ich meine Mutter und 
zwei meiner älteren Brüder im Krieg verloren hatte. Als Ergebnis der 
politischen Instabilität in meinem Land auch nach dem Krieg, gründete mein 
Vater seine Kakao- und Kaffee-Export-Geschäft in meinem Land Abidjan, 
Elfenbeinküste.

Er war in Burke, einer nördlichen Stadt, um für den Kauf einer Kakaoplantage zu 
verhandeln, als er von den Rebellen getroffen wurde, die kämpften, um die 
Regierung des Landes zu übernehmen. Der Tod meines Vaters hat mich jetzt zu 
einer Waise gemacht und damit der Gefahr ausgesetzt.
Vor seinem unglücklichen Tod rief mich mein verstorbener Vater neben seinem 
kranken Bett und erzählte mir als seine einzige überlebende Tochter, dass er in 
einer der prominenten Bank hier in unserem Land die Summe von 4,6 Millionen 
Euro hinterlegt hatte. Mit meinem Namen als die nächste Angehörige.

Infolge der gegenwärtigen Unsicherheit von Leben und Eigentum in diesem Land,
Ich möchte in ein anderes Land verlagern, weil es in diesem Cote d'Ivoire keine 
weiteren guten Wertpapiere mehr gibt, keine weiteren guten Universitäten mehr, 
da dieser Rebellen-, Politik- und Bürgerkrieg begann, ich hoffe, Sie haben 
hören über den Krieg von Cote d Ivoire.
 
Meine Gründe für die Kontaktaufnahme sind Sie unten aufgeführt:
1. Ich möchte, dass Sie mir helfen, die Summe von vier Millionen 
sechshunderttausend Euro (€ 4,6 0 0, 0 0 0,0 0) zu unterstützen und zu 
investieren, die ich von meinem verstorbenen Vater geerbt habe, bevor er starb.
2. Ich möchte, dass Sie mir helfen, die Hochschulzugangsberechtigung zu 
erhalten, sobald ich in Ihr Land nach der Überweisung des Geldes ankomme.
3. Ich möchte, dass du mein Vormund bist, also ist mein Vater tot.
4. Ich möchte, dass du mir behilfst, eine gute Unterkunft in deinem Land zu 
bekommen.

Ich bin bereit, Ihnen 20% der Gesamtsumme als Entschädigung für Ihre Bemühungen 
nach dem erfolgreichen Transfer meines ererbten Geldes in Ihr nominiertes Konto 
anzubieten. Bitte schätze ich am meisten, wenn du mich kontaktieren kannst, 
sobald du diese Nachricht erhältst, damit wir weiter darüber sprechen können.
 
Hochachtungsvoll.
Isabelle Seyyed
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_cleaner lockdep warning in v4.9.56

2017-11-03 Thread Petr Janecek
Hello,
  this warning happened during "btrfs subvolume remove" of a
readonly snapshot after the newest in the snapshot series was
"btrfs received".

[96857.000284] [ cut here ]
[96857.000307] WARNING: CPU: 1 PID: 371 at kernel/locking/lockdep.c:704 
register_lock_class+0x4c8/0x530
[96857.000322] Modules linked in: fuse vfat msdos fat dm_mod nfsd auth_rpcgss 
oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs ipmi_watchdog libcrc32c 
raid1 iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm_intel kvm 
evdev irqbypass serio_raw hpilo hpwdt tpm_tis tpm_tis_core acpi_power_meter tpm 
button lpc_ich mfd_core md_mod ipmi_si ipmi_poweroff ipmi_devintf 
ipmi_msghandler autofs4 btrfs xor raid6_pq sg sd_mod uas usb_storage 
crc32c_intel ahci libahci psmouse libata scsi_mod uhci_hcd xhci_pci xhci_hcd 
tg3 ptp pps_core libphy thermal ehci_pci ehci_hcd usbcore usb_common
[96857.000744] CPU: 1 PID: 371 Comm: btrfs-cleaner Not tainted 4.9.56 #13
[96857.000769] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 06/06/2014
[96857.000795]  c9a4fa48 81309da5  

[96857.000849]  c9a4fa88 81059b7c 02c4 

[96857.000901]   8234baf0 880041fca450 

[96857.000954] Call Trace:
[96857.000979]  [] dump_stack+0x67/0x92
[96857.001004]  [] __warn+0xcc/0xf0
[96857.001029]  [] warn_slowpath_null+0x18/0x20
[96857.001054]  [] register_lock_class+0x4c8/0x530
[96857.001080]  [] __lock_acquire+0x76/0x7f0
[96857.001105]  [] lock_acquire+0xbe/0x1f0
[96857.001161]  [] ? btrfs_tree_lock+0x89/0x250 [btrfs]
[96857.001188]  [] _raw_write_lock+0x33/0x50
[96857.001233]  [] ? btrfs_tree_lock+0x89/0x250 [btrfs]
[96857.001276]  [] btrfs_tree_lock+0x89/0x250 [btrfs]
[96857.001322]  [] ? find_extent_buffer+0xda/0x1e0 [btrfs]
[96857.001367]  [] ? release_extent_buffer+0xc0/0xc0 [btrfs]
[96857.001409]  [] do_walk_down+0xf0/0x930 [btrfs]
[96857.001450]  [] walk_down_tree+0xb2/0xe0 [btrfs]
[96857.001491]  [] btrfs_drop_snapshot+0x3a9/0x780 [btrfs]
[96857.001517]  [] ? _raw_spin_unlock+0x22/0x30
[96857.001561]  [] ? btrfs_kill_all_delayed_nodes+0xbd/0xd0 
[btrfs]
[96857.001617]  [] btrfs_clean_one_deleted_snapshot+0xad/0xe0 
[btrfs]
[96857.001672]  [] cleaner_kthread+0x16f/0x1e0 [btrfs]
[96857.001713]  [] ? btree_invalidatepage+0xa0/0xa0 [btrfs]
[96857.001741]  [] kthread+0x116/0x130
[96857.001765]  [] ? kthread_park+0x60/0x60
[96857.001790]  [] ret_from_fork+0x27/0x40
[96857.001814] ---[ end trace ff21435da4cc1bc5 ]---


Regards,

Petr

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Austin S. Hemmelgarn

On 2017-11-03 03:42, Kai Krakow wrote:

Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" :


On 2017-10-31 01:57, Marat Khalili wrote:

On 31/10/17 00:37, Chris Murphy wrote:

But off hand it sounds like hardware was sabotaging the expected
write ordering. How to test a given hardware setup for that, I
think, is really overdue. It affects literally every file system,
and Linux storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible
root trees either haven't actually been written yet (still) or
they've been overwritten, even though the super is updated. But
again, it's speculation, we don't actually know why your system
was no longer mountable.

Just a detached view: I know hardware should respect
ordering/barriers and such, but how hard is it really to avoid
overwriting at least one complete metadata tree for half an hour
(even better, yet another one for a day)? Just metadata, not data
extents.

If you're running on an SSD (or thinly provisioned storage, or
something else which supports discards) and have the 'discard' mount
option enabled, then there is no backup metadata tree (this issue was
mentioned on the list a while ago, but nobody ever replied), because
it's already been discarded.  This is ideally something which should
be addressed (we need some sort of discard queue for handling in-line
discards), but it's not easy to address.

Otherwise, it becomes a question of space usage on the filesystem,
and this is just another reason to keep some extra slack space on the
FS (though that doesn't help _much_, it does help).  This, in theory,
could be addressed, but it probably can't be applied across mounts of
a filesystem without an on-disk format change.


Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Indeed.


Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

Inline discards can already cause bad performance spikes.


I wonder how a regular fstrim run through cron applies to this problem?
You functionally lose any old (freed) trees, they just get kept around 
until you call fstrim.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-03 Thread Austin S. Hemmelgarn

On 2017-11-03 03:26, Kai Krakow wrote:

Am Thu, 2 Nov 2017 22:47:31 -0400
schrieb Dave :


On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow 
wrote:



You may want to try btrfs autodefrag mount option and see if it
improves things (tho, the effect may take days or weeks to apply if
you didn't enable it right from the creation of the filesystem).

Also, autodefrag will probably unshare reflinks on your snapshots.
You may be able to use bees[1] to work against this effect. Its
interaction with autodefrag is not well tested but it works fine
for me. Also, bees is able to reduce some of the fragmentation
during deduplication because it will rewrite extents back into
bigger chunks (but only for duplicated data).

[1]: https://github.com/Zygo/bees


I will look into bees. And yes, I plan to try autodefrag. (I already
have it enabled now.) However, I need to understand something about
how btrfs send-receive works in regard to reflinks and fragmentation.

Say I have 2 snapshots on my live volume. The earlier one of them has
already been sent to another block device by btrfs send-receive (full
backup). Now defrag runs on the live volume and breaks some percentage
of the reflinks. At this point I do an incremental btrfs send-receive
using "-p" (or "-c") with the diff going to the same other block
device where the prior snapshot was already sent.

Will reflinks be "made whole" (restored) on the receiving block
device? Or is the state of the source volume replicated so closely
that reflink status is the same on the target?

Also, is fragmentation reduced on the receiving block device?

My expectation is that fragmentation would be reduced and duplication
would be reduced too. In other words, does send-receive result in
defragmentation and deduplication too?


As far as I understand, btrfs send/receive doesn't create an exact
mirror. It just replays the block operations between generation
numbers. That is: If it finds new blocks referenced between
generations, it will write a _new_ block to the destination.
That is mostly correct, except it's not a block level copy.  To put it 
in a heavily simplified manner, send/receive will recreate the subvolume 
using nothing more than basic file manipulation syscalls (write(), 
chown(), chmod(), etc), the clone ioctl, and some extra logic to figure 
out the correct location to clone from.  IOW, it's functionally 
equivalent to using rsync to copy the data, and then deduplicating, 
albeit a bit smarter about when to deduplicate (and more efficient in 
that respect).


So, no, it won't reduce fragmentation or duplication. It just keeps
reflinks intact as long as such extents weren't touched within the
generation range. Otherwise they are rewritten as new extents.
A received subvolume will almost always be less fragmented than the 
source, since everything is received serially, and each file is written 
out one at a time.


Autodefrag and deduplication processes will as such probably increase
duplication at the destination. A developer may have a better clue, tho.
In theory, yes, but in practice, not so much.  Autodefrag generally 
operates on very small blocks of data (64k IIRC), and I'm pretty sure it 
has some heuristic that only triggers it on small random writes, so 
depending on the workload, it may not be triggering much (for example, 
it often won't trigger on cache directories, since those almost never 
have files rewritten in place).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Move leaf verification to correct timing to avoid false panic for sanity test

2017-11-03 Thread Qu Wenruo


On 2017年11月03日 18:59, Filipe Manana wrote:
> On Thu, Nov 2, 2017 at 7:04 AM, Qu Wenruo  wrote:
>> [BUG]
>> If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
>> instantly cause kernel panic like:
>>
>> --
>> ...
>> assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
>> ...
>> Call Trace:
>>  btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
>>  setup_items_for_insert+0x385/0x650 [btrfs]
>>  __btrfs_drop_extents+0x129a/0x1870 [btrfs]
>> ...
>> --
>>
>> [Cause]
>> Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
>> if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
>>
>> However some btrfs_mark_buffer_dirty() caller, like
>> setup_items_for_insert(), doesn't really initialize its item data but
>> only initialize its item pointers, leaving item data uninitialized.
> 
> So instead of doing this juggling, the best would be to have it not call
> mark_buffer_dirty(), and leave that responsibility for the caller after
> it initializes the item data. I give you a very good reason for that below.

However setup_items_for_insert() is just one of the possible causes,
unless we overhaul all btrfs_mark_buffer_dirty() callers, it will be
whac-a-aole.

> 
>>
>> This makes tree-checker catch uninitialized data as error, causing
>> such panic.
>>
>> [Fix]
>> The correct timing to check leaf validation should be before write IO or
>> after read IO.
>>
>> Just like ee have already done the tree validation check at btree
>> readpage end io hook, this patch will move the write time tree checker to
>> csum_dirty_buffer().
>>
>> As csum_dirty_buffer() is called just before submitting btree write bio, as
>> the call path shows:
>>
>> btree_submit_bio_hook()
>> |- __btree_submit_bio_start()
>>|- btree_csum_one_bio()
>>   |- csum_dirty_buffer()
>>  |- btrfs_check_leaf()
>>
>> By this we can ensure the leaf passed in is in consistent status, and
>> can check them without causing tons of false alert.
>>
>> Reported-by: Lakshmipathi.G 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/disk-io.c | 26 +++---
>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index efce9a2fa9be..6c17bce2a05e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -506,6 +506,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
>> *fs_info, struct page *page)
>> u64 start = page_offset(page);
>> u64 found_start;
>> struct extent_buffer *eb;
>> +   int ret;
>>
>> eb = (struct extent_buffer *)page->private;
>> if (page != eb->pages[0])
>> @@ -524,7 +525,24 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
>> *fs_info, struct page *page)
>> ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
>> btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
>>
>> -   return csum_tree_block(fs_info, eb, 0);
>> +   ret = csum_tree_block(fs_info, eb, 0);
>> +   if (ret)
>> +   return ret;
>> +
>> +#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>> +   /*
>> +* Do extra check before we write the tree block into disk.
>> +*/
>> +   if (btrfs_header_level(eb) == 0) {
>> +   ret = btrfs_check_leaf(fs_info->tree_root, eb);
>> +   if (ret) {
>> +   btrfs_print_leaf(eb);
>> +   ASSERT(0);
>> +   return ret;
>> +   }
>> +   }
>> +#endif
>> +   return 0;
>>  }
>>
>>  static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>> @@ -3847,12 +3865,6 @@ void btrfs_mark_buffer_dirty(struct extent_buffer 
>> *buf)
>> percpu_counter_add_batch(_info->dirty_metadata_bytes,
>>  buf->len,
>>  fs_info->dirty_metadata_batch);
>> -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>> -   if (btrfs_header_level(buf) == 0 && btrfs_check_leaf(root, buf)) {
>> -   btrfs_print_leaf(buf);
>> -   ASSERT(0);
>> -   }
>> -#endif
> 
> So there's a reason why btrfs_check_leaf() was called here, at
> mark_buffer_dirty(),
> instead of somewhere else like csum_dirty_buffer().
> 
> The reason is that once some bad code inserts a key out of order for
> example (or did any
> other bad stuff that check_leaf() catched before you added the
> tree-checker thing), we
> would get a trace that pinpoints exactly where the bad code is. With
> this change, we will
> only know some is bad when writeback of the leaf starts, and before
> that happens, the leaf might
> have been changed dozens of times by many different functions (and
> this happens very
> often, it's far from being a unusual case), in which case the given
> trace won't tell you which code
> misbehaved. This makes it harder to find out bugs, and as it used to
> be it certainly helped me in
> the past several times. IOW, I would prefer what I 

Re: [PATCH 06/11] btrfs: document device locking

2017-11-03 Thread Anand Jain



Thanks for writing this.


+ * - fs_devices::device_list_mutex (per-fs, with RCU)
+ *
+ *   protects updates to fs_devices::devices, ie. adding and deleting
+ *
+ *   simple list traversal with read-only actions can be done with RCU
+ *   protection
+ *
+ *   may be used to exclude some operations from running concurrently without
+ *   any modifications to the list (see write_all_supers)



+ * - volume_mutex
+ *
+ *   coarse lock owned by a mounted filesystem; used to exclude some operations
+ *   that cannot run in parallel and affect the higher-level properties of the
+ *   filesystem like: device add/deleting/resize/replace, or balance



+ * - chunk_mutex
+ *
+ *   protects chunks, adding or removing during allocation, trim or when
+ *   a new device is added/removed


::


+ * Lock nesting
+ * 
+ *
+ * uuid_mutex
+ *   volume_mutex
+ * device_list_mutex
+ *   chunk_mutex
+ * balance_mutex



If we have a list of operations that would consume these locks then we 
can map it accordingly for better clarity.


To me it looks like we have too many locks.
 - we don't have to differentiate the mounted and unmounted context
   for device locks.
 - Two lock would be sufficient, one for the device list
   (add/rm,replace,..) and another for device property changes
   (resize, trim,..).

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Move leaf verification to correct timing to avoid false panic for sanity test

2017-11-03 Thread Filipe Manana
On Thu, Nov 2, 2017 at 7:04 AM, Qu Wenruo  wrote:
> [BUG]
> If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
> instantly cause kernel panic like:
>
> --
> ...
> assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
> ...
> Call Trace:
>  btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
>  setup_items_for_insert+0x385/0x650 [btrfs]
>  __btrfs_drop_extents+0x129a/0x1870 [btrfs]
> ...
> --
>
> [Cause]
> Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
> if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
>
> However some btrfs_mark_buffer_dirty() caller, like
> setup_items_for_insert(), doesn't really initialize its item data but
> only initialize its item pointers, leaving item data uninitialized.

So instead of doing this juggling, the best would be to have it not call
mark_buffer_dirty(), and leave that responsibility for the caller after
it initializes the item data. I give you a very good reason for that below.

>
> This makes tree-checker catch uninitialized data as error, causing
> such panic.
>
> [Fix]
> The correct timing to check leaf validation should be before write IO or
> after read IO.
>
> Just like ee have already done the tree validation check at btree
> readpage end io hook, this patch will move the write time tree checker to
> csum_dirty_buffer().
>
> As csum_dirty_buffer() is called just before submitting btree write bio, as
> the call path shows:
>
> btree_submit_bio_hook()
> |- __btree_submit_bio_start()
>|- btree_csum_one_bio()
>   |- csum_dirty_buffer()
>  |- btrfs_check_leaf()
>
> By this we can ensure the leaf passed in is in consistent status, and
> can check them without causing tons of false alert.
>
> Reported-by: Lakshmipathi.G 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c | 26 +++---
>  1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index efce9a2fa9be..6c17bce2a05e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -506,6 +506,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
> *fs_info, struct page *page)
> u64 start = page_offset(page);
> u64 found_start;
> struct extent_buffer *eb;
> +   int ret;
>
> eb = (struct extent_buffer *)page->private;
> if (page != eb->pages[0])
> @@ -524,7 +525,24 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
> *fs_info, struct page *page)
> ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
> btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
>
> -   return csum_tree_block(fs_info, eb, 0);
> +   ret = csum_tree_block(fs_info, eb, 0);
> +   if (ret)
> +   return ret;
> +
> +#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> +   /*
> +* Do extra check before we write the tree block into disk.
> +*/
> +   if (btrfs_header_level(eb) == 0) {
> +   ret = btrfs_check_leaf(fs_info->tree_root, eb);
> +   if (ret) {
> +   btrfs_print_leaf(eb);
> +   ASSERT(0);
> +   return ret;
> +   }
> +   }
> +#endif
> +   return 0;
>  }
>
>  static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
> @@ -3847,12 +3865,6 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
> percpu_counter_add_batch(_info->dirty_metadata_bytes,
>  buf->len,
>  fs_info->dirty_metadata_batch);
> -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> -   if (btrfs_header_level(buf) == 0 && btrfs_check_leaf(root, buf)) {
> -   btrfs_print_leaf(buf);
> -   ASSERT(0);
> -   }
> -#endif

So there's a reason why btrfs_check_leaf() was called here, at
mark_buffer_dirty(),
instead of somewhere else like csum_dirty_buffer().

The reason is that once some bad code inserts a key out of order for
example (or did any
other bad stuff that check_leaf() catched before you added the
tree-checker thing), we
would get a trace that pinpoints exactly where the bad code is. With
this change, we will
only know some is bad when writeback of the leaf starts, and before
that happens, the leaf might
have been changed dozens of times by many different functions (and
this happens very
often, it's far from being a unusual case), in which case the given
trace won't tell you which code
misbehaved. This makes it harder to find out bugs, and as it used to
be it certainly helped me in
the past several times. IOW, I would prefer what I mentioned earlier
or, at very least, do those new
checks that validate data only at writeback start time.

>  }
>
>  static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo 

Re: [PATCH v2] Btrfs: add support for fallocate's zero range operation

2017-11-03 Thread Filipe Manana
On Fri, Nov 3, 2017 at 10:29 AM, Filipe Manana  wrote:
> On Fri, Nov 3, 2017 at 9:30 AM, Nikolay Borisov  wrote:
>>
>>
>> On 25.10.2017 17:59, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> This implements support the zero range operation of fallocate. For now
>>> at least it's as simple as possible while reusing most of the existing
>>> fallocate and hole punching infrastructure.
>>>
>>> Signed-off-by: Filipe Manana 
>>> ---
>>>
>>> V2: Removed double inode unlock on error path from failure to lock range.
>>>
>>>  fs/btrfs/file.c | 332 
>>> +---
>>>  1 file changed, 290 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index aafcc785f840..e0d15c0d1641 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -2448,7 +2448,48 @@ static int find_first_non_hole(struct inode *inode, 
>>> u64 *start, u64 *len)
>>>   return ret;
>>>  }
>>>
>>> -static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>> +static int btrfs_punch_hole_lock_range(struct inode *inode,
>>> +const u64 lockstart,
>>> +const u64 lockend,
>>> +struct extent_state **cached_state)
>>> +{
>>> + while (1) {
>>> + struct btrfs_ordered_extent *ordered;
>>> + int ret;
>>> +
>>> + truncate_pagecache_range(inode, lockstart, lockend);
>>> +
>>> + lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
>>> +  cached_state);
>>> + ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
>>> +
>>> + /*
>>> +  * We need to make sure we have no ordered extents in this 
>>> range
>>> +  * and nobody raced in and read a page in this range, if we 
>>> did
>>> +  * we need to try again.
>>> +  */
>>> + if ((!ordered ||
>>> + (ordered->file_offset + ordered->len <= lockstart ||
>>> +  ordered->file_offset > lockend)) &&
>>> +  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
>>> + if (ordered)
>>> + btrfs_put_ordered_extent(ordered);
>>> + break;
>>> + }
>>> + if (ordered)
>>> + btrfs_put_ordered_extent(ordered);
>>> + unlock_extent_cached(_I(inode)->io_tree, lockstart,
>>> +  lockend, cached_state, GFP_NOFS);
>>> + ret = btrfs_wait_ordered_range(inode, lockstart,
>>> +lockend - lockstart + 1);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len,
>>> + bool lock_inode)
>>>  {
>>>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>>   struct btrfs_root *root = BTRFS_I(inode)->root;
>>> @@ -2477,7 +2518,8 @@ static int btrfs_punch_hole(struct inode *inode, 
>>> loff_t offset, loff_t len)
>>>   if (ret)
>>>   return ret;
>>>
>>> - inode_lock(inode);
>>> + if (lock_inode)
>>> + inode_lock(inode);
>>>   ino_size = round_up(inode->i_size, fs_info->sectorsize);
>>>   ret = find_first_non_hole(inode, , );
>>>   if (ret < 0)
>>> @@ -2516,7 +2558,8 @@ static int btrfs_punch_hole(struct inode *inode, 
>>> loff_t offset, loff_t len)
>>>   truncated_block = true;
>>>   ret = btrfs_truncate_block(inode, offset, 0, 0);
>>>   if (ret) {
>>> - inode_unlock(inode);
>>> + if (lock_inode)
>>> + inode_unlock(inode);
>>>   return ret;
>>>   }
>>>   }
>>> @@ -2564,38 +2607,12 @@ static int btrfs_punch_hole(struct inode *inode, 
>>> loff_t offset, loff_t len)
>>>   goto out_only_mutex;
>>>   }
>>>
>>> - while (1) {
>>> - struct btrfs_ordered_extent *ordered;
>>> -
>>> - truncate_pagecache_range(inode, lockstart, lockend);
>>> -
>>> - lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
>>> -  _state);
>>> - ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
>>> -
>>> - /*
>>> -  * We need to make sure we have no ordered extents in this 
>>> range
>>> -  * and nobody raced in and read a page in this range, if we 
>>> did
>>> -  * we need to try again.
>>> -  */
>>> - if ((!ordered ||
>>> - (ordered->file_offset + ordered->len <= lockstart ||
>>> -  

Re: [PATCH v2] Btrfs: add support for fallocate's zero range operation

2017-11-03 Thread Filipe Manana
On Fri, Nov 3, 2017 at 9:30 AM, Nikolay Borisov  wrote:
>
>
> On 25.10.2017 17:59, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> This implements support the zero range operation of fallocate. For now
>> at least it's as simple as possible while reusing most of the existing
>> fallocate and hole punching infrastructure.
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>
>> V2: Removed double inode unlock on error path from failure to lock range.
>>
>>  fs/btrfs/file.c | 332 
>> +---
>>  1 file changed, 290 insertions(+), 42 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index aafcc785f840..e0d15c0d1641 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -2448,7 +2448,48 @@ static int find_first_non_hole(struct inode *inode, 
>> u64 *start, u64 *len)
>>   return ret;
>>  }
>>
>> -static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>> +static int btrfs_punch_hole_lock_range(struct inode *inode,
>> +const u64 lockstart,
>> +const u64 lockend,
>> +struct extent_state **cached_state)
>> +{
>> + while (1) {
>> + struct btrfs_ordered_extent *ordered;
>> + int ret;
>> +
>> + truncate_pagecache_range(inode, lockstart, lockend);
>> +
>> + lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
>> +  cached_state);
>> + ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
>> +
>> + /*
>> +  * We need to make sure we have no ordered extents in this 
>> range
>> +  * and nobody raced in and read a page in this range, if we did
>> +  * we need to try again.
>> +  */
>> + if ((!ordered ||
>> + (ordered->file_offset + ordered->len <= lockstart ||
>> +  ordered->file_offset > lockend)) &&
>> +  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
>> + if (ordered)
>> + btrfs_put_ordered_extent(ordered);
>> + break;
>> + }
>> + if (ordered)
>> + btrfs_put_ordered_extent(ordered);
>> + unlock_extent_cached(_I(inode)->io_tree, lockstart,
>> +  lockend, cached_state, GFP_NOFS);
>> + ret = btrfs_wait_ordered_range(inode, lockstart,
>> +lockend - lockstart + 1);
>> + if (ret)
>> + return ret;
>> + }
>> + return 0;
>> +}
>> +
>> +static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len,
>> + bool lock_inode)
>>  {
>>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>   struct btrfs_root *root = BTRFS_I(inode)->root;
>> @@ -2477,7 +2518,8 @@ static int btrfs_punch_hole(struct inode *inode, 
>> loff_t offset, loff_t len)
>>   if (ret)
>>   return ret;
>>
>> - inode_lock(inode);
>> + if (lock_inode)
>> + inode_lock(inode);
>>   ino_size = round_up(inode->i_size, fs_info->sectorsize);
>>   ret = find_first_non_hole(inode, , );
>>   if (ret < 0)
>> @@ -2516,7 +2558,8 @@ static int btrfs_punch_hole(struct inode *inode, 
>> loff_t offset, loff_t len)
>>   truncated_block = true;
>>   ret = btrfs_truncate_block(inode, offset, 0, 0);
>>   if (ret) {
>> - inode_unlock(inode);
>> + if (lock_inode)
>> + inode_unlock(inode);
>>   return ret;
>>   }
>>   }
>> @@ -2564,38 +2607,12 @@ static int btrfs_punch_hole(struct inode *inode, 
>> loff_t offset, loff_t len)
>>   goto out_only_mutex;
>>   }
>>
>> - while (1) {
>> - struct btrfs_ordered_extent *ordered;
>> -
>> - truncate_pagecache_range(inode, lockstart, lockend);
>> -
>> - lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
>> -  _state);
>> - ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
>> -
>> - /*
>> -  * We need to make sure we have no ordered extents in this 
>> range
>> -  * and nobody raced in and read a page in this range, if we did
>> -  * we need to try again.
>> -  */
>> - if ((!ordered ||
>> - (ordered->file_offset + ordered->len <= lockstart ||
>> -  ordered->file_offset > lockend)) &&
>> -  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
>> - if (ordered)
>> - btrfs_put_ordered_extent(ordered);
>> 

Re: [PATCH v2] Btrfs: add support for fallocate's zero range operation

2017-11-03 Thread Nikolay Borisov


On 25.10.2017 17:59, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> This implements support the zero range operation of fallocate. For now
> at least it's as simple as possible while reusing most of the existing
> fallocate and hole punching infrastructure.
> 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Removed double inode unlock on error path from failure to lock range.
> 
>  fs/btrfs/file.c | 332 
> +---
>  1 file changed, 290 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index aafcc785f840..e0d15c0d1641 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2448,7 +2448,48 @@ static int find_first_non_hole(struct inode *inode, 
> u64 *start, u64 *len)
>   return ret;
>  }
>  
> -static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +static int btrfs_punch_hole_lock_range(struct inode *inode,
> +const u64 lockstart,
> +const u64 lockend,
> +struct extent_state **cached_state)
> +{
> + while (1) {
> + struct btrfs_ordered_extent *ordered;
> + int ret;
> +
> + truncate_pagecache_range(inode, lockstart, lockend);
> +
> + lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
> +  cached_state);
> + ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
> +
> + /*
> +  * We need to make sure we have no ordered extents in this range
> +  * and nobody raced in and read a page in this range, if we did
> +  * we need to try again.
> +  */
> + if ((!ordered ||
> + (ordered->file_offset + ordered->len <= lockstart ||
> +  ordered->file_offset > lockend)) &&
> +  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
> + if (ordered)
> + btrfs_put_ordered_extent(ordered);
> + break;
> + }
> + if (ordered)
> + btrfs_put_ordered_extent(ordered);
> + unlock_extent_cached(_I(inode)->io_tree, lockstart,
> +  lockend, cached_state, GFP_NOFS);
> + ret = btrfs_wait_ordered_range(inode, lockstart,
> +lockend - lockstart + 1);
> + if (ret)
> + return ret;
> + }
> + return 0;
> +}
> +
> +static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len,
> + bool lock_inode)
>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> @@ -2477,7 +2518,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
>   if (ret)
>   return ret;
>  
> - inode_lock(inode);
> + if (lock_inode)
> + inode_lock(inode);
>   ino_size = round_up(inode->i_size, fs_info->sectorsize);
>   ret = find_first_non_hole(inode, , );
>   if (ret < 0)
> @@ -2516,7 +2558,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
>   truncated_block = true;
>   ret = btrfs_truncate_block(inode, offset, 0, 0);
>   if (ret) {
> - inode_unlock(inode);
> + if (lock_inode)
> + inode_unlock(inode);
>   return ret;
>   }
>   }
> @@ -2564,38 +2607,12 @@ static int btrfs_punch_hole(struct inode *inode, 
> loff_t offset, loff_t len)
>   goto out_only_mutex;
>   }
>  
> - while (1) {
> - struct btrfs_ordered_extent *ordered;
> -
> - truncate_pagecache_range(inode, lockstart, lockend);
> -
> - lock_extent_bits(_I(inode)->io_tree, lockstart, lockend,
> -  _state);
> - ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
> -
> - /*
> -  * We need to make sure we have no ordered extents in this range
> -  * and nobody raced in and read a page in this range, if we did
> -  * we need to try again.
> -  */
> - if ((!ordered ||
> - (ordered->file_offset + ordered->len <= lockstart ||
> -  ordered->file_offset > lockend)) &&
> -  !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
> - if (ordered)
> - btrfs_put_ordered_extent(ordered);
> - break;
> - }
> - if (ordered)
> - btrfs_put_ordered_extent(ordered);
> - unlock_extent_cached(_I(inode)->io_tree, 

Re: [PATCH 5/8] btrfs-progs: ctree: Introduce function to create an empty tree

2017-11-03 Thread Lu Fengqi

On 10/27/2017 03:29 PM, Qu Wenruo wrote:

Introduce a new function, btrfs_create_tree(), to create an empty tree.

Currently, there is only one caller to create new tree, namely
data reloc tree in mkfs.
However it's copying fs tree to create a new root.

This copy fs tree method is not a good idea if we only need an empty
tree.

So here introduce a new function, btrfs_create_tree() to create new
tree.
Which will handle the following things:
1) New tree root leaf
Using generic tree allocation

2) New root item in tree root

3) Modify special tree root pointers in fs_info
Only quota_root is supported yet, but can be expended easily

This patch provides the basis to implement quota support in mkfs.

Signed-off-by: Qu Wenruo 
---
  ctree.c | 109 
  ctree.h |   2 ++
  2 files changed, 111 insertions(+)

diff --git a/ctree.c b/ctree.c
index 4fc33b14000a..c707be58c413 100644
--- a/ctree.c
+++ b/ctree.c
@@ -22,6 +22,7 @@
  #include "repair.h"
  #include "internal.h"
  #include "sizes.h"
+#include "utils.h"
  
  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root

  *root, struct btrfs_path *path, int level);
@@ -136,6 +137,114 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
return 0;
  }
  
+/*

+ * Create a new tree root, with root objectid set to @objectid.
+ *
+ * NOTE: Doesn't support tree with non-zero offset, like tree reloc tree.
+ */
+int btrfs_create_root(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info, u64 objectid)
+{
+   struct extent_buffer *node;
+   struct btrfs_root *new_root;
+   struct btrfs_disk_key disk_key;
+   struct btrfs_key location;
+   struct btrfs_root_item root_item = { 0 };
+   int ret;
+
+   new_root = malloc(sizeof(*new_root));
+   if (!new_root)
+   return -ENOMEM;
+
+   btrfs_setup_root(new_root, fs_info, objectid);
+   if (!is_fstree(objectid))
+   new_root->track_dirty = 1;
+   add_root_to_dirty_list(new_root);


Since add_root_to_dirty_list only add root which track_dirty != 0 to 
dirty list, why not write like the following?


if (!is_fstree(objectid)) {
new_root->track_dirty = 1;
add_root_to_dirty_list(new_root);
}


+
+   new_root->objectid = objectid;
+   new_root->root_key.objectid = objectid;


These have been initialized in btrfs_setup_root, so we don't need to 
initialize again.



+   new_root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+   new_root->root_key.offset = 0;
+
+   node = btrfs_alloc_free_block(trans, new_root, fs_info->nodesize,
+ objectid, _key, 0, 0, 0);
+   if (IS_ERR(node)) {
+   ret = PTR_ERR(node);
+   error("failed to create root node for tree %llu: %d (%s)",
+ objectid, ret, strerror(-ret));
+   return ret;
+   }
+   new_root->node = node;
+
+   btrfs_set_header_generation(node, trans->transid);
+   btrfs_set_header_backref_rev(node, BTRFS_MIXED_BACKREF_REV);
+   btrfs_clear_header_flag(node, BTRFS_HEADER_FLAG_RELOC |
+ BTRFS_HEADER_FLAG_WRITTEN);
+   btrfs_set_header_owner(node, objectid);
+   btrfs_set_header_nritems(node, 0);
+   btrfs_set_header_level(node, 0);
+   write_extent_buffer(node, fs_info->fsid, btrfs_header_fsid(),
+   BTRFS_FSID_SIZE);
+   ret = btrfs_inc_ref(trans, new_root, node, 0);
+   if (ret < 0)
+   goto free;
+
+   /*
+* Special tree roots may need to modify pointers in @fs_info
+* Only quota is supported yet.
+*/
+   switch (objectid) {
+   case BTRFS_QUOTA_TREE_OBJECTID:
+   if (fs_info->quota_root) {
+   error("quota root already exists");
+   ret = -EEXIST;
+   goto free;
+   }
+   fs_info->quota_root = new_root;
+   fs_info->quota_enabled = 1;
+   break;
+   /*
+* Essential trees can't be created by this function, yet.
+* As we expect such skeleton exists, or a lot of functions like
+* btrfs_alloc_free_block() doesn't work at all
+*/
+   case BTRFS_ROOT_TREE_OBJECTID:
+   case BTRFS_EXTENT_TREE_OBJECTID:
+   case BTRFS_CHUNK_TREE_OBJECTID:
+   case BTRFS_FS_TREE_OBJECTID:
+   ret = -EEXIST;
+   goto free;
+   default:
+   /* Subvolume trees don't need special handles */
+   if (is_fstree(objectid))
+   break;
+   /* Other special trees are not supported yet */
+   ret = -ENOTTY;
+   goto free;
+   }
+   btrfs_mark_buffer_dirty(node);
+   btrfs_set_root_bytenr(_item, 

[PATCH RESEND 4/4] btrfs-progs: test: Add test image for lowmem mode referencer count mismatch false alert

2017-11-03 Thread Lu Fengqi
Add a image which can reproduce the extent item referencer count
mismatch false alert for lowmem mode.

Reported-by: Marc MERLIN 
Signed-off-by: Lu Fengqi 
---
 .../ref_count_mismatch_false_alert.img   | Bin 0 -> 4096 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/ref_count_mismatch_false_alert.img

diff --git 
a/tests/fsck-tests/020-extent-ref-cases/ref_count_mismatch_false_alert.img 
b/tests/fsck-tests/020-extent-ref-cases/ref_count_mismatch_false_alert.img
new file mode 100644
index 
..85110a813b5d00cb35d23babc70d57510cae19b0
GIT binary patch
literal 4096
zcmeH}c|6oxAIE>Q7>1#-&$wtTF}mEwgG?nemXMMqOIb3S>}eQX#*#aBGA>1k8`yK>to-!TQWGJDmK5OLv(4g?)FJ-uijX$PT~x!ht)C
zY5P1L4Blq?7v2ronfPws75J{e|40Gapo2(xdkdYH52nfE^?_C$uX-3`wy=*GP3mJul9nzPWam{Dy8Jm@HHSh;$0D%?#(CLV)}m
z4=I4Q9U&_*-d#g>Vt=XgLnU`1QDX$?18gfj8HGr{0^h5Eif=UgaWL?esD!xAzoX
zSPYMeybi82L>KA))qZ@PiatGF(7mnZRyApaxavmP17%d2>L@Te*1{-nwKu1W&2h)^EItwUC=(6YpcF!`TVtS0q;I(2VAvqY@;m@EcAW6pd@gCaAJWrm=LDD!yE;pNYVL%W@MuW`_Z;`{%KeXc
z6yV)hwUaVf1fT=6WYY$FN?k9wupm%KCipH*$G#z5H%XvAY#nOQCTEn4S1`={@Gxr=
zt~K2#@Pm%O&0L?%-m|gq$XuO~V6@K7~nFvdMFs;1g@2c>FiCFkFwOPj
zJ@{D_aGpeL|`wvq=c0(z<`nlQb)97j(gm&49Ve)kpX|
z<;B$0X?oL^)zQAzKjaU{M{UWA^-ua~yn7ruy@aAhosds;P119BS?So?C%gCglEl8$
z!q`l|sGp3QukgjlxR6qk7a0or0O>(q(`Mo3{wGxwNaFIDpC3^<@r`gzm*=R
z(~pa6-#RgJ49%V|*lxGLiBV!o%B4?x1D`dX3BolS52|cfnP<$dmrPzBi`{wY&3LKa
z^*j)&6)GvZYgkKjutQN{?yTvtlK~nv(x3YF=`p%HO0B^Mu2$SrFW>L|YDR+WSau}8
zGLxS(4p|G=KszkW2)*cgKmTUKZ|M)8_OesX>$?2jYiGPp=H$$4dtF)N)MQ;+U13
z`)AJ5wzSW=pI@2Vf2Pllk4;|F*k0jSy-7Gq`O|pI`P!Oy(LcCB!h09GCJS$6atEv*
zH@lZkR9r7zb2jVUR4(!g6YmWL@*r;bEA4Np)V
zpnjftsG)f^C+HBxJwT~T^|!1&$(M4_6-Z3IBDMBkp472TXb=iM^wPP8G#nv
zC6qWNQ*1w+^Y~ZT^et>M?!T0nVBQK+{Krai+mmboyL+aKu%`-FD=Vc6
zte)<=7aT3k0(l$=dFZ!PfHy4!dSx2@inv~XylqQqdaT@+G9X;+f_43L!#JKCUAHh8

[PATCH RESEND 3/4] btrfs-progs: lowmem check: Fix false alert about referencer count mismatch

2017-11-03 Thread Lu Fengqi
The normal back reference counting doesn't care about the extent referred
by the extent data in the shared leaf. The check_extent_data_backref
function need to skip the leaf that owner mismatch with the root_id.

Reported-by: Marc MERLIN 
Signed-off-by: Lu Fengqi 
---
 cmds-check.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 5750bb72..a93ac2c8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12468,7 +12468,8 @@ static int check_extent_data_backref(struct 
btrfs_fs_info *fs_info,
leaf = path.nodes[0];
slot = path.slots[0];
 
-   if (slot >= btrfs_header_nritems(leaf))
+   if (slot >= btrfs_header_nritems(leaf) ||
+   btrfs_header_owner(leaf) != root_id)
goto next;
btrfs_item_key_to_cpu(leaf, , slot);
if (key.objectid != objectid || key.type != 
BTRFS_EXTENT_DATA_KEY)
-- 
2.15.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs seed question

2017-11-03 Thread Kai Krakow
Am Thu, 12 Oct 2017 09:20:28 -0400
schrieb Joseph Dunn :

> On Thu, 12 Oct 2017 12:18:01 +0800
> Anand Jain  wrote:
> 
> > On 10/12/2017 08:47 AM, Joseph Dunn wrote:  
> > > After seeing how btrfs seeds work I wondered if it was possible
> > > to push specific files from the seed to the rw device.  I know
> > > that removing the seed device will flush all the contents over to
> > > the rw device, but what about flushing individual files on demand?
> > > 
> > > I found that opening a file, reading the contents, seeking back
> > > to 0, and writing out the contents does what I want, but I was
> > > hoping for a bit less of a hack.
> > > 
> > > Is there maybe an ioctl or something else that might trigger a
> > > similar action?
> > 
> >You mean to say - seed-device delete to trigger copy of only the 
> > specified or the modified files only, instead of whole of
> > seed-device ? What's the use case around this ?
> >   
> 
> Not quite.  While the seed device is still connected I would like to
> force some files over to the rw device.  The use case is basically a
> much slower link to a seed device holding significantly more data than
> we currently need.  An example would be a slower iscsi link to the
> seed device and a local rw ssd.  I would like fast access to a
> certain subset of files, likely larger than the memory cache will
> accommodate.  If at a later time I want to discard the image as a
> whole I could unmount the file system or if I want a full local copy
> I could delete the seed-device to sync the fs.  In the mean time I
> would have access to all the files, with some slower (iscsi) and some
> faster (ssd) and the ability to pick which ones are in the faster
> group at the cost of one content transfer.
> 
> I'm not necessarily looking for a new feature addition, just if there
> is some existing call that I can make to push specific files from the
> slow mirror to the fast one.  If I had to push a significant amount of
> metadata that would be fine, but the file contents feeding some
> computations might be large and useful only to certain clients.
> 
> So far I found that I can re-write the file with the same contents and
> thanks to the lack of online dedupe these writes land on the rw mirror
> so later reads to that file should not hit the slower mirror.  By the
> way, if I'm misunderstanding how the read process would work after the
> file push please correct me.
> 
> I hope this makes sense but I'll try to clarify further if you have
> more questions.

You could try to wrap something like bcache ontop of the iscsi device,
then make it a read-mostly cache (like bcache write-around mode). This
probably involves rewriting the iscsi contents to add a bcache header.
You could try mdcache instead.

Then you sacrifice a few gigabytes of local SSD storage of the caching
layer.

I guess that you're sharing the seed device with different machines. As
bcache will add a protective superblock, you may need to thin-clone the
seed image on the source to have independent superblocks per each
bcache instance. Not sure how this applies to mdcache as I never used
it.

But the caching approach is probably the easiest way to go for you. And
it's mostly automatic once deployed: you don't have to manually
choose which files to move to the sprout...


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Kai Krakow
Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-10-31 01:57, Marat Khalili wrote:
> > On 31/10/17 00:37, Chris Murphy wrote:  
> >> But off hand it sounds like hardware was sabotaging the expected
> >> write ordering. How to test a given hardware setup for that, I
> >> think, is really overdue. It affects literally every file system,
> >> and Linux storage technology.
> >>
> >> It kinda sounds like to me something other than supers is being
> >> overwritten too soon, and that's why it's possible for none of the
> >> backup roots to find a valid root tree, because all four possible
> >> root trees either haven't actually been written yet (still) or
> >> they've been overwritten, even though the super is updated. But
> >> again, it's speculation, we don't actually know why your system
> >> was no longer mountable.  
> > Just a detached view: I know hardware should respect
> > ordering/barriers and such, but how hard is it really to avoid
> > overwriting at least one complete metadata tree for half an hour
> > (even better, yet another one for a day)? Just metadata, not data
> > extents.  
> If you're running on an SSD (or thinly provisioned storage, or
> something else which supports discards) and have the 'discard' mount
> option enabled, then there is no backup metadata tree (this issue was
> mentioned on the list a while ago, but nobody ever replied), because
> it's already been discarded.  This is ideally something which should
> be addressed (we need some sort of discard queue for handling in-line
> discards), but it's not easy to address.
> 
> Otherwise, it becomes a question of space usage on the filesystem,
> and this is just another reason to keep some extra slack space on the
> FS (though that doesn't help _much_, it does help).  This, in theory,
> could be addressed, but it probably can't be applied across mounts of
> a filesystem without an on-disk format change.

Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

I wonder how a regular fstrim run through cron applies to this problem?


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow
Am Thu, 2 Nov 2017 22:47:31 -0400
schrieb Dave :

> On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow 
> wrote:
> 
> >
> > You may want to try btrfs autodefrag mount option and see if it
> > improves things (tho, the effect may take days or weeks to apply if
> > you didn't enable it right from the creation of the filesystem).
> >
> > Also, autodefrag will probably unshare reflinks on your snapshots.
> > You may be able to use bees[1] to work against this effect. Its
> > interaction with autodefrag is not well tested but it works fine
> > for me. Also, bees is able to reduce some of the fragmentation
> > during deduplication because it will rewrite extents back into
> > bigger chunks (but only for duplicated data).
> >
> > [1]: https://github.com/Zygo/bees  
> 
> I will look into bees. And yes, I plan to try autodefrag. (I already
> have it enabled now.) However, I need to understand something about
> how btrfs send-receive works in regard to reflinks and fragmentation.
> 
> Say I have 2 snapshots on my live volume. The earlier one of them has
> already been sent to another block device by btrfs send-receive (full
> backup). Now defrag runs on the live volume and breaks some percentage
> of the reflinks. At this point I do an incremental btrfs send-receive
> using "-p" (or "-c") with the diff going to the same other block
> device where the prior snapshot was already sent.
> 
> Will reflinks be "made whole" (restored) on the receiving block
> device? Or is the state of the source volume replicated so closely
> that reflink status is the same on the target?
> 
> Also, is fragmentation reduced on the receiving block device?
> 
> My expectation is that fragmentation would be reduced and duplication
> would be reduced too. In other words, does send-receive result in
> defragmentation and deduplication too?

As far as I understand, btrfs send/receive doesn't create an exact
mirror. It just replays the block operations between generation
numbers. That is: If it finds new blocks referenced between
generations, it will write a _new_ block to the destination.

So, no, it won't reduce fragmentation or duplication. It just keeps
reflinks intact as long as such extents weren't touched within the
generation range. Otherwise they are rewritten as new extents.

Autodefrag and deduplication processes will as such probably increase
duplication at the destination. A developer may have a better clue, tho.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow
Am Fri, 3 Nov 2017 08:58:22 +0300
schrieb Marat Khalili :

> On 02/11/17 04:39, Dave wrote:
> > I'm going to make this change now. What would be a good way to
> > implement this so that the change applies to the $HOME/.cache of
> > each user?  
> I'd make each user's .cache a symlink (should work but if it won't
> then bind mount) to a per-user directory in some separately mounted
> volume with necessary options.

On a systemd system, each user already has a private tmpfs location
at /run/user/$(id -u).

You could add to the central login script:

# CACHE_DIR="/run/user/$(id -u)/cache"
# mkdir -p $CACHE_DIR && ln -snf $CACHE_DIR $HOME/.cache

You should not run this as root (because of mkdir -p).

You could wrap it into an if statement:

# if [ "$(whoami)" -ne "root" ]; then
#   ...
# fi


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow
Am Thu, 2 Nov 2017 22:59:36 -0400
schrieb Dave :

> On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2017-11-01 21:39, Dave wrote:  
> >> I'm going to make this change now. What would be a good way to
> >> implement this so that the change applies to the $HOME/.cache of
> >> each user?
> >>
> >> The simple way would be to create a new subvolume for each existing
> >> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
> >> mount location for each user. I don't mind doing that as there are
> >> only 4 users to consider. One minor concern is that it adds an
> >> unexpected step to the process of creating a new user. Is there a
> >> better way?
> >>  
> > The easiest option is to just make sure nobody is logged in and run
> > the following shell script fragment:
> >
> > for dir in /home/* ; do
> > rm -rf $dir/.cache
> > btrfs subvolume create $dir/.cache
> > done
> >
> > And then add something to the user creation scripts to create that
> > subvolume.  This approach won't pollute /etc/fstab, will still
> > exclude the directory from snapshots, and doesn't require any
> > hugely creative work to integrate with user creation and deletion.
> >
> > In general, the contents of the .cache directory are just that,
> > cached data. Provided nobody is actively accessing it, it's
> > perfectly safe to just nuke the entire directory...  
> 
> I like this suggestion. Thank you. I had intended to mount the .cache
> subvolumes with the NODATACOW option. However, with this approach, I
> won't be explicitly mounting the .cache subvolumes. Is it possible to
> use "chattr +C $dir/.cache" in that loop even though it is a
> subvolume? And, is setting the .cache directory to NODATACOW the right
> choice given this scenario? From earlier comments, I believe it is,
> but I want to be sure I understood correctly.

It is important to apply "chattr +C" to the _empty_ directory, because
even if used recursively, it won't apply to already existing, non-empty
files. But the +C attribute is inherited by newly created files and
directory: So simply follow the "chattr +C on empty directory" and
you're all set.

BTW: You cannot mount subvolumes from an already mounted btrfs device
with different mount options. That is currently not implemented (except
for maybe a very few options). So the fstab approach probably wouldn't
have helped you (depending on your partition layout).

You can simply just create subvolumes within the location needed and
they are implicitly mounted. Then change the particular subvolume cow
behavior with chattr.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-03 Thread Kai Krakow
Am Thu, 2 Nov 2017 23:24:29 -0400
schrieb Dave :

> On Thu, Nov 2, 2017 at 4:46 PM, Kai Krakow 
> wrote:
> > Am Wed, 1 Nov 2017 02:51:58 -0400
> > schrieb Dave :
> >  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> Thanks for confirming. I must have missed those reports. I had
> >> never considered this idea until now -- but I like it.
> >>
> >> Are there any blogs or wikis where people have done something
> >> similar to what we are discussing here?  
> >
> > I used rsync before, backup source and destination both were btrfs.
> > I was experiencing the same btrfs bug from time to time on both
> > devices, luckily not at the same time.
> >
> > I instead switched to using borgbackup, and xfs as the destination
> > (to not fall the same-bug-in-two-devices pitfall).  
> 
> I'm going to stick with btrfs everywhere. My reasoning is that my
> biggest pitfalls will be related to lack of knowledge. So focusing on
> learning one filesystem better (vs poorly learning two) is the better
> strategy for me, given my limited time. (I'm not an IT professional of
> any sort.)
> 
> Is there any problem with the Borgbackup repository being on btrfs?

No. I just wanted to point out that keeping backup and source on
different media (which includes different technology, too) is common
best practice and adheres to the 3-2-1 backup strategy.


> > Borgbackup achieves a
> > much higher deduplication density and compression, and as such also
> > is able to store much more backup history in the same storage
> > space. The first run is much slower than rsync (due to enabled
> > compression) but successive runs are much faster (like 20 minutes
> > per backup run instead of 4-5 hours).
> >
> > I'm currently storing 107 TB of backup history in just 2.2 TB backup
> > space, which counts a little more than one year of history now,
> > containing 56 snapshots. This is my retention policy:
> >
> >   * 5 yearly snapshots
> >   * 12 monthly snapshots
> >   * 14 weekly snapshots (worth around 3 months)
> >   * 30 daily snapshots
> >
> > Restore is fast enough, and a snapshot can even be fuse-mounted
> > (tho, in that case mounted access can be very slow navigating
> > directories).
> >
> > With latest borgbackup version, the backup time increased to around
> > 1 hour from 15-20 minutes in the previous version. That is due to
> > switching the file cache strategy from mtime to ctime. This can be
> > tuned to get back to old performance, but it may miss some files
> > during backup if you're doing awkward things to file timestamps.
> >
> > I'm also backing up some servers with it now, then use rsync to sync
> > the borg repository to an offsite location.
> >
> > Combined with same-fs local btrfs snapshots with short retention
> > times, this could be a viable solution for you.  
> 
> Yes, I appreciate the idea. I'm going to evaluate both rsync and
> Borgbackup.
> 
> The advantage of rsync, I think, is that it will likely run in just a
> couple minutes. That will allow me to run it hourly and to keep my
> live volume almost entire free of snapshots and fully defragmented.
> It's also very simple as I already have rsync. And since I'm going to
> run btrfs on the backup volume, I can perform hourly snapshots there
> and use Snapper to manage retention. It's all very simple and relies
> on tools I already have and know.
> 
> However, the advantages of Borgbackup you mentioned (much higher
> deduplication density and compression) make it worth considering.
> Maybe Borgbackup won't take long to complete successive (incremental)
> backups on my system.

Once a full backup was taken, incremental backups are extremely fast.
At least for me, it works much faster than rsync. And as with btrfs
snapshots, each incremental backup is also a full backup. It's not like
traditional backup software that needs the backup parent and grand
parent to make use of the differential and/or incremental backups.

There's one caveat, tho: Only one process can access a repository at a
time, that is you need to serialize different backup jobs if you want
them to go into the same repository. Deduplication is done only within
the same repository. Tho, you might be able to leverage btrfs
deduplication (e.g. using bees) across multiple repositories if you're
not using encrypted repositories.

But since you're currently using send/receive and/or rsync, encrypted
storage of the backup doesn't seem to be an important point to you.

Burp with its client/server approach may have an advantage here, so its
setup seems to be more complicated. Borg is really easy to use. I never
tried burp, tho.


> I'll have to try it to see. It's a very nice
> looking project. I'm surprised I never heard of it before.

It seems to follow similar principles as burp (which I never heard of
previously). It seems like the really good backup software has some
sort of PR problem... ;-)


-- 
Regards,
Kai

Replies to list-only preferred.

--

Re: kernel BUG at fs/btrfs/ctree.h:3457!

2017-11-03 Thread Lakshmipathi.G
Yes, the patch works.  Enabled both CONFIG_BTRFS_FS_CHECK_INTEGRITY
and CONFIG_BTRFS_FS_RUN_SANITY_TESTS.
And applied above patch. This method also resolved the issue. thanks.

Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html