Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-05 Thread Roman Mamedov
On Thu, 6 Dec 2018 06:11:46 +
Robert White  wrote:

> So it would be dog-slow, but it would be neat if BTRFS had a mount 
> option to convert any TRIM command from above into the write of a zero, 
> 0xFF, or trash block to the device below if that device doesn't support 
> TRIM. Real TRIM support would override the block write.

There is such a project:

  "dm-linear like target which provides discard, but replaces it with write of
  random data to a discarded region. Thus, discarded data is securely deleted."

  https://github.com/vt-alt/dm-secdel

-- 
With respect,
Roman


btrfs progs always assume devid 1?

2018-12-05 Thread Roman Mamedov
Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.

-- 
With respect,
Roman


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Roman Mamedov
On Thu, 22 Nov 2018 22:07:25 +0900
Tomasz Chmielewski  wrote:

> Spot on!
> 
> Removed "discard" from fstab and added "ssd", rebooted - no more 
> btrfs-cleaner running.

Recently there has been a bugfix for TRIM in Btrfs:
  
  btrfs: Ensure btrfs_trim_fs can trim the whole fs
  https://patchwork.kernel.org/patch/10579539/

Perhaps your upgraded kernel is the first one to contain it, and for the first
time you're seeing TRIM to actually *work*, with the actual performance impact
of it on a large fragmented FS, instead of a few contiguous unallocated areas.

-- 
With respect,
Roman


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-15 Thread Roman Mamedov
On Thu, 15 Nov 2018 11:39:58 -0700
Juan Alberto Cirez  wrote:

> Is BTRFS mature enough to be deployed on a production system to underpin 
> the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?

What are you looking to gain from using Btrfs on an NVR system? It doesn't
sound like any of its prime-time features -- such as snapshots, checksumming,
compression or reflink copying -- are a must for a bulk video recording
scenario.

And even if you don't need or use those features, you still pay the price for
having them, Btrfs can be 2x to 10x slower than its simpler competitors:
https://www.phoronix.com/scan.php?page=article=linux418-nvme-raid=2

If you just meant to use its multi-device support instead of a separate RAID
layer, IMO that can't be considered prime-time or depended upon in production
(yes, not even RAID 1 or 10).

-- 
With respect,
Roman


Re: unable to mount btrfs after upgrading from 4.16.1 to 4.19.1

2018-11-09 Thread Roman Mamedov
On Sat, 10 Nov 2018 03:08:01 +0900
Tomasz Chmielewski  wrote:

> After upgrading from kernel 4.16.1 to 4.19.1 and a clean restart, the fs 
> no longer mounts:

Did you try rebooting back to 4.16.1 to see if it still mounts there?

-- 
With respect,
Roman


Re: CoW behavior when writing same content

2018-10-09 Thread Roman Mamedov
On Tue, 9 Oct 2018 09:52:00 -0600
Chris Murphy  wrote:

> You'll be left with three files. /big_file and root/big_file will
> share extents, and snapshot/big_file will have its own extents. You'd
> need to copy with --reflink for snapshot/big_file to have shared
> extents with /big_file - or deduplicate.

Or use rsync for copying, in the mode where it reads and checksums blocks of
both files, to copy only the non-matching portions.

rsync --inplace

  This  option  is  useful  for  transferring  large  files   with
  block-based  changes  or appended data, and also on systems that
  are disk bound, not network bound.  It  can  also  help  keep  a
  copy-on-write filesystem snapshot from diverging the entire con‐
  tents of a file that only has minor changes.

-- 
With respect,
Roman


Re: Problem with BTRFS

2018-09-14 Thread Roman Mamedov
On Fri, 14 Sep 2018 19:27:04 +0200
Rafael Jesús Alcántara Pérez  wrote:

> BTRFS info (device sdc1): use lzo compression, level 0
> BTRFS warning (device sdc1): 'recovery' is deprecated, use
> 'usebackuproot' instead
> BTRFS info (device sdc1): trying to use backup root at mount time
> BTRFS info (device sdc1): disk space caching is enabled
> BTRFS info (device sdc1): has skinny extents
> BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
> fs_devices total_rw_bytes 601023424

There is a recent feature added to "btrfs rescue" to fix this kind of
condition: https://patchwork.kernel.org/patch/10011399/

You need a recent version of the Btrfs tools for it, not sure which, I see
that it's not in version 4.13 but is present in 4.17.

-- 
With respect,
Roman


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-18 Thread Roman Mamedov
On Fri, 17 Aug 2018 23:17:33 +0200
Martin Steigerwald  wrote:

> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff. You
> > can check for yourself by comparing write speeds of compressible vs
> > incompressible data, it should be the same. At most, the modern ones
> > know to recognize a stream of binary zeroes and have a special case
> > for that.
> 
> Interesting. Do you have any backup for your claim?

Just "something I read". I follow quote a bit of SSD-related articles and
reviews which often also include a section to talk about the controller
utilized, its background and technological improvements/changes -- and the
compression going out of fashion after SandForce seems to be considered a
well-known fact.

Incidentally, your old Intel 320 SSDs actually seem to be based on that old
SandForce controller (or at least license some of that IP to extend on it),
and hence those indeed might perform compression.

> As the data still needs to be transferred to the SSD at least when the 
> SATA connection is maxed out I bet you won´t see any difference in write 
> speed whether the SSD compresses in real time or not.

Most controllers expose two readings in SMART:

  - Lifetime writes from host (SMART attribute 241)
  - Lifetime writes to flash (attribute 233, or 177, or 173...)

It might be difficult to get the second one, as often it needs to be decoded
from others such as "Average block erase count" or "Wear leveling count".
(And seems to be impossible on Samsung NVMe ones, for example)

But if you have numbers for both, you know the write amplification of the
drive (and its past workload).

If there is compression at work, you'd see the 2nd number being somewhat, or
significantly lower -- and barely increase at all, if you write highly
compressible data. This is not typically observed on modern SSDs, except maybe
when writing zeroes. Writes to flash will be the same as writes from host, or
most often somewhat higher, as the hardware can typically erase flash only in
chunks of 2MB or so, hence there's quite a bit of under the hood reorganizing
going on. Also as a result depending on workloads the "to flash" number can be
much higher than "from host".

Point is, even when the SATA link is maxed out in both cases, you can still
check if there's compression at work via using those SMART attributes.

> In any case: It was a experience report, no request for help, so I don´t 
> see why exact error messages are absolutely needed. If I had a support 
> inquiry that would be different, I agree.

Well, when reading such stories (involving software that I also use) I imagine
what if I had been in that situation myself, what would I do, would I have
anything else to try, do I know about any workaround for this. And without any
technical details to go from, those are all questions left unanswered.

-- 
With respect,
Roman


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Roman Mamedov
On Fri, 17 Aug 2018 14:28:25 +0200
Martin Steigerwald  wrote:

> > First off, keep in mind that the SSD firmware doing compression only
> > really helps with wear-leveling.  Doing it in the filesystem will help
> > not only with that, but will also give you more space to work with.
> 
> While also reducing the ability of the SSD to wear-level. The more data 
> I fit on the SSD, the less it can wear-level. And the better I compress 
> that data, the less it can wear-level.

Do not consider SSD "compression" as a factor in any of your calculations or
planning. Modern controllers do not do it anymore, the last ones that did are
SandForce, and that's 2010 era stuff. You can check for yourself by comparing
write speeds of compressible vs incompressible data, it should be the same. At
most, the modern ones know to recognize a stream of binary zeroes and have a
special case for that.

As for general comment on this thread, always try to save the exact messages
you get when troubleshooting or getting failures from your system. Saying just
"was not able to add" or "btrfs replace not working" without any exact details
isn't really helpful as a bug report or even as a general "experiences" story,
as we don't know what was the exact cause of those, could that have been
avoided or worked around, not to mention what was your FS state at the time
(as in "btrfs fi show" and "fi df").

-- 
With respect,
Roman


Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Roman Mamedov
On Tue, 14 Aug 2018 16:41:11 +0300
Dmitrii Tcvetkov  wrote:

> If usebackuproot doesn't help then filesystem is beyond repair and you
> should try to refresh your backups with "btrfs restore" and restore from 
> them[1].
> 
> [1] 
> https://btrfs.wiki.kernel.org/index.php/FAQ#How_do_I_recover_from_a_.22parent_transid_verify_failed.22_error.3F

This is really the worst unfixed Btrfs issue today. This happens a lot on
unclean shutdowns or reboots, and the only advice is usually to "start over"
-- even if your FS is 40 TB and the discrepancy is just by half a dozen
transids. There needs to be a way in fsck to accept (likely minor!) FS damage
and forcefully fixup transids to what they should be -- or even nuke the
affected portions entirely.

-- 
With respect,
Roman


"error inheriting props for ino": Btrfs "compression" property

2018-07-25 Thread Roman Mamedov
Hello,

On two machines I have subvolumes where I backup other hosts' root filesystems
via rsync. These subvolumes have the +c attribute on them.

During the backup, sometimes I get tons of messages like these in dmesg:

[Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props 
for ino 1213720 (root 1301): -28
[Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props 
for ino 1213723 (root 1301): -28
[Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props 
for ino 1213724 (root 1301): -28
[Wed Jul 25 20:58:22 2018] BTRFS error (device dm-8): error inheriting props 
for ino 1213725 (root 1301): -28

# btrfs inspect inode-resolve 1213720 .
./gemini/lib/modules/4.14.58-rm2+/kernel/virt

This seems to be related to the "compression" property in Btrfs:

# btrfs property get ./gemini/lib/modules/4.14.58-rm2+/
compression=zlib

# btrfs property get ./gemini/lib/modules/4.14.58-rm2+/kernel/virt

(no output)

Why would it fail like that? This does seem harmless, but the messages are 
annoying and it's puzzling why 
this happens in the first place.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Roman Mamedov
On Mon, 2 Jul 2018 08:19:03 -0700
Marc MERLIN  wrote:

> I actually have fewer snapshots than this per filesystem, but I backup
> more than 10 filesystems.
> If I used as many snapshots as you recommend, that would already be 230
> snapshots for 10 filesystems :)

(...once again me with my rsync :)

If you didn't use send/receive, you wouldn't be required to keep a separate
snapshot trail per filesystem backed up, one trail of snapshots for the entire
backup server would be enough. Rsync everything to subdirs within one
subvolume, then do timed or event-based snapshots of it. You only need more
than one trail if you want different retention policies for different datasets
(e.g. in my case I have 91 and 31 days).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Roman Mamedov
On Fri, 29 Jun 2018 00:22:10 -0700
Marc MERLIN  wrote:

> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> > On Thu, 28 Jun 2018 23:59:03 -0700
> > Marc MERLIN  wrote:
> > 
> > > I don't waste a week recreating the many btrfs send/receive relationships.
> > 
> > Consider not using send/receive, and switching to regular rsync instead.
> > Send/receive is very limiting and cumbersome, including because of what you
> > described. And it doesn't gain you much over an incremental rsync. As for
> 
> Err, sorry but I cannot agree with you here, at all :)
> 
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences

I use it for backing up root filesystems of about 20 hosts, and for syncing
large multi-terabyte media collections -- it's fast enough in both.
Admittedly neither of those case has millions of subdirs or files where
scanning may take a long time. And in the former case it's also all from and
to SSDs. Maybe your use case is different where it doesn't work as well. But
perhaps then general day-to-day performance is not great either, so I'd suggest
looking into SSD-based LVM caching, it really works wonders with Btrfs.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-29 Thread Roman Mamedov
On Thu, 28 Jun 2018 23:59:03 -0700
Marc MERLIN  wrote:

> I don't waste a week recreating the many btrfs send/receive relationships.

Consider not using send/receive, and switching to regular rsync instead.
Send/receive is very limiting and cumbersome, including because of what you
described. And it doesn't gain you much over an incremental rsync. As for
snapshots on the backup server, you can either automate making one as soon as a
backup has finished, or simply make them once/twice a day, during a period
when no backups are ongoing.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set

2018-05-14 Thread Roman Mamedov
On Mon, 14 May 2018 11:36:26 +0300
Nikolay Borisov  wrote:

> So what made you have these expectation, is it codified somewhere
> (docs/man pages etc)? I'm fine with that semantics IF this is what
> people expect.

"Compression ...does not work for NOCOW files":
https://btrfs.wiki.kernel.org/index.php/Compression

The mount options man page does not say that the NOCOW attribute of files will
be disregarded with compress-force.  It only mentions interaction with the
nodatacow and nodatasum mount options. So I'd expect the attribute to still
work and prevent compression of NOCOW files.

> Now the question is why people grew up to have this expectation and not the
> other way round? IMO force_compress should really disregard everything else

Both are knobs that the user needs to explicitly set, the difference is that
the +C attribute is fine-grained and the mount option is global. If they are
set by the user to conflicting values, it seems more useful to have the
fine-grained control override the global one, not the other way round.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set

2018-05-14 Thread Roman Mamedov
On Mon, 14 May 2018 11:10:34 +0300
Nikolay Borisov  wrote:

> But if we have mounted the fs with FORCE_COMPRESS shouldn't we disregard
> the inode flags, presumably the admin knows what he is doing?

Please don't. Personally I always assumed chattr +C would prevent both CoW and
compression, and used that as a way to override volume-wide compress-force for
a particular folder. Now that it turns out this wasn't working, the patch
would fix it to behave in line with prior expectations.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Roman Mamedov
On Sat, 10 Mar 2018 16:50:22 +0100
Adam Borowski  wrote:

> Since we're on a btrfs mailing list, if you use qemu, you really want
> sparse format:raw instead of qcow2 or preallocated raw.  This also works
> great with TRIM.

Agreed, that's why I use RAW. QCOW2 would add a second layer of COW on top of
Btrfs, which sounds like a nightmare. Even if you would run those files as
NOCOW in Btrfs, somehow I feel FS-native COW is more efficient than emulating
it in userspace with special format files.

> > It works, just not with some of the QEMU virtualized disk device drivers.
> > You don't need to use qemu-img to manually dig holes either, it's all
> > automatic.
> 
> It works only with scsi and virtio-scsi drivers.  Most qemu setups use
> either ide (ouch!) or virtio-blk.

It works with IDE as well.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Roman Mamedov
On Sat, 10 Mar 2018 15:19:05 +0100
Christoph Anton Mitterer  wrote:

> TRIM/discard... not sure how far this is really a solution.

It is the solution in a great many of usage scenarios, don't know enough about
your particular one, though.

Note you can use it on HDDs too, even without QEMU and the like: via using LVM
"thin" volumes. I use that on a number of machines, the benefit is that since
TRIMed areas are "stored nowhere", those partitions allow for incredibly fast
block-level backups, as it doesn't have to physically read in all the free
space, let alone any stale data in there. LVM snapshots are also way more
efficient with thin volumes, which helps during backup.

> dm-crypt per default blocks discard.

Out of misguided paranoia. If your crypto is any good (and last I checked AES
was good enough), there's really not a lot to gain for the "attacker" knowing
which areas of the disk are used and which are not.

> Some longer time ago I had a look at whether qemu would support that on
> it's own,... i.e. the guest and it's btrfs would normally use discard,
> but the image file below would mark the block as discarded and later on
> e can use some qemu-img command to dig holes into exactly those
> locations.
> Back then it didn't seem to work.

It works, just not with some of the QEMU virtualized disk device drivers.
You don't need to use qemu-img to manually dig holes either, it's all
automatic.

> But even if it would in the meantime, a proper zerofree implementation
> would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw
> images in qemu, the whole thing couldn't work but with really zeroing
> the blocks inside the guest.

QEMU deallocates parts of its raw images for those areas which have been
TRIM'ed by the guest. In fact I never use qcow2, always raw images only.
Yet, boot a guest, issue fstrim, and see the raw file while still having the
same size, show much lower actual disk usage in "du".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs subvolume mount with different options

2018-01-12 Thread Roman Mamedov
On Fri, 12 Jan 2018 17:49:38 + (GMT)
"Konstantin V. Gavrilenko"  wrote:

> Hi list,
> 
> just wondering whether it is possible to mount two subvolumes with different 
> mount options, i.e.
> 
> |
> |- /a  defaults,compress-force=lza

You can have use different compression algorithms across the filesystem
(including none), via "btrfs properties" on directories or subvolumes. They
are inherited down the tree.

$ mkdir test
$ sudo btrfs prop set test compression zstd
$ echo abc > test/def
$ sudo btrfs prop get test/def compression
compression=zstd

But it appears this doesn't provide a way to apply compress-force.

> |- /b  defaults,nodatacow

Nodatacow can be applied to any dir/subvolume recursively, or any file (as long 
as it's created but not
written yet) via chattr +C.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.14.3] btrfs out of space error

2017-12-14 Thread Roman Mamedov
On Fri, 15 Dec 2017 01:39:03 +0100
Ian Kumlien  wrote:

> Hi,
> 
> Running a 4.14.3 kernel, this just happened, but there should have
> been another 20 gigs or so available.
> 
> The filesystem seems fine after a reboot though

What are your mount options, and can you show the output of "btrfs fi
df" and "btrfs fi us" for the filesystem? And what does 
"cat /sys/block/sdb/queue/rotational" return.

I wonder if it's the same old "ssd allocation scheme" problem, and no
balancing done in a long time or at all.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-18 Thread Roman Mamedov
On Sat, 18 Nov 2017 02:08:46 +0100
Hans van Kranenburg  wrote:

> It's using send + balance at the same time. There's something that makes
> btrfs explode when you do that.
> 
> It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various
> different explosions in kernel log. Since that happened, I made sure I
> never did those two things at the same time.

Shouldn't it prevent send during balance, or balance during send then, if
that's the case?

You talk about it "exploding" like it's a normal thing, to have Invalid opcode
BUGs in kernel log, and the user has to take care to not use two of the
regular FS features at the same time.

Seems to be a bug which should be fixed, rather than warning everyone "not to
send during balance".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-16 Thread Roman Mamedov
On Thu, 16 Nov 2017 16:12:56 -0800
Marc MERLIN  wrote:

> On Thu, Nov 16, 2017 at 11:32:33PM +0100, Holger Hoffstätte wrote:
> > Don't pop the champagne just yet, I just read that apprently 4.14 broke
> > bcache for some people [1]. Not sure how much that affects you, but it might
> > well make things worse. Yeah, I know, wonderful.
> 
> Oh my, that's actually pretty terrible.
> I've just reverted both my machines to 3.13, the last thing I need is more
> btrfs corruption.

Why so far back though, the latest 4.4 and 4.9 are both good series and run
without issues for me since a long time. Or perhaps you meant 4.13 :)

> I'm also starting to question if I should just drop bcache. It does help
> access to a big and slow-ish array, but corruption and periodic btrfs
> full rebuilds is not something I can afford to do timewise :-/

I suggest that you try lvmcache instead. It's much more flexible than bcache,
does pretty much the same job, and has much less of the "hacky" feel to it.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Roman Mamedov
On Tue, 14 Nov 2017 15:09:52 +0100
Klaus Agnoletti  wrote:

> Hi Roman
> 
> I almost understand :-) - however, I need a bit more information:
> 
> How do I copy the image file to the 6TB without screwing the existing
> btrfs up when the fs is not mounted? Should I remove it from the raid
> again?

Oh, you already added it to your FS, that's so unfortunate. For my scenario I
assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space.

You could try removing it, but with one of the existing member drives
malfunctioning, I wonder if trying any operation on that FS will cause further
damage. For example if you remove the 6TB one, how do you prevent Btrfs from
using the bad 2TB drive as destination to relocate data from the 6TB drive. Or
use it for one of the metadata mirrors, which will fail to write properly,
leading into transid failures later, etc.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Roman Mamedov
On Tue, 14 Nov 2017 10:36:22 +0200
Klaus Agnoletti  wrote:

> Obviously, I want /dev/sdd emptied and deleted from the raid.

  * Unmount the RAID0 FS

  * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive
(noting how much of it is actually unreadable -- chances are it's mostly
intact)

  * physically remove the bad drive (have a powerdown or reboot for this to be
sure Btrfs didn't remember it somewhere)

  * set up a loop device from the dd_rescue'd 2TB file

  * run `btrfs device scan`

  * mount the RAID0 filesystem

  * run the delete command on the loop device, it will not encounter I/O
errors anymore.


[1] Note that "ddrescue" and "dd_rescue" are two different programs for the
same purpose, one may work better than the other. I don't remember which. :)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Roman Mamedov
On Mon, 13 Nov 2017 22:39:44 -0500
Dave  wrote:

> I have my live system on one block device and a backup snapshot of it
> on another block device. I am keeping them in sync with hourly rsync
> transfers.
> 
> Here's how this system works in a little more detail:
> 
> 1. I establish the baseline by sending a full snapshot to the backup
> block device using btrfs send-receive.
> 2. Next, on the backup device I immediately create a rw copy of that
> baseline snapshot.
> 3. I delete the source snapshot to keep the live filesystem free of
> all snapshots (so it can be optimally defragmented, etc.)
> 4. hourly, I take a snapshot of the live system, rsync all changes to
> the backup block device, and then delete the source snapshot. This
> hourly process takes less than a minute currently. (My test system has
> only moderate usage.)
> 5. hourly, following the above step, I use snapper to take a snapshot
> of the backup subvolume to create/preserve a history of changes. For
> example, I can find the version of a file 30 hours prior.

Sounds a bit complex, I still don't get why you need all these snapshot
creations and deletions, and even still using btrfs send-receive.

Here is my scheme:

/mnt/dst <- mounted backup storage volume
/mnt/dst/backup  <- a subvolume 
/mnt/dst/backup/host1/ <- rsync destination for host1, regular directory
/mnt/dst/backup/host2/ <- rsync destination for host2, regular directory
/mnt/dst/backup/host3/ <- rsync destination for host3, regular directory
etc.

/mnt/dst/backup/host1/bin/
/mnt/dst/backup/host1/etc/
/mnt/dst/backup/host1/home/
...
Self explanatory. All regular directories, not subvolumes.

Snapshots:
/mnt/dst/snaps/backup <- a regular directory
/mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup
/mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup
/mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup

Accessing historic data:
/mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
...
/bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system).


No need for btrfs send-receive, only plain rsync is used, directly from
hostX:/ to /mnt/dst/backup/host1/;

No need to create or delete snapshots during the actual backup process;

A single common timeline is kept for all hosts to be backed up, snapshot count
not multiplied by the number of hosts (in my case the backup location is
multi-purpose, so I somewhat care about total number of snapshots there as
well);

Also, all of this works even with source hosts which do not use Btrfs.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Roman Mamedov
On Tue, 14 Nov 2017 10:14:55 +0300
Marat Khalili  wrote:

> Don't keep snapshots under rsync target, place them under ../snapshots 
> (if snapper supports this):

> Or, specify them in --exclude and avoid using --delete-excluded.

Both are good suggestions, in my case each system does have its own snapshots
as well, but they are retained for much shorter. So I both use --exclude to
avoid fetching the entire /snaps tree from the source system, and store
snapshots of the destination system outside of the rsync target dirs.

>Or keep using -x if it works, why not?

-x will exclude content of all subvolumes down the tree on the source side --
not only the time-based ones. If you take care to never casually create any
subvolumes content of which you'd still want backed up, then I guess it can
work.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Move loop termination condition in while()

2017-11-01 Thread Roman Mamedov
On Wed,  1 Nov 2017 11:32:18 +0200
Nikolay Borisov  wrote:

> Fallocating a file in btrfs goes through several stages. The one before 
> actually
> inserting the fallocated extents is to create a qgroup reservation, covering
> the desired range. To this end there is a loop in btrfs_fallocate which checks
> to see if there are holes in the fallocated range or !PREALLOC extents past 
> EOF
> and if so create qgroup reservations for them. Unfortunately, the main 
> condition
> of the loop is burried right at the end of its body rather than in the actual
> while statement which makes it non-obvious. Fix this by moving the condition
> in the while statement where it belongs. No functional changes.

If it turns out that "cur_offset >= alloc_end" from the get go, previously the
loop body would be entered and executed once. With this change, it will not
anymore.

I did not examine the context to see if such case is possible, likely,
beneficial or harmful. But if you wanted 100% no functional changes no matter
what, maybe better use a "do ... while" loop?

> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/file.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e0d15c0d1641..ecbe186cb5da 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3168,7 +3168,7 @@ static long btrfs_fallocate(struct file *file, int mode,
>  
>   /* First, check if we exceed the qgroup limit */
>   INIT_LIST_HEAD(_list);
> - while (1) {
> + while (cur_offset < alloc_end) {
>   em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, cur_offset,
> alloc_end - cur_offset, 0);
>   if (IS_ERR(em)) {
> @@ -3204,8 +3204,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>   }
>   free_extent_map(em);
>   cur_offset = last_byte;
> - if (cur_offset >= alloc_end)
> - break;
>   }
>  
>   /*



-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-10-31 Thread Roman Mamedov
On Wed, 1 Nov 2017 01:00:08 -0400
Dave  wrote:

> To reconcile those conflicting goals, the only idea I have come up
> with so far is to use btrfs send-receive to perform incremental
> backups as described here:
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup .

Another option is to just use the regular rsync to a designated destination
subvolume on the backup host, AND snapshot that subvolume on that host from
time to time (or on backup completions, if you can synchronize that).

rsync --inplace will keep space usage low as it will not reupload entire files
in case of changes/additions to them.

Yes rsync has to traverse both directory trees to find changes, but that's
pretty fast (couple of minutes at most, for a typical root filesystem),
especially if you use SSD or SSD caching.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some assistance/direction in determining a system hang during heavy IO

2017-10-26 Thread Roman Mamedov
On Thu, 26 Oct 2017 09:40:19 -0600
Cheyenne Wills  wrote:

> Briefly when I upgraded a system from 4.0.5 kernel to 4.9.5 (and
> later) I'm seeing a blocked task timeout with heavy IO against a
> multi-lun btrfs filesystem.  I've tried a 4.12.12 kernel and am still
> getting the hang.

There is now 4.9.58 (fifty three versions later!) and 4.12 series is long
abandoned and gone from the charts altogether. So just in case, did you check
with the latest kernels?

Also, keep in mind the 120 second warnings are just that, and not an error
condition by themselves. You can disable them or increase the maximum timeout
in sysctl settings. And it is not clear from your reports if you only get
warnings and after the load subsides everything is back to normal, or the FS
locks out "for good", i.e. with all access attempts hanging indefinitely and
no way to unmount the FS or otherwise recover.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mount failing - unable to find logical

2017-10-17 Thread Roman Mamedov
On Wed, 18 Oct 2017 09:24:01 +0800
Qu Wenruo  wrote:

> 
> 
> On 2017年10月18日 04:43, Cameron Kelley wrote:
> > Hey btrfs gurus,
> > 
> > I have a 4 disk btrfs filesystem that has suddenly stopped mounting
> > after a recent reboot. The data is in an odd configuration due to
> > originally being in a 3 disk RAID1 before adding a 4th disk and running
> > a balance to convert to RAID10. There wasn't enough free space to
> > completely convert, so about half the data is still in RAID1 while the
> > other half is in RAID10. Both metadata and system are RAID10. It has
> > been in this configuration for 6 months or so now since adding the 4th
> > disk. It just holds archived media and hasn't had any data added or
> > modified in quite some time. I feel pretty stupid now for not correcting
> > that sooner though.
> > 
> > I have tried mounting with different mount options for recovery, ro,
> > degraded, etc. Log shows errors about "unable to find logical
> > 3746892939264 length 4096"
> > 
> > When I do a btrfs check, it doesn't find any issues. Running
> > btrfs-find-root comes up with a message about a block that the
> > generation doesn't match. If I specify that block on the btrfs check, I
> > get transid verify failures.
> > 
> > I ran a dry run of a recovery of the entire filesystem which runs
> > through every file with no errors. I would just restore the data and
> > start fresh, but unfortunately I don't have the free space at the moment
> > for the ~4.5TB of data.
> > 
> > I also ran full smart self tests on all 4 disks with no errors.
> > 
> > root@nas2:~# uname -a
> > Linux nas2 4.13.7-041307-generic #201710141430 SMP Sat Oct 14 14:39:06
> > UTC 2017 i686 i686 i686 GNU/Linux
> 
> I don't think i686 kernel will cause any difference, but considering
> most of us are using x86_64 to develop/test, maybe it will be a good
> idea to upgrade to x86_64 kernel?

Indeed a problem with mounting on 32-bit in 4.13 has been reported recently:
https://www.spinics.net/lists/linux-btrfs/msg69734.html
with the same error message.

I believe it's this patchset that is supposed to fix that.
https://www.spinics.net/lists/linux-btrfs/msg70001.html

@Cameron maybe you didn't just reboot, but also upgraded your kernel at the
same time? In any case, try a 4.9 series kernel, or a 64-bit machine if you
want to stay with 4.13.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lost about 3TB

2017-10-03 Thread Roman Mamedov
On Tue, 3 Oct 2017 10:54:05 +
Hugo Mills  wrote:

>There are other possibilities for missing space, but let's cover
> the obvious ones first.

One more obvious thing would be files that are deleted, but still kept open by
some app (possibly even from network, via NFS or SMB!). @Frederic, did you try
rebooting the system?

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Give up on bcache?

2017-09-26 Thread Roman Mamedov
On Tue, 26 Sep 2017 16:50:00 + (UTC)
Ferry Toth  wrote:

> https://www.phoronix.com/scan.php?page=article=linux414-bcache-
> raid=2
> 
> I think it might be idle hopes to think bcache can be used as a ssd cache 
> for btrfs to significantly improve performance..

My personal real-world experience shows that SSD caching -- with lvmcache --
does indeed significantly improve performance of a large Btrfs filesystem with
slowish base storage.

And that article, sadly, only demonstrates once again the general mediocre
quality of Phoronix content: it is an astonishing oversight to not check out
lvmcache in the same setup, to at least try to draw some useful conclusion, is
it Bcache that is strangely deficient, or SSD caching as a general concept
does not work well in the hardware setup utilized.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Roman Mamedov
On Tue, 12 Sep 2017 12:32:14 +0200
Adam Borowski  wrote:

> discard in the guest (not supported over ide and virtio, supported over scsi
> and virtio-scsi)

IDE does support discard in QEMU, I use that all the time.

It got broken briefly in QEMU 2.1 [1], but then fixed again.

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757927


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount time for big filesystems

2017-08-31 Thread Roman Mamedov
On Thu, 31 Aug 2017 07:45:55 -0400
"Austin S. Hemmelgarn"  wrote:

> If you use dm-cache (what LVM uses), you need to be _VERY_ careful and 
> can't use it safely at all with multi-device volumes because it leaves 
> the underlying block device exposed.

It locks the underlying device so it can't be seen by Btrfs and cause problems.

# btrfs dev scan
Scanning for Btrfs filesystems

# btrfs fi show
Label: none  uuid: 62ff7619-8202-47f6-8c7e-cef6f082530e
Total devices 1 FS bytes used 112.00KiB
devid1 size 16.00GiB used 2.02GiB path /dev/mapper/vg-OriginLV

# ls -la /dev/mapper/
total 0
drwxr-xr-x  2 root root 140 Aug 31 12:01 .
drwxr-xr-x 16 root root2980 Aug 31 12:01 ..
crw---  1 root root 10, 236 Aug 31 11:59 control
lrwxrwxrwx  1 root root   7 Aug 31 12:01 vg-CacheDataLV_cdata -> ../dm-1
lrwxrwxrwx  1 root root   7 Aug 31 12:01 vg-CacheDataLV_cmeta -> ../dm-2
lrwxrwxrwx  1 root root   7 Aug 31 12:06 vg-OriginLV -> ../dm-0
lrwxrwxrwx  1 root root   7 Aug 31 12:01 vg-OriginLV_corig -> ../dm-3

# btrfs dev scan /dev/dm-0
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV'

# btrfs dev scan /dev/dm-3
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV_corig'
ERROR: device scan failed on '/dev/mapper/vg-OriginLV_corig': Device or 
resource busy

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount time for big filesystems

2017-08-31 Thread Roman Mamedov
On Thu, 31 Aug 2017 12:43:19 +0200
Marco Lorenzo Crociani  wrote:

> Hi,
> this 37T filesystem took some times to mount. It has 47 
> subvolumes/snapshots and is mounted with 
> noatime,compress=zlib,space_cache. Is it normal, due to its size?

If you could implement SSD caching in front of your FS (such as lvmcache or
bcache), that would work wonders for performance in general, and especially
for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
a 14 TB FS.

As for in general, with your FS size perhaps you should be using
"space_cache=v2" for better performance, but I'm not sure if that will have
any effect on mount time (aside from slowing down the first mount with that).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleted subvols don't go away?

2017-08-28 Thread Roman Mamedov
On Mon, 28 Aug 2017 15:03:47 +0300
Nikolay Borisov  wrote:

> when the cleaner thread runs again the snapshot's root item is going to
> be deleted for good and you no longer will see it.

Oh, that's pretty sweet -- it means there's actually a way to reliably wait
for cleaner work to be done on all deleted snapshots before unmounting the FS.
I was wondering about that recently for some transient filesystems (which get
mounted, synced to, snapshot-created/removed, then unmounted). Now can just
loop with a few second sleeps until `btrfs sub list -d $PATH` comes up empty.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netapp-alike snapshots?

2017-08-22 Thread Roman Mamedov
On Tue, 22 Aug 2017 18:57:25 +0200
Ulli Horlacher <frams...@rus.uni-stuttgart.de> wrote:

> On Tue 2017-08-22 (21:45), Roman Mamedov wrote:
> 
> > It is beneficial to not have snapshots in-place. With a local directory of
> > snapshots, issuing things like "find", "grep -r" or even "du" will take an
> > inordinate amount of time and will produce a result you do not expect.
> 
> Netapp snapshots are invisible for tools doing opendir()/readdir()
> One could simulate this with symlinks for the snapshot directory:
> store the snapshot elsewhere (not inplace) and create a symlink to it, in
> every directory.
> 
> 
> > Personally I prefer to have a /snapshots directory on every FS
> 
> My users want the snapshots locally in a .snapshot subdirectory.
> Because Netapp do it this way - for at least 20 years and we have a
> multi-PB Netapp storage environment.
> No chance to change this.

Just a side note, you do know that only subvolumes can be snapshotted on Btrfs,
not any regular directory? And that snapshots are not recursive, i.e. if a
subvolume "contains" other subvolumes (hint: it really doesn't), snapshots of
the parent one will not include content of subvolumes below that in the tree.

I don't know how Netapp does this, from the way you describe that setup it
feels like with Btrfs you're still in for some bad surprises and a part of
your expectations will not be met.

Do you plan to make each and every directory and subdirectory a subvolume (so
that it could have a trail of its own snapshots)? There will be performance
implications to that. Also deleting subvolumes can only be done via the
"btrfs" tool, they won't delete like normal dirs, e.g. when trying to do that
remotely via NFS or Samba share.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Roman Mamedov
On Tue, 22 Aug 2017 17:45:37 +0200
Ulli Horlacher  wrote:

> In perl I have now:
> 
> $root = $volume;
> while (`btrfs subvolume show "$root" 2>/dev/null` !~ /toplevel subvolume/) {
>   $root = dirname($root);
>   last if $root eq '/';
> }
> 
> 

If you are okay with rolling your own solutions like this, take a look at
"btrfs filesystem usage ". It will print the blockdevice used for
mounting the base FS. From that you can find the mountpoint via /proc/mounts.

Performance-wise it seems to work instantly on an almost full 2TB FS.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netapp-alike snapshots?

2017-08-22 Thread Roman Mamedov
On Tue, 22 Aug 2017 16:24:51 +0200
Ulli Horlacher  wrote:

> On Tue 2017-08-22 (15:44), Peter Becker wrote:
> > Is use: https://github.com/jf647/btrfs-snap
> > 
> > 2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
> > > With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
> > > You can find these snapshots in every local directory (readonly).
> > > Example:
> > >
> > > framstag@fex:/sw/share: ll .snapshot/
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-15_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-16_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-17_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-18_0010
> > > drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
> > > .snapshot/daily.2017-08-19_0010
> > > drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
> > > .snapshot/daily.2017-08-20_0010
> > > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > > .snapshot/daily.2017-08-21_0010
> > > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > > .snapshot/hourly.2017-08-20_1210
> > > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > > .snapshot/hourly.2017-08-20_1610
> > > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > > .snapshot/hourly.2017-08-20_2010
> > > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > > .snapshot/hourly.2017-08-21_0810
> > > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > > .snapshot/hourly.2017-08-21_1210
> > > drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
> > > .snapshot/hourly.2017-08-21_1610
> 
> btrfs-snap does not create local .snapshot/ sub-directories, but saves the
> snapshots in the toplevel root volume directory.

It is beneficial to not have snapshots in-place. With a local directory of
snapshots, issuing things like "find", "grep -r" or even "du" will take an
inordinate amount of time and will produce a result you do not expect.

For some of those tools the problem can be avoided (by always keeping in mind
to use "-x" with du, or "--one-file-system" with tar), but not for all of them.

Personally I prefer to have a /snapshots directory on every FS, and e.g. timed
snapshots of /home/username/src will live in /snapshots/home-username-src/. No
point to hide it there with a dot either, as it's convenient to be able to
browse older snapshots with GUI filemanagers (which hide dot-files by default).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Roman Mamedov
On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
"Konstantin V. Gavrilenko"  wrote:

> I believe the chunk size of 512kb is even worth for performance then the 
> default settings on my HW RAID of  256kb.

It might be, but that does not explain the original problem reported at all.
If mdraid performance would be the bottleneck, you would see high iowait,
possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
thread pegging into 100% CPU.

> So now I am moving the data from the array and will be rebuilding it with 64
> or 32 chunk size and checking the performance.

64K is the sweet spot for RAID5/6:
http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
On Fri, 4 Aug 2017 12:44:44 +0500
Roman Mamedov <r...@romanrm.net> wrote:

> > What is 0x98f94189, is it not a csum of a block of zeroes by any chance?
> 
> It does seem to be something of that sort

Actually, I think I know what happened.

I used "dd bs=1M conv=sparse" to copy source FS onto a LUKS device, which
skipped copying 1M-sized areas of zeroes from the source device by seeking
over those areas on the destination device.

This only works OK if the destination device is entirely zeroed beforehand.

But I also use --allow-discards for the LUKS device; so it may be that after a
discard passthrough to the underlying SSD, which will then return zeroes for
discarded areas, LUKS will not take care to pass zeroes back "upwards" when
reading from those areas, instead it may attempt to decrypt them with its
crypto process, making them read back to userspace as random data.

So after an initial TRIM the destination crypto device was not actually zeroed,
far from it. :)

As a result, every large non-sparse file with at least 1MB-long run of zeroes
in it (those sqlite ones appear to fit the bill) was not written out entirely
onto the destination device by dd, and the intended zero areas were left full
of crypto-randomness instead.

Sorry for the noise, I hope at least this catch was somewhat entertaining.

And Btrfs saves the day once again. :)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SQLite Re: csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
On Fri, 4 Aug 2017 12:18:58 +0500
Roman Mamedov <r...@romanrm.net> wrote:

> What I find weird is why the expected csum is the same on all of these.
> Any idea what this might point to as the cause?
> 
> What is 0x98f94189, is it not a csum of a block of zeroes by any chance?

It does seem to be something of that sort, as it appears in
https://www.spinics.net/lists/linux-btrfs/msg67281.html 
(though as factual csum, not the expected one).

> a few files turned out to be unreadable

Actually, turns out ALL of those are sqlite files(!)

.mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users)
.moonchild productions/pale moon/.../urlclassifier3.sqlite
.config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2 users)
.config/chromium/Default/History
.config/chromium/Default/Top Sites

nothing else affected.

Forgot to mention that the kernel version is 4.9.40.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
Hello,

I've migrated my home dir to a luks dm-crypt device some time ago, and today
during a scheduled backup a few files turned out to be unreadable, with csum
errors from Btrfs in dmesg.

What I find weird is why the expected csum is the same on all of these.
Any idea what this might point to as the cause?

What is 0x98f94189, is it not a csum of a block of zeroes by any chance?

(I use a patch from Qu Wenruo to improve the error reporting).

[483575.992252] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1
[483575.994518] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1
[483575.995640] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
2785280 csum 0x7f97f4a6 expected csum 0x98f94189 mirror 1
[483575.996599] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
1736704 csum 0x7476ddf8 expected csum 0x98f94189 mirror 1
[483585.020047] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1
[483585.023036] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1
[483585.023702] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1900544 csum 0x26c571dc expected csum 0x98f94189 mirror 1
[483585.023761] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
2949120 csum 0x27726fbe expected csum 0x98f94189 mirror 1
[483599.026289] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17645568 csum 0xdd5bf4de expected csum 0x98f94189 mirror 1
[483599.027425] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1
[483599.032396] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1
[483599.092709] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1
[483599.093080] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1
[483599.093242] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1736704 csum 0x1d4087fc expected csum 0x98f94189 mirror 1
[483627.708625] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1
[483627.709459] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1
[483627.709799] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2965504 csum 0xfaff212d expected csum 0x98f94189 mirror 1
[483634.462684] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
5062656 csum 0x8c7df392 expected csum 0x98f94189 mirror 1
[483634.462703] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1
[483634.466602] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
7159808 csum 0xfc06d954 expected csum 0x98f94189 mirror 1
[483634.466604] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
6111232 csum 0xc802b3b4 expected csum 0x98f94189 mirror 1
[483634.470118] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1
[483634.470257] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
10305536 csum 0x3d8c1843 expected csum 0x98f94189 mirror 1
[483634.471085] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
9256960 csum 0xba3fede3 expected csum 0x98f94189 mirror 1
[483634.471128] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
8302592 csum 0x7de15198 expected csum 0x98f94189 mirror 1
[484152.178497] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1
[484152.180422] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1736704 csum 0xf01ac658 expected csum 0x98f94189 mirror 1
[484152.181598] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1
[484152.182242] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
2785280 csum 0xc78988ec expected csum 0x98f94189 mirror 1
[484158.569489] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
2138112 csum 0xab34e90e expected csum 0x98f94189 mirror 1
[484158.571885] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
2785280 csum 0xd611911e expected csum 0x98f94189 mirror 1
[484158.575191] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
3833856 csum 0x6277c8a6 expected csum 0x98f94189 mirror 1
[484158.575620] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
4882432 csum 0x3293c3e7 expected csum 0x98f94189 mirror 1
[484158.578637] BTRFS warning 

Re: Crashed filesystem, nothing helps

2017-08-02 Thread Roman Mamedov
On Wed, 02 Aug 2017 11:17:04 +0200
Thomas Wurfbaum  wrote:
 
> A restore does also not help:
> mainframe:~ # btrfs restore /dev/sdb1 /mnt
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> Ignoring transid failure
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> Ignoring transid failure
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> Ignoring transid failure

Did it just abruptly exit there? Or you terminated it?

IIRC these messages (about ignoring) are not a problem for restore, it should
be able to continue. Or if not, it would print a more definitive error
message, e.g. "Couldn't read tree root" or such.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Roman Mamedov
On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:

> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Roman Mamedov
On Sun, 30 Jul 2017 18:14:35 +0200
"marcel.cochem"  wrote:

> I am pretty sure that not all data is lost as i can grep thorugh the
> 100 GB SSD partition. But my question is, if there is a tool to rescue
> all (intact) data and maybe have only a few corrupt files which can't
> be recovered.

There is such a tool, see https://btrfs.wiki.kernel.org/index.php/Restore

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS error: bad tree block start 0 623771648

2017-08-01 Thread Roman Mamedov
On Mon, 31 Jul 2017 11:12:01 -0700
Liu Bo  wrote:

> Superblock and chunk tree root is OK, looks like the header part of
> the tree root is now all-zero, but I'm unable to think of a btrfs bug
> which can lead to that (if there is, it is a serious enough one)

I see that the FS is being mounted with "discard". So maybe it was a TRIM gone
bad (wrong location or in a wrong sequence).

Generally it appears to be not recommended to use "discard" by now (because of
its performance impact, and maybe possible issues like this), instead schedule
to call "fstrim " once a day or so, and/or on boot-up.

> on ssd like disks, by default there is only one copy for metadata.

Time and time again, the default of "single" metadata for SSD is a terrible
idea. Most likely DUP metadata would save the FS in this case.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Roman Mamedov
On Fri, 28 Jul 2017 17:40:50 +0100 (BST)
"Konstantin V. Gavrilenko"  wrote:

> Hello list, 
> 
> I am stuck with a problem of btrfs slow performance when using compression.
> 
> when the compress-force=lzo mount flag is enabled, the performance drops to 
> 30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
> mount options: btrfs 
> relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

It does not work like that, you need to set compress-force=lzo (and remove
compress=).

With your setup I believe you currently use compress-force[=zlib](default),
overriding compress=lzo, since it's later in the options order.

Secondly,

> autodefrag

This sure sounded like a good thing to enable? on paper? right?...

The moment you see anything remotely weird about btrfs, this is the first
thing you have to disable and retest without. Oh wait, the first would be
qgroups, this one is second.

Finally, what is the reasoning behind "commit=10", and did you check with the
default value of 30?

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread Roman Mamedov
On Mon, 24 Jul 2017 09:46:34 -0400
"Austin S. Hemmelgarn"  wrote:

> > I am a little bit confused because the balance command is running since
> > 12 hours and only 3GB of data are touched. This would mean the whole
> > balance process (new disc has 8TB) would run a long, long time... and
> > is using one cpu by 100%.
> 
> Based on what you're saying, it sounds like you've either run into a 
> bug, or have a huge number of snapshots

...and possibly quotas (qgroups) enabled. (perhaps automatically by some tool,
and not by you). Try:

  btrfs quota disable 

With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add skeleton code for compression heuristic

2017-07-21 Thread Roman Mamedov
On Fri, 21 Jul 2017 13:00:56 +0800
Anand Jain  wrote:

> 
> 
> On 07/18/2017 02:30 AM, David Sterba wrote:
> > So it basically looks good, I could not resist and rewrote the changelog
> > and comments. There's one code fix:
> > 
> > On Mon, Jul 17, 2017 at 04:52:58PM +0300, Timofey Titovets wrote:
> >> -static inline int inode_need_compress(struct inode *inode)
> >> +static inline int inode_need_compress(struct inode *inode, u64 start, u64 
> >> end)
> >>   {
> >>struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >>
> >>/* force compress */
> >>if (btrfs_test_opt(fs_info, FORCE_COMPRESS))
> >> -  return 1;
> >> +  return btrfs_compress_heuristic(inode, start, end);
> 
> 
> 
> > This must stay 'return 1', if force-compress is on, so the change is
> > reverted.
> 
>   Initially I thought 'return 1' is correct, but looking in depth,
>   it is not correct as below..
> 
>   The biggest beneficiary of the estimating the compression ratio
>   in advance (heuristic) is when customers are using the
>   -o compress-force. But 'return 1' here is making them not to
>   use heuristic. So definitely something is wrong.

man mount says for btrfs:

If compress-force is specified, all files will be compressed,  whether
or  not they compress well.

So compress-force by definition should always compress all files no matter
what, and not use any heuristic. In fact it has no right to, as user forced
compression to always on. Returning 1 up there does seem right to me.

>   -o compress is about the whether each of the compression-granular bytes
>   (BTRFS_MAX_UNCOMPRESSED) of the inode should be tried to compress OR
>   just give up for the whole inode by looking at the compression ratio
>   of the current compression-granular.
>   This approach can be overridden by -o compress-force. So in
>   -o compress-force there will be a lot more efforts in _trying_
>   to compression than in -o compress. We must use heuristic for
>   -o compress-force.

Semantic and the user expectation of compress-force dictates to always
compress without giving up, even if it turns out to be slower and not providing
much benefit.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "detect-zeroes=unmap" support in Btrfs?

2017-07-18 Thread Roman Mamedov
On Tue, 18 Jul 2017 16:57:10 +0500
Roman Mamedov <r...@romanrm.net> wrote:

> if a block written consists of zeroes entirely, instead of writing zeroes to
> the backing storage, converts that into an "unmap" operation
> (FALLOC_FL_PUNCH_HOLE[1]).

BTW I found that it is very easy to "offline" process preexisting files for
this, using "fallocate -d".

   -d, --dig-holes
  Detect  and  dig  holes. Makes the file sparse in-place, without
  using extra disk space. The minimal size of the hole depends  on
  filesystem I/O block size (usually 4096 bytes). Also, when using
  this option, --keep-size is implied. If no range is specified by
  --offset and --length, then all file is analyzed for holes.

  You  can think of this as doing a "cp --sparse" and renaming the
  dest file as the original,  without  the  need  for  extra  disk
  space.

So my suggestion is to implement an "online" counterpart to such
forced-sparsifying, i.e. the same thing done on FS I/O in-band.
(the analogy is with offline vs in-band dedup).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"detect-zeroes=unmap" support in Btrfs?

2017-07-18 Thread Roman Mamedov
Hello,

Qemu/KVM has this nice feature in its storage layer, "detect-zeroes=unmap".
Basically the VM host detects if a block written by the guest consists of
zeroes entirely, and instead of writing zeroes to the backing storage,
converts that into an "unmap" operation (FALLOC_FL_PUNCH_HOLE[1]).

I wonder if the same could be added into Btrfs directly? With a CoW filesystem
there is really no reason to store long runs of zeroes, or even spend
compression cycles on them (even if they compress really well), it would be
more efficient to turn all zero-filled blocks into file holes. (In effect
forcing all files with zero blocks into always being "sparse" files.)

You could say this increases fragmentation, but given the CoW nature of Btrfs,
any write to a file increases fragmentation already (except with "nocow"), and
converting zeroes into holes would be beneficial due to not requiring any
actual IO when those need to be read (reading back zeroes which are not stored
anywhere, as opposed to reading actual zeroes from disk).

[1] http://man7.org/linux/man-pages/man2/fallocate.2.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk root problem

2017-07-06 Thread Roman Mamedov
On Wed, 5 Jul 2017 22:10:35 -0600
Daniel Brady  wrote:

> parent transid verify failed

Typically in Btrfs terms this means "you're screwed", fsck will not fix it, and
nobody will know how to fix or what is the cause either. Time to restore from
backups! Or look into "btrfs restore" if you don't have any.

In your case it's especially puzzling as the difference in transid numbers is
really significant (about 100K), almost like the FS was operating for months
without updating some parts of itself -- and no checksum errors either, so
all looks correct, except that everything is horribly wrong.

This kind of error seems to occur more often in RAID setups, either Btrfs
native RAID, or with Btrfs on top of other RAID setups -- i.e. where it
becomes a complex issue that all writes to multi devices DO complete IN order,
in case of an unclean shutdown. (which is much simpler on a single device FS).

Also one of your disks or cables is failing (was /dev/sde on that boot, but may
get a different index next boot), check SMART data for it and replace.

> [   21.230919] BTRFS info (device sdf): bdev /dev/sde errs: wr 402545, rd
> 234683174, flush 194501, corrupt 0, gen 0

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-06-08 Thread Roman Mamedov
On Thu, 8 Jun 2017 19:57:10 +0200
Hans van Kranenburg  wrote:

> There is an improvement with subvolume delete + nossd that is visible
> between 4.7 and 4.9.

I don't remember if I asked before, but did you test on 4.4? The two latest
longterm series are 4.9 and 4.4. 4.7 should be abandoned and forgotten by now
really, certainly not used daily in production, it's not even listed on
kernel.org anymore. Also it's possible the 4.7 branch that you test did not
receive all the bugfix backports from mainline like the longterm series do.

> I have no idea what change between 4.7 and 4.9 is responsible for this, but
> it's good.  

FWIW, this appears to be the big Btrfs change between 4.7 and 4.9 (in 4.8):

Btrfs: introduce ticketed enospc infrastructure
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=957780eb2788d8c218d539e19a85653f51a96dc1

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getting rid of "csum failed" on a hw raid

2017-06-07 Thread Roman Mamedov
On Wed, 7 Jun 2017 15:09:02 +0200
Adam Borowski  wrote:

> On Wed, Jun 07, 2017 at 01:10:26PM +0300, Timofey Titovets wrote:
> > 2017-06-07 13:05 GMT+03:00 Stefan G. Weichinger :
> > > Am 2017-06-07 um 11:37 schrieb Timofey Titovets:
> > >
> > >> btrfs scrub start /mnt_path do this trick
> > >>
> > >> After, you can find info with paths in dmesg
> > >
> > > thank you, I think I have the file, it's a qemu-img-file.
> > > I try cp-ing it to another fs first, but assume this will fail, right?
> > 
> > Yes, because btrfs will return -EIO
> > So try dd_rescue
> 
> Or even plain dd conv=noerror.  Both will do a faithful analogue of a
> physical disk with a silent data corruption on the affected sectors.

Yeah, except "plain dd conv=noerror" will produce a useless corrupted image,
because it will be shifted forward by the number of unreadable bytes after the
first error.

You also need the "sync" flag in there.
https://superuser.com/questions/622541/what-does-dd-conv-sync-noerror-do
http://www.debianadmin.com/recover-data-from-a-dead-hard-drive-using-dd.html
https://wiki.archlinux.org/index.php/disk_cloning
Or just stick with dd_rescue and not try to correct people's perfectly good
suggestions with completely wrong and harmful ones.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Btrfs: compression must free at least PAGE_SIZE

2017-05-21 Thread Roman Mamedov
On Sun, 21 May 2017 19:54:05 +0300
Timofey Titovets  wrote:

> Sorry, but i know about subpagesize-blocksize patch set, but i don't
> understand where you see conflict?
> 
> Can you explain what you mean?
> 
> By PAGE_SIZE i mean fs cluster size in my patch set.

This appears to be exactly the conflict. Subpagesize blocksize patchset would
make it possible to use e.g. Btrfs with 4K block (cluster) size on a MIPS
machine with 64K-sized pages. Would your checking for PAGE_SIZE still be
correct then?

> So if and when subpage patch set would merged, PAGE_SIZE should be
> replaced with sector size, and all continue work correctly.

I guess Duncan's question was why not compare against block size from the get
go, rather than create more places for Chandan to scour through to eliminate
all "blocksize = pagesize" assumptions...

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 corrupted

2017-05-19 Thread Roman Mamedov
On Fri, 19 May 2017 11:55:27 +0300
Pasi Kärkkäinen  wrote:

> > > Try saving your data with "btrfs restore" first 
> > 
> > First post, he tried that.  No luck.  Tho that was with 4.4 userspace.  
> > It might be worth trying with the 4.11-rc or soon to be released 4.11 
> > userspace, tho...
> > 
> 
> Try with 4.12-rc, I assume :)

No, actually I missed that this was already tried, and a newer kernel will not
help at "btrfs restore", AFAIU it works entirely in userspace, not kernel.

Newer btrfs-progs could be something to try though, as the version used seems
pretty old -- btrfs-progs v4.4.1, while the current one is v4.11.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 6 corrupted

2017-05-17 Thread Roman Mamedov
On Thu, 18 May 2017 04:09:38 +0200
Łukasz Wróblewski  wrote:

> I will try when stable 4.12 comes out.
> Unfortunately I do not have a backup.
> Fortunately, these data are not so critical.
> Some private photos and videos of youth.
> However, I would be very happy if I could get it back.

Try saving your data with "btrfs restore" first (i.e. you can do it right now,
as it doesn't depend on kernel versions), after you have your data recovered
and reliably backed up, then you can proceed with experiments on new kernel,
patches and whatnot.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Roman Mamedov
On Fri, 12 May 2017 20:36:44 +0200
Kai Krakow  wrote:

> My concern is with fail scenarios of some SSDs which die unexpected and
> horribly. I found some reports of older Samsung SSDs which failed
> suddenly and unexpected, and in a way that the drive completely died:
> No more data access, everything gone. HDDs start with bad sectors and
> there's a good chance I can recover most of the data except a few
> sectors.

Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort
of RAID.

In a way it's even better, that SSDs [are said to] fail abruptly and entirely.
You can then just restore from backups and go on. Whereas a failing HDD can
leave you puzzled on e.g. whether it's a cable or controller problem instead,
and possibly can even cause some data corruption which you won't notice until
too late.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backing up BTRFS metadata

2017-05-11 Thread Roman Mamedov
On Thu, 11 May 2017 09:19:28 -0600
Chris Murphy  wrote:

> On Thu, May 11, 2017 at 8:56 AM, Marat Khalili  wrote:
> > Sorry if question sounds unorthodox, Is there some simple way to read (and
> > backup) all BTRFS metadata from volume?
> 
> btrfs-image

Hm, I thought that's for debugging only, and that you can't actually restore
metadata onto a data-containing FS and have anything mountable/readable as a
result.

Seems not to be the case, and in fact, could this be one of the "missing
links" in the Fsck story, 

   -w
   Walk all the trees manually and copy any blocks that are
   referenced. Use this option if your extent tree is corrupted to
   make sure that all of the metadata is captured.

This certainly does sound like something to try for some of those broken
filesystems where Btrfsck refuses to do anything. Save image with this manual
walking/reconstruction of the trees, then restore. Too bad I already nuked
mine, so can't experiment with that.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Roman Mamedov
On Wed, 10 May 2017 09:48:07 +0200
Martin Steigerwald  wrote:

> Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. 
>  

Indeed it is. It may or may not be possible to build a perfect Fsck, but IMO
for the time being, what's most sorely missing, is some sort of a knowingly
destructive repair mode, as in "I don't care about partial user data loss,
just whack the FS metadata into full logical consistency at any means
necessary".

Also feels like it doesn't currently deal with the majority of actual
in-real-world corruptions, notably the "parent transid failure" (even by a few
dozens increments) which it can only helpfully "Ignore" during repair.

So even with a minor corruption (something wonky in just ONE block of a
multi-terabyte FS) the answer is way too often "nuke the entire thing and
restore from backups".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Roman Mamedov
On Wed, 10 May 2017 09:02:46 +0200
Stefan Priebe - Profihost AG  wrote:

> how to fix bad key ordering?

You should clarify does the FS in question mount (read-write? read-only?)
and what are the kernel messages if it does not.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "corrupt leaf, invalid item offset size pair"

2017-05-09 Thread Roman Mamedov
On Mon, 8 May 2017 20:05:44 +0200
"Janos Toth F."  wrote:

> May be someone more talented will be able to assist you but in my
> experience this kind of damage is fatal in practice (even if you could
> theoretically fix it, it's probably easier to recreate the fs and
> restore the content from backup, or use the rescue tool to save some
> of the old content which you never had copies from and restore that).
> I think the problem is that the disturbed disk gets out of sync
> (obviously, it misses some queued/buffered writes) from the rest of
> the fs/disk(s) but later gets accepted back like it's in a perfectly
> fine state (and/or Btrfs is ready to deal with problems like this,
> though it looks like it is not), and then some fatal corruption starts
> developing (due to the problematic disk being treated like it has
> correct data, even though it has some errors). If you have it mounted
> RW long enough, it will probably get worse and gets unmountable at
> some point (and thus harder, if not impossible to rescure any data).
> This is how I usually lost my RAID-5 mode Btrfs filesystems before I
> stopped experimenting with that. I never had this problem since I
> disabled SATA HotPlug (in the firmware setup of the motherboard) and
> switched to RAID-10 mode (and eventually replaced both faulty SATA
> cables in the system, one at a time after an incident...).

Yeah I scrapped the FS and now restoring from backups.  For some of the stuff
that wasn't backed up, "btrfs restore" worked remarkably well.

This was my primary 9x2TB mdadm RAID6 with Btrfs on top. But after all, it
appears to be too risky to run all storage as a huge SPOF like that. And
since I had almost everything backed up elsewhere, there's seems to be little
justification for the protections of RAID6 (the machine does not need 100.00%
uptime and does not even have hot-swap drive bays).

So I will now switch to using individual drives with single-device Btrfs on
each, joined for convenience with mhddfs/unionfs/aufs on the directory tree
level.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"corrupt leaf, invalid item offset size pair"

2017-05-07 Thread Roman Mamedov
Hello,

It appears like during some trouble with HDD cables and controllers, I got some 
disk corruption.
As a result, after a short period of time my Btrfs went read-only, and now does 
not mount anymore.

[Sun May  7 23:08:02 2017] BTRFS error (device dm-8): parent transid verify 
failed on 13799442505728 wanted 625048 found 624487
[Sun May  7 23:08:02 2017] BTRFS info (device dm-8): read error corrected: ino 
1 off 13799442505728 (dev /dev/mapper/vg-r6p1 sector 6736670512)
[Sun May  7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify 
failed on 13799589576704 wanted 625088 found 624488
[Sun May  7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify 
failed on 13799589576704 wanted 625088 found 624402
[Sun May  7 23:08:33 2017] [ cut here ]
[Sun May  7 23:08:33 2017] WARNING: CPU: 3 PID: 2022 at 
fs/btrfs/extent-tree.c:6555 __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]()
[Sun May  7 23:08:33 2017] BTRFS: Transaction aborted (error -5)
[Sun May  7 23:08:33 2017] Modules linked in: dm_mirror dm_region_hash dm_log 
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_nat xt_limit 
xt_length nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6t_rpfilter 
ipt_rpfilter xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack ip6table_raw iptable_raw ip6table_mangle 
iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables 
cpufreq_userspace cpufreq_conservative cpufreq_stats cpufreq_powersave nbd nfsd 
nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd 
grace sunrpc fscache 8021q garp mrp bridge stp llc bonding tcp_illinois aoe 
crc32 loop it87 hwmon_vid fuse kvm_amd kvm irqbypass crct10dif_pclmul eeepc_wmi 
crc32_pclmul ghash_clmulni_intel asus_wmi sparse_keymap rfkill
[Sun May  7 23:08:33 2017]  video sha256_ssse3 sha256_generic hmac mxm_wmi drbg 
ansi_cprng snd_hda_codec_realtek aesni_intel snd_hda_codec_generic aes_x86_64 
snd_hda_intel lrw gf128mul snd_hda_codec glue_helper snd_pcsp snd_hda_core 
ablk_helper snd_hwdep cryptd snd_pcm snd_timer cp210x joydev snd serio_raw 
k10temp evdev usbserial edac_mce_amd edac_core soundcore fam15h_power 
sp5100_tco acpi_cpufreq tpm_infineon wmi i2c_piix4 tpm_tis tpm 8250_fintek 
shpchp processor button ext4 crc16 mbcache jbd2 btrfs dm_cache_smq raid10 raid1 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
crc32c_generic md_mod dm_cache_mq dm_cache dm_persistent_data dm_bio_prison 
dm_bufio libcrc32c dm_mod sg sd_mod ata_generic hid_generic usbhid hid ohci_pci 
sata_mv ahci pata_jmicron libahci crc32c_intel sata_sil24
[Sun May  7 23:08:33 2017]  ehci_pci ohci_hcd xhci_pci psmouse xhci_hcd 
ehci_hcd libata usbcore scsi_mod e1000e usb_common ptp pps_core
[Sun May  7 23:08:33 2017] CPU: 3 PID: 2022 Comm: btrfs-transacti Not tainted 
4.4.66-rm1+ #181
[Sun May  7 23:08:33 2017] Hardware name: To be filled by O.E.M. To be filled 
by O.E.M./M5A97 LE R2.0, BIOS 2601 03/24/2015
[Sun May  7 23:08:33 2017]  0286 2595262f 8800d675baf0 
812ff351
[Sun May  7 23:08:33 2017]  8800d675bb38 a03929d2 8800d675bb28 
8107eb95
[Sun May  7 23:08:33 2017]  0c42b8ffb000 fffb 8805b6c60800 

[Sun May  7 23:08:33 2017] Call Trace:
[Sun May  7 23:08:33 2017]  [] dump_stack+0x63/0x82
[Sun May  7 23:08:33 2017]  [] warn_slowpath_common+0x95/0xe0
[Sun May  7 23:08:33 2017]  [] warn_slowpath_fmt+0x5c/0x80
[Sun May  7 23:08:33 2017]  [] 
__btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]
[Sun May  7 23:08:33 2017]  [] ? 
btrfs_merge_delayed_refs+0x67/0x610 [btrfs]
[Sun May  7 23:08:33 2017]  [] 
__btrfs_run_delayed_refs+0x99c/0x1260 [btrfs]
[Sun May  7 23:08:33 2017]  [] ? dequeue_task_fair+0x597/0x870
[Sun May  7 23:08:33 2017]  [] ? put_prev_entity+0x42/0x760
[Sun May  7 23:08:33 2017]  [] 
btrfs_run_delayed_refs+0x7e/0x2b0 [btrfs]
[Sun May  7 23:08:33 2017]  [] ? del_timer_sync+0x48/0x50
[Sun May  7 23:08:33 2017]  [] 
btrfs_commit_transaction+0x5d/0xa60 [btrfs]
[Sun May  7 23:08:33 2017]  [] ? start_transaction+0x99/0x4d0 
[btrfs]
[Sun May  7 23:08:33 2017]  [] 
transaction_kthread+0x1dc/0x250 [btrfs]
[Sun May  7 23:08:33 2017]  [] ? 
btrfs_cleanup_transaction+0x560/0x560 [btrfs]
[Sun May  7 23:08:33 2017]  [] kthread+0xfa/0x110
[Sun May  7 23:08:33 2017]  [] ? kthread_park+0x60/0x60
[Sun May  7 23:08:33 2017]  [] ret_from_fork+0x3f/0x70
[Sun May  7 23:08:33 2017]  [] ? kthread_park+0x60/0x60
[Sun May  7 23:08:33 2017] ---[ end trace 13439d259c35afcf ]---
[Sun May  7 23:08:33 2017] BTRFS: error (device dm-8) in 
__btrfs_free_extent:6555: errno=-5 IO failure
[Sun May  7 23:08:33 2017] BTRFS info (device dm-8): forced readonly
[Sun May  7 23:08:33 2017] BTRFS: error (device dm-8) in 
btrfs_run_delayed_refs:2930: errno=-5 IO failure
[Sun May  7 23:14:26 2017] BTRFS error (device dm-8): cleaner transaction 
attach returned -30

Unmounted and 

Re: btrfs check --repair: failed to repair damaged filesystem, aborting

2017-05-03 Thread Roman Mamedov
On Tue, 2 May 2017 23:17:11 -0700
Marc MERLIN  wrote:

> On Tue, May 02, 2017 at 11:00:08PM -0700, Marc MERLIN wrote:
> > David,
> > 
> > I think you maintain btrfs-progs, but I'm not sure if you're in charge 
> > of check --repair.
> > Could you comment on the bottom of the mail, namely:
> > > failed to repair damaged filesystem, aborting
> > > So, I'm out of luck now, full wipe and 3-5 day rebuild?
>   
> Actually, another thought:
> Is there or should there be a way to repair around the bit that cannot
> be repaired?
> Separately, or not, can I locate which bits are causing the repair to
> fail and maybe get a pointer to the path/inode so that I can hopefully
> just delete those bad data structures (assuming deleting them is even
> possible and that the FS won't just go read only as I try to do that)

There is the "btrfs-corrupt-block" tool which helped me to kick Btrfsck
further along its course in a similar "unrepairable" situation.
https://www.spinics.net/lists/linux-btrfs/msg53061.html

In your case it appears like the block 2899180224512 is giving it the most
trouble, so you could start with killing that one. From what I can tell this
tool zeroes out the entire block, so Btrfsck can simply delete the reference
and forget it, rather than repeatedly trying to figure out solutions and
bailing out with "failed to repair damaged filesystem, aborting".

Depending on what was stored in it, you may have either no visible effect, or
a complete filesystem failure, or anything in between. Hence if you want to
experiment with this, find a way to work on writable overlay snapshots (also
described in the linked message).

> Here is the full run if that helps:
> https://pastebin.com/STMFHty4
> 
> > Thanks,
> > Marc
> > 
> > Rest:
> > On Tue, May 02, 2017 at 11:47:22AM -0700, Marc MERLIN wrote:
> > > (cc trimmed)
> > > 
> > > The one in debian/unstable crashed:
> > > gargamel:~# btrfs --version
> > > btrfs-progs v4.7.3
> > > gargamel:~# btrfs check --repair /dev/mapper/dshelf2
> > > bytenr mismatch, want=2899180224512, have=3981076597540270796
> > > extent-tree.c:2721: alloc_reserved_tree_block: Assertion `ret` failed.
> > > btrfs[0x43e418]
> > > btrfs[0x43e43f]
> > > btrfs[0x43f276]
> > > btrfs[0x43f46f]
> > > btrfs[0x4407ef]
> > > btrfs[0x440963]
> > > btrfs(btrfs_inc_extent_ref+0x513)[0x44107a]
> > > btrfs[0x420053]
> > > btrfs[0x4265eb]
> > > btrfs(cmd_check+0x)[0x427d6d]
> > > btrfs(main+0x12f)[0x40a341]
> > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f6b632e82b1]
> > > btrfs(_start+0x2a)[0x40a37a]
> > > 
> > > Ok, it's old, let's take git from today:
> > > gargamel:~# btrfs --version
> > > btrfs-progs v4.10.2
> > > As a note, 
> > > gargamel:~# btrfs check --mode=lowmem --repair /dev/mapper/dshelf2
> > > enabling repair mode
> > > ERROR: low memory mode doesn't support repair yet
> > > 
> > > As a note, a 32bit binary on a 64bit kernel:
> > > gargamel:~# btrfs check --repair /dev/mapper/dshelf2
> > > enabling repair mode
> > > Checking filesystem on /dev/mapper/dshelf2
> > > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
> > > checking extents
> > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> > > checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
> > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> > > bytenr mismatch, want=2899180224512, have=3981076597540270796
> > > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
> > > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
> > > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> > > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> > > parent transid verify failed on 1671538819072 wanted 293964 found 293902
> > > parent transid verify failed on 1671538819072 wanted 293964 found 293902
> > > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
> > > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
> > > cmds-check.c:6291: add_data_backref: BUG_ON `!back` triggered, value 1
> > > Aborted
> > > 
> > > let's try again with a 64bit binary built from git:
> > > (...)
> > > Repaired extent references for 4227617038336
> > > ref mismatch on [4227872751616 4096] extent item 1, found 0
> > > Incorrect local backref count on 4227872751616 parent 3493071667200 owner > > > 0
> > > offset 0 found 0 wanted 1 back 0x56470b18e7f0  
> > > Backref disk bytenr does not match extent record, bytenr=4227872751616, 
> > > ref
> > > bytenr=0
> > > backpointer mismatch on [4227872751616 4096]
> > > owner ref check failed [4227872751616 4096]
> > > repair deleting extent record: key 4227872751616 168 4096
> > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> > > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> > > checksum 

Re: [PATCH 3/3] Make max_size consistent with nr

2017-04-28 Thread Roman Mamedov
On Fri, 28 Apr 2017 11:13:36 +0200
Christophe de Dinechin  wrote:

> Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1.
> In check_extent_refs, we will call:
> 
>   set_extent_dirty(root->fs_info->excluded_extents,
>rec->start,
>rec->start + rec->max_size - 1);
> 
> This ends up with BUG_ON(end < start) in insert_state.
> 
> Signed-off-by: Christophe de Dinechin 
> ---
>  cmds-check.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 58e65d6..774e9b6 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree 
> *extent_cache, u64 bytenr,
>   tmpl.start = bytenr;
>   tmpl.nr = 1;
>   tmpl.metadata = 1;
> +tmpl.max_size = 1;
>  
>   ret = add_extent_rec_nolookup(extent_cache, );
>   if (ret)

The original code uses Tab characters for indent, but your addition uses
spaces. Also same problem in patch 2/3.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No space left on device when doing "mkdir"

2017-04-27 Thread Roman Mamedov
On Thu, 27 Apr 2017 08:52:30 -0500
Gerard Saraber  wrote:

> I could just reboot the system and be fine for a week or so, but is
> there any way to diagnose this?

`btrfs fi df` for a start.

Also obligatory questions: do you have a lot of snapshots, and do you use
qgroups?

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Tue, 18 Apr 2017 03:23:13 + (UTC)
Duncan <1i5t5.dun...@cox.net> wrote:

> Without reading the links...
> 
> Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
> thinking narrowly, those actually sold as ssds?
> 
> Because all I've read (and I admit I may not actually be current, but...) 
> on for instance sd cards, certainly ssds by definition, says they're 
> still very write-cycle sensitive -- very simple FTL with little FTL wear-
> leveling.
> 
> And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
> FTL with some, somewhat simplistic, wear-leveling.
> 

If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be
SSDs "by definition", but nobody will think of an SD card when you say "I
bought an SSD for my computer". And yes, SD card and USB flash sticks are
commonly understood to be much simpler and more brittle devices than full
blown desktop (not to mention server) SSDs.

> While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
> NVME connected, may indeed match your argument, no real end-user concern 
> necessary any more as the FTLs are advanced enough that user or 
> filesystem level write-cycle concerns simply aren't necessary these days.
> 
> 
> So does that claim that write-cycle concerns simply don't apply to modern 
> ssds, also apply to common thumb drives and sd cards?  Because these are 
> certainly ssds both technically and by btrfs standards.
> 


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Mon, 17 Apr 2017 07:53:04 -0400
"Austin S. Hemmelgarn"  wrote:

> General info (not BTRFS specific):
> * Based on SMART attributes and other factors, current life expectancy 
> for light usage (normal desktop usage) appears to be somewhere around 
> 8-12 years depending on specifics of usage (assuming the same workload, 
> F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
> end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
> (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
> bottom of the barrel).

Life expectancy for an SSD is defined not in years, but in TBW (terabytes
written), and AFAICT that's not "from host", but "to flash" (some SSDs will
show you both values in two separate SMART attributes out of the box, on some
it can be unlocked). Filesystem may come into play only by the amount of write
amplification they cause (how much "to flash" is greater than "from host").
Do you have any test data to show that FSes are ranked in that order by WA
they cause, or is it all about "general feel" and how they are branded (F2FS
says so, so it must be the best)

> * Queued DISCARD support is still missing in most consumer SATA SSD's, 
> which in turn makes the trade-off on those between performance and 
> lifetime much sharper.

My choice was to make a script to run from crontab, using "fstrim" on all
mounted SSDs nightly, and aside from that all FSes are mounted with
"nodiscard". Best of the both worlds, and no interference with actual IO
operation.

> * Modern (2015 and newer) SSD's seem to have better handling in the FTL 
> for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
> sure if this is actually a result of the FTL being better, or some 
> change in the hardware.

Again, what makes you think this, did you observe the write amplification
readings and now those are demonstrably lower than on "2014 and older" SSDs?
So, by how much, and which models did you compare?

> * In my personal experience, Intel, Samsung, and Crucial appear to be 
> the best name brands (in relative order of quality).  I have personally 
> had bad experiences with SanDisk and Kingston SSD's, but I don't have 
> anything beyond circumstantial evidence indicating that it was anything 
> but bad luck on both counts.

Why not think in terms not of "name brands" but platforms, i.e. a controller
model + flash combination. For instance Intel have been using some other
companies' controllers in their SSDs. Kingston uses tons of various
controllers (Sandforce/Phison/Marvell/more?) depending on the model and range.

> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
> SSD.

"Appear to"? Just... what. So how many SSDs did you have fail under nocow?

Or maybe can we get serious in a technical discussion? Did you by any chance
mean cause more writes to the SSD and more "to flash" writes (resulting in a
higher WA). If so, then by how much, and what was your test scenario comparing
the same usage with and without nocow?

> * Compression should help performance and device lifetime most of the 
> time, unless your CPU is fully utilized on a regular basis (in which 
> case it will hurt performance, but still improve device lifetimes).

Days are long gone since the end user had to ever think about device lifetimes
with SSDs. Refer to endurance studies such as 
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead
http://ssdendurancetest.com/
https://3dnews.ru/938764/
It has been demonstrated that all SSDs on the market tend to overshoot even
their rated TBW by several times, as a result it will take any user literally
dozens of years to wear out the flash no matter which filesystem or what
settings used. And most certainly it's not worth it changing anything
significant in your workflow (such as enabling compression if it's otherwise
inconvenient or not needed) just to save the SSD lifetime.

On Mon, 17 Apr 2017 13:13:39 -0400
"Austin S. Hemmelgarn"  wrote:

> > What is a high end SSD these days? Built-in NVMe?
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
> drives, the high quality Intel ones

As opposed to bad Samsung EVO drives and low-quality Intel ones?

> and the Crucial MX series, but 
> probably some others.  My choice of words here probably wasn't the best 
> though.

Again, which controller? Crucial does not manufacture SSD controllers on their
own, they just pack and brand stuff manufactured by someone else. So if you
meant Marvell based SSDs, then that's many brands, not just Crucial.

> For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
> rewritten in-place.  This means that cheap FTL's will rewrite that erase 
> block in-place (which won't hurt performance but will impact device 
> lifetime), and good ones will rewrite into a free 

Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-09 Thread Roman Mamedov
On Sun, 9 Apr 2017 06:38:54 +
Paul Jones  wrote:

> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org 
> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Hans van Kranenburg
> Sent: Sunday, 9 April 2017 6:19 AM
> To: linux-btrfs 
> Subject: About free space fragmentation, metadata write amplification and 
> (no)ssd
> 
> > So... today a real life story / btrfs use case example from the trenches at 
> > work...
> 
> Snip!!
> 
> Great read. I do the same thing for backups on a much smaller scale and it 
> works brilliantly.  Two 4T drives in btrfs raid1.
> I will mention that I recently setup caching using LLVM (1 x 300G ssd for 
> each 4T drive), and it's extraordinary how much of a difference it makes. 
> Especially when running deduplication. If it's feasible perhaps you could try 
> it with a nvme drive.

You mean LVM, not LLVM :)

I was actually going to suggest that as well, in my case I use a 32GB SSD
cache for my entire 14TB filesystem with 15 GB metadata (*2 in DUP). In fact
you should check the metadata size on yours, most likely you can get by with
an order of magnitude smaller cache for exactly the same benefit (and have the
rest of 2x300GB for other interesting uses).

And yeah it's amazing, especially when deleting old snapshots or doing
backups. In my case I backup the entire root FS from about 30 hosts, and keep
that in periodic snapshots for a month. Previously I would also stagger rsync
runs so that no more than 4 or 5 hosts get backed up at the same time (and
still there would be tons of trashing in seeks and iowait), now it's no
problem whatsoever.

The only issue that I have with this setup is you need to "cleanly close" the
cached LVM device on shutdown/reboot, and apparently there is no init script
in Debian that would do that (experimenting with adding some hacks, but no
success yet). So on every boot the entire cache is marked dirty and data is
being copied from cache to the actual storage, which takes some time, since
this appears to be done in a random IO pattern.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mix ssd and hdd in single volume

2017-04-03 Thread Roman Mamedov
On Mon, 3 Apr 2017 11:30:44 +0300
Marat Khalili  wrote:

> You may want to look here: https://www.synology.com/en-global/dsm/Btrfs 
> . Somebody forgot to tell Synology, which already supports btrfs in all 
> hardware-capable devices. I think Rubicon has been crossed in 
> 'mass-market NAS[es]', for good or not.

AFAIR Synology did not come to this list asking for (any kind of) advice
prior to implementing that (else they would have gotten the same kind of post
from Duncan and others), and it's not Btrfs developers job to have an outreach
program to contact vendors and educate them to not use Btrfs.

I don't remember seeing them actively contribute improvements or fixes
especially for the RAID5 or RAID6 features (which they ADVERTISE on that page
as a fully working part of their product). That doesn't seem honest to end
users or playing nicely with the upstream developers. What the upstream gets
instead is just those end-users coming here one by one some years later,
asking how to fix a broken Btrfs RAID5 on an embedded box running some 3.10 or
3.14 kernel.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?

2017-04-03 Thread Roman Mamedov
On Sun, 2 Apr 2017 09:30:46 +0300
Andrei Borzenkov  wrote:

> 02.04.2017 03:59, Duncan пишет:
> > 
> > 4) In fact, since an in-place convert is almost certainly going to take 
> > more time than a blow-away and restore from backup,
> 
> This caught my eyes. Why? In-place convert just needs to recreate
> metadata. If you have multi-terabyte worth of data copying them twice
> hardly can be faster.

In-place convert is most certainly faster than copy-away and restore, in fact
it can be very fast if you use the option to not calculate checksums for the
entire filesystem's data (btrfs-convert -d).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-27 Thread Roman Mamedov
On Mon, 27 Mar 2017 13:32:47 -0600
Chris Murphy  wrote:

> How about if qgroups are enabled, then non-root user is prevented from
> creating new subvolumes?

That sounds like, if you turn your headlights on in a car, then in-vehicle air
conditioner randomly stops working. :)

Two things only vaguely related from the end user's point of view.

> Or is there a way for a new nested subvolume to be included in its
> parent's quota, rather than the new subvolume having a whole new quota
> limit?

Either that, or a separate "allow non-root user subvolumes/snapshots creation"
mount option. There is already one for deletion, after all.

   user_subvol_rm_allowed
  Allow subvolumes to be deleted by a  non-root  user.   Use  with
  caution.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-27 Thread Roman Mamedov
On Mon, 27 Mar 2017 16:49:47 +0200
Christian Theune  wrote:

> Also: the idea of migrating on btrfs also has its downside - the performance 
> of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current 
> shrinking job to finish but this is likely limited to the “find free space” 
> algorithm. We’re talking about a few megabytes converted per second. Sigh.

Btw since this is all on LVM already, you could set up lvmcache with a small
SSD-based cache volume. Even some old 60GB SSD would work wonders for
performance, and with the cache policy of "writethrough" you don't have to
worry about its reliability (much).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-27 Thread Roman Mamedov
On Mon, 27 Mar 2017 15:20:37 +0200
Christian Theune  wrote:

> (Background info: we’re migrating large volumes from btrfs to xfs and can
> only do this step by step: copying some data, shrinking the btrfs volume,
> extending the xfs volume, rinse repeat. If someone should have any
> suggestions to speed this up and not having to think in terms of _months_
> then I’m all ears.)

I would only suggest that you reconsider XFS. You can't shrink XFS, therefore
you won't have the flexibility to migrate in the same way to anything better
that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not
perform that much better over Ext4, and very importantly, Ext4 can be shrunk.

>From the looks of it Ext4 has also overcome its 16TB limitation:
http://askubuntu.com/questions/779754/how-do-i-resize-an-ext4-partition-beyond-the-16tb-limit

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: backing up a file server with many subvolumes

2017-03-26 Thread Roman Mamedov
On Sat, 25 Mar 2017 23:00:20 -0400
"J. Hart"  wrote:

> I have a Btrfs filesystem on a backup server.  This filesystem has a 
> directory to hold backups for filesystems from remote machines.  In this 
> directory is a subdirectory for each machine.  Under each machine 
> subdirectory is one directory for each filesystem (ex /boot, /home, etc) 
> on that machine.  In each filesystem subdirectory are incremental 
> snapshot subvolumes for that filesystem.  The scheme is something like 
> this:
> 
> /backup///
> 
> I'd like to try to back up (duplicate) the file server filesystem 
> containing these snapshot subvolumes for each remote machine.  The 
> problem is that I don't think I can use send/receive to do this. "Btrfs 
> send" requires "read-only" snapshots, and snapshots are not recursive as 
> yet.  I think there are too many subvolumes which change too often to 
> make doing this without recursion practical.

You could have done time-based snapshots on the top level (for /backup/), say,
every 6 hours, and keep those for e.g. a month. Then don't bother with any
other kind of subvolumes/snapshots on the backup machine, and do backups from
remote machines into their respective subdirectories using simple 'rsync'.

That's what a sensible scheme looks like IMO, as opposed to a Btrfs-induced
exercise in futility that you have (there are subvolumes? must use them for
everything, even the frigging /boot/; there is send/receive? absolutely must
use it for backing up; etc.)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help : "bad tree block start" -> btrfs forced readonly

2017-03-17 Thread Roman Mamedov
On Fri, 17 Mar 2017 10:27:11 +0100
Lionel Bouton  wrote:

> Hi,
> 
> Le 17/03/2017 à 09:43, Hans van Kranenburg a écrit :
> > btrfs-debug-tree -b 3415463870464
> 
> Here is what it gives me back :
> 
> btrfs-debug-tree -b 3415463870464 /dev/sdb
> btrfs-progs v4.6.1
> checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
> checksum verify failed on 3415463870464 found A85405B7 wanted 01010101
> bytenr mismatch, want=3415463870464, have=72340172838076673
> ERROR: failed to read 3415463870464
> 
> Is there a way to remove part of the tree and keep the rest ? It could
> help minimize the time needed to restore data.

If you are able to experiment with writable snapshots, you could try using
"btrfs-corrupt-block" to kill the bad block, and see what btrfsck makes out of
the rest. In a similar case I got little to no damage to the overall FS.
http://www.spinics.net/lists/linux-btrfs/msg53061.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.10/4.11 Experiences

2017-02-17 Thread Roman Mamedov
On Thu, 16 Feb 2017 13:37:53 +0200
Imran Geriskovan  wrote:

> What are your experiences for btrfs regarding 4.10 and 4.11 kernels?
> I'm still on 4.8.x. I'd be happy to hear from anyone using 4.1x for
> a very typical single disk setup. Are they reasonably stable/good
> enough for this case?

You should always check with https://www.kernel.org/ what are the current
versions and what is their status. As you can see, 4.8 is basically dead in
the water, nowhere seen on the website, it does not get any updates anymore
by the kernel devs. If yours is a distro kernel, you now have to rely on
whatever fixes (and with what kind of quality) the distro maintainers are able
to backport.

Personally I took a liking to always running the latest longterm series, i.e.
right now staying on 4.4 and after a few initial hiccups it appears rock-solid
Btrfs-wise (as you said for single device, no multi-devices, no qgroup etc).

I'd suggest that you either upgrade to 4.9 (from the news it appears that one
will be granted the next longterm serues status), or switch to 4.4, which may
or may not be less preferable, given there are some scary sounding reports
about 4.9 (if you have this list's archive, search for "4.9" in thread
titles) with little to no conclusive resolutions.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected behavior involving file attributes and snapshots.

2017-02-14 Thread Roman Mamedov
On Tue, 14 Feb 2017 10:30:43 -0500
"Austin S. Hemmelgarn"  wrote:

> I was just experimenting with snapshots on 4.9.0, and came across some 
> unexpected behavior.
> 
> The simple explanation is that if you snapshot a subvolume, any files in 
> the subvolume that have the NOCOW attribute will not have that attribute 
> in the snapshot.  Some further testing indicates that this is the only 
> file attribute that isn't preserved (I checked all the chattr flags that 
> BTRFS supports).
> 
> I'm kind of curious whether:
> 1. This is actually documented somewhere, as it's somewhat unexpected 
> given that everything else is preserved when snapshotting.
> 2. This is intended behavior, or just happens to be a side effect of the 
> implementation.

I don't seem to get this on 4.4.45 and 4.4.47.

$ btrfs sub create test
Create subvolume './test'
$ touch test/abc
$ chattr +C test/abc 
$ echo def > test/abc
$ ls -la test/abc 
-rw-r--r-- 1 rm rm 4 Feb 14 20:52 test/abc
$ lsattr test/abc 
---C test/abc
$ btrfs sub snap test test2
Create a snapshot of 'test' in './test2'
$ ls -la test2/abc 
-rw-r--r-- 1 rm rm 4 Feb 14 20:52 test2/abc
$ lsattr test2/abc 
---C test2/abc

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Roman Mamedov
On Tue, 7 Feb 2017 09:13:25 -0500
Peter Zaitsev  wrote:

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

It still does provide some advantage, as in each write into new area since
last hour snapshot is going to be CoW'ed only once, as opposed to every new
write getting CoW'ed every time no matter what.

I'm not sold on autodefrag, what I'd suggest instead is to schedule regular
defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase
space usage temporarily as it will partially unmerge extents previously shared
across snapshots, but you won't get away runaway fragmentation anymore, as you
would without nodatacow or with periodical snapshotting.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to have metadata-only device with no data?

2017-02-05 Thread Roman Mamedov
On Sun, 5 Feb 2017 22:55:42 +0100
Hans van Kranenburg  wrote:

> On 02/05/2017 10:42 PM, Alexander Tomokhov wrote:
> > Is it possible, having two drives to do raid1 for metadata but keep data on 
> > a single drive only?
> 
> Nope.
> 
> Would be a really nice feature though... Putting metadata on SSD and
> bulk data on HDD...
> 

You can play around with this hack just to see how that would perform, but it
comes with no warranty and untested even by me. I was going to try it, but put
it on hold since you'd also need to make sure the SSD is being preferred for
metadata reads (and not HDD), but so far have not figured out a simple way of
ensuring that.

--- linux-amd64-4.4/fs/btrfs/volumes.c.orig 2016-11-01 22:41:41.970978721 
+0500
+++ linux-amd64-4.4/fs/btrfs/volumes.c  2016-11-01 22:58:45.958977731 +0500
@@ -4597,6 +4597,14 @@
if (total_avail == 0)
continue;
 
+   /* If we have two devices and one is less than 25% of the total 
FS size, then
+* presumably it's a small device just for metadata RAID1, 
don't use it
+* for new data chunks. */
+   if ((fs_devices->num_devices == 2) &&
+   (device->total_bytes * 4 < 
fs_devices->total_rw_bytes) &&
+   (type & BTRFS_BLOCK_GROUP_DATA))
+   continue;
+
ret = find_free_dev_extent(trans, device,
   max_stripe_size * dev_stripes,
   _offset, _avail);


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5: btrfs rescue chunk-recover segfaults.

2017-01-23 Thread Roman Mamedov
On Mon, 23 Jan 2017 14:15:55 +0100
Simon Waid  wrote:

> I have a btrfs raid5 array that has become unmountable.

That's the third time you send this today. Will you keep resending every few
hours until you get a reply? That's not how mailing lists work.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dup vs raid1 in single disk

2017-01-19 Thread Roman Mamedov
On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo"  wrote:

> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.

RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.

You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.

In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.

> On a related note, I see this caveat about dup in the manpage:
> 
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"

That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.

And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't add/replace a device on degraded filesystem

2016-12-31 Thread Roman Mamedov
On Thu, 29 Dec 2016 19:27:30 -0500
Rich Gannon  wrote:

> I can mount my filesystem with -o degraded, but I can not do btrfs 
> replace or btrfs device add as the filesystem is in read-only mode, and 
> I can not mount read-write.

You can try my patch which removes that limitation 
https://patchwork.kernel.org/patch/9419189/

Also as Duncan said there's a "more proper" patch to fix this in the works
somewhere, which does a per-chunk check for degraded, and would also allow to
mount the FS read-write in your case.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with btrfs filesystem loading

2016-12-29 Thread Roman Mamedov
On Thu, 29 Dec 2016 16:42:09 +0100
Michał Zegan  wrote:

> I have odroid c2, processor architecture aarch64, linux kernel from
> master as of today from http://github.com/torwalds/linux.git.
> It seems that the btrfs module cannot be loaded. The only thing that
> happens is that after modprobe i see:
> modprobe: can't load module btrfs (kernel/fs/btrfs/btrfs.ko.gz): unknown
> symbol in module, or unknown parameter
> No errors in dmesg, like I have ignore_loglevel in kernel cmdline and no
> logs in console appear except logs for loading dependencies like xor
> module, but that is probably not important.
> The kernel has been recompiled few minutes ago from scratch, the only
> thing left was .config file. What is that? other modules load correctly
> from what I can see.

In the past there's been some trouble with crc32 dependencies:
https://www.spinics.net/lists/linux-btrfs/msg32104.html
Not sure if that's relevant anymore, but in any case, check if you have
crc32-related stuff either built-in or compiled as modules, if latter, try
loading those before btrfs (/lib/modules/*/kernel/crypto/crc32*)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Roman Mamedov
On Wed, 30 Nov 2016 07:50:17 -0500
"Austin S. Hemmelgarn"  wrote:

> > *) Read performance is not optimized: all metadata is always read from the
> > first device unless it has failed, data reads are supposedly balanced 
> > between
> > devices per PID of the process reading. Better implementations dispatch 
> > reads
> > per request to devices that are currently idle.
> Based on what I've seen, the metadata reads get balanced too.

https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451
This starts from the mirror number 0 and tries others in an incrementing
order, until succeeds. It appears that as long as the mirror with copy #0 is up
and not corrupted, all reads will simply get satisfied from it.

> > *) Write performance is not optimized, during long full bandwidth sequential
> > writes it is common to see devices writing not in parallel, but with a long
> > periods of just one device writing, then another. (Admittedly have been some
> > time since I tested that).
> I've never seen this be an issue in practice, especially if you're using 
> transparent compression (which caps extent size, and therefore I/O size 
> to a given device, at 128k).  I'm also sane enough that I'm not doing 
> bulk streaming writes to traditional HDD's or fully saturating the 
> bandwidth on my SSD's (you should be over-provisioning whenever 
> possible).  For a desktop user, unless you're doing real-time video 
> recording at higher than HD resolution with high quality surround sound, 
> this probably isn't going to hit you (and even then you should be 
> recording to a temporary location with much faster write speeds (tmpfs 
> or ext4 without a journal for example) because you'll likely get hit 
> with fragmentation).

I did not use compression while observing this;

Also I don't know what is particularly insane about copying a 4-8 GB file onto
a storage array. I'd expect both disks to write at the same time (like they
do in pretty much any other RAID1 system), not one-after-another, effectively
slowing down the entire operation by as much as 2x in extreme cases.

> As far as not mounting degraded by default, that's a conscious design 
> choice that isn't going to change.  There's a switch (adding 'degraded' 
> to the mount options) to enable this behavior per-mount, so we're still 
> on-par in that respect with LVM and MD, we just picked a different 
> default.  In this case, I actually feel it's a better default for most 
> cases, because most regular users aren't doing exhaustive monitoring, 
> and thus are not likely to notice the filesystem being mounted degraded 
> until it's far too late.  If the filesystem is degraded, then 
> _something_ has happened that the user needs to know about, and until 
> some sane monitoring solution is implemented, the easiest way to ensure 
> this is to refuse to mount.

The easiest is to write to dmesg and syslog, if a user doesn't monitor those
either, it's their own fault; and the more user friendly one would be to still
auto mount degraded, but read-only.

Comparing to Ext4, that one appears to have the "errors=continue" behavior by
default, the user has to explicitly request "errors=remount-ro", and I have
never seen anyone use or recommend the third option of "errors=panic", which
is basically the equivalent of the current Btrfs practce.

> > *) It does not properly handle a device disappearing during operation. 
> > (There
> > is a patchset to add that).
> >
> > *) It does not properly handle said device returning (under a
> > different /dev/sdX name, for bonus points).
> These are not an easy problem to fix completely, especially considering 
> that the device is currently guaranteed to reappear under a different 
> name because BTRFS will still have an open reference on the original 
> device name.
> 
> On top of that, if you've got hardware that's doing this without manual 
> intervention, you've got much bigger issues than how BTRFS reacts to it. 
>   No correctly working hardware should be doing this.

Unplugging and replugging a SATA cable of a RAID1 member should never put your
system under the risk of a massive filesystem corruption; you cannot say it
absolutely doesn't with the current implementation.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-29 Thread Roman Mamedov
On Wed, 30 Nov 2016 00:16:48 +0100
Wilson Meier  wrote:

> That said, btrfs shouldn't be used for other then raid1 as every other
> raid level has serious problems or at least doesn't work as the expected
> raid level (in terms of failure recovery).

RAID1 shouldn't be used either:

*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).

*) A degraded RAID1 won't mount by default.

If this was the root filesystem, the machine won't boot.

To mount it, you need to add the "degraded" mount option.
However you have exactly a single chance at that, you MUST restore the RAID to
non-degraded state while it's mounted during that session, since it won't ever
mount again in the r/w+degraded mode, and in r/o mode you can't perform any
operations on the filesystem, including adding/removing devices.

*) It does not properly handle a device disappearing during operation. (There
is a patchset to add that).

*) It does not properly handle said device returning (under a
different /dev/sdX name, for bonus points).

Most of these also apply to all other RAID levels.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-11-28 Thread Roman Mamedov
On Mon, 28 Nov 2016 00:03:12 -0500
Zygo Blaxell  wrote:

> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8e3a5a2..b1314d6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> btrfs_path *path,
>   max_size = min_t(unsigned long, PAGE_SIZE, max_size);
>   ret = btrfs_decompress(compress_type, tmp, page,
>  extent_offset, inline_size, max_size);
> + WARN_ON(max_size > PAGE_SIZE);
> + if (max_size < PAGE_SIZE) {
> + char *map = kmap(page);
> + memset(map + max_size, 0, PAGE_SIZE - max_size);
> + kunmap(page);
> + }
>   kfree(tmp);
>   return ret;
>  }

Wasn't this already posted as:

btrfs: fix silent data corruption while reading compressed inline extents
https://patchwork.kernel.org/patch/9371971/

but you don't indicate that's a V2 or something, and in fact the patch seems
exactly the same, just the subject and commit message are entirely different.
Quite confusing.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount option nodatacow for VMs on SSD?

2016-11-25 Thread Roman Mamedov
On Fri, 25 Nov 2016 12:01:37 + (UTC)
Duncan <1i5t5.dun...@cox.net> wrote:

> Obviously this can be a HUGE problem on spinning rust due to its seek times,
> a problem zero-seek-time ssds don't have

They are not strictly zero seek time either. Sure you don't have the issue of
moving the physical head around, but still, sequential reads are way faster
even on SSDs, compared to random reads. Somewhat typical result for a
consumer SSD:

   Sequential Read :   382.301 MB/s
  Sequential Write :   315.124 MB/s
 Random Read 512KB :   261.751 MB/s
Random Write 512KB :   334.615 MB/s
Random Read 4KB (QD=1) :19.859 MB/s [  4848.5 IOPS]
   Random Write 4KB (QD=1) :61.794 MB/s [ 15086.3 IOPS]
   Random Read 4KB (QD=32) :   132.415 MB/s [ 32327.9 IOPS]
  Random Write 4KB (QD=32) :   203.051 MB/s [ 49573.0 IOPS]

If you have tons of 4K fragments, reading them in can go as low as 20 MB/sec,
compared to 382 MB/sec if they were all in one piece.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: My system mounts the wrong btrfs partition, from the wrong disk!

2016-11-25 Thread Roman Mamedov
On Fri, 25 Nov 2016 12:05:57 +0100
Niccolò Belli  wrote:

> This is something pretty unbelievable, so I had to repeat it several times 
> before finding the courage to actually post it to the mailing list :)
> 
> After dozens of data loss I don't trust my btrfs partition that much, so I 
> make a backup copy with dd weekly.

https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices

"don't make copies with dd."

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Roman Mamedov
On Wed, 16 Nov 2016 11:55:32 +0100
Martin Steigerwald  wrote:

> I do think that above kernel messages invite such a kind of interpretation
> tough. I took the "BTRFS: open_ctree failed" message as indicative to some
> structural issue with the filesystem.

For the reason as to why the writable mount didn't work, check "btrfs fi df"
for the filesystem to see if you have any "single" profile chunks on it: quite
likely you did already mount it "degraded,rw" in the past *once*, after which
those "single" chunks get created, and consequently it won't mount r/w anymore
(without lifting the restriction on the number of missing devices as proposed).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Roman Mamedov
On Wed, 16 Nov 2016 11:25:00 +0100
Martin Steigerwald  wrote:

> merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
> mount: Falscher Dateisystemtyp, ungültige Optionen, der
> Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
> Kodierungsseite oder ein anderer Fehler
> 
>   Manchmal liefert das Systemprotokoll wertvolle Informationen –
>   versuchen Sie  dmesg | tail  oder ähnlich
> merkaba:~#32> dmesg | tail -6
> [ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
> [ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
> [ 3080.120703] BTRFS info (device dm-13): disk space caching is enabled
> [ 3080.120706] BTRFS info (device dm-13): has skinny extents
> [ 3080.150957] BTRFS warning (device dm-13): missing devices (1) exceeds 
> the limit (0), writeable mount is not allowed
> [ 3080.195941] BTRFS: open_ctree failed

I have to wonder did you read the above message? What you need at this point
is simply "-o degraded,ro". But I don't see that tried anywhere down the line.

See also (or try): https://patchwork.kernel.org/patch/9419189/

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: when btrfs scrub reports errors and btrfs check --repair does not

2016-11-13 Thread Roman Mamedov
On Sun, 13 Nov 2016 07:06:30 -0800
Marc MERLIN  wrote:

> So first:
> a) find -inum returns some inodes that don't match
> b) but argh, multiple files (very different) have the same inode number, so 
> finding
> files by inode number after scrub flagged an inode bad, isn't going to work :(

I wonder why do you even need scrub to verify file readability. Just try
reading all files by using e.g. "cfv -Crr", the read errors produced will
point you directly to files which are unreadable, without the need to lookup
them in a backward way via inum. Then just restore those from backups.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [PATCH] Mounting "degraded,rw" should allow for any number of devices missing

2016-11-09 Thread Roman Mamedov
Hello,

Mounting "degraded,rw" should allow for any number of devices missing, as in
many cases the current check seems overly strict and not helpful during what
is already a manual recovery scenario. Let's assume the user applying the
"degraded" option knows best what condition their FS is in and what are the
next steps required to recover from the degraded state.

Specifically this would allow salvaging "JBOD-style" arrays of data=single
metadata=RAID1, if the user is ready to accept loss of data portions which
were on the removed drive. Currently if one of the disks got removed it is not
possible for such array to be mounted rw at all -- hence not possible to
"dev delete missing" and the only solution is to recreate the FS.

Besides, I am currently testing a concept of SSD+HDD array with data=single
and metadata=RAID1, where the SSD is used for RAID1 metadata chunks only.
E.g. my 13 TB FS only has about 14 GB of metadata at the moment, so I could
comfortably use a spare 60GB SSD as a metadata-only device for it.
(Making all metadata reads prefer SSD could be the next step.)
It would be nice to be able to just lose/fail/forget that SSD, without having
to redo the entire FS. But again, since the remaining device has data=single,
currently it won't be write-mountable in the degraded state, even though the
missing device had only ever contained RAID1 chunks.

Maybe someone has other ideas how to solve the above scenarios?

Thanks

--- linux-amd64-4.4/fs/btrfs/disk-io.c.orig 2016-11-09 16:19:50.431117913 
+0500
+++ linux-amd64-4.4/fs/btrfs/disk-io.c  2016-11-09 16:20:31.567117874 +0500
@@ -2992,7 +2992,8 @@
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
if (fs_info->fs_devices->missing_devices >
 fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
+   !(sb->s_flags & MS_RDONLY) &&
+   !btrfs_raw_test_opt(fs_info->mount_opt, DEGRADED)) {
pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
fs_info->fs_devices->missing_devices,
fs_info->num_tolerated_disk_barrier_failures);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check --repair: ERROR: cannot read chunk root

2016-11-04 Thread Roman Mamedov
On Fri, 4 Nov 2016 01:01:13 -0700
Marc MERLIN  wrote:

> Basically I have this:
> sde8:64   0   3.7T  0 
> └─sde1 8:65   0   3.7T  0 
>   └─md59:50  14.6T  0 
> └─bcache0252:00  14.6T  0 
>   └─crypt_bcache0 (dm-0) 253:00  14.6T  0 
> 
> I'll try dd'ing the md5 directly now, but that's going to take another 2 days 
> :(
> 
> That said, given that almost half the device is not readable from user space
> for some reason, that would explain why btrfs check is failing. Obviously it
> can't do its job if it can't read blocks.

I don't see anything to support the notion that "half is unreadable", maybe
just a 512-byte sector is unreadable -- but that would be enough to make
regular dd bail out -- which is why you should be using dd_rescue for this,
not regular dd. Assuming you just want to copy over as much data as possible,
and not simply test if dd fails or not (but in any case dd_rescue at least
would not fail instantly and would tell you precise count of how much is
unreadable).

There is "GNU ddrescue" and "dd_rescue", I liked the first one better, but
they both work on a similar principle.

Also didn't you recently have issues with bad block lists on mdadm. This
mysterious "unreadable and nothing in dmesg" does appear to be a continuation
of that.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to speed up unlink()?

2016-10-20 Thread Roman Mamedov
On Thu, 20 Oct 2016 08:09:14 -0400
"Austin S. Hemmelgarn"  wrote:

> > So, it's possible to return unlink() early? or this a bad idea(and why)?
> I may be completely off about this, but I could have sworn that unlink() 
> returns when enough info is on the disk that both:
> 1. The file isn't actually visible in the directory.
> 2. If the system crashes, the filesystem will know to finish the cleanup.

As I understand it there is no fundamental reason why rm of a heavily
fragmented file couldn't be exactly as fast as deleting a subvolume with
only that single file in it. Remove the directory reference and instantly
return success to userspace, continuing to clean up extents in the background.

However for many uses that could be counter-productive, as scripts might
expect the disk space to be freed up completely after the rm command returns
(as they might need to start filling up the partition with new data). 

In snapshot deletion there are various commit modes built in for that purpose,
but I'm not sure if you can easily extend POSIX file deletion to implement
synchronous and non-synchronous deletion modes.

* Try the 'unlink' program instead of 'rm'; if "just remove the dir entry for
  now" was implemented anywhere, I'd expect it to be via that.
* Try doing 'eatmydata rm', but that's more of a crazy idea than anything else,
  as eatmydata only affects fsyncs, and I don't think rm is necessarily
  invoking those.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs-progs: Add a command to show bg info

2016-10-17 Thread Roman Mamedov
On Tue, 18 Oct 2016 09:39:32 +0800
Qu Wenruo  wrote:

> >  static const char * const cmd_inspect_inode_resolve_usage[] = {
> > "btrfs inspect-internal inode-resolve [-v]  ",
> > "Get file system paths for the given inode",
> > @@ -702,6 +814,8 @@ const struct cmd_group inspect_cmd_group = {
> > 0 },
> > { "min-dev-size", cmd_inspect_min_dev_size,
> > cmd_inspect_min_dev_size_usage, NULL, 0 },
> > +   { "bg_analysis", cmd_inspect_bg_analysis,
> > +   cmd_inspect_bg_analysis_usage, NULL, 0 },
> 
> Just naming preference, IMHO show-block-groups or dump-block-groups 
> seems better for me.

And in any case please don't mix separation by "-" and "_" in the same command
string. In btrfs tool the convention is to separate words in subcommand names
using "-".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Roman Mamedov
On Wed, 12 Oct 2016 15:19:16 -0400
Zygo Blaxell  wrote:

> I'm not even sure btrfs does this--I haven't checked precisely what
> it does in dup mode.  It could send both copies of metadata to the
> disks with a single barrier to separate both metadata updates from
> the superblock updates.  That would be bad in this particular case.

It would be bad in any case, including a single physical disk and no RAID, and
I don't think there's any basis to speculate that mdadm doesn't implement
write barriers properly.

> In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
> is an interruption (system crash, a disk times out, etc) in degraded mode,

Moreover, in any non-COW system writes temporarily corrupt data. So again,
writing to a (degraded or not) mdadm RAID5 is not much different than writing
to a single physical disk. However I believe in the Btrfs case metadata is
always COW, so this particular problem may be not as relevant here in the
first place.

-- 
With respect,
Roman


pgpM_a8ZbdVne.pgp
Description: OpenPGP digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Roman Mamedov
On Tue, 11 Oct 2016 17:58:22 -0600
Chris Murphy  wrote:

> But consider the identical scenario with md or LVM raid5, or any
> conventional hardware raid5. A scrub check simply reports a mismatch.
> It's unknown whether data or parity is bad, so the bad data strip is
> propagated upward to user space without error. On a scrub repair, the
> data strip is assumed to be good, and good parity is overwritten with
> bad.

That's why I love to use Btrfs on top of mdadm RAID5/6 -- combining a mature
and stable RAID implementation with Btrfs anti-corruption checksumming
"watchdog". In the case that you described, no silent corruption will occur,
as Btrfs will report an uncorrectable read error -- and I can just restore the
file in question from backups.


On Wed, 12 Oct 2016 00:37:19 -0400
Zygo Blaxell  wrote:

> A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> snowball's chance in hell of surviving a disk failure on a live array
> with only data losses.  This would work if mdadm and btrfs successfully
> arrange to have each dup copy of metadata updated separately, and one
> of the copies survives the raid5 write hole.  I've never tested this
> configuration, and I'd test the heck out of it before considering
> using it.

Not sure what you mean here, a non-fatal disk failure (i.e. within being
compensated by redundancy) is invisible to the upper layers on mdadm arrays.
They do not need to "arrange" anything, on such failure from the point of view
of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
still perfectly and correctly readable and writable.

-- 
With respect,
Roman


pgpCiQALhZ93Z.pgp
Description: OpenPGP digital signature


Re: csum failed during copy/compare

2016-10-10 Thread Roman Mamedov
On Mon, 10 Oct 2016 10:44:39 +0100
Martin Dev  wrote:

> I work for system verification of SSDs and we've recently come up
> against an issue with BTRFS on Ubuntu 16.04

> This seems to be a recent change

...well, a change in what? 

If you really didn't change anything on your machines and the used process,
there is no reason for anything to start breaking, other than obvious hardware
issues from age/etc (likely not what's happening here).

So you most likely did change something yourself, and perhaps the change was
upgrading OS version, kernel version(!!!), or versions of software in general.

As such, the first suggestion would be go through the recent software updates
history, maybe even restore an OS image you used three months ago (if
available) and confirm that the problem doesn't occur there. After that it's a
process called bisecting, there are tools for that, but likely you don't even
need those yet, just carefully note when you got which upgrades, paying
highest attention to the kernel version, and note at which point the
corruptions start to occur.

> as the same process has been used for the last 2 years

-- 
With respect,
Roman


pgp6nLUf_jf4p.pgp
Description: OpenPGP digital signature


  1   2   3   4   >