from:"Brendan Hide"

Re: [PATCH] Fix typos

2018-11-28 Thread Brendan Hide





On 11/28/18 1:23 PM, Nikolay Borisov wrote:



On 28.11.18 г. 13:05 ч., Andrea Gelmini wrote:

Signed-off-by: Andrea Gelmini 
---






diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bab2f1983c07..babbd75d91d2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -104,7 +104,7 @@ static void __endio_write_update_ordered(struct inode 
*inode,
  
  /*

   * Cleanup all submitted ordered extents in specified range to handle errors
- * from the fill_dellaloc() callback.
+ * from the fill_delalloc() callback.


This is a pure whitespace fix which is generally frowned upon. What you
can do though, is replace 'fill_delalloc callback' with
'btrfs_run_delalloc_range' since the callback is gone already.


   *
   * NOTE: caller must ensure that when an error happens, it can not call
   * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
@@ -1831,7 +1831,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,





diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 410c7e007ba8..d7b6c2b09a0c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -892,7 +892,7 @@ static int create_snapshot(struct btrfs_root *root, struct 
inode *dir,
   *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
   *  8. If we were asked to remove a non-directory and victim isn't one - 
EISDIR.
   *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ * 10. We don't allow removal of NFS silly renamed files; it's handled by
   * nfs_async_unlink().
   */
  
@@ -3522,7 +3522,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 olen,

   false);
/*
 * If one of the inodes has dirty pages in the respective range or
-* ordered extents, we need to flush dellaloc and wait for all ordered
+* ordered extents, we need to flush delalloc and wait for all ordered


Just whitespace fix, drop it.





If the spelling is changed, surely that is not a whitespace fix?

Re: NVMe SSD + compression - benchmarking

2018-04-28 Thread Brendan Hide



On 04/28/2018 04:05 AM, Qu Wenruo wrote:



On 2018年04月28日 01:41, Brendan Hide wrote:

Hey, all

I'm following up on the queries I had last week since I have installed
the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing
whether or not I'm doing these benchmarks correctly.

As a first test, I put together a 4.7GB .tar containing mostly
duplicated copies of the kernel source code (rather compressible).
Writing this to the SSD I was seeing repeatable numbers - but noted that
the new (supposedly faster) zstd compression is noticeably slower than
all other methods. Perhaps this is partly due to lack of
multi-threading? No matter, I did also notice a supposedly impossible
stat when there is no compression, in that it seems to be faster than
the PCI-E 2.0 bus theoretically can deliver:


I'd say the test method is more like real world usage other than benchmark.
Moreover, the kernel source copying is not that good for compression, as
mostly of the files are smaller than 128K, which means they can't take
much advantage of multi thread split based on 128K.

And kernel source is consistent of multiple small files, and btrfs is
really slow for metadata heavy workload.

I'd recommend to start with simpler workload, then go step by step
towards more complex workload.

Large file sequence write with large block size would be a nice start
point, as it could take all advantage of multithread compression.


Thanks, Qu

I did also test the folder tree where I realised it is intense / far 
from a regular use-case. It gives far slower results with zlib being the 
slowest. The source's average file size is near 13KiB. However, in this 
test where I gave some results below, the .tar is a large (4.7GB) 
singular file - I'm not unpacking it at all.


Average results from source tree:
compression type / write speed / read speed
no / 0.29 GBps / 0.20 GBps
lzo / 0.21 GBps / 0.17 GBps
zstd / 0.13 GBps / 0.14 GBps
zlib / 0.06 GBps / 0.10 GBps

Average results from .tar:
compression type / write speed / read speed
no / 1.42 GBps / 2.79 GBps
lzo / 1.17 GBps / 2.04 GBps
zstd / 0.75 GBps / 1.97 GBps
zlib / 1.24 GBps / 2.07 GBps


Another advice here is, if you really want a super fast storage, and
there is plenty memory, brd module will be your best friend.
And for modern mainstream hardware, brd could provide performance over
1GiB/s:
$ sudo modprobe brd rd_nr=1 rd_size=2097152
$ LANG=C dd if=/dev/zero  bs=1M of=/dev/ram0  count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.45593 s, 1.5 GB/s


My real worry is that I'm currently reading at 2.79GB/s (see result 
above and below) without compression when my hardware *should* limit it 
to 2.0GB/s. This tells me either `sync` is not working or my benchmark 
method is flawed.



Thanks,
Qu



compression type / write speed / read speed (in GBps)
zlib / 1.24 / 2.07
lzo / 1.17 / 2.04
zstd / 0.75 / 1.97
no / 1.42 / 2.79

The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0
16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's
peak throughput *should* be 2.0 GBps - but above you can see the average
read benchmark is 2.79GBps. :-/

The crude timing script I've put together does the following:
- Format the SSD anew with btrfs and no custom settings
- wait 180 seconds for possible hardware TRIM to settle (possibly
overkill since the SSD is new)
- Mount the fs using all defaults except for compression, which could be
of zlib, lzo, zstd, or no
- sync
- Drop all caches
- Time the following
  - Copy the file to the test fs (source is a ramdisk)
  - sync
- Drop all caches
- Time the following
  - Copy back from the test fs to ramdisk
  - sync
- unmount

I can see how, with compression, it *can* be faster than 2 GBps (though
it isn't). But I cannot see how having no compression could possibly be
faster than 2 GBps. :-/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NVMe SSD + compression - benchmarking

2018-04-27 Thread Brendan Hide


Hey, all

I'm following up on the queries I had last week since I have installed 
the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing 
whether or not I'm doing these benchmarks correctly.


As a first test, I put together a 4.7GB .tar containing mostly 
duplicated copies of the kernel source code (rather compressible). 
Writing this to the SSD I was seeing repeatable numbers - but noted that 
the new (supposedly faster) zstd compression is noticeably slower than 
all other methods. Perhaps this is partly due to lack of 
multi-threading? No matter, I did also notice a supposedly impossible 
stat when there is no compression, in that it seems to be faster than 
the PCI-E 2.0 bus theoretically can deliver:


compression type / write speed / read speed (in GBps)
zlib / 1.24 / 2.07
lzo / 1.17 / 2.04
zstd / 0.75 / 1.97
no / 1.42 / 2.79

The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0 
16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's 
peak throughput *should* be 2.0 GBps - but above you can see the average 
read benchmark is 2.79GBps. :-/


The crude timing script I've put together does the following:
- Format the SSD anew with btrfs and no custom settings
- wait 180 seconds for possible hardware TRIM to settle (possibly 
overkill since the SSD is new)
- Mount the fs using all defaults except for compression, which could be 
of zlib, lzo, zstd, or no

- sync
- Drop all caches
- Time the following
 - Copy the file to the test fs (source is a ramdisk)
 - sync
- Drop all caches
- Time the following
 - Copy back from the test fs to ramdisk
 - sync
- unmount

I can see how, with compression, it *can* be faster than 2 GBps (though 
it isn't). But I cannot see how having no compression could possibly be 
faster than 2 GBps. :-/


I can of course get more info if it'd help figure out this puzzle:

Kernel info:
Linux localhost.localdomain 4.16.3-1-vfio #1 SMP PREEMPT Sun Apr 22 
12:35:45 SAST 2018 x86_64 GNU/Linux
^ Close to the regular ArchLinux kernel - but with vfio, and compiled 
with -arch=native. See https://aur.archlinux.org/pkgbase/linux-vfio/


CPU model:
model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Motherboard model:
Product Name: Z68MA-G45 (MS-7676)

lspci output for the slot:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe 
SSD Controller SM961/PM961

^ The disk id sans serial is Samsung_SSD_960_EVO_1TB

dmidecode output for the slot:
Handle 0x001E, DMI type 9, 17 bytes
System Slot Information
Designation: J8B4
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 4
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: :02:01.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Brendan Hide


Thank you, all

Though the info is useful, there's not a clear consensus on what I 
should expect. For interest's sake, I'll post benchmarks from the device 
itself when it arrives.


I'm expecting at least that I'll be blown away :)

On 04/18/2018 09:23 PM, Chris Murphy wrote:



On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn 
> wrote:


For reference, the zstd compression in BTRFS uses level 3 by default
(as does zlib compression IIRC), though I'm not sure about lzop (I
think it uses the lowest compression setting).



The user space tool, zstd, does default to 3, according to its man page.

    -# # compression level [1-19] (default: 3)


However, the kernel is claiming it's level 0, which doesn't exist in the 
man page. So I have no idea what we're using. This is what I get with 
mount option compress=zstd


[    4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0




--
Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Brendan Hide


Hi, all

I'm looking for some advice re compression with NVME. Compression helps 
performance with a minor CPU hit - but is it still worth it with the far 
higher throughputs offered by newer PCI and NVME-type SSDs?


I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my 
home desktop. I previously used compression on an older SATA-based Intel 
520 SSD, where compression made sense.


However, the wisdom isn't so clear-cut if the SSD is potentially faster 
than the compression algorithm with my CPU (aging i7 3770).


Testing using a copy of the kernel source tarball in tmpfs  it seems my 
system can compress/decompress at about 670MB/s using zstd with 8 
threads. lzop isn't that far behind. But I'm not sure if the benchmark 
I'm running is the same as how btrfs would be using it internally.


Given these numbers I'm inclined to believe compression will make things 
slower - but can't be sure without knowing if I'm testing correctly.


What is the best practice with benchmarking and with NVME/PCI storage?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Brendan Hide




On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote:

On 2017-08-03 14:29, Christoph Anton Mitterer wrote:

On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote:
There are no higher-level management tools (e.g. RAID
management/monitoring, etc.)...

[snip]

As far as 'higher-level' management tools, you're using your system 
wrong if you _need_ them.  There is no need for there to be a GUI, or a 
web interface, or a DBus interface, or any other such bloat in the main 
management tools, they work just fine as is and are mostly on par with 
the interfaces provided by LVM, MD, and ZFS (other than the lack of 
machine parseable output).  I'd also argue that if you can't reassemble 
your storage stack by hand without using 'higher-level' tools, you 
should not be using that storage stack as you don't properly understand it.


On the subject of monitoring specifically, part of the issue there is 
kernel side, any monitoring system currently needs to be polling-based, 
not event-based, and as a result monitoring tends to be a very system 
specific affair based on how much overhead you're willing to tolerate. 
The limited stuff that does exist is also trivial to integrate with many 
pieces of existing monitoring infrastructure (like Nagios or monit), and 
therefore the people who care about it a lot (like me) are either 
monitoring by hand, or are just using the tools with their existing 
infrastructure (for example, I use monit already on all my systems, so I 
just make sure to have entries in the config for that to check error 
counters and scrub results), so there's not much in the way of incentive 
for the concerned parties to reinvent the wheel.


To counter, I think this is a big problem with btrfs, especially in 
terms of user attrition. We don't need "GUI" tools. At all. But we do 
need that btrfs is self-sufficient enough that regular users don't get 
burnt by what they would view as unexpected behaviour.  We have 
currently a situation where btrfs is too demanding on inexperienced users.


I feel we need better worst-case behaviours. For example, if *I* have a 
btrfs on its second-to-last-available chunk, it means I'm not 
micro-managing properly. But users shouldn't have to micro-manage in the 
first place. Btrfs (or a management tool) should just know to balance 
the least-used chunk and/or delete the lowest-priority snapshot, etc. It 
shouldn't cause my services/apps to give diskspace errors when, clearly, 
there is free space available.


The other "high-level" aspect would be along the lines of better 
guidance and standardisation for distros on how best to configure btrfs. 
This would include guidance/best practices for things like appropriate 
subvolume mountpoints and snapshot paths, sensible schedules or logic 
(or perhaps even example tools/scripts) for balancing and scrubbing the 
filesystem.


I don't have all the answers. But I also don't want to have to tell 
people they can't adopt it because a) they don't (or never will) 
understand it; and b) they're going to resent me for their irresponsibly 
losing their own data.


--
______
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Brendan Hide

The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html


"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream in 
Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat 
Enterprise Linux 7 series. However, this is the last planned update to 
this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through your 
Red Hat representative on features and requirements you have for file 
systems and storage technology."



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 status?

2017-01-23 Thread Brendan Hide



Hey, all

Long-time lurker/commenter here. Production-ready RAID5/6 and N-way 
mirroring are the two features I've been anticipating most, so I've 
commented regularly when this sort of thing pops up. :)


I'm only addressing some of the RAID-types queries as Qu already has a 
handle on the rest.


Small-yet-important hint: If you don't have a backup of it, it isn't 
important.


On 01/23/2017 02:25 AM, Jan Vales wrote:

[ snip ]
Correct me, if im wrong...
* It seems, raid1(btrfs) is actually raid10, as there are no more than 2
copies of data, regardless of the count of devices.


The original "definition" of raid1 is two mirrored devices. The *nix 
industry standard implementation (mdadm) extends this to any number of 
mirrored devices. Thus confusion here is understandable.



** Is there a way to duplicate data n-times?


This is a planned feature, especially in lieu of feature-parity with 
mdadm, though the priority isn't particularly high right now. This has 
been referred to as "N-way mirroring". The last time I recall discussion 
over this, it was hoped to get work started on it after raid5/6 was stable.



** If there are only 3 devices and the wrong device dies... is it dead?


Qu has the right answers. Generally if you're using anything other than 
dup, raid0, or single, one disk failure is "okay". More than one failure 
is closer to "undefined". Except with RAID6, where you need to have more 
than two disk failures before you have lost data.



* Whats the diffrence of raid1(btrfs) and raid10(btrfs)?


Some nice illustrations from Qu there. :)


** After reading like 5 diffrent wiki pages, I understood, that there
are diffrences ... but not what they are and how they affect me :/
* Whats the diffrence of raid0(btrfs) and "normal" multi-device
operation which seems like a traditional raid0 to me?


raid0 stripes data in 64k chunks (I think this size is tunable) across 
all devices, which is generally far faster in terms of throughput in 
both writing and reading data.


By '"normal" multi-device' I will assume this means "single" with 
multiple devices. New writes with "single" will use a 1GB chunk on one 
device until the chunk is full, at which point it allocates a new chunk, 
which will usually be put on the disk with the most available free 
space. There is no particular optimisation in place comparable to raid0 
here.




Maybe rename/alias raid-levels that do not match traditional
raid-levels, so one cannot expect some behavior that is not there.



The extreme example is imho raid1(btrfs) vs raid1.
I would expect that if i have 5 btrfs-raid1-devices, 4 may die and btrfs
should be able to fully recover, which, if i understand correctly, by
far does not hold.
If you named that raid-level say "george" ... I would need to consult
the docs and I obviously would not expect any behavior. :)


We've discussed this a couple of times. Hugo came up with a notation 
since dubbed "csp" notation: c->Copies, s->Stripes, and p->Parities.


Examples of this would be:
raid1: 2c
3-way mirroring across 3 (or more*) devices: 3c
raid0 (2-or-more-devices): 2s
raid0 (3-or-more): 3s
raid5 (5-or-more): 4s1p
raid16 (12-or-more): 2c4s2p

* note the "or more": Mdadm *cannot* mirror less mirrors or stripes than 
devices, whereas there is no particular reason why btrfs won't be able 
to do this.


A minor problem with csp notation is that it implies a complete 
implementation of *any* combination of these, whereas the idea was 
simply to create a way to refer to the "raid" levels in a consistent way.


I hope this brings some clarity. :)



regards,
Jan Vales
--
I only read plaintext emails.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Preliminary BTRFS Encryption

2016-09-16 Thread Brendan Hide

For the most part, I agree with you, especially about the strategy being
backward - and file encryption being a viable more-easily-implementable
direction.

However, you are doing yourself a disservice to compare btrfs' features
as a "re-implementation" of existing tools. The existing tools cannot do
what btrfs' devs want to implement. See below inline.

On 09/16/2016 03:12 AM, Dave Chinner wrote:

On Tue, Sep 13, 2016 at 09:39:46PM +0800, Anand Jain wrote:

This patchset adds btrfs encryption support.

The main objective of this series is to have bugs fixed and stability.
I have verified with fstests to confirm that there is no regression.

A design write-up is coming next, however here below is the quick example
on the cli usage. Please try out, let me know if I have missed something.

Yup, that best practices say "do not roll your own encryption
infrastructure".

100% agreed

This is just my 2c worth - take it or leave it, don't other flaming.
Keep in mind that I'm not picking on btrfs here - I asked similar
hard questions about the proposed f2fs encryption implementation.
That was a "copy and snowflake" version of the ext4 encryption code -
they made changes and now we have generic code and common
functionality between ext4 and f2fs.

Also would like to mention that a review from the security experts is due,
which is important and I believe those review comments can be accommodated
without major changes from here.

That's a fairly significant red flag to me - security reviews need
to be done at the design phase against specific threat models -
security review is not a code/implementation review...

Also agreed. This is a bit backward.

The ext4 developers got this right by publishing threat models and
design docs, which got quite a lot of review and feedback before
code was published for review.

https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/edit#heading=h.qmnirp22ipew

[small reorder of comments]

As of now these patch set supports encryption on per subvolume, as
managing properties on per subvolume is a kind of core to btrfs, which is
easier for data center solution-ing, seamlessly persistent and easy to
manage.

We've got dmcrypt for this sort of transparent "device level"
encryption. Do we really need another btrfs layer that re-implements ...

[snip]
Woah, woah. This is partly addressed by Roman's reply - but ...

Subvolumes:
Subvolumes are not comparable to block devices. This thinking is flawed
at best; cancerous at worst.

As a user I tend to think of subvolumes simply as directly-mountable
folders.

As a sysadmin I also think of them as snapshottable/send-receiveable
folders.

And as a dev I know they're actually not that different from regular
folders. They have some extra metadata so aren't as lightweight - but of
course they expose very useful flexibility not available in a regular
folder.

MD/raid comparison:
In much the same way, comparing btrfs' raid features to md directly is
also flawed. Btrfs even re-uses code in md to implement raid-type
features in ways that md cannot.

I can't answer for the current raid5/6 stability issues - but I am
confident that the overall design is good, and that it will be fixed.

The generic file encryption code is solid, reviewed, tested and
already widely deployed via two separate filesystems. There is a
much wider pool of developers who will maintain it, reveiw changes
and know all the traps that a new implementation might fall into.
There's a much bigger safety net here, which significantly lowers
the risk of zero-day fatal flaws in a new implementation and of
flaws in future modifications and enhancements.

Hence, IMO, the first thing to do is implement and make the generic
file encryption support solid and robust, not tack it on as an
afterthought for the magic btrfs encryption pixies to take care of.

Indeed, with the generic file encryption, btrfs may not even need
the special subvolume encryption pixies. i.e. you can effectively
implement subvolume encryption via configuration of a multi-user
encryption key for each subvolume and apply it to the subvolume tree
root at creation time. Then only users with permission to unlock the
subvolume key can access it.

Once the generic file encryption is solid and fulfils the needs of
most users, then you can look to solving the less common threat
models that neither dmcrypt or per-file encryption address. Only if
the generic code cannot be expanded to address specific threat
models should you then implement something that is unique to
btrfs

Agreed, this sounds like a far safer and achievable implementation process.

Cheers,

Dave.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Allocator behaviour during device delete

2016-06-09 Thread Brendan Hide




On 06/09/2016 03:07 PM, Austin S. Hemmelgarn wrote:

On 2016-06-09 08:34, Brendan Hide wrote:

Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD
(in this case on a LUKS-encrypted 200GB partition) - and am curious if
this behaviour I've noted below is expected or known. I figure it is a
bug. Depending on the situation, it *could* be severe. In my case it was
simply annoying.

---
Steps

After having added the new device (btrfs dev add), I deleted the old
device (btrfs dev del)

Then, whilst waiting for that to complete, I started a watch of "btrfs
fi show /". Note that the below is very close to the output at the time
- but is not actually copy/pasted from the output.


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 298.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path
/dev/mapper/cryptroot



devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the
SSD still had no data:


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 284.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path
/dev/mapper/cryptroot


The "FS bytes used" amount was changing a lot - but mostly stayed near
the original total, which is expected since there was very little
happening other than the "migration".

I'm not certain of the exact point where it started using the new disk's
space. I figure that may have been helpful to pinpoint. :-/

OK, I'm pretty sure I know what was going on in this case.  Your
assumption that device delete uses the balance code is correct, and that
is why you see what's happening happening.  There are two key bits that
are missing though:
1. Balance will never allocate chunks when it doesn't need to.
2. The space usage listed in fi show is how much space is allocated to
chunks, not how much is used in those chunks.

In this case, based on what you've said, you had a lot of empty or
mostly empty chunks.  As a result of this, the device delete was both
copying data, and consolidating free space.  If you have a lot of empty
or mostly empty chunks, it's not unusual for a device delete to look
like this until you start hitting chunks that have actual data in them.
The pri8mary point of this behavior is that it makes it possible to
directly switch to a smaller device without having to run a balance and
then a resize before replacing the device, and then resize again
afterwards.


Thanks, Austin. Your explanation is along the lines of my thinking though.

The new disk should have had *some* data written to it at that point, as 
it started out at over 600GiB in allocation (should have probably 
mentioned that already). Consolidating or not, I would consider data 
being written to the old disk to be a bug, even if it is considered minor.


I'll set up a reproducible test later today to prove/disprove the theory. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Allocator behaviour during device delete

2016-06-09 Thread Brendan Hide


Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD 
(in this case on a LUKS-encrypted 200GB partition) - and am curious if 
this behaviour I've noted below is expected or known. I figure it is a 
bug. Depending on the situation, it *could* be severe. In my case it was 
simply annoying.


---
Steps

After having added the new device (btrfs dev add), I deleted the old 
device (btrfs dev del)


Then, whilst waiting for that to complete, I started a watch of "btrfs 
fi show /". Note that the below is very close to the output at the time 
- but is not actually copy/pasted from the output.


> Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
> Total devices 2 FS bytes used 115.03GiB
> devid1 size 0.00GiB used 298.06GiB path /dev/sda2
> devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot


devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the 
SSD still had no data:


> Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
> Total devices 2 FS bytes used 115.03GiB
> devid1 size 0.00GiB used 284.06GiB path /dev/sda2
> devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot

The "FS bytes used" amount was changing a lot - but mostly stayed near 
the original total, which is expected since there was very little 
happening other than the "migration".


I'm not certain of the exact point where it started using the new disk's 
space. I figure that may have been helpful to pinpoint. :-/


---
Educated guess as to what was happening:

Key: Though the available space on devid1 is displayed as 0 GiB, 
internally the allocator still sees most of the device's space as 
available. The allocator will continue writing to the old disk even 
though the intention is to remove it.


The dev delete operation goes through the chunks in sequence and does a 
"normal" balance operation on each, which the kernel simply sends to the 
"normal" single allocator. At the start of the operation, the allocator 
will see that the device of 1TB has more space available than the 200GB 
device, thus it writes the data to a new chunk on the 1TB spindle.


Only after the chunk is balanced away, does the operation mark *only* 
that "source" chunk as being unavailable. As each chunk is subsequently 
balanced away, eventually the allocator will see that there is more 
space available on the new device than on the old device (1:199/2:200), 
thus the next chunk gets allocated to the new device. The same occurs 
for the next chunk (1:198/2:199) and so on, until the device finally has 
zero usage and is removed completely.


---
Naive approach for a fix (assuming my assessment above is correct)

At the start:
1. "Balance away"/Mark-as-Unavailable empty space
2. Balance away the *current* chunks (data+metadata) that would 
otherwise be written to if the device was still available

3. As before, balance in whatever order is applicable.

---
Severity

I figure that, for my use-case, this isn't a severe issue. However, in 
the case where you want quickly to remove a potentially failing disk 
(common use case for dev delete), I'd much rather that btrfs does *not* 
write data to the disk I'm trying to remove, making this a potentially 
severe bug.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-04 Thread Brendan Hide




On 06/03/16 20:59, Christoph Anton Mitterer wrote:

On Fri, 2016-06-03 at 13:42 -0500, Mitchell Fossen wrote:

Thanks for pointing that out, so if I'm thinking correctly, with
RAID1
it's just that there is a copy of the data somewhere on some other
drive.

With RAID10, there's still only 1 other copy, but the entire
"original"
disk is mirrored to another one, right?
As Justin mentioned, btrfs doesn't raid whole disks/devices. Instead, it 
works with chunks.




To be honest, I couldn't tell you for sure :-/ ... IMHO the btrfs
documentation has some "issues".

mkfs.btrfs(8) says: 2 copies for RAID10, so I'd assume it's just the
striped version of what btrfs - for whichever questionable reason -
calls "RAID1".


The "questionable reason" is simply the fact that it is, now as well as 
at the time the features were added, the closest existing terminology 
that best describes what it does. Even now, it would be difficult on the 
spot adequately to explain what it means for redundancy without also 
mentioning "RAID".


Btrfs does not raid disks/devices. It works with chunks that are 
allocated to devices when the previous chunk/chunk-set is full.


We're all very aware of the inherent problem of language - and have 
discussed various ways to address it. You will find that some on the 
list (but not everyone) are very careful to never call it "RAID" - but 
instead raid (very small difference, I know). Hugo Mills previously made 
headway in getting discussion and consensus of proper nomenclature. *



Especially, when you have an odd number devices (or devices with
different sizes), its not clear to me, personally, at all how far that
redundancy actually goes respectively what btrfs actually does... could
be that you have your 2 copies, but maybe on the same device then?


No, btrfs' raid1 naively guarantees that the two copies will *never* be 
on the same device. raid10 does the same thing - but in stripes on as 
many devices as possible.


The reason I say "naively" is that there is little to stop you from 
creating a 2-device "raid1" using two partitions on the same physical 
device. This is especially difficult to detect if you add abstraction 
layers (lvm, dm-crypt, etc). This same problem does apply to mdadm however.


Though it won't necessarily answer all questions about allocation, I 
strongly suggest checking out Hugo's btrfs calculator **


I hope this is helpful.

* http://comments.gmane.org/gmane.comp.file-systems.btrfs/34717 / 
https://www.spinics.net/lists/linux-btrfs/msg33742.html

* http://comments.gmane.org/gmane.comp.file-systems.btrfs/34792
** http://carfax.org.uk/btrfs-usage/




Cheers,
Chris.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: USB memory sticks wear & speed: btrfs vs f2fs?

2016-02-09 Thread Brendan Hide


On 2/9/2016 1:13 PM, Martin wrote:

How does btrfs compare to f2fs for use on (128GByte) USB memory sticks?

Particularly for wearing out certain storage blocks?

Does btrfs heavily use particular storage blocks that will prematurely
"wear out"?

(That is, could the whole 128GBytes be lost due to one 4kByte block
having been re-written excessively too many times due to a fixed
repeatedly used filesystem block?)

Any other comparisons/thoughts for btrfs vs f2fs?
Copy-on-write (CoW) designs tend naturally to work well with flash 
media. F2fs is *specifically* designed to work well with flash, whereas 
for btrfs it is a natural consequence of the copy-on-write design. With 
both filesystems, if you randomly generate a 1GB file and delete it 1000 
times, onto a 1TB flash, you are *very* likely to get exactly one write 
to *every* block on the flash (possibly two writes to <1% of the blocks) 
rather than, as would be the case with non-CoW filesystems, 1000 writes 
to a small chunk of blocks.


I haven't found much reference or comparison information online wrt wear 
leveling - mostly performance benchmarks that don't really address your 
request. Personally I will likely never bother with f2fs unless I 
somehow end up working on a project requiring relatively small storage 
in Flash (as that is what f2fs was designed for).


If someone can provide or link to some proper comparison data, that 
would be nice. :)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Btrfs device and pool management (wip)

2015-12-01 Thread Brendan Hide

On 11/30/2015 11:09 PM, Chris Murphy wrote:

On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferro...@gmail.com> wrote:

I've had multiple cases of disks that got one write error then were fine for
more than a year before any further issues. My thought is add an option to
retry that single write after some short delay (1-2s maybe), and if it still
fails, then mark the disk as failed.

Seems reasonable.
I think I added this to the Project Ideas page on the wiki a *very* long
time ago

https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

"After a device is marked as unreliable, maintain the device within the
FS in order to confirm the issue persists. The device will still
contribute toward fs performance but will not be treated as if
contributing towards replication/reliability. If the device shows that
the given errors were a once-off issue then the device can be marked as
reliable once again. This will mitigate further unnecessary rebalance.
See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ -
"[Drive Resurrection]" as an example of where this is a significant
feature for storage vendors."

Agreed. Maybe it would be an error rate (set by ratio)?

I was thinking of either:
a. A running count, using the current error counting mechanisms, with some
max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how long an
error is considered, and how many are allowed).

OK.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [RFC] Btrfs device and pool management (wip)

2015-12-01 Thread Brendan Hide

On 12/1/2015 12:05 PM, Brendan Hide wrote:

On 11/30/2015 11:09 PM, Chris Murphy wrote:

On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferro...@gmail.com> wrote:

I've had multiple cases of disks that got one write error then were
fine for
more than a year before any further issues. My thought is add an
option to
retry that single write after some short delay (1-2s maybe), and if
it still

fails, then mark the disk as failed.

Seems reasonable.
I think I added this to the Project Ideas page on the wiki a *very*
long time ago
https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

"After a device is marked as unreliable, maintain the device within
the FS in order to confirm the issue persists. The device will still
contribute toward fs performance but will not be treated as if
contributing towards replication/reliability. If the device shows that
the given errors were a once-off issue then the device can be marked
as reliable once again. This will mitigate further unnecessary
rebalance. See
http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive
Resurrection]" as an example of where this is a significant feature
for storage vendors."
Related, a separate section on that same page mentions a Jeff Mahoney.
Perhaps he should be consulted or his work should be looked into:

Take device with heavy IO errors offline or mark as "unreliable"
"Devices should be taken offline after they reach a given threshold of
IO errors. Jeff Mahoney works on handling EIO errors (among others),
this project can build on top of it."

Agreed. Maybe it would be an error rate (set by ratio)?

I was thinking of either:
a. A running count, using the current error counting mechanisms,
with some

max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how
long an

error is considered, and how many are allowed).

OK.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: raid6 + hot spare question

2015-09-09 Thread Brendan Hide


Things can be a little more nuanced.

First off, I'm not even sure btrfs supports a hot spare currently. I 
haven't seen anything along those lines recently in the list - and don't 
recall anything along those lines before either. The current mention of 
it in the Project Ideas page on the wiki implies it hasn't been looked 
at yet.


Also, depending on your experience with btrfs, some of the tasks 
involved in fixing up a missing/dead disk might be daunting.


See further (queries for btrfs-devs too) inline below:

On 2015-09-08 14:12, Hugo Mills wrote:

On Tue, Sep 08, 2015 at 01:59:19PM +0200, Peter Keše wrote:


However I'd like to be prepared for a disk failure. Because my
server is not easily accessible and disk replacement times can be
long, I'm considering the idea of making a 5-drive raid6, thus
getting 12TB useable space + parity. In this case, the extra 4TB
drive would serve as some sort of a hot spare.

From the above I'm reading one of two situations:
a) 6 drives, raid6 across 5 drives and 1 unused/hot spare
b) 5 drives, raid6 across 5 drives and zero unused/hot spare


My assumption is that if one hard drive fails before the volume is
more than 8TB full, I can just rebalance and resize the volume from
12 TB back to 8 TB essentially going from 5-drive raid6 to 4-drive
raid6).

Can anyone confirm my assumption? Can I indeed rebalance from
5-drive raid6 to 4-drive raid6 if the volume is not too big?

Yes, you can, provided, as you say, the data is small enough to fit
into the reduced filesystem.

Hugo.

This is true - however, I'd be hesitant to build this up due to the 
current process not being very "smooth" depending on how unlucky you 
are. If you have scenario b above, will the filesystem still be 
read/write or read-only post-reboot? Will it "just work" with the only 
requirement being free space on the four working disks?


RAID6 is intended to be tolerant of two disk failures. In the case of 
there being a double failure and only 5 disks, the ease with which the 
user can balance/convert to a 3-disk raid5 is also important.


Please shoot down my concerns. :)

--
______
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] btrfs: add replace missing and replace RAID 5/6 to profile configs

2015-07-27 Thread Brendan Hide

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2015/07/24 07:50 PM, Omar Sandoval wrote:
 On Fri, Jul 24, 2015 at 02:09:46PM +0200, David Sterba wrote:
 On Thu, Jul 23, 2015 at 01:51:50PM -0700, Omar Sandoval wrote:
 +   # We can't do replace with these profiles because they +
 # imply only one device ($SCRATCH_DEV), and we need to +
 #
 keep $SCRATCH_DEV around for _scratch_mount +   # and
 _check_scratch_fs. +local unsupported=( +   
 single +
 dup
 
 DUP does imply single device, but why does 'single' ?
 
 It does not, I apparently forgot that you could use single to 
 concatenate multiple devices. I'll fix that in v2.
 
 Thanks for reviewing!
 
Late to the party. DUP *implies* single device but there are cases
where dup is used on a multi-device fs. Even if the use-cases aren't
good or intended to be long-term, they are still valid, right?


- -- 
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJVtq9AAAoJEE+uni74c4qN5eYIAJAGznsi3RD1tchbSLwhMXJk
bJJ4ORB9taLXHykSfYTsHIaUoVpcVR6tT/I1jz5070DY3mKkQ16a8nwtSxPba4Lv
QiS8YRegFiHMYzZbH1T7Tnm6R9g/RZsaU7GS3JhP9HUYG7hIWGRRuoiOjYn/hoLw
uMXuIFOkPKGYDgyAhDIp3KDYlBjMHT6Oun7CcpvTjXiOnzJFFp3MSt3b6mmmdMVV
YKWpWyKVh7qlENEoqKb4exqr1WGYKU+kBLXRs4wdm3xb66EcWYs0Er1u6v+K1trx
nryFrfUxYtMJsSuR9ZJm88DOsXKAuX1LEdRKVOlq7krsIK8HlTizccMXAl10gKk=
=ndkL
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OT raid-1 from raid-10?

2015-02-22 Thread Brendan Hide


On 2015/02/22 03:02, Dave Stevens wrote:

If there's a better list please say so.


Either way, there is definitely not enough information here for us to 
give any practical advice. This is a btrfs-related mailing list and 
there's no indication even that you are using btrfs as your filesystem.


Typically, linux raid autodetect refers to mdraid, which is not btrfs 
- and changing the partition's type won't change the underlying issue 
of the data being unavailable.




I have a raid-10 array with two dirty drives and (according to the 
kernel) not enough mirrors to repair the raid-10. But I think drives 
sda and sdb are mirrored and maybe I could read the data off them if I 
changed the fs type from linux raid autodetect to ext3. is that 
reasonable?


D



If at least 3 disks are still in good condition, there's a chance that 
you can recover all your data. If not, my advice is to restore from 
backup. If you don't have backups ... well ... we have a couple of 
sayings about that, mostly along the lines of: If you don't have backups 
of it, it wasn't that important to begin with.


Run the following commands (as root) and send us the output - then maybe 
someone will be kind enough to point you in the right direction:

uname -a
cat /etc/*release
btrfs --version
btrfs fi show
cat /proc/mdstat
fdisk -l
gdisk -l


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-09 Thread Brendan Hide

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2015/02/09 10:30 PM, Kai Krakow wrote:
 Brendan Hide bren...@swiftspirit.co.za schrieb:
 
 I have the following two lines in 
 /etc/udev/rules.d/61-persistent-storage.rules for two old 250GB
[snip]
 Wouldn't it be easier and more efficient to use this:
 
 ACTION==add|change, KERNEL==sd[a-z], ENV{ID_SERIAL}==..., 
 ATTR{device/timeout}=120
 
 Otherwise you always spawn a shell and additional file descriptors,
 and you could spare a variable interpolation. Tho it probably
 depends on your udev version...
 
 I'm using this and it works setting the attributes (set deadline on
 SSD):
 
 ACTION==add|change, KERNEL==sd[a-z],
 ATTR{queue/rotational}==0, ATTR{queue/scheduler}=deadline
 
 And, I think you missed the double-equal == behind ENV{}...
 Right? Otherwise you just assign a value. Tho, you could probably
 match on ATTR{devices/model} instead to be more generic (the serial
 is probably too specific). You can get those from the
 /sys/block/sd* subtree.
 

It is certainly possible that it isn't 100% the right way - but it has
been working. Your suggestions certainly sound more
efficient/canonical. I was following what I found online until it
worked.  :)

I'll make the appropriate adjustments and test.

Thanks!

- -- 
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJU2YknAAoJEE+uni74c4qNopMH/34nj5wEi3m25jk/vEUud3hh
bbK4/mh564VnMc1NnpYXe++gUUTf0+203JDERgCQ1k3XjFMUe3VDPQBSdCIxcuOV
H7BtFWcuUYvaTd/3kHTcB2mp097RUQs25Jhcmf8y/+YZdnglnpSrRYtIIMM8osil
Y70IzoSRLuVHYlZT5VPmH7r7P9CeW5VnEG0jb3DkDe+tLH2Ed1Wy/Ti5myX0BF2l
7vJ1gTnPMmIUu/MKmNka6/hSWKGV7G2MeFoOy9UB2HhWsdGCjpJ1z8ToRQLcZbWX
yCpSjw2GDCtdG91iKiWK+kAJOreKqWGA3GSdgKqZhAQVg6LFeml1qLrBZ7H9H1o=
=TtpU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem

2015-02-08 Thread Brendan Hide


On 2015/02/09 01:58, constantine wrote:

Second, SMART is only saying its internal test is good. The errors are
related to data transfer, so that implicates the enclosure (bridge
chipset or electronics), the cable, or the controller interface.
Actually it could also be a flaky controller or RAM on the drive
itself too which I don't think get checked with SMART tests.

Which test should I do from now on (on a weekly basis?) so as to
prevent similar things from happening?
Chris has given some very good info so far. I've also had to learn some 
of this stuff the hard way (failed/unreliable drives, data 
unavailable/lost, etc). The info below will be of help to you in 
following some of the advice already given. Unfortunately, the best 
course of action I see so far is to follow Chris' advice and to purchase 
more disks so you can make a backup ASAP.


I have the following two lines in 
/etc/udev/rules.d/61-persistent-storage.rules for two old 250GB 
spindles. It sets the timeout to 120 seconds because these two disks 
don't support SCT ERC. This may very well apply without modification to 
other distros - but this is only tested in Arch:
ACTION==add, KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL}=ST3250410AS_6RYF5NP7 RUN+=/bin/sh -c 'echo 120  
/sys$devpath/device/timeout'
ACTION==add, KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL}=ST3250820AS_9QE2CQWC RUN+=/bin/sh -c 'echo 120  
/sys$devpath/device/timeout'


I have a smart_scan script* that does a check of all disks using 
smartctl. The meat of the script is in main(). The rest of the script is 
from a template of mine. The script, with no parameters, will do a short 
and then a long test on all drives. It does not give any output - 
however if you have smartd running and configured appropriately, smartd 
will pick up on any issues found and send appropriate alerts 
(email/system log/etc).


It is configured in /etc/cron.d/smart. It runs a short test every 
morning and a long test every Saturday evening:

255***root/usr/local/sbin/smart_scan short
2518**6root/usr/local/sbin/smart_scan long

Then, scrubbing**:
This relatively simple script runs a scrub on all disks and prints the 
results *only* if there were errors.
I've scheduled this in a cron as well to execute *every* morning shortly 
after 2am. Cron is configured to send me an email if there is any output 
- so I only get an email if there's something to look into.


And finally, I have btsync configured to synchronise my Arch desktop's 
system journal to a couple of local and remote servers of mine. A much 
cleaner way to do this would be to use an external syslog server - I 
haven't yet looked into doing that properly, however.


http://swiftspirit.co.za/down/smart_scan
http://swiftspirit.co.za/down/btrfs-scrub-all

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recovery Operation With Multiple Devices

2015-01-23 Thread Brendan Hide


On 2015/01/23 09:53, Brett King wrote:

Hi All,
Just wondering how 'btrfs recovery' operates
I'm assuming you're referring to a different set of commands or general 
scrub/recovery processes. AFAIK there is no btrfs recovery command.



, when the source device given is one of many in an MD array - I can't find 
anything documentation beyond a single device use case.
btrfs doesn't know what an md array or member is, therefore your results 
aren't going to be well-defined. Depending on the type of md array the 
member was in, your data may be mostly readable (RAID1) or 
completely/mostly non-interpretable (RAID5/6/10/0) until md fixes the array.



Does it automatically include all devices in the relevant MD array as occurs 
when mounting, or does it only restore the data which happened to be written to 
the specific, single device given ?
As above, btrfs is not md-aware. It will attempt to work with what it is 
given. It might not understand anything it sees as it will not have a 
good description of what it is looking at. Imagine being given 
instructions on how to get somewhere only to find that the first 20 
instructions and every second instruction thereafter was skipped and 
there's a 50% chance the destination doesn't exist.



 From an inverse perspective, how can I restore all data including snapshots, 
which are spread across a damaged MD FS to a new (MD) FS ?
Your best bet is to restore the md array. More details are needed for 
anyone to assist - for example what RAID-type was the array set up with, 
how many disks were in the array, and how it failed. Also, technically 
this is the wrong place to ask for advice about restoring md arrays. ;)



Can send / receive do this perhaps ?
Send/receive is for sending good data to a destination that can accept 
it. This, as above, depends on the data being readable/available. Very 
likely the data will be unreadable from a single disk unless the md 
array was RAID1.



Thanks in advance !
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs send and interrupts

2015-01-23 Thread Brendan Hide


On 2015/01/23 11:58, Matthias Urlichs wrote:

Hi,

root@data:/daten/backup/email/xmit# btrfs send foo | ssh wherever btrfs receive 
/mnt/mail/xmit/
At subvol foo
At subvol foo
[ some time passes and I need to do something else on that volume ]

^Z
[1]+  Stopped btrfs send foo | ssh -p50022 surf btrfs receive 
/mnt/mail/xmit/
root@data:/daten/backup/email/xmit# bg
[1]+ btrfs send foo | ssh -p50022 surf btrfs receive /mnt/mail/xmit/ 
root@data:/daten/backup/email/xmit#
[ Immediately afterwards, this happens: ]

ERROR: crc32 mismatch in command.

At subvol foo

At subvol foo
ERROR: creating subvolume foo failed. File exists
[1]+  Exit 1  btrfs send foo | ssh -p50022 surf btrfs receive 
/mnt/mail/xmit/
root@data:/daten/backup/email/xmit#

Yowch. Please make sure that the simple act of backgrounding a data
transfer doesn't abort it. That was ten hours in, now I have to repeat
the whole thing. :-/

Thank you.
Interesting case. I'm not sure of the merits/workaround needed to do 
this. It appears even using cat into netcat (nc) causes netcat to quit 
if you background the operation.


A workaround for future: I *strongly* recommend using screen for 
long-lived operations. This would have avoided the problem. Perhaps you 
were sitting in front of the server and it wasn't much of a concern at 
the time - but most admins work remotely. Never mind ^z, what about 
other occurrences such as if the power/internet goes out at your 
office/home and the server is on another continent? Your session dies 
and you lose 10 hours of work/waiting. With a screen session, that is no 
longer true.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I need to P. are we almost there yet?

2015-01-02 Thread Brendan Hide


On 2015/01/02 15:42, Austin S Hemmelgarn wrote:

On 2014-12-31 12:27, ashf...@whisperpc.com wrote:
I see this as a CRITICAL design flaw.  The reason for calling it 
CRITICAL
is that System Administrators have been trained for 20 years that 
RAID-10

can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.

No, some rather simple math

That's the problem. The math isn't as simple as you'd expect:

The example below is probably a pathological case - but here goes. Let's 
say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where 
d1 is the first bit of data and d2 is the second:

Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10

Lose any two disks and you have a 50% chance on *each* chunk to have 
lost that chunk. With traditional RAID10 you have a 50% chance of losing 
the array entirely. With btrfs, the more data you have stored, the 
chances get closer to 100% of losing *some* data in a 2-disk failure.


In the above example, losing A and B means you lose d3, d6, and d7 
(which ends up being 60% of all chunks).

Losing A and C means you lose d1 (20% of all chunks).
Losing A and D means you lose d9 (20% of all chunks).
Losing B and C means you lose d10 (20% of all chunks).
Losing B and D means you lose d2 (20% of all chunks).
Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

The above skewed example has an average of 40% of all chunks failed. As 
you add more data and randomise the allocation, this will approach 50% - 
BUT, the chances of losing *some* data is already clearly shown to be 
very close to 100%.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Another No space left balance failure with plenty of space available

2014-12-02 Thread Brendan Hide


Hey, guys

This is on my ArchLinux desktop. Current values as follows and the exact 
error is currently reproducible. Let me know if you want me to run any 
tests/etc. I've made an image (76MB) and can send the link to interested 
parties.


I have come across this once before in the last few weeks. The 
workaround at that time was to run multiple balances with incrementing 
-musage and -dusage values. Whether or not that was a real, imaginary, 
or temporary fix is another story. I have backups but the issue doesn't 
yet appear to cause any symptoms other than these errors.


The drive is a second-hand 60GB Intel 330 recycled from a decommissioned 
server. The mkfs was run on Nov 4th before I started a migration from 
spinning rust. According to my pacman logs btrfs-progs was on 3.17-1 and 
kernel was 3.17.1-1


root ~ $ uname -a
Linux watricky.invalid.co.za 3.17.4-1-ARCH #1 SMP PREEMPT Fri Nov 21 
21:14:42 CET 2014 x86_64 GNU/Linux

root ~ $ btrfs fi show /
Label: 'arch-btrfs-root'  uuid: 782a0edc-1848-42ea-91cb-de8334f0c248
Total devices 1 FS bytes used 17.44GiB
devid1 size 40.00GiB used 20.31GiB path /dev/sdc1

Btrfs v3.17.2
root ~ $ btrfs fi df /
Data, single: total=18.00GiB, used=16.72GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=738.67MiB
GlobalReserve, single: total=256.00MiB, used=0.00B

Relevant kernel lines:
root ~ $ journalctl -k | grep ^Dec\ 01\ 21\:
Dec 01 21:10:01 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
relocating block group 46166704128 flags 36
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
found 2194 extents
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
relocating block group 45059407872 flags 34
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
found 1 extents
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
1 enospc errors during balance



 Original Message 
Subject: 	Cron root@watricky /usr/bin/btrfs balance start -musage=90 / 
21  /dev/null

Date:   Mon, 01 Dec 2014 21:10:03 +0200
From:   (Cron Daemon) r...@watricky.valid.co.za
To: bren...@swiftspirit.co.za



ERROR: error during balancing '/' - No space left on device
There may be more info in syslog - try dmesg | tail



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS equivalent for tune2fs?

2014-12-01 Thread Brendan Hide


On 2014/12/02 07:54, MegaBrutal wrote:

Hi all,

I know there is a btrfstune, but it doesn't provide all the
functionality I'm thinking of.

For ext2/3/4 file systems I can get a bunch of useful data with
tune2fs -l. How can I retrieve the same type of information about a
BTRFS file system? (E.g., last mount time, last checked time, blocks
reserved for superuser*, etc.)

* Anyway, does BTRFS even have an option to reserve X% for the superuser?
Btrfs does not yet have this option. I'm certain that specific feature 
is in mind for the future however.


As regards other equivalents, the same/similar answer applies. There 
simply aren't a lot of tuneables available right now.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS equivalent for tune2fs?

2014-12-01 Thread Brendan Hide


On 2014/12/02 09:31, Brendan Hide wrote:

On 2014/12/02 07:54, MegaBrutal wrote:

Hi all,

I know there is a btrfstune, but it doesn't provide all the
functionality I'm thinking of.

For ext2/3/4 file systems I can get a bunch of useful data with
tune2fs -l. How can I retrieve the same type of information about a
BTRFS file system? (E.g., last mount time, last checked time, blocks
reserved for superuser*, etc.)

* Anyway, does BTRFS even have an option to reserve X% for the 
superuser?
Btrfs does not yet have this option. I'm certain that specific feature 
is in mind for the future however.


As regards other equivalents, the same/similar answer applies. There 
simply aren't a lot of tuneables available right now.



Almost forgot about this: btrfs property (get|set)

Again, there are a lot of features still to be added.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Brendan Hide


On 2014/11/25 13:30, Liu Bo wrote:

This is actually inspired by ZFS, who offers checksum functions ranging
from the simple-and-fast fletcher2 to the slower-but-secure sha256.

Back to btrfs, crc32c is the only choice.

And also for the slowness of sha256, Intel has a set of instructions for
it, Intel SHA Extensions, that may help a lot.


I think the advantage will be in giving a choice with some strong 
suggestions:


An example of suggestions - if using sha256 on an old or low-power 
CPU, detect that the CPU doesn't support the appropriate acceleration 
functions and print a warning at mount or a warning-and-prompt at mkfs-time.


The default could even be changed based on the architecture - though I 
suspect crc32c is already a good default on most architectures.


The choice allowance gives flexibility where admins know it optimally 
could be used - and David's suggestion (separate thread) would be able 
to take advantage of that.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-26 Thread Brendan Hide


On 2014/11/25 18:47, David Sterba wrote:

We could provide an interface for external applications that would make
use of the strong checksums. Eg. external dedup, integrity db. The
benefit here is that the checksum is always up to date, so there's no
need to compute the checksums again. At the obvious cost.


I can imagine some use-cases where you might even want more than one 
algorithm to be used and stored. Not sure if that makes me a madman, 
though. ;)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-24 Thread Brendan Hide


On 2014/11/23 03:07, Marc MERLIN wrote:

On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote:

Which is correct?

Less than or equal to 55% full.
  
This confuses me. Does that mean that the fullest blocks do not get

rebalanced?


Balance has three primary benefits:
- free up some space for new allocations
- change storage profile
- balance/migrate data to or away from new or failing disks (the 
original purpose of balance)

and one fringe benefit:
- force a data re-write (good if you think your spinning-rust needs to 
re-allocate sectors)


In the regular case where you're not changing the storage profile or 
migrating data between disks, there isn't much to gain from balancing 
full chunks - and it involves a lot of work. For SSDs, it is 
particularly bad for wear. For spinning rust it is merely a lot of 
unnecessary work.



I guess I was under the mistaken impression that the more data you had the
more you could be out of balance.


A chunk is the part of a block group that lives on one device, so
in RAID-1, every block group is precisely two chunks; in RAID-0, every
block group is 2 or more chunks, up to the number of devices in the
FS. A chunk is usually 1 GiB in size for data and 250 MiB for
metadata, but can be smaller under some circumstances.

Right. So, why would you rebalance empty chunks or near empty chunks?
Don't you want to rebalance almost full chunks first, and work you way to
less and less full as needed?


Balancing empty chunks makes them available for re-allocation - so that 
is directly useful and light on workload.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] btrfs: enable swap file support

2014-11-24 Thread Brendan Hide


On 2014/11/25 00:03, Omar Sandoval wrote:

[snip]

The snapshot issue is a little tricker to resolve. I see a few options:

1. Just do the COW and hope for the best
2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot
happens while a swap file is active, we'll fall back to 1.
3. Clobber any swap file extents which are in a snapshot, i.e., always use the
existing extent.

I'm partial to 3, as it's the simplest approach, and I don't think it makes
much sense for a swap file to be in a snapshot anyways. I'd appreciate any
comments that anyone might have.


Personally, 3 seems pragmatic - but not necessarily correct. :-/

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Brendan Hide


On 2014/11/21 06:58, Zygo Blaxell wrote:

You have one reallocated sector, so the drive has lost some data at some
time in the last 49000(!) hours.  Normally reallocations happen during
writes so the data that was lost was data you were in the process of
overwriting anyway; however, the reallocated sector count could also be
a sign of deteriorating drive integrity.

In /var/lib/smartmontools there might be a csv file with logged error
attribute data that you could use to figure out whether that reallocation
was recent.

I also notice you are not running regular SMART self-tests (e.g.
by smartctl -t long) and the last (and first, and only!) self-test the
drive ran was ~12000 hours ago.  That means most of your SMART data is
about 18 months old.  The drive won't know about sectors that went bad
in the last year and a half unless the host happens to stumble across
them during a read.

The drive is over five years old in operating hours alone.  It is probably
so fragile now that it will break if you try to move it.
All interesting points. Do you schedule SMART self-tests on your own 
systems? I have smartd running. In theory it tracks changes and sends 
alerts if it figures a drive is going to fail. But, based on what you've 
indicated, that isn't good enough.



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
seconds
total bytes scrubbed: 189.49GiB with 5420 errors
error details: read=5 csum=5415
corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
That seems a little off.  If there were 5 read errors, I'd expect the drive to
have errors in the SMART error log.

Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
There have been a number of fixes to csums in btrfs pulled into the kernel
recently, and I've retired two five-year-old computers this summer due
to RAM/CPU failures.
The difference here is that the issue only affects the one drive. This 
leaves the probable cause at:

- the drive itself
- the cable/ports

with a negligibly-possible cause at the motherboard chipset.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Brendan Hide


On 2014/11/18 09:36, Roman Mamedov wrote:

On Tue, 18 Nov 2014 09:29:54 +0200
Brendan Hide bren...@swiftspirit.co.za wrote:


Hey, guys

See further below extracted output from a daily scrub showing csum
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
errors like this for a few days now.

The disk is patently unreliable but smartctl's output implies there are
no issues. Is this somehow standard faire for S.M.A.R.T. output?

Not necessarily the disk's fault, could be a SATA controller issue. How are
your disks connected, which controller brand and chip? Add lspci output, at
least if something other than the ordinary to the motherboard chipset's
built-in ports.


In this case, yup, its directly to the motherboard chipset's built-in ports. 
This is a very old desktop, and the other 3 disks don't have any issues. I'm 
checking out the alternative pointed out by Austin.

SATA-relevant lspci output:
00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI 
Controller (rev 02)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Brendan Hide


On 2014/11/18 14:08, Austin S Hemmelgarn wrote:

[snip] there are some parts of the drive that aren't covered by SMART 
attributes on most disks, most notably the on-drive cache. There really isn't a 
way to disable the read cache on the drive, but you can disable write-caching.
Its an old and replaceable disk - but if the cable replacement doesn't 
work I'll try this for kicks. :)

The other thing I would suggest trying is a different data cable to the drive 
itself, I've had issues with some SATA cables (the cheap red ones you get in 
the retail packaging for some hard disks in particular) having either bad 
connectors, or bad strain-reliefs, and failing after only a few hundred hours 
of use.

Thanks. I'll try this first. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-11-17 Thread Brendan Hide


On 2014/11/17 09:35, Daniel Dressler top-posted:

If a UUID is not unique enough how will adding a second UUID or
unique drive identifier help?
A UUID is *supposed* to be unique by design. Isolated, the design is 
adequate.


But the bigger picture clearly shows the design is naive. And broken.

A second per-disk id (note I said unique - but I never said universal 
as in UUID) would allow for better-defined behaviour where, presently, 
we're simply saying current behaviour is undefined and you're likely to 
get corruption.


On the other hand, I asked already if we have IDs of some sort (how else 
do we know which disk a chunk is stored on?), thus I don't think we need 
to add anything to the format.


A simple scenario similar to the one the OP introduced:

Disk sda - says it is UUID Z with diskid 0
Disk sdb - says it is UUID Z with diskid 0

If we're ignoring the fact that there are two disks with the same UUID 
and diskid and it causes corruption, then the kernel is doing something 
stupid but fixable. We have some choices:
- give a clear warning and ignore one of the disks (could just pick the 
first one - or be a little smarter and pick one based on some heuristic 
- for example extent generation number)

- give a clear error and panic

Normal multi-disk scenario:
Disk sda - UUID Z with diskid 1
Disk sdb - UUID Z with diskid 2

These two disks are in the same filesystem and are supposed to work 
together - no issues.


My second suggestion covers another scenario as well:

Disk sda - UUID Z with diskid 1; root block indicates that only diskid 
1 is recorded as being part of the filesystem
Disk sdb - UUID Z with diskid 3; root block indicates that only diskid 
3 is recorded as being part of the filesystem


Again, based on the existing featureset, it seems reasonable that this 
information should already be recorded in the fs metadata. If the 
behaviour is undefined and causing corruption, again the kernel is 
currently doing something stupid but fixable. Again, we have similar 
choices:

- give a clear warning and ignore bad disk(s)
- give a clear error and panic


2014-11-17 15:59 GMT+09:00 Brendan Hide bren...@swiftspirit.co.za:

cc'd bug-g...@gnu.org for FYI

On 2014/11/17 03:42, Duncan wrote:

MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted:


Hello guys,

I think you'll like this...
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429

UUID is an initialism for Universally Unique IDentifier.[1]

If the UUID isn't unique, by definition, then, it can't be a UUID, and
that's a bug in whatever is making the non-unique would-be UUID that
isn't unique and thus cannot be a universally unique ID.  In this case
that would appear to be LVM.

Perhaps the right question to ask is Where should this bug be fixed?.

TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is
likely seen as being out of scope. The correct fix probably lies in the
ecosystem design, which requires co-operation from btrfs.

Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making
its snapshot, is doing its job exactly as expected.

Additionally, there are other ways to get to a similar state without LVM:
ddrescue backup, SAN snapshot, old missing disk re-introduced, etc.

That leaves two places where this can be fixed: grub and btrfs

Grub is already a little smart here - it avoids snapshots. But in this case
it is relying on the UUID and only finding it in the snapshot. So possibly
this is a bug in grub affecting the bug reporter specifically - but perhaps
the bug is in btrfs where grub is relying on btrfs code.

Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice
that is left to the user/admin/distro. I don't think saying LVM snapshots
are incompatible with btrfs is the right way to go either.

That leaves two aspects of this issue which I view as two separate bugs:
a) Btrfs cannot gracefully handle separate filesystems that have the same
UUID. At all.
b) Grub appears to pick the wrong filesystem when presented with two
filesystems with the same UUID.

I feel a) is a btrfs bug.
I feel b) is a bug that is more about ecosystem design than grub being
silly.

I imagine a couple of aspects that could help fix a):
- Utilise a unique drive identifier in the btrfs metadata (surely this
exists already?). This way, any two filesystems will always have different
drive identifiers *except* in cases like a ddrescue'd copy or a block-level
snapshot. This will provide a sensible mechanism for defined behaviour,
preventing corruption - even if that defined behaviour is to simply give
out lots of PEBKAC errors and panic.
- Utilise a drive list to ensure that two unrelated filesystems with the
same UUID cannot get mixed up. Yes, the user/admin would likely be the
culprit here (perhaps a VM rollout process that always gives out the same
UUID in all its filesystems). Again, does btrfs not already have something
like this built-in that we're simply not utilising

scrub implies failing drive - smartctl blissfully unaware

2014-11-17 Thread Brendan Hide


Hey, guys

See further below extracted output from a daily scrub showing csum 
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
errors like this for a few days now.


The disk is patently unreliable but smartctl's output implies there are 
no issues. Is this somehow standard faire for S.M.A.R.T. output?


Here are (I think) the important bits of the smartctl output for 
$(smartctl -a /dev/sdb) (the full results are attached):
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   253   006Pre-fail 
Always   -   0
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail 
Always   -   1
  7 Seek_Error_Rate 0x000f   086   060   030Pre-fail 
Always   -   440801014
197 Current_Pending_Sector  0x0012   100   100   000Old_age 
Always   -   0
198 Offline_Uncorrectable   0x0010   100   100   000Old_age 
Offline  -   0
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age 
Always   -   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age 
Offline  -   0
202 Data_Address_Mark_Errs  0x0032   100   253   000Old_age 
Always   -   0




 Original Message 
Subject:Cron root@watricky /usr/local/sbin/btrfs-scrub-all
Date:   Tue, 18 Nov 2014 04:19:12 +0200
From:   (Cron Daemon) root@watricky
To: brendan@watricky



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
seconds
total bytes scrubbed: 189.49GiB with 5420 errors
error details: read=5 csum=5415
corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
[snip]

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3250410AS
Serial Number:6RYF5NP7
Firmware Version: 4.AAA
User Capacity:250,059,350,016 bytes [250 GB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:Tue Nov 18 09:16:03 2014 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(  430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  64) minutes.
SCT capabilities:  (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   253   006Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0003   099   097   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   68
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   1
  7 Seek_Error_Rate 0x000f   086   060   030Pre-fail  Always   
-

Re: BTRFS messes up snapshot LV with origin

2014-11-16 Thread Brendan Hide


cc'd bug-g...@gnu.org for FYI

On 2014/11/17 03:42, Duncan wrote:

MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted:


Hello guys,

I think you'll like this...
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429

UUID is an initialism for Universally Unique IDentifier.[1]

If the UUID isn't unique, by definition, then, it can't be a UUID, and
that's a bug in whatever is making the non-unique would-be UUID that
isn't unique and thus cannot be a universally unique ID.  In this case
that would appear to be LVM.

Perhaps the right question to ask is Where should this bug be fixed?.

TL;DR: This needs more thought and input from btrfs devs. To LVM, the 
bug is likely seen as being out of scope. The correct fix probably 
lies in the ecosystem design, which requires co-operation from btrfs.


Making a snapshot in LVM is a fundamental thing - and I feel LVM, in 
making its snapshot, is doing its job exactly as expected.


Additionally, there are other ways to get to a similar state without 
LVM: ddrescue backup, SAN snapshot, old missing disk re-introduced, etc.


That leaves two places where this can be fixed: grub and btrfs

Grub is already a little smart here - it avoids snapshots. But in this 
case it is relying on the UUID and only finding it in the snapshot. So 
possibly this is a bug in grub affecting the bug reporter specifically - 
but perhaps the bug is in btrfs where grub is relying on btrfs code.


Yes, I'd rather use btrfs' snapshot mechanism - but this is often a 
choice that is left to the user/admin/distro. I don't think saying LVM 
snapshots are incompatible with btrfs is the right way to go either.


That leaves two aspects of this issue which I view as two separate bugs:
a) Btrfs cannot gracefully handle separate filesystems that have the 
same UUID. At all.
b) Grub appears to pick the wrong filesystem when presented with two 
filesystems with the same UUID.


I feel a) is a btrfs bug.
I feel b) is a bug that is more about ecosystem design than grub being 
silly.


I imagine a couple of aspects that could help fix a):
- Utilise a unique drive identifier in the btrfs metadata (surely this 
exists already?). This way, any two filesystems will always have 
different drive identifiers *except* in cases like a ddrescue'd copy or 
a block-level snapshot. This will provide a sensible mechanism for 
defined behaviour, preventing corruption - even if that defined 
behaviour is to simply give out lots of PEBKAC errors and panic.
- Utilise a drive list to ensure that two unrelated filesystems with 
the same UUID cannot get mixed up. Yes, the user/admin would likely be 
the culprit here (perhaps a VM rollout process that always gives out the 
same UUID in all its filesystems). Again, does btrfs not already have 
something like this built-in that we're simply not utilising fully?


I'm not exactly sure of the correct way to fix b) except that I 
imagine it would be trivial to fix once a) is fixed.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check segfaults after flipping 2 Bytes

2014-10-02 Thread Brendan Hide


On 2014/10/02 01:31, Duncan wrote:

Niklas Fischer posted on Wed, 01 Oct 2014 22:29:55 +0200 as excerpted:


I was trying to determine how btrfs reacts to disk errors, when I
discovered, that flipping two Bytes, supposedly inside of a file can
render the filesystem unusable. Here is what I did:

1. dd if=/dev/zero of=/dev/sdg2 bs=1M
2. mkfs.btrfs /dev/sdg2
3. mount /dev/sdg2 /tmp/btrfs
4. echo hello world this is some text  /tmp/btrfs/hello
5. umount /dev/sdg2

Keep in mind that on btrfs, small enough files will not be written to
file extents but instead will be written directly into the metadata.

That's a small enough file I guess that's what you were seeing, which
would explain the two instances of the string, since on a single device
btrfs, metadata is dup mode by default.

That metadata block would then fail checksum, and an attempt would be
made to use the second copy, which of course would fail it the same way.

At least a very unlikely scenario in production.

And that being the only file in the filesystem, I'd /guess/ (not being a
developer myself, just a btrfs testing admin and list regular) that
metadata block is still the original one, which very likely contains
critical filesystem information as well, thus explaining the mount
failure when the block failed checksum verify.
This is a possible use-case for an equivalent to ZFS's ditto blocks. An 
alternative strategy would be to purposefully sparsify early metadata 
blocks (this is thinking out loud - whether or not that is a viable or 
easy strategy is debatable).

In theory at least, with a less synthetic test case there'd be enough
more metadata on the filesystem that the affected metadata block would be
further down the chain, and corrupting it wouldn't corrupt critical
filesystem information as it wouldn't be in the same block.

That might explain the problem, but I don't know enough about btrfs to
know how reasonable a solution would be.
[snip]
A reasonable workaround to get the filesystem back into a usable or 
recoverable state might be to mount read-only and ignore checksums. That 
would keep the filesystem intact, though the system has no way to know 
whether or not the folder structures are also corrupt.


I'm not sure if there is a mount option for this use case however. The 
option descriptions for nodatasum and nodatacow imply that *new* 
checksums are not generated. In this case the checksums already exist.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check segfaults after flipping 2 Bytes

2014-10-02 Thread Brendan Hide


On 2014/10/02 07:51, Brendan Hide wrote:

On 2014/10/02 01:31, Duncan wrote:
[snip]

I'm not sure if there is a mount option for this use case however. The 
option descriptions for nodatasum and nodatacow imply that *new* 
checksums are not generated. In this case the checksums already exist.


Looks like btrfsck has a relevant option, albeit likely more destructive 
than absolutely necessary:

--init-csum-tree
   create a new CRC tree.

^ Also, mail was sent as HTML 12 hours ago thus was never delivered. 
Thunderbird has been disciplined.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas for a feature implementation

2014-08-13 Thread Brendan Hide


On 2014/08/12 17:52, David Pottage wrote:


[snip] ... if it does not then the file-system has broken the contract 
to secure delete a file when you asked it to.

This is a technicality - and it has not necessarily broken the contract.

I think the correct thing to do would be to securely delete the metadata 
referring to that file. That would satisfy the concept that there is no 
evidence that the file ever existed in that location. The fact that it 
actually does still legitimately exist elsewhere is not a caveat - it is 
simply acting within standard behaviour.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC errors during balance

2014-07-21 Thread Brendan Hide


On 20/07/14 14:59, Duncan wrote:

Marc Joliet posted on Sun, 20 Jul 2014 12:22:33 +0200 as excerpted:


On the other hand, the wiki [0] says that defragmentation (and
balancing) is optional, and the only reason stated for doing either is
because they will have impact on performance.

Yes.  That's what threw off the other guy as well.  He decided to skip it
for the same reason.

If I had a wiki account I'd change it, but for whatever reason I tend to
be far more comfortable writing list replies, sometimes repeatedly, than
writing anything on the web, which I tend to treat as read-only.  So I've
never gotten a wiki account and thus haven't changed it, and apparently
the other guy with the problem and anyone else that knows hasn't changed
it either, so the conversion page still continues to underemphasize the
importance of completing the conversion steps, including the defrag, in
proper order.

I've inserted information specific to this in the wiki. Others with wiki 
accounts, feel free to review:

https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3#Before_first_use

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] generic/017: skip invalid block sizes for btrfs

2014-06-23 Thread Brendan Hide

Not subscribed to fstests so not sure if this will reach that mailing 
list...


I feel Takeuchi's instincts are right, even if the analysis *may* be 
wrong. As is it looks like there should be a btrfs) selector inside the 
case.


On 23/06/14 12:48, Satoru Takeuchi wrote:

Hi Filipe,

(2014/06/23 19:28), Filipe David Borba Manana wrote:

In btrfs the block size (called sector size in btrfs) can not be
smaller then the page size. Therefore skip block sizes smaller
then page size if the fs is btrfs, so that the test can succeed
on btrfs (testing only with block sizes of 4kb on systems with a
page size of 4Kb).

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com

I consider it doesn't work since this test is not for Btrfs.
Please see the following code.

tests/generic/017:
===
for (( BSIZE = 1024; BSIZE = 4096; BSIZE *= 2 )); do

length=$(($BLOCKS * $BSIZE))
case $FSTYP in
xfs)
_scratch_mkfs -b size=$BSIZE  $seqres.full 21
;;
ext4)
_scratch_mkfs -b $BSIZE  $seqres.full 21
;;
esac
_scratch_mount  $seqres.full 21
===

There is no btrfs here.

This test was moved to shared/005 to generic/017
at 21723cdbf303e031d6429f67fec9768750a5db7d.

Original supported fs is here.
===
supported_fs xfs ext4
===

I suspect that Lukas moved this test to generic/ by mistake or forgot to
add $FSTYP == btrfs case.

Thanks,
Satoru


---
   tests/generic/017 | 8 
   1 file changed, 8 insertions(+)

diff --git a/tests/generic/017 b/tests/generic/017
index 13b7254..6495be5 100755
--- a/tests/generic/017
+++ b/tests/generic/017
@@ -51,6 +51,14 @@ BLOCKS=10240
   
   for (( BSIZE = 1024; BSIZE = 4096; BSIZE *= 2 )); do
   
+	# btrfs doesn't support block size smaller then page size

+   if [ $FSTYP == btrfs ]; then
+   if (( $BSIZE  `getconf PAGE_SIZE` )); then
+   echo 80
+   continue
+   fi
+   fi
+
length=$(($BLOCKS * $BSIZE))
case $FSTYP in
xfs)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] lib: add size unit t/p/e to memparse

2014-06-13 Thread Brendan Hide


On 12/06/14 23:15, Andrew Morton wrote:

On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com wrote:


For modern filesystems such as btrfs, t/p/e size level operations
are common.
add size unit t/p/e parsing to memparse

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
changelog
v1-v2: replace kilobyte with kibibyte, and others
v2-v3: add missing unit bytes in comment
---
  lib/cmdline.c | 25 -
  1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/lib/cmdline.c b/lib/cmdline.c
index eb67911..511b9be 100644
--- a/lib/cmdline.c
+++ b/lib/cmdline.c
@@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints)
   *@retptr: (output) Optional pointer to next char after parse completes
   *
   *Parses a string into a number.  The number stored at @ptr is
- * potentially suffixed with %K (for kilobytes, or 1024 bytes),
- * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
- * 1073741824).  If the number is suffixed with K, M, or G, then
- * the return value is the number multiplied by one kilobyte, one
- * megabyte, or one gigabyte, respectively.
+ * potentially suffixed with
+ * %K (for kibibytes, or 1024 bytes),
+ * %M (for mebibytes, or 1048576 bytes),
+ * %G (for gibibytes, or 1073741824 bytes),
+ * %T (for tebibytes, or 1099511627776 bytes),
+ * %P (for pebibytes, or 1125899906842624 bytes),
+ * %E (for exbibytes, or 1152921504606846976 bytes).

I'm afraid I find these names quite idiotic - we all know what the
traditional terms mean so why go and muck with it.

Also, kibibytes sounds like cat food.

Hi, Andrew

While I agree it sounds like cat food, it seemed like a good opportunity 
to fix a minor issue that is otherwise unlikely to be fixed for a very 
long time. Should we feel uncomfortable with the patch, as is, because 
of language/correctness friction? Pedantry included, the patch is 
correct. ;)


Thanks

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] lib: add size unit t/p/e to memparse

2014-06-13 Thread Brendan Hide


On 13/06/14 03:42, Gui Hecheng wrote:

For modern filesystems such as btrfs, t/p/e size level operations
are common.
add size unit t/p/e parsing to memparse

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
changelog
v1-v2: replace kilobyte with kibibyte, and others
v2-v3: add missing unit bytes in comment
v3-v4: remove idiotic name for K,M,G,P,T,E
---
  lib/cmdline.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/cmdline.c b/lib/cmdline.c
index d4932f7..76a712e 100644
--- a/lib/cmdline.c
+++ b/lib/cmdline.c
@@ -121,11 +121,7 @@ EXPORT_SYMBOL(get_options);
   *@retptr: (output) Optional pointer to next char after parse completes
   *
   *Parses a string into a number.  The number stored at @ptr is
- * potentially suffixed with %K (for kilobytes, or 1024 bytes),
- * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
- * 1073741824).  If the number is suffixed with K, M, or G, then
- * the return value is the number multiplied by one kilobyte, one
- * megabyte, or one gigabyte, respectively.
+ * potentially suffixed with K, M, G, T, P, E.
   */
  
  unsigned long long memparse(const char *ptr, char **retptr)

@@ -135,6 +131,15 @@ unsigned long long memparse(const char *ptr, char **retptr)
unsigned long long ret = simple_strtoull(ptr, endptr, 0);
  
  	switch (*endptr) {

+   case 'E':
+   case 'e':
+   ret = 10;
+   case 'P':
+   case 'p':
+   ret = 10;
+   case 'T':
+   case 't':
+   ret = 10;
case 'G':
case 'g':
ret = 10;

Ah, I see - you've removed all reference to their names. That's good too. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What to do about snapshot-aware defrag

2014-05-31 Thread Brendan Hide


On 2014/05/31 12:00 AM, Martin wrote:

OK... I'll jump in...

On 30/05/14 21:43, Josef Bacik wrote:

[snip]
Option 1: Only relink inodes that haven't changed since the snapshot was
taken.

Pros:
-Faster
-Simpler
-Less duplicated code, uses existing functions for tricky operations so
less likely to introduce weird bugs.

Cons:
-Could possibly lost some of the snapshot-awareness of the defrag.  If
you just touch a file we would not do the relinking and you'd end up
with twice the space usage.

[...]


Obvious way to go for fast KISS.


I second this - KISS is better.

Would in-band dedupe resolve the issue with losing the 
snapshot-awareness of the defrag? I figure that if someone absolutely 
wants everything deduped efficiently they'd put in the necessary 
resources (memory/dedicated SSD/etc) to have in-band dedupe work well.

One question:

Will option one mean that we always need to mount with noatime or
read-only to allow snapshot defragging to do anything?


That is a very good question. I very rarely have mounts without noatime 
- and usually only because I hadn't thought of it.



Regards,
Martin


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl

2014-05-21 Thread Brendan Hide


On 2014/05/21 06:15 AM, Qu Wenruo wrote:

[snip]

 Further on top of your check_missing patch I am writing
 code to to handle disk reappear. I should be sending them
 all soon.

Disk reappear problem is also reproduce here.

I am intersting about how will your patch to deal with.
Is your patch going to check super genertion to determing previously 
missing device and

wipe reappeared superblock?(Wang mentioned it in the mail in Jan.)



With md we have the bitmap feature that helps prevent resynchronising 
the entire disk when doing a re-add. Wiping the superblock is *better* 
than what we currently have (corruption) - but hopefully the end goal is 
to be able to have it re-add *without* introducing corruption.




IMO the reappear disk problem can also be resolved by not swap 
tgtdev-uuid and srcdev-uuid,

which means tgtdev will not use the same uuid of srcdev.

Thanks,
Qu


Thanks, Anand




--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ditto blocks on ZFS

2014-05-20 Thread Brendan Hide


On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote:

On 2014-05-19 22:07, Russell Coker wrote:

[snip]
As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
like to use RAID-1 with ditto blocks for my important data and RAID-0 for
unimportant data.


But the proposed changes for n-way replication would already handle
this.
[snip]

Russell's specific request above is probably best handled by being able 
to change replication levels per subvolume - this won't be handled by 
N-way replication.


Extra replication on leaf nodes will make relatively little difference 
in the scenarios laid out in this thread - but on trunk nodes (folders 
or subvolumes closer to the filesystem root) it makes a significant 
difference. Plain N-way replication doesn't flexibly treat these two 
nodes differently.


As an example, Russell might have a server with two disks - yet he wants 
6 copies of all metadata for subvolumes and their immediate subfolders. 
At three folders deep he only wants to have 4 copies. At six folders 
deep, only 2. Ditto blocks add an attractive safety net without 
unnecessarily doubling or tripling the size of *all* metadata.


It is a good idea. The next question to me is whether or not it is 
something that can be implemented elegantly and whether or not a 
talented *dev* thinks it is a good idea.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: send/receive and bedup

2014-05-19 Thread Brendan Hide

On 19/05/14 15:00, Scott Middleton wrote:

On 19 May 2014 09:07, Marc MERLIN m...@merlins.org wrote:

On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote:

I read so much about BtrFS that I mistaked Bedup with Duperemove.
Duperemove is actually what I am testing.

I'm currently using programs that find files that are the same, and
hardlink them together:
http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html

hardlink.py actually seems to be the faster (memory and CPU) one event
though it's in python.
I can get others to run out of RAM on my 8GB server easily :(

Interesting app.

An issue with hardlinking (with the backups use-case, this problem isn't likely
to happen), is that if you modify a file, all the hardlinks get changed along
with it - including the ones that you don't want changed.

@Marc: Since you've been using btrfs for a while now I'm sure you've already
considered whether or not a reflink copy is the better/worse option.

Bedup should be better, but last I tried I couldn't get it to work.
It's been updated since then, I just haven't had the chance to try it
again since then.

Please post what you find out, or if you have a hardlink maker that's
better than the ones I found :)

Thanks for that.

I may be completely wrong in my approach.

I am not looking for a file level comparison. Bedup worked fine for
that. I have a lot of virtual images and shadow protect images where
only a few megabytes may be the difference. So a file level hash and
comparison doesn't really achieve my goals.

I thought duperemove may be on a lower level.

https://github.com/markfasheh/duperemove

Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the btrfs-extent-same ioctl.

It defaults to 128k but you can make it smaller.

I hit a hurdle though. The 3TB HDD I used seemed OK when I did a long
SMART test but seems to die every few hours. Admittedly it was part of
a failed mdadm RAID array that I pulled out of a clients machine.

The only other copy I have of the data is the original mdadm array
that was recently replaced with a new server, so I am loathe to use
that HDD yet. At least for another couple of weeks!

I am still hopeful duperemove will work.
Duperemove does look exactly like what you are looking for. The last
traffic on the mailing list regarding that was in August last year. It
looks like it was pulled into the main kernel repository on September 1st.

The last commit to the duperemove application was on April 20th this
year. Maybe Mark (cc'd) can provide further insight on its current status.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: ditto blocks on ZFS

2014-05-19 Thread Brendan Hide


On 2014/05/19 10:36 PM, Martin wrote:

On 18/05/14 17:09, Russell Coker wrote:

On Sat, 17 May 2014 13:50:52 Martin wrote:

[...]

Do you see or measure any real advantage?

[snip]
This is extremely difficult to measure objectively. Subjectively ... see 
below.

[snip]

*What other failure modes* should we guard against?


I know I'd sleep a /little/ better at night knowing that a double disk 
failure on a raid5/1/10 configuration might ruin a ton of data along 
with an obscure set of metadata in some long tree paths - but not the 
entire filesystem.


The other use-case/failure mode - where you are somehow unlucky enough 
to have sets of bad sectors/bitrot on multiple disks that simultaneously 
affect the only copies of the tree roots - is an extremely unlikely 
scenario. As unlikely as it may be, the scenario is a very painful 
consequence in spite of VERY little corruption. That is where the 
peace-of-mind/bragging rights come in.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.

2014-05-18 Thread Brendan Hide


On 2014/05/18 02:05 PM, Hugo Mills wrote:

On Sun, May 18, 2014 at 03:04:33PM +0800, Qu Wenruo wrote:
I don't have any real suggestions for alternatives coming from my
experience, other than not this. I've used docbook for man pages
briefly, many years ago. Looking around on the web, reStructuredText
might be a good option. Personally, I'd like to write docs in LaTeX,
but I'm not sure how easy it is to convert that to man pages.

Hugo.

What I have read so far indicates that LaTeX is the simplest most 
beautiful way to create portable documentation - and that exporting to a 
man page is simple. I can't vouch for it except to say that it is worth 
investigating.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID-1 - suboptimal write performance?

2014-05-18 Thread Brendan Hide


On 2014/05/16 11:36 PM, Austin S Hemmelgarn wrote:

On 05/16/2014 04:41 PM, Tomasz Chmielewski wrote:

On Fri, 16 May 2014 14:06:24 -0400
Calvin Walton calvin.wal...@kepstin.ca wrote:


No comment on the performance issue, other than to say that I've seen
similar on RAID-10 before, I think.


Also, what happens when the system crashes, and one drive has
several hundred megabytes data more than the other one?

This shouldn't be an issue as long as you occasionally run a scrub or
balance. The scrub should find it and fix the missing data, and a
balance would just rewrite it as proper RAID-1 as a matter of course.

It's similar (writes to just one drive, while the other is idle) when
removing (many) snapshots.

Not sure if that's optimal behaviour.


[snip]

Ideally, BTRFS should dispatch the first write for a block in a
round-robin fashion among available devices.  This won't fix the
underlying issue, but it will make it less of an issue for BTRFS.



More ideally, btrfs should dispatch them in parallel. This will likely 
be looked into for N-way mirroring. Having 3 or more copies and working 
in the current way would be far from optimal.




--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: staggered stripes

2014-05-15 Thread Brendan Hide


On 2014/05/15 04:38 PM, Russell Coker wrote:

On Thu, 15 May 2014 09:31:42 Duncan wrote:

Does the BTRFS RAID functionality do such staggered stripes?  If not
could it be added?

AFAIK nothing like that yet, but it's reasonably likely to be implemented
later.  N-way-mirroring is roadmapped for next up after raid56
completion, however.

It's RAID-5/6 when we really need such staggering.  It's a reasonably common
configuration choice to use two different brands of disk for a RAID-1 array.
As the correlation between parts of the disks with errors only applied to
disks of the same make and model (and this is expected due to
firmware/manufacturing issues) the people who care about such things on RAID-1
have probably already dealt with the issue.


You do mention the partition alternative, but not as I'd do it for such a
case.  Instead of doing a different sized buffer partition (or using the
mkfs.btrfs option to start at some offset into the device) on each
device, I'd simply do multiple partitions and reorder them on each
device.

If there are multiple partitions on a device then that will probably make
performance suck.  Also does BTRFS even allow special treatment of them or
will it put two copies from a RAID-10 on the same disk?


I suspect the approach is similar to the following:
sd[abcd][1234] each configured as LVM PVs
sda[1234] as an LVM VG
sdb[2345] as an LVM VG
sdc[3456] as an LVM VG
sdd[4567] as an LVM VG
btrfs across all four VGs

^ Um - the above is ignoring DOS-style partition limitations

Tho N-way-mirroring would sure help here too, since if a given
area around the same address is assumed to be weak on each device, I'd
sure like greater than the current 2-way-mirroring, even if if I had a
different filesystem/partition at that spot on each one, since with only
two-way-mirroring if one copy is assumed to be weak, guess what, you're
down to only one reasonably reliable copy now, and that's not a good spot
to be in if that one copy happens to be hit by a cosmic ray or otherwise
fail checksum, without another reliable copy to fix it since that other
copy is in the weak area already.

Another alternative would be using something like mdraid's raid10 far
layout, with btrfs on top of that...

In the copies= option thread Brendan Hide stated that this sort of thing is
planned.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mkfs.btrfs: allow UUID specification at mkfs time

2014-05-14 Thread Brendan Hide


On 14/05/14 09:31, Wang Shilong wrote:

On 05/14/2014 09:18 AM, Eric Sandeen wrote:

Allow the specification of the filesystem UUID at mkfs time.

(Implemented only for mkfs.btrfs, not btrfs-convert).

Just out of curiosity, this option is used for what kind of use case?
I notice Ext4 also has this option.:-)
Personally I can't think of any average or normal use case. The 
simplest case however is in using predictable/predetermined UUIDs.


Certain things, such as testing or perhaps even large-scale automation, 
are likely simpler to implement with a predictable UUID.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: -musage=0 means always reporting relocation

2014-05-11 Thread Brendan Hide


On 2014/05/11 11:52 AM, Russell Coker wrote:

On Sun, 11 May 2014, Russell Coker russ...@coker.com.au wrote:

Below is the output of running a balance a few times on a 120G SSD.

Sorry forgot to mention that's kernel 3.14.1 Debian package.


Please send the output of the two following command:
btrfs fi df /

This will give more information on your current chunk situation. I 
suspect this is a case where a system chunk (which is included when 
specifying metadata) that is not actually being relocated. This is a bug 
that I believe was already fixed, though I'm not sure in which version.


The pathological case is where you have a chunk that is 1% full and 
*every* other in-use chunk on the device is 100% full. In that 
situation, a balance will simply move that data into a new chunk (which 
will only ever reach 1% full). Thus, all subsequent balances will 
relocate that same data again to another new chunk.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does btrfs fi show show full?

2014-05-07 Thread Brendan Hide


On 2014/05/07 09:59 AM, Marc MERLIN wrote:

[snip]

Did I get this right?
I'm not sure I did, since it seems the bigger the -dusage number, the
more work balance has to do.

If I asked -dsuage=85, it would do all chunks that are more than 15%
full?


-dusage=85 balances all chunks that up to 85% full. The higher the 
number, the more work that needs to be done.

So, do I need to change the text above to say more than 45% full ?

More generally, does it not make sense to just use the same percentage
in -dusage than the percentage of total filesytem full?

Thanks,
Marc


Separately, Duncan has made me realise my halfway up algorithm is not 
very good - it was probably just good enough at the time and worked 
well enough that I wasn't prompted to analyse it further.


Doing a simulation with randomly-semi-filled chunks, df at 55%, and 
chunk utilisation at 86%, -dusage=55 balances 30% of the chunks, almost 
perfectly bringing chunk utilisation down to 56%. In my algorithm I 
would have used -dusage=70 which in my simulation would have balanced 
34% of the chunks - but bringing chunk utilisation down to 55% - a bit 
of wasted effort and unnecessary SSD wear.


I think now that I need to experiment with a much lower -dusage value 
and perhaps to repeat the balance with the df value (55 in the example) 
if the chunk usage is still too high. Getting an optimal first value 
algorithmically might prove a challenge - I might just end up picking 
some arbitrary percentage point below the df value.


Pathological use-cases still apply however (for example if all chunks 
except one are exactly 54% full). The up-side is that if the algorithm 
is applied regularly (as in scripted and scheduled) then the situation 
will always be that the majority of chunks are going to be relatively 
full, avoiding the pathological use-case.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Brendan Hide


Hi, Marc. Inline below. :)

On 2014/05/06 02:19 PM, Marc MERLIN wrote:

On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote:

In the case above, because the filesystem is only 55% full, I can
ask balance to rewrite all chunks that are more than 55% full:

legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1

-dusage=50 will balance all chunks that are 50% *or less* used,

Sorry, I actually meant to write 55 there.


not more. The idea is that full chunks are better left alone while
emptyish chunks are bundled together to make new full chunks,
leaving big open areas for new chunks. Your process is good however
- just the explanation that needs the tweak. :)

Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55?


As usual, it depends on what end-result you want. Paranoid rebalancing - 
always ensuring there are as many free chunks as possible - is totally 
unnecessary. There may be more good reasons to rebalance - but I'm only 
aware of two: a) to avoid ENOSPC due to running out of free chunks; and 
b) to change allocation type.


If you want all chunks either full or empty (except for that last chunk 
which will be somewhere inbetween), -dusage=55 will get you 99% there.

In your last example, a full rebalance is not necessary. If you want
to clear all unnecessary chunks you can run the balance with
-dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of
the data chunks that are 80% and less used, which would by necessity
get about ~160GB worth chunks back out of data and available for
re-use.

So in my case when I hit that case, I had to use dusage=0 to recover.
Anything above that just didn't work.


I suspect when using more than zero the first chunk it wanted to balance 
wasn't empty - and it had nowhere to put it. Then when you did dusage=0, 
it didn't need a destination for the data. That is actually an 
interesting workaround for that case.

On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote:

Forgot this part: Also in your last example, you used -dusage=0
and it balanced 91 chunks. That means you had 91 empty or
very-close-to-empty chunks. ;)

Correct. That FS was very mis-balanced.

On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote:

The standard response on the mailing list for this issue is to
temporarily add an additional device to the filesystem (even e.g. a 4GB
USB flash drive is often enough) - this will add space to allocate a few
new chunks, allowing the balance to proceed. You can remove the extra
device after the balance completes.

I just added that tip, thank you.
  
On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote:

Recently kernel 3.14 allowed fixing a metadata space error that seemed to be
impossible to solve with 3.13.  So it's possible that some of my other
problems with a lack of metadata space could have been solved with kernel 3.14
too.

Good point. I added that tip too.

Thanks,
Marc



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using mount -o bind vs mount -o subvol=vol

2014-05-05 Thread Brendan Hide


On 05/05/14 06:36, Roman Mamedov wrote:

On Mon, 05 May 2014 06:13:30 +0200
Brendan Hide bren...@swiftspirit.co.za wrote:


1) There will be a *very* small performance penalty (negligible, really)

Oh, really, it's slower to mount the device directly? Not that I really
care, but that's unexpected.

Um ... the penalty is if you're mounting indirectly. ;)

I feel that's on about the same scale as giving your files shorter filenames,
so that they open faster. Or have you looked at the actual kernel code with
regard to how it's handled, or maybe even have any benchmarks, other than a
general thought of it's indirect, so it probably must be slower?


My apologies - not everyone here is a native English-speaker.

You are 100% right, though. The scale is very small. By negligible, the 
penalty is at most a few CPU cycles. When compared to the wait time on 
a spindle, it really doesn't matter much.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does btrfs fi show show full?

2014-05-05 Thread Brendan Hide


On 05/05/14 07:50, Marc MERLIN wrote:

On Mon, May 05, 2014 at 06:11:28AM +0200, Brendan Hide wrote:

The per-device used amount refers to the amount of space that has
been allocated to chunks. That first one probably needs a balance.
Btrfs doesn't behave very well when available diskspace is so low
due to the fact that it cannot allocate any new chunks. An attempt
to allocate a new chunk will result in ENOSPC errors.

The Total bytes used refers to the total actual data that is stored.

Right. So 'Total used' is what I'm really using, whereas 'devid used' is
actually what is being used due to the way btrfs doesn't seem to reclaim
chunks after they're not used anymore, or some similar problem.

In the second FS:

Label: btrfs_pool1  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
Total devices 1 FS bytes used 442.17GiB
devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot

The difference is huge between 'Total used' and 'devid used'.

Is btrfs going to fix this on its own, or likely not and I'm stuck doing
a full balance (without filters since I'm balancing data and not
metadata)?

If that helps.
legolas:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total=734.01GiB, used=435.29GiB
System, DUP: total=8.00MiB, used=96.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=8.50GiB, used=6.74GiB
Metadata, single: total=8.00MiB, used=0.00

Thanks,
Marc
What I typically do in snapshot cleanup scripts is to use the usage= 
filter with the percentage being a simple algorithm of halfway between 
actual data and maximum chunk allocation I'm comfortable with. As an 
example, one of my servers' diskspace was at 65% and last night the 
chunk allocation reached the 90% mark. So it automatically ran the 
balance with dusage=77. This would have cleared out about half of the 
unnecessary chunks.


Balance causes a lot of load with spinning rust - of course, 
after-hours, nobody really cares. With SSDs it causes wear. This is just 
one method I felt was a sensible way for me to avoid ENOSPC issues while 
also ensuring I'm not rebalancing the entire system unnecessarily.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-05 Thread Brendan Hide

On 05/05/14 14:16, Marc MERLIN wrote:

I've just written this new page:
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

First, are there problems in it?

Second, are there other FS full issues I should mention in it?

Thanks,
Marc
In the case above, because the filesystem is only 55% full, I can ask
balance to rewrite all chunks that are more than 55% full:

legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1

-dusage=50 will balance all chunks that are 50% *or less* used, not
more. The idea is that full chunks are better left alone while emptyish
chunks are bundled together to make new full chunks, leaving big open
areas for new chunks. Your process is good however - just the
explanation that needs the tweak. :)

In your last example, a full rebalance is not necessary. If you want to
clear all unnecessary chunks you can run the balance with -dusage=80
(636GB/800GB~=79%). That will cause a rebalance only of the data chunks
that are 80% and less used, which would by necessity get about ~160GB
worth chunks back out of data and available for re-use.

The issue I'm not sure of how to get through is if you can't balance
*because* of ENOSPC errors. I'd probably start scouring the mailing list
archives if I ever come across that.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-05 Thread Brendan Hide

On 05/05/14 19:07, Brendan Hide wrote:

On 05/05/14 14:16, Marc MERLIN wrote:

I've just written this new page:
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

First, are there problems in it?

Second, are there other FS full issues I should mention in it?

Thanks,
Marc
In the case above, because the filesystem is only 55% full, I can ask
balance to rewrite all chunks that are more than 55% full:

legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1

-dusage=50 will balance all chunks that are 50% *or less* used, not
more. The idea is that full chunks are better left alone while
emptyish chunks are bundled together to make new full chunks, leaving
big open areas for new chunks. Your process is good however - just the
explanation that needs the tweak. :)

In your last example, a full rebalance is not necessary. If you want
to clear all unnecessary chunks you can run the balance with
-dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the
data chunks that are 80% and less used, which would by necessity get
about ~160GB worth chunks back out of data and available for re-use.

The issue I'm not sure of how to get through is if you can't balance
*because* of ENOSPC errors. I'd probably start scouring the mailing
list archives if I ever come across that.

Forgot this part: Also in your last example, you used -dusage=0 and it
balanced 91 chunks. That means you had 91 empty or very-close-to-empty
chunks. ;)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Thoughts on RAID nomenclature

2014-05-05 Thread Brendan Hide


On 2014/05/05 11:17 PM, Hugo Mills wrote:

A passing remark I made on this list a day or two ago set me to
thinking. You may all want to hide behind your desks or in a similar
safe place away from the danger zone (say, Vladivostok) at this
point...


I feel like I can brave some mild horrors. Of course, my C skills 
aren't up to scratch so its all just bravado. ;)

If we switch to the NcMsPp notation for replication, that
comfortably describes most of the plausible replication methods, and
I'm happy with that. But, there's a wart in the previous proposition,
which is putting d for 2cd to indicate that there's a DUP where
replicated chunks can go on the same device. This was the jumping-off
point to consider chunk allocation strategies in general.

At the moment, we have two chunk allocation strategies: dup and
spread (for want of a better word; not to be confused with the
ssd_spread mount option, which is a whole different kettle of
borscht). The dup allocation strategy is currently only available for
2c replication, and only on single-device filesystems. When a
filesystem with dup allocation has a second device added to it, it's
automatically upgraded to spread.


I thought this step was manual - but okay! :)

The general operation of the chunk allocator is that it's asked for
locations for n chunks for a block group, and makes a decision about
where those chunks go. In the case of spread, it sorts the devices in
decreasing order of unchunked space, and allocates the n chunks in
that order. For dup, it allocates both chunks on the same device (or,
generalising, may allocate the chunks on the same device if it has
to).

Now, there are other variations we could consider. For example:

  - linear, which allocates on the n smallest-numbered devices with
free space. This goes halfway towards some people's goal of
minimising the file fragments damaged in a device failure on a 1c
FS (again, see (*)). [There's an open question on this one about
what happens when holes open up through, say, a balance.]

  - grouped, which allows the administrator to assign groups to the
devices, and allocates each chunk from a different group. [There's
a variation here -- we could look instead at ensuring that
different _copies_ go in different groups.]

Given these four (spread, dup, linear, grouped), I think it's
fairly obvious that spread is a special case of grouped, where each
device is its own group. Then dup is the opposite of grouped (i.e. you
must have one or the other but not both). Finally, linear is a
modifier that changes the sort order.

All of these options run completely independently of the actual
replication level selected, so we could have 3c:spread,linear
(allocates on the first three devices only, until one fills up and
then it moves to the fourth device), or 2c2s:grouped, with a device
mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
different copies on different device controllers.

Does this all make sense? Are there any other options or features
that we might consider for chunk allocation at this point? Having had
a look at the chunk allocator, I think most if not all of this is
fairly easily implementable, given a sufficiently good method of
describing it all, which is what I'm trying to get to the bottom of in
this discussion.


I think I get most of what you're saying. If its not too difficult, 
perhaps you could update (or duplicate to another URL) your 
/btrfs-usage/ calculator to reflect the idea. It'd definitely make it 
easier for everyone (including myself) to know we're on the same page.


I like the idea that the administrator would have more granular control 
over where data gets allocated first or where copies belong. 
Splicing data to different controllers as you mentioned can help with 
both redundancy and performance.


Note: I've always thought of dup as a special form of spread where we 
just write things out twice - but yes, there's no need for it to be 
compatible with any other allocation type.

Hugo.

(*) The missing piece here is to deal with extent allocation in a
similar way, which would offer better odds again on the number of
files damaged in a device-loss situation on a 1c FS. This is in
general a much harder problem, though. The only change we have in this
area at the moment is ssd_spread, which doesn't do very much. It also
has the potential for really killing performance and/or file
fragmentation.




--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is metadata redundant over more than one drive with raid0 too?

2014-05-04 Thread Brendan Hide


Hi, Marc

Raid0 is not redundant in any way. See inline below.

On 2014/05/04 01:27 AM, Marc MERLIN wrote:

So, I was thinking. In the past, I've done this:
mkfs.btrfs -d raid0 -m raid1 -L btrfs_raid0 /dev/mapper/raid0d*

My rationale at the time was that if I lose a drive, I'll still have
full metadata for the entire filesystem and only missing files.
If I have raid1 with 2 drives, I should end up with 4 copies of each
file's metadata, right?

But now I have 2 questions
1) btrfs has two copies of all metadata on even a single drive, correct?


Only when *specifically* using -m dup (which is the default on a single 
non-SSD device), will there be two copies of the metadata stored on a 
single device. This is not recommended when using multiple devices as it 
means one device failure will likely cause critical loss of metadata. 
When using -m raid1 (as is the case in your first example above and as 
is the default with multiple devices), two copies of the metadata are 
distributed across two devices (each of those devices with a copy has 
only a single copy).

If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the
metadata on the same drive or is btrfs smart enough to spread out
metadata copies so that they're not on the same drive?


This will mean there is only a single copy, albeit striped across the 
drives.


2) does btrfs lay out files on raid0 so that files aren't striped across
more than one drive, so that if I lose a drive, I only lose whole files,
but not little chunks of all my files, making my entire FS toast?


raid0 currently allocates a single chunk on each device and then makes 
use of RAID0-like stripes across these chunks until a new chunk needs 
to be allocated. This is good for performance but not good for 
redundancy. A total failure of a single device will mean any large files 
will be lost and only files smaller than the default per-disk stripe 
width (I believe this used to be 4K and is now 16K - I could be wrong) 
stored only on the remaining disk will be available.


The scenario you mentioned at the beginning, if I lose a drive, I'll 
still have full metadata for the entire filesystem and only missing 
files is more applicable to using -m raid1 -d single. Single is not 
geared towards performance and, though it doesn't guarantee a file is 
only on a single disk, the allocation does mean that the majority of all 
files smaller than a chunk will be stored on only one disk or the other 
- not both.


Thanks,
Marc


I hope the above is helpful.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using mount -o bind vs mount -o subvol=vol

2014-05-04 Thread Brendan Hide


On 2014/05/04 02:47 AM, Marc MERLIN wrote:

Is there any functional difference between

mount -o subvol=usr /dev/sda1 /usr
and
mount /dev/sda1 /mnt/btrfs_pool
mount -o bind /mnt/btrfs_pool/usr /usr

?

Thanks,
Marc

There are two issues with this.
1) There will be a *very* small performance penalty (negligible, really)

2) Old snapshots and other supposedly-hidden subvolumes will be 
accessible under /mnt/btrfs_pool. This is a minor security concern 
(which of course may not concern you, depending on your use-case).


There are a few similar minor security concerns - the 
recently-highlighted issue with old snapshots is the potential that old 
vulnerable binaries within a snapshot are still accessible and/or 
executable.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Copying related snapshots to another server with btrfs send/receive?

2014-05-04 Thread Brendan Hide


On 2014/05/04 05:12 AM, Marc MERLIN wrote:

Another question I just came up with.

If I have historical snapshots like so:
backup
backup.sav1
backup.sav2
backup.sav3

If I want to copy them up to another server, can btrfs send/receive
let me copy all of the to another btrfs pool while keeping the
duplicated block relationship between all of them?
Note that the backup.sav dirs will never change, so I won't need
incremental backups on those, just a one time send.
I believe this is supposed to work, correct?

The only part I'm not clear about is am I supposed to copy them all at
once in the same send command, or one by one?

If they had to be copied together and if I create a new snapshot of
backup: backup.sav4

If I use btrfs send to that same destination, is btrfs send/receive indeed
able to keep the shared block relationship?

Thanks,
Marc

I'm not sure if they can be sent in one go. :-/

Sending one-at-a-time, the shared-data relationship will be kept by 
using the -p (parent) parameter. Send will only send the differences and 
receive will create a new snapshot, adjusting for those differences, 
even when the receive is run on a remote server.


$ btrfs send backup | btrfs receive $path/
$ btrfs send -p backup backup.sav1 | btrfs receive $path/
$ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/
$ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/
$ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: copies= option

2014-05-04 Thread Brendan Hide


On 2014/05/04 05:27 AM, Duncan wrote:

Russell Coker posted on Sun, 04 May 2014 12:16:54 +1000 as excerpted:


Are there any plans for a feature like the ZFS copies= option?

I'd like to be able to set copies= separately for data and metadata.  In
most cases RAID-1 provides adequate data protection but I'd like to have
RAID-1 and copies=2 for metadata so that if one disk dies and another
has some bad sectors during recovery I'm unlikely to lose metadata.

Hugo's the guy with the better info on this one, but until he answers...

The zfs license issues mean it's not an option for me and I'm thus not
familiar with its options in any detail, but if I understand the question
correctly, yes.

And of course since btrfs treats data and metadata separately, it's
extremely unlikely that any sort of copies= option wouldn't be separately
configurable for each.

There was a discussion of a very nice multi-way-configuration schema that
I deliberately stayed out of as both a bit above my head and far enough
in the future that I didn't want to get my hopes up too high about it
yet.  I already want N-way-mirroring so bad I can taste it, and this was
that and way more... if/when it ever actually gets coded and committed to
the mainline kernel btrfs.  As I said, Hugo should have more on it, as he
was active in that discussion as it seemed to line up perfectly with his
area of interest.

The simple answer is yes, this is planned. As Duncan implied, however, 
it is not on the immediate roadmap. Internally we appear to be referring 
to this feature as N-way redundancy or N-way mirroring.


My understanding is that the biggest hurdle before the primary devs will 
look into N-way redundancy is to finish the Raid5/6 implementation to 
include self-healing/scrubbing support - a critical issue before it can 
be adopted further.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is metadata redundant over more than one drive with raid0 too?

2014-05-04 Thread Brendan Hide


On 2014/05/04 09:24 AM, Marc MERLIN wrote:

On Sun, May 04, 2014 at 08:57:19AM +0200, Brendan Hide wrote:

Hi, Marc

Raid0 is not redundant in any way. See inline below.
  
Thanks for clearing things up.



But now I have 2 questions
1) btrfs has two copies of all metadata on even a single drive, correct?

Only when *specifically* using -m dup (which is the default on a
single non-SSD device), will there be two copies of the metadata
stored on a single device. This is not recommended when using

Ah, so -m dup is default like I thought, but not on SSD?
Ooops, that means that my laptop does not have redundant metadata on its
SSD like I thought. Thanks for the heads up.
Ah, I see the man page now This is because SSDs can remap blocks
internally so duplicate blocks could end up in the same erase block
which negates the benefits of doing metadata duplication.


You can force dup but, per the man page, whether or not that is 
beneficial is questionable.



multiple devices as it means one device failure will likely cause
critical loss of metadata.

That's the part where I'm not clear:

What's the difference between -m dup and -m raid1
Don't they both say 2 copies of the metadata?
Is -m dup only valid for a single drive, while -m raid1 for 2+ drives?


The issue is that -m dup will always put both copies on a single device. 
If you lose that device, you've lost both (all) copies of that metadata. 
With -m raid1 the second copy is on a *different* device.


I believe dup *can* be used with multiple devices but mkfs.btrfs might 
not let you do it from the get-go. The way most have gotten there is by 
having dup on a single device and then, after adding another device, 
they didn't convert the metadata to raid1.



If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the
metadata on the same drive or is btrfs smart enough to spread out
metadata copies so that they're not on the same drive?

This will mean there is only a single copy, albeit striped across
the drives.

Ok, so -m raid0 only means a single copy of metadata, thanks for
explaining.


good for redundancy. A total failure of a single device will mean
any large files will be lost and only files smaller than the default
per-disk stripe width (I believe this used to be 4K and is now 16K -
I could be wrong) stored only on the remaining disk will be
available.
  
Gotcha, thanks for confirming, so -m raid1 -d raid0 really only protects

against metadata corruption or a single block loss, but otherwise if you
lost a drive in a 2 drive raid0, you'll have lost more than just half
your files.


The scenario you mentioned at the beginning, if I lose a drive,
I'll still have full metadata for the entire filesystem and only
missing files is more applicable to using -m raid1 -d single.
Single is not geared towards performance and, though it doesn't
guarantee a file is only on a single disk, the allocation does mean
that the majority of all files smaller than a chunk will be stored
on only one disk or the other - not both.

Ok, so in other words:
-d raid0: if you one 1 drive out of 2, you may end up with small files
and the rest will be lost

-d single: you're more likely to have files be on one drive or the
other, although there is no guarantee there either.

Correct?


Correct


Thanks,
Marc



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Copying related snapshots to another server with btrfs send/receive?

2014-05-04 Thread Brendan Hide


On 2014/05/04 09:28 AM, Marc MERLIN wrote:

On Sun, May 04, 2014 at 09:16:02AM +0200, Brendan Hide wrote:

Sending one-at-a-time, the shared-data relationship will be kept by
using the -p (parent) parameter. Send will only send the differences
and receive will create a new snapshot, adjusting for those
differences, even when the receive is run on a remote server.

$ btrfs send backup | btrfs receive $path/
$ btrfs send -p backup backup.sav1 | btrfs receive $path/
$ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/
$ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/
$ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/

So this is exactly the same than what I do incremental backups with
brrfs send, but -p only works if the snapshot is read only, does it not?
I do use that for my incremental syncs and don't mind read only
snapshots there, but if I have read/write snapshots that are there for
other reasons than btrfs send incrementals, can I still send them that
way with -p?
(I thought that wouldn't work)

Thanks,
Marc
Yes, -p (parent) and -c (clone source) are the only ways I'm aware of to 
push subvolumes across while ensuring data-sharing relationship remains 
intact. This will end up being much the same as doing incremental backups:

From the man page section on -c:
You must not specify clone sources unless you guarantee that these 
snapshots are exactly in the same state on both sides, the sender and 
the receiver. It is allowed to omit the '-p parent' option when '-c 
clone-src' options are given, in which case 'btrfs send' will 
determine a suitable parent among the clone sources itself.


-p does require that the sources be read-only. I suspect -c does as 
well. This means that it won't be so simple as you want your sources to 
be read-write. Probably the only way then would be to make read-only 
snapshots whenever you want to sync these over while also ensuring that 
you keep at least one read-only snapshot intact - again, much like 
incremental backups.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does btrfs fi show show full?

2014-05-04 Thread Brendan Hide


On 2014/05/05 02:54 AM, Marc MERLIN wrote:

More slides, more questions, sorry :)
(thanks for the other answers, I'm still going through them)

If I have:
gandalfthegreat:~# btrfs fi show
Label: 'btrfs_pool1'  uuid: 873d526c-e911-4234-af1b-239889cd143d
Total devices 1 FS bytes used 214.44GB
devid1 size 231.02GB used 231.02GB path /dev/dm-0

I'm a bit confused.

It tells me
1) FS uses 214GB out of 231GB
2) Device uses 231GB out of 231GB

I understand how the device can use less than the FS if you have
multiple devices that share a filesystem.
But I'm not sure how a filesystem can use less than what's being used on
a single device.

Similarly, my current laptop shows:
legolas:~# btrfs fi show
Label: btrfs_pool1  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
Total devices 1 FS bytes used 442.17GiB
devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot

So, am I 100GB from being full, or am I really only using 442GB out of 865GB?

If so, what does the device used value really mean if it can be that
much higher than the filesystem used value?

Thanks,
Marc
The per-device used amount refers to the amount of space that has been 
allocated to chunks. That first one probably needs a balance. Btrfs 
doesn't behave very well when available diskspace is so low due to the 
fact that it cannot allocate any new chunks. An attempt to allocate a 
new chunk will result in ENOSPC errors.


The Total bytes used refers to the total actual data that is stored.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using mount -o bind vs mount -o subvol=vol

2014-05-04 Thread Brendan Hide


On 2014/05/05 02:56 AM, Marc MERLIN wrote:

On Sun, May 04, 2014 at 09:07:55AM +0200, Brendan Hide wrote:

On 2014/05/04 02:47 AM, Marc MERLIN wrote:

Is there any functional difference between

mount -o subvol=usr /dev/sda1 /usr
and
mount /dev/sda1 /mnt/btrfs_pool
mount -o bind /mnt/btrfs_pool/usr /usr

?

Thanks,
Marc

There are two issues with this.
1) There will be a *very* small performance penalty (negligible, really)

Oh, really, it's slower to mount the device directly? Not that I really
care, but that's unexpected.


Um ... the penalty is if you're mounting indirectly. ;)
  

2) Old snapshots and other supposedly-hidden subvolumes will be
accessible under /mnt/btrfs_pool. This is a minor security concern
(which of course may not concern you, depending on your use-case).
There are a few similar minor security concerns - the
recently-highlighted issue with old snapshots is the potential that
old vulnerable binaries within a snapshot are still accessible
and/or executable.

That's a fair point. I can of course make that mountpoint 0700, but it's
a valid concern in some cases (not for me though).

So thanks for confirming my understanding, it sounds like both are valid
and if you're already mounting the main pool like I am, that's the
easiest way.

Thanks,
Marc

All good. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help with space

2014-05-02 Thread Brendan Hide


On 02/05/14 10:23, Duncan wrote:

Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted:


On Thu, 1 May 2014, Duncan 1i5t5.dun...@cox.net wrote:
[snip]
http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf‎

Whether a true RAID-1 means just 2 copies or N copies is a matter of
opinion. Papers such as the above seem to clearly imply that RAID-1 is
strictly 2 copies of data.

Thanks for that link. =:^)

My position would be that reflects the original, but not the modern,
definition.  The paper seems to describe as raid1 what would later come
to be called raid1+0, which quickly morphed into raid10, leaving the
raid1 description only covering pure mirror-raid.
Personally I'm flexible on using the terminology in day-to-day 
operations and discussion due to the fact that the end-result is close 
enough. But ...


The definition of RAID 1 is still only a mirror of two devices. As far 
as I'm aware, Linux's mdraid is the only raid system in the world that 
allows N-way mirroring while still referring to it as RAID1. Due to 
the way it handles data in chunks, and also due to its rampant layering 
violations, *technically* btrfs's RAID-like features are not RAID.


To differentiate from RAID, we're already using lowercase raid and, 
in the long term, some of us are also looking to do away with raid{x} 
terms altogether with what Hugo and I last termed as csp notation. 
Changing the terminology is important - but it is particularly non-urgent.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: protect snapshots from deleting during send

2014-04-26 Thread Brendan Hide


On 2014/04/16 05:22 PM, David Sterba wrote:

On Wed, Apr 16, 2014 at 04:59:09PM +0200, Brendan Hide wrote:

On 2014/04/16 03:40 PM, Chris Mason wrote:

So in my example with the automated tool, the tool really shouldn't be
deleting a snapshot where send is in progress.  The tool should be told
that snapshot is busy and try to delete it again later.

It makes more sense now, 'll queue this up for 3.16 and we can try it out
in -next.

-chris

So ... does this mean the plan is to a) have userland tool give an error; or
b) a deletion would be scheduled in the background for as soon as the send
has completed?

b) is current state, a) is the plan

with the patch, 'btrfs subvol delete' would return EPERM/EBUSY

My apologies, I should have followed up on this a while ago already. :-/

Would having something closer to b) be more desirable if the resource 
simply disappears but continues in the background? This would be as in a 
lazy umount, where presently-open files are left open and writable but 
the directory tree has disappeared.


I submit that, with a), the actual status is more obvious/concrete 
whereas with b+lazy), current issues would flow smoothly with no errors 
and no foreseeable future issues.


I reserve the right to be wrong, of course. ;)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cycle of send/receive for backup/restore is incomplete...

2014-04-23 Thread Brendan Hide


Replied inline:

On 2014/04/24 12:30 AM, Robert White wrote:
So the backup/restore system described using snapshots is incomplete 
because the final restore is a copy operation. As such, the act of 
restoring from the backup will require restarting the entire backup 
cycle because the copy operation will scramble the metadata 
consanguinity.


The real choice is to restore by sending the snapshot back via send 
and receive so that all the UIDs and metadata continue to match up.


But there's no way to promote the final snapshot to a non-snapshot 
subvolume identical to the one made by the original btrfs subvolume 
create operation.


btrfs doesn't differentiate snapshots and subvolumes. They're the same 
first-class citizen. A snapshot is a subvolume that just happens to have 
some data (automagically/naturally) deduplicated with another subvolume.


Consider a file system with __System as the default mount (e.g. btrfs 
subvolume create /__System). You make a snapshot (btrfs sub snap -r 
/__System /__System_BACKUP). Then you send the backup to another file 
system with send receive. Nothing new here.


The thing is, if you want to restore from that backup, you'd 
send/receive /__System_BACKUP to the new/restore drive. But that 
snapshot is _forced_ to be read only. So then your only choice is to 
make a writable snapshot called /__System. At this point you have a 
tiny problem, the three drives aren't really the same.


The __System and __System_BACKUP on the final drive are subvolumes of 
/, while on the original system / and /__System were full subvolumes.


There's no such thing as a full subvolume. Again, they're all 
first-class citizens. The real root of a btrfs is always treated as a 
subvolume, as are the subvolumes inside it too. Just because other 
subvolumes are contained therein it doesn't mean they're diminished 
somehow. You cannot have multiple subvolumes *without* having them be a 
sub volume of the real root subvolume.


It's dumb, it's a tiny difference, but it's annoying. There needs to 
be a way to promote /__System to a non-snapshot status.


If you look at the output of btrfs subvolume list -s / on the 
various drives it's not possible to end up with the exact same system 
as the original.


From a user application perspective, the system *is* identical to the 
original. That's the important part.


If you want the disk to be identical bit for bit then you want a 
different backup system entirely, one that backs up the hard disk, not 
the files/content.


On the other hand if you just want to have all your snapshots restored 
as well, that's not too difficult. Its pointless from most perspectives 
- but not difficult.


There needs to be either an option to btrfs subvolume create that 
takes a snapshot as an argument to base the new device on, or an 
option to receive that will make a read-write non-snapshot subvolume.


This feature already exists. This is a very important aspect of how 
snapshots work with send / receive and why it makes things very 
efficient. They work just as well for a restore as they do for a backup. 
The flag you are looking for is -p for parent, which you should 
already be using for the backups in the first place:


From backup host:
$ btrfs send -p /backup/path/yesterday /backup/path/last_backup | 
netcat or whatever you choose


From restored host:
$ netcat or whatever you choose | btrfs receive /tmp/btrfs_root/

Then you make the non-read-only snapshot of the restored subvolume.


[snip]




--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: protect snapshots from deleting during send

2014-04-16 Thread Brendan Hide


On 2014/04/16 03:40 PM, Chris Mason wrote:
So in my example with the automated tool, the tool really shouldn't be 
deleting a snapshot where send is in progress.  The tool should be 
told that snapshot is busy and try to delete it again later.


It makes more sense now, 'll queue this up for 3.16 and we can try it 
out in -next.


-chris
So ... does this mean the plan is to a) have userland tool give an 
error; or b) a deletion would be scheduled in the background for as 
soon as the send has completed?


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Copying a disk containing a btrfs filesystem

2014-04-11 Thread Brendan Hide


Hi, Michael

Btrfs send/receive can transfer incremental snapshots as well - you're 
looking for the -p or parent parameter. On the other hand, it might 
not be the right tool for the job.


If you're 100% happy with your old disk's *content*/layout/etc (just not 
happy with the disk's reliability), try an overnight/over-weekend 
ddrescue instead:

http://www.forensicswiki.org/wiki/Ddrescue

What I've done in the past is scripted ddrescue to recover as much data 
as possible. Its like using dd between two disks except that it can keep 
a log of bad sectors that can be retried later. The log also helps by 
ensuring that if you cancel the operation, you can start it again and it 
will continue where it left off.


Additionally, you can have it skip big sections of the disk when it 
comes across bad sectors - and it can trim these sections on 
subsequent runs.


Btrfs send/receive has the advantage that you can run it while your 
system is still active. DDrescue has the advantage that it is very good 
at recovering 99% of your data where a disk has lots of bad sectors.


For btrfs, send/receive the main subvolumes then, afterward, 
send/receive the snapshots using the parent parameter, -p. There 
*is* the possibility that this needs to be reversed - as in, the backup 
should be treated as the parent instead of the other way around:


 btrfs send /home | btrfs receive /mnt/new-disk/home
 btrfs send -p /home /backups/home-2014-04-08 | btrfs receive 
/mnt/new-disk/backups/.




Below is from the last scriptlet I made when I last did the ddrescue 
route (in that case I was recovering a failing NTFS drive). It was 
particularly bad and took a whole weekend to recover. The new disk 
worked flawlessly however. :)


How I've used ddrescue in the past is to connect the failing and new 
disk to a server. Alternatively, using USB, you could boot from a rescue 
CD/flash drive and do the rescue there.


a) Identify the disks in /dev/disk/by-whatever-you-choose and put 
those values into the bash script below. ensure that it refers to the 
disk as a whole (not a partition for example). This ensures that 
re-ordering of the drives after a reboot won't affect the process.
b) Set up a log file location on a separate filesystem - a flash drive 
is ideal unless you've gone the server route, where I normally just 
put the log into a path on the the server as so: 
/root/brendan-laptop.recovery.log


#!/bin/bash
src_disk=/dev/disk/by-id/ata-ST3250410AS_6RYF5NP7
dst_disk=/dev/disk/by-id/ata-ST3500418AS_9VM2ZZQS
#log=/path/to/log
log=~/brendan-laptop.recovery.log

#Sector size (default is 512 - newer disks should probably be 4096)
sector_size=4096

#Force writing to a block device - disable if you're backing up to an 
image file

#force=
force=-f

# We want to skip bigger chunks to get as much data as possible before 
the source disk really dies
# For the same reason, we also want to start with (attempting) to 
maintain a high read rate

#Minimum read rate in MB/s before skipping
min_readrate=10485760


#Default skip size is 64k.
for skip_size in 65536 16384; do
 #Going forward
 ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size 
$src_disk $dst_disk $log

 #Going in reverse
 ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size 
-R $src_disk $dst_disk $log

done

#Default skip size is 64k.
#This time re-trying all failed/skipped sections
for skip_size in 4096; do
 #Going forward
 ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size 
-A $src_disk $dst_disk $log

 #Going in reverse
 ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size 
-R -A $src_disk $dst_disk $log

done

#Default skip size is 64k.
for skip_size in 1024 256 64 ; do
 #Going forward
 ddrescue -r 1 -b $sector_size -d $force -K $skip_size $src_disk 
$dst_disk $log

 #Going in reverse
 ddrescue -r 1 -b $sector_size -d $force -K $skip_size -R $src_disk 
$dst_disk $log

done

echo Done. Run an chkdsk/fsck/whatever-might-be appropriate for the new 
disk's filesystem(s)



On 10/04/14 15:21, Michael Schuerig wrote:

SMART indicates that my notebook disk may soon be failing (an
unreadable/uncorrectable sector), therefore I intend to exchange it. The
disk contains a single btrfs filesystem with several nested(!)
subvolumes, each with several read-only snapshots in a .snapshots
subdirectory.

As far as I can tell, btrfs currently does not offer a sensible way to
duplicate the entire contents of the old disk onto a new one. I can use
cp, rsync, or send/receive to copy the main subvolumes. But unless I'm
missing something obvious, the snapshots are effectively lost. btrfs
send optionally takes multiple clone sources, but I've never seen an
example of its usage.

If that's what experimental means, I'm willing to accept it. However,
I'd like to emphasize that there's still something missing. Of course,
most of all I'd like to be proved wrong.

Michael




--
__
Brendan Hide
http

Re: [PATCH] lib: add size unit t/p/e to memparse

2014-03-31 Thread Brendan Hide


On 31/03/14 12:03, Gui Hecheng wrote:

- * potentially suffixed with %K (for kilobytes, or 1024 bytes),
- * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
- * 1073741824).  If the number is suffixed with K, M, or G, then
+ * potentially suffixed with
+ * %K (for kilobytes, or 1024 bytes),
+ * %M (for megabytes, or 1048576 bytes),
+ * %G (for gigabytes, or 1073741824),
+ * %T (for terabytes, or 1099511627776),
+ * %P (for petabytes, or 1125899906842624 bytes),
+ * %E (for exabytes, or 1152921504606846976 bytes).


My apologies, I should have noticed this in your earlier mail. This 
could be updated to specifically refer to the binary prefixes rather 
than the old SI-conflicting names:

kibibyte, mebibyte, gibibyte, tebibyte, pebibyte, and exbibyte

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: add btrfs resize unit t/p/e support

2014-03-27 Thread Brendan Hide


On 2014/03/27 04:51 AM, Gui Hecheng wrote:

[snip]

We add t/p/e support by replacing lib/cmdline.c:memparse
with btrfs_memparse. The btrfs_memparse copies memparse's code
and add unit t/p/e parsing.

Is there a conflict preventing adding this to memparse directly?

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Asymmetric RAID0

2014-03-25 Thread Brendan Hide


On 25/03/14 07:15, Slava Barinov wrote:

Hello,

  I've been using a single drive btrfs for some time and when free space
  became too low I've added an additional drive and rebalanced FS with
  RAID0 data and RAID1 System and Metadata storage.

  Now I have the following configuration:

# btrfs fi show /btr
Label: none  uuid: f9d78880-10a7-439b-8ebd-14d815edbc19
 Total devices 2 FS bytes used 415.45GiB
 devid1 size 931.51GiB used 222.03GiB path /dev/sdc
 devid2 size 431.51GiB used 222.03GiB path /dev/sdb

# btrfs fi df /btr
Data, RAID0: total=424.00GiB, used=406.81GiB
System, RAID1: total=32.00MiB, used=40.00KiB
Metadata, RAID1: total=10.00GiB, used=8.64GiB

# df -h
Filesystem   Size  Used Avail Use% Mounted on
/dev/sdb 1.4T  424G  437G  50% /btr

  I suppose I should trust to btrfs fi df, not df utility.

  So the main question is if such asymmetric RAID0 configuration
  possible at all and why does btrfs ignore ~500 GB of free space on
  /dev/sdc drive?

  Also it's interesting what will happen when I add 20 GB more data to
  my FS. Should I be prepared to usual btrfs low-space problems?

Best regards,
Slava Barinov.
The raid0 will always distribute data to each disk relatively equally. 
There are exceptions of course. The way to have it better utilise the 
diskspace is to use either single (which won't get the same 
performance as raid0) or to add a third disk.


In any raided configuration, the largest disk won't be fully utilised 
unless the other disks add up to be equal to or more than that largest disk.


Play around with Hugo's disk usage calculator to get a better idea of 
what the different configurations will do: http://carfax.org.uk/btrfs-usage/


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs and raid5 status with kernel 3.14, documentation, and howto

2014-03-25 Thread Brendan Hide


On 25/03/14 03:29, Marc MERLIN wrote:

On Tue, Mar 25, 2014 at 01:11:43AM +, Martin wrote:

There's a big thread a short while ago about using parity across
n-devices where the parity is spread such that you can have 1, 2, and up
to 6 redundant devices. Well beyond just raid5 and raid6:

http://lwn.net/Articles/579034/
  
Aah, ok. I didn't understand you meant that. I know nothing about that, but

to be honest, raid6 feels like it's enough for me :)


There are a few of us who are very much looking forward to these 
special/flexible RAID types - for example RAID15 (very good performance, 
very high redundancy, less than 50% diskspace efficiency). The csp 
notation will probably make it easier to develop the flexible raid types 
and is very much required in order to better manage these more flexible 
raid types.


A typical RAID15 with 12 disks would in csp notation is written as:
2c5s1p

And some would like to be able to use the exact same redundancy scheme 
even with extra disks:
2c5s1p on 16 disks (note, the example is not 2c7s1p, though that would 
also be a valid scheme with 16 disks being the minimum number of disks 
required)


The last thread on this (I think) can be viewed here, 
http://www.spinics.net/lists/linux-btrfs/msg23137.html where Hugo also 
explains and lists the notation for the existing schemes.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)

2014-03-23 Thread Brendan Hide


On 2014/03/22 11:11 PM, Marc MERLIN wrote:

Please consider adding a blank line between quotes, it makes them just a bit
more readable :)


Np.


On Sat, Mar 22, 2014 at 11:02:24PM +0200, Brendan Hide wrote:

- it doesn't create writeable snapshots on the destination in case you want
to use the copy as a live filesystem

One of the issues with doing writeable snapshots by default is that the
backup is (ever so slightly) less safe from fat-finger syndrome. If I
want a writeable snapshot, I'll make it from the read-only snapshot,
thereby reducing the chances of accidentally tainting or deleting data
in the backup. I actually *did* accidentally delete my entire filesystem
(hence the paranoid umounts). But, of course, my script *first* created
read-only snapshots from which recovery took only a few minutes. ;)

The writeable snapshot I create is on top of the read only one used by btrfs
receive. So, I can play with it, but it won't upset/break anything for
the backup.

The historical snapshots I keep give me cheap backups to go back to do get a
file I may have deleted 3 days ago and want back now even though my btrfs
send/receive runs hourly.


Ah. In that case my comment is moot. I could add support for something 
like this but I'm unlikely to use it.



[snip]

- Your comments say shlock isn't safe and that's documented. I don't see
that in the man page
http://manpages.ubuntu.com/manpages/trusty/man1/shlock.1.html

That man page looks newer than the one I last looked at - specifically
the part saying improved by Berend Reitsma to solve a race condition.
The previous documentation on shlock indicated it was safe for hourly
crons - but not in the case where a cron might be executed twice
simultaneously. Shlock was recommended by a colleague until I realised
this potential issue, thus my template doesn't use it. I should update
the comment with some updated information.

It's not super important, it was more my curiosity.
If a simple lock program in C isn't atomic, what's the point of it?
I never looked at the source code, but maybe I should...


Likely the INN devs needed something outside of a shell environment. 
Based on the man page, shlock should be atomic now.



I'd love to have details on this if I shouldn't be using it
- Is set -o noclobber; echo $$  $lockfile really atomic and safer than
shlock? If so, great, although I would then wonder why shlock even exists
:)

The part that brings about an atomic lock is noclobber, which sets it
so that we are not allowed to clobber/overwrite an existing file.
Thus, if the file exists, the command fails. If it successfully creates
the new file, the command returns true.
  
I understand how it's supposed to work, I just wondered if it was really

atomic as it should be since there would be no reason for shlock to even
exist with that line of code you wrote.


When I originally came across the feature I wasn't sure it would work 
and did extensive testing: For example, spawn 30 000 processes, each of 
which tried to take the lock. After the machine became responsive again 
;) only 1 lock ever turned out to have succeeded.


Since then its been in production use across various scripts on hundreds 
of servers. My guess (see above) is that the INN devs couldn't or didn't 
want to use it.


The original page where I learned about noclobber: 
http://www.davidpashley.com/articles/writing-robust-shell-scripts/


Thanks for the info Marc 


No problem - and thanks. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)

2014-03-22 Thread Brendan Hide

 $src_newsnap | $ssh btrfs receive $dest_pool/
fi

# We make a read-write snapshot in case you want to use it for a chroot
# and some testing with a writeable filesystem or want to boot from a
# last good known snapshot.
btrfs subvolume snapshot $src_newsnap $src_newsnaprw
$ssh btrfs subvolume snapshot $dest_pool/$src_newsnap 
$dest_pool/$src_newsnaprw

# Keep track of the last snapshot to send a diff against.
ln -snf $src_newsnap ${vol}_last
# The rw version can be used for mounting with subvol=vol_last_rw
ln -snf $src_newsnaprw ${vol}_last_rw
$ssh ln -snf $src_newsnaprw $dest_pool/${vol}_last_rw

# How many snapshots to keep on the source btrfs pool (both read
# only and read-write).
ls -rd ${vol}_ro* | tail -n +$(( $keep + 1 ))| while read snap
do
 btrfs subvolume delete $snap
done
ls -rd ${vol}_rw* | tail -n +$(( $keep + 1 ))| while read snap
do
 btrfs subvolume delete $snap
done

# Same thing for destination (assume the same number of snapshots to keep,
# you can change this if you really want).
$ssh ls -rd $dest_pool/${vol}_ro* | tail -n +$(( $keep + 1 ))| while read snap
do
 $ssh btrfs subvolume delete $snap
done
$ssh ls -rd $dest_pool/${vol}_rw* | tail -n +$(( $keep + 1 ))| while read snap
do
 $ssh btrfs subvolume delete $snap
done

rm $lock



Hi, Marc

Feel free to use ideas from my own script. Some aspects in my script are 
more mature and others are frankly pathetic. ;)


There are also quite a lot of TODOs throughout my script that aren't 
likely to get the urgent attention they deserve. It has been slowly 
evolving over the last two weeks.


http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)

2014-03-22 Thread Brendan Hide


On 2014/03/22 09:44 PM, Brendan Hide wrote:

On 2014/03/21 07:29 PM, Marc MERLIN wrote:




Hi, Marc

Feel free to use ideas from my own script. Some aspects in my script 
are more mature and others are frankly pathetic. ;)


There are also quite a lot of TODOs throughout my script that aren't 
likely to get the urgent attention they deserve. It has been slowly 
evolving over the last two weeks.


http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup


I forgot to include some notes:

The script depends on a config file at /etc/btrfs-backup/paths.conf 
which is supposed to contain the paths as well as some parameters. At 
present the file consists solely of the following paths as these are all 
separate subvolumes in my test system:

/
/home
/usr
/var


The snapshot names on source and backup are formatted as below. This 
way, daylight savings doesn't need any special treatment:

__2014-03-17-23h00m01s+0200
__2014-03-18-23h00m01s+0200
__2014-03-19-23h00m01s+0200
__2014-03-20-23h00m02s+0200
__2014-03-21-23h00m01s+0200
_home_2014-03-17-23h00m01s+0200
_home_2014-03-18-23h00m01s+0200
_home_2014-03-19-23h00m01s+0200
_home_2014-03-20-23h00m02s+0200
_home_2014-03-21-23h00m01s+0200
_usr_2014-03-17-23h00m01s+0200
_usr_2014-03-18-23h00m01s+0200
_usr_2014-03-19-23h00m01s+0200
_usr_2014-03-20-23h00m02s+0200
_usr_2014-03-21-23h00m01s+0200
_var_2014-03-17-23h00m01s+0200
_var_2014-03-18-23h00m01s+0200
_var_2014-03-19-23h00m01s+0200
_var_2014-03-20-23h00m02s+0200
_var_2014-03-21-23h00m01s+0200



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)

2014-03-22 Thread Brendan Hide


On 2014/03/22 10:00 PM, Marc MERLIN wrote:

On Sat, Mar 22, 2014 at 09:44:05PM +0200, Brendan Hide wrote:

Hi, Marc

Feel free to use ideas from my own script. Some aspects in my script are
more mature and others are frankly pathetic. ;)

There are also quite a lot of TODOs throughout my script that aren't
likely to get the urgent attention they deserve. It has been slowly
evolving over the last two weeks.

http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup

I figured I likely wasn't the only one working on a script like this :)

 From a quick read, it looks even more complex than mine :) but
Well ... I did say some things are pathetic my side. ;) I also use a 
template (its about 3 years old now) when I make a new script, hence the 
options such as being able to ignore the mutex checks and also having a 
random delay at start. These obviously add some unnecessary complexity.

- it doesn't do ssh to a destination for a remote backup
There should be a TODO for this on my side. Presently in testing I'm 
only using it for device-local backup to a separate disk and not to a 
proper remote backup.

- it doesn't seem to keep a list of configurable snapshots not necessary for
send/restore but useful for getting historical data
I'm not sure what this is useful for. :-/ If related, I plan on creating 
a separate script to move snapshots around into _var_daily_$date, 
_var_weekly_$date, etc.

- it doesn't seem to use a symlink to keep track of the last complete
snapshot on the source and destination, and does more work to compensate
when recovering from an incomplete backup/restore.

Yes, a symlink would make this smoother.

- it doesn't create writeable snapshots on the destination in case you want
to use the copy as a live filesystem
One of the issues with doing writeable snapshots by default is that the 
backup is (ever so slightly) less safe from fat-finger syndrome. If I 
want a writeable snapshot, I'll make it from the read-only snapshot, 
thereby reducing the chances of accidentally tainting or deleting data 
in the backup. I actually *did* accidentally delete my entire filesystem 
(hence the paranoid umounts). But, of course, my script *first* created 
read-only snapshots from which recovery took only a few minutes. ;)

Things I noticed:
- I don't use ionice, maybe I should. Did you find that it actually made a
difference with send/receive?
This is just a habit I've developed over time in all my scripts. I 
figure that if I'm using the machine at the time and the snapshot has a 
large churn, I'd prefer the ionice. That said, the main test system is a 
desktop which is likely to have much less churn than a server. In the 
last two weeks the longest daily incremental backup took about 5 minutes 
to complete, while it typically takes about 30 seconds only.

- Your comments say shlock isn't safe and that's documented. I don't see
that in the man page
http://manpages.ubuntu.com/manpages/trusty/man1/shlock.1.html
That man page looks newer than the one I last looked at - specifically 
the part saying improved by Berend Reitsma to solve a race condition. 
The previous documentation on shlock indicated it was safe for hourly 
crons - but not in the case where a cron might be executed twice 
simultaneously. Shlock was recommended by a colleague until I realised 
this potential issue, thus my template doesn't use it. I should update 
the comment with some updated information.


My only two worries then would be a) if it is outdated on other distros 
and b) that it appears that it is not installed by default. On my Arch 
desktop it seems to be available with inn[1] (Usenet server and 
related software) and nowhere else. It seems the same on Ubuntu (Google 
pointed me to inn2-dev). Do you have INN installed? If not, where did 
you get shlock from?

I'd love to have details on this if I shouldn't be using it
- Is set -o noclobber; echo $$  $lockfile really atomic and safer than
shlock? If so, great, although I would then wonder why shlock even exists :)
The part that brings about an atomic lock is noclobber, which sets it 
so that we are not allowed to clobber/overwrite an existing file. 
Thus, if the file exists, the command fails. If it successfully creates 
the new file, the command returns true.


I'd consider changing this mostly for the fact that depending on INN is 
a very big dependency. There are other options as well, though I don't 
think they're as portable as noclobber.


Thanks,
Marc


Thanks for your input. It has already given me some direction. :)

[1] https://www.archlinux.org/packages/community/x86_64/inn/files/

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [systemd-devel] [HEADS-UP] Discoverable Partitions Spec

2014-03-12 Thread Brendan Hide


On 2014/03/12 09:31 PM, Chris Murphy wrote:

On Mar 12, 2014, at 1:12 PM, Goffredo Baroncelli kreij...@inwind.it wrote:

On 03/12/2014 06:24 PM, Chris Mason wrote:

Your suggestion also sounds like it places snapshots outside of their parent 
subvolume? If so it mitigates a possible security concern if the snapshot 
contains (old) binaries with vulnerabilities. I asked about how to go about 
assessing this on the Fedora security list:
https://lists.fedoraproject.org/pipermail/security/2014-February/001748.html

There aren't many replies but the consensus is that it's a legitimate concern, 
so either the snapshots shouldn't be persistently available (which is typical 
with e.g. snapper, and also yum-plugin-fs-snapshot), and/or when the subvolume 
containing snapshots is mounted, it's done with either mount option noexec or 
nosuid (no consensus on which one, although Gnome Shell uses nosuid by default 
when automounting removable media).
This is exactly the same result if following the previously-recommended 
subvolume layout given on the Arch wiki. It seems this wiki advice has 
disappeared so I can't give a link for it ...


My apologies if the rest of my mail is off-topic.

Though not specifically for rollback, my snapshots prior to btrfs {send 
| , receive} backup is done via temporary mountpoint. Until two days ago 
I was still using rsync to a secondary btrfs volume and the __snapshots 
folder had been sitting empty for about a year. The performance 
difference with send|receive is magnitudes apart: A daily backup to the 
secondary disk now takes between 30 and 40 seconds whereas it took 20 to 
30 minutes with rsync.


Here are my current subvolumes:
__active
__active/home
__active/usr
__active/var
__snapshots/__2014-03-12-23h00m01s+0200
__snapshots/_home_2014-03-12-23h00m01s+0200
__snapshots/_usr_2014-03-12-23h00m01s+0200
__snapshots/_var_2014-03-12-23h00m01s+0200

I hadn't thought of noexec or nosuid. On a single-user system you don't 
really expect that type of incursion. I will put up my work after I've 
properly automated cleanup.


The only minor gripe I have with the temporary mount is that I feel it 
should be possible to perform snapshots and use send|receive without the 
requirement of having the subvolumes be visible in userspace.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Understanding btrfs and backups

2014-03-06 Thread Brendan Hide


On 2014/03/06 09:27 PM, Eric Mesa wrote:

Brian Wong wrote: a snapshot is different than a backup
[snip]

...

Three hard drives: A, B, and C.

Hard drives A and B - btrfs RAID-1 so that if one drive dies I can keep
using my system until the replacement for the raid arrives.

Hard drive C - gets (hourly/daily/weekly/or some combination of the above)
snapshots from the RAID. (Starting with the initial state snapshot) Each
timepoint another snapshot is copied to hard drive C.

[snip]...

So if that's what I'm doing, do snapshots become a way to do backups?
An important distinction for anyone joining the conversation is that 
snapshots are *not* backups, in a similar way that you mentioned that 
RAID is not a backup. If a hard drive implodes, its snapshots go with it.


Snapshots can (and should) be used as part of a backup methodology - and 
your example is almost exactly the same as previous good backup 
examples. I think most of the time there's mention of an external 
backup server keeping the backups, which is the only major difference 
compared to the process you're looking at. Btrfs send/receive with 
snapshots can make the process far more efficient compared to rsync. 
Rsync doesn't have any record as to what information has changed so it 
has to compare all the data (causing heavy I/O). Btrfs keeps a record 
and can skip to the part of sending the data.


I do something similar to what you have described on my Archlinux 
desktop - however I haven't updated my (very old) backup script to take 
advantage of btrfs' send/receive functionality. I'm still using rsync. :-/

/ and /home are on btrfs-raid1 on two smallish disks
/mnt/btrfs-backup is on btrfs single/dup on a single larger disk

See https://btrfs.wiki.kernel.org/index.php/Incremental_Backup for a 
basic incremental methodology using btrfs send/receive


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to wait for snapshot deletion?

2014-02-13 Thread Brendan Hide


On 2014/02/13 09:02 PM, Kai Krakow wrote:

Hi!

Is it technically possible to wait for a snapshot completely purged from
disk? I imagine an option like --wait for btrfs delete subvolume.

This would fit some purposes I'm planning to implement:

* In a backup scenario
I have a similar use-case for this also involving backups. In my case I 
have a script that uses a btrfs filesystem for the backup store using 
snapshots. At the end of each run, if diskspace usage is below a 
predefined threshold, it will delete old snapshots until the diskspace 
usage is below that threshold again.


Of course, the first time I added the automatic deletion, it deleted far 
more than was necessary due to the fact that the actual freeing of 
diskspace is asynchronous from the command completion. I ended up 
setting a small delay (of about 60 seconds) between each iteration and 
also set it to monitor system load. If load is not low enough after the 
delay then it waits another 60 seconds.


This complicated (frankly broken) workaround would be completely 
unnecessary with a --wait switch.


Alternatively, perhaps a knob where we can see if a subvolume deletion 
is in progress could help.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-05 Thread Brendan Hide


On 2014/02/05 10:15 PM, Roman Mamedov wrote:

Hello,

On a freshly-created RAID1 filesystem of two 1TB disks:

# df -h /mnt/p2/
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2

I cannot write 2TB of user data to that RAID1, so this estimate is clearly
misleading. I got tired of looking at the bogus disk free space on all my
RAID1 btrfs systems, so today I decided to do something about this:

...

After:

# df -h /mnt/p2/
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   1.8T  1.1M  912G   1% /mnt/p2

Until per-subvolume RAID profiles are implemented, this estimate will be
correct, and even after, it should be closer to the truth than assuming the
user will fill their RAID1 FS only with subvolumes of single or raid0 profiles.
This is a known issue: 
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F


Btrfs is still considered experimental - this is just one of those 
caveats we've learned to adjust to.


The change could work well for now and I'm sure it has been considered. 
I guess the biggest end-user issue is that you can, at a whim, change 
the model for new blocks - raid0/5/6,single etc and your value from 5 
minutes ago is far out from your new value without having written 
anything or taken up any space. Not a show-stopper problem, really.


The biggest dev issue is that future features will break this behaviour, 
such as the per-subvolume RAID profiles you mentioned. It is difficult 
to motivate including code (for which there's a known workaround) where 
we know it will be obsoleted.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-transaction blocked for more than 120 seconds

2014-01-07 Thread Brendan Hide


On 2014/01/06 12:57 AM, Roman Mamedov wrote:

Did you align your partitions to accommodate for the 4K sector of the EARS?
I had, yes. I had to do a lot of research to get the array working 
optimally. I didn't need to repartition the spare so this carried over 
to its being used as an OS disk.


I actually lost the Green array twice - and learned some valuable lessons:

1. I had an 8-port SCSI card which was dropping the disks due to the 
timeout issue mentioned by Chris. That caused the first array failure. 
Technically all the data was on the disks - but temporarily 
irrecoverable as disks were constantly being dropped. I made a mistake 
during ddrescue which simultaneously destroyed two disks' data, meaning 
that the recovery operation was finally for nought. The only consolation 
was that I had very little data at the time and none of it was 
irreplaceable.


2. After replacing the SCSI card with two 4-port SATA cards, a few 
months later I still had a double-failure (the second failure being 
during the RAID5 rebuild). This time it was only due to bad disks and a 
lack of scrubbing/early warning - clearly my own fault.


Having learnt these lessons, I'm now a big fan of scrubbing and backups. ;)

I'm also pushing for RAID15 wherever data is mission-critical. I simply 
don't trust the reliability of disks any more and I also better 
understand how, by having more and/or larger disks in a RAID5/6 array, 
the overall reliability of that array array plummets.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-transaction blocked for more than 120 seconds

2014-01-05 Thread Brendan Hide


On 2014/01/05 11:17 PM, Sulla wrote:

Certainly: I have 3 HDDs, all of which WD20EARS.

Maybe/maybe-not off-topic:
Poor hardware performance, though not necessarily the root cause, can be 
a major factor with these errors.


WD Greens (Reds too, for that matter) have poor non-sequential 
performance. An educated guess I'd say there's a 15% chance this is a 
major factor to the problem and, perhaps, a 60% chance it is merely a 
small contributor to the problem. Greens are aimed at consumers 
wanting high capacity and a low pricepoint. The result is poor 
performance. See footnote * re my experience.


My general recommendation (use cases vary of course) is to install a 
tiny SSD (60GB, for example) just for the OS. It is typically cheaper 
than the larger drives and will be *much* faster. WD Greens and Reds 
have good *sequential* throughput but comparatively abysmal random 
throughput even in comparison to regular non-SSD consumer drives.


*
I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a 
single 250GB IDE disk for the OS. When the very old IDE disk inevitably 
died, I decided to use a spare 1.5TB drive for the OS. Performance was 
bad enough that I simply bought my first SSD the same week.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: No space left on device, problem

2013-10-27 Thread Brendan Hide


On 2013/10/27 10:50 AM, Igor M wrote:

On Sun, Oct 27, 2013 at 2:00 AM, Tomasz Chmielewski t...@virtall.com wrote:

Still no messages. Parameter seems to be active as
/sys/module/printk/parameters/ignore_loglevel is Y, but there are no
messages in log files or dmesg. Maybe I need to turn on some kernel
debugging option and recompile kernel ?
Also I should mention that cca 230G+ data was copied before this error
started to occur.

I think I saw a similar issue before.

Can you try using rsync with --bwlimit XY option to copy the files?

The option will limit the speed, in kB, at which the file is being
copied; it will work even when source and destination files are on a
local machine.


Also I run strace cp -a ..
...
read(3, 350348f07$0$24520$c3e8da3$fb4835..., 65536) = 65536
write(4, 350348f07$0$24520$c3e8da3$fb4835..., 65536) = 65536
read(3, 62.76C52BF412E849CB86D4FF3898B94..., 65536) = 65536
write(4, 62.76C52BF412E849CB86D4FF3898B94..., 65536) = -1 ENOSPC (No
space left on device)

Last two write calls take a lot more time, and then last one returns
ENOSPC. But if this write is retryed, then it succeeds.
I tried with midnight commander and when error occurs, if I Retry
operation then it finishes copying this file until error occurs again
at next file.

With --bwlimit it seems to be better, lower the speed later the error
occurs, and if it's slow enough copy is successfull.
But now I'm not sure anymore. I copied a few files with bwlimit, and
now sudenly error doesn't occur anymore, even with no bwlimit.
I'll do some more tests.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
This sounds to me like the problem is related to read performance 
causing a bork. This would explain why bwlimit helps, as well as why cp 
works the second time around (since it is cached).


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Non-intelligent behaviour on device delete with multiple devices?

2013-10-27 Thread Brendan Hide


On 2013/10/27 07:33 PM, Hans-Kristian Bakke wrote:

Hi

Today I tried removing two devices from a multidevice btrfs RAID10
volume using the following command:
---
btrfs device delete /dev/sdl /dev/sdk /btrfs
---

It first removed device sdl and then sdk. What I did not expect
however was that btrfs didn't remove sdk from the available drives
when removing and rebalancing data from the first device. This
resultated in that over 300GB of data was actually added to sdk during
removal of sdl, only to make the removal process of sdk longer.

This seems to me as a rather non-intelligent way to do this. I would
expect that all drives given as input to the btrfs device delete
command was removed from the list of drives available for rebalancing
of the data during removal of the drives.
This is a known issue I'm sure will be addressed. It has annoyed me in 
the past as well. Perhaps add it to the wiki: 
https://btrfs.wiki.kernel.org/index.php/Project_ideas


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Total fs size does not match with the actual size of the setup

2013-10-27 Thread Brendan Hide

On 2013/10/27 10:27 PM, Lester B wrote:

2013/10/28 Hugo Mills h...@carfax.org.uk:

On Mon, Oct 28, 2013 at 04:09:18AM +0800, Lester B wrote:

The btrfs setup only have one device of size 7 GiB but
when I run df, the total size shown is 15 GiB. Running
btrfs --repair

I'd recommend not running btrfs check --repair unless you really
know what you're doing, or you've checked with someone knowledgable
and they say you should try it. On a non-broken filesystem (as here),
it's probably OK, though.

displays an error cache and super
generation don't match, space cache will be invalidated.

This is harmless.

How can I correct the total fs size as shown in df?

You can't. It's an artefact of the fact that you've got a RAID-1
(or RAID-10, or --mixed and DUP) filesystem, and that the standard
kernel interface for df doesn't allow us to report the correct figures
-- see [1] (and the subsequent entry as well) for a more detailed
description.

Hugo.

[1]
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F

--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Nothing right in my left brain. Nothing left in ---
my right brain.

But my setup is a simple one without any RAID levels or other things so at least
df size column should show the actual size of my setup.

Could you send us the output of the following?:
btrfs fi df mountpoint
(where mountpoint is the path where the btrfs is mounted.)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: btrfs raid5

2013-10-22 Thread Brendan Hide


On 2013/10/22 07:18 PM, Alexandre Oliva wrote:

... and
it is surely an improvement over the current state of raid56 in btrfs,
so it might be a good idea to put it in.
I suspect the issue is that, while it sortof works, we don't really want 
to push people to use it half-baked. This is reassuring work, however. 
Maybe it would be nice to have some half-baked code *anyway*, even if 
Chris doesn't put it in his pull requests juuust yet. ;)

So far, I've put more than
1TB of data on that failing disk with 16 partitions on raid6, and
somehow I got all the data back successfully: every file passed an
md5sum check, in spite of tons of I/O errors in the process.

Is this all on a single disk? If so it must be seeking like mad! haha

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] [RFC] RAID-level terminology change

2013-03-26 Thread Brendan Hide

Bit late - but that could be explored in future. The main downside I see 
with automatic redundancy/optimisation is the complexity it 
introduces. Likely this would be best served with user-space tools.


On 11/03/13 02:21, Roger Binns wrote:

On 10/03/13 15:04, Hugo Mills wrote:

Given that this is going to end up rewriting *all* of the data on the
FS,

Why does all data have to be rewritten?  Why does every piece of data have
to have exactly the same storage parameters in terms of
non-redundancy/performance/striping options?
This is a good point. You don't necessarily have to rewrite everything 
all at once so the performance penalty is not necessarily that bad. More 
importantly, some restripe operations actually don't need much change 
on-disk (in theory).


Let's say we have disks with 3c chunk allocation and this needs to be 
reallocated into 2c chunks.


In practice at present what would actually happen is that it would first 
create a *new* 2c chunk and migrate the data over from the 3c chunk. 
Once the data is moved across we finally mark the space taken up by the 
original 3c chunk as available for use. Rinse; Repeat.


In theory we can skip this rebalance/migration step by thinning out 
the chunks in-place: relabel the chunk as 2c and mark the unneeded 
copies as available diskspace.


A similar situation applies to other types of conversions in that they 
could be converted in-place with much less I/O or that the I/O could be 
optimised (for example sequential I/O between disks with minimal 
buffering needed vs moving data between two locations on the same disk). 
I'm sure there are other possibilities for in-place conversions too, 
such as moving from 4c to 2c2s or 2c to 2s.


xC - (x-1)C
xCmS - (x/2)C(m*2)S

The complexity of the different types of conversions hasn't escaped me, 
and I do see another downside as well. With the 3C-2C conversion there 
is the inevitability of macro fragmentation. Again, there could be 
long-term performance implication or it might even be negligible.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: User feedback: raise the default leaf size to 16k

2013-03-03 Thread Brendan Hide


On 2013/02/13 12:33 PM, Holger Hoffstaette wrote:

- raise the leaf size to 16k
- use single metadata profile

...

the difference in behaviour on a single disk is *very* noticeable.
Did you try an isolated change of leaf size? I think the devs would be 
willing to look into the default size if it makes a dramatic difference 
on its own. Personally I think you are seeing an improvement more as a 
result of the metadata profile rather than the leafsize.


I don't think changing the default profile for metadata will be easily 
entertained as this is very important for protecting against corruption 
due to bitrot.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] Periodic write spikes while idling, on btrfs root

2013-03-03 Thread Brendan Hide


On 2013/02/14 12:15 PM, Vedant Kumar wrote:

Hello,

I'm experiencing periodic write spikes while my system is idle.

...

turned out to be some systemd log in
/var/log/journal. I turned off journald and rebooted, but the write spike
behavior remained.

...

best,
-vk

I believe btrfs syncs every 30 seconds (if anything's changed).

This sounds like systemd's journal is not actually disabled and that it 
is simply logging new information every few seconds and forcing it to be 
synced to disk. Have you tried following the journal as root to see what 
is being logged?

journalctl -f

Alternatively, as another measure to troubleshoot, in 
/etc/systemd/journald.conf, change the Storage= option either to none 
(which disables logging completely) or to a path inside a tmpfs, thereby 
eliminating btrfs' involvement.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] Btrfs-progs: check out if the swap device

2013-03-03 Thread Brendan Hide


On 2013/02/14 09:53 AM, Tsutomu Itoh wrote:

+   if (ret  0) {
+   fprintf(stderr, error checking %s status: %s\n, file,
+   strerror(-ret));
+   exit(1);
+   }

...

+   /* check if the device is busy */
+   fd = open(file, O_RDWR|O_EXCL);
+   if (fd  0) {
+   fprintf(stderr, unable to open %s: %s\n, file,
+   strerror(errno));
+   exit(1);
+   }
This is fine and works (as tested by David) - but I'm not sure if the 
below suggestions from Zach were taken into account.


1. If the check with open(file, O_RDWR|O_EXCL) shows that the device 
is available, there's no point in checking if it is mounted as a swap 
device. A preliminary check using this could precede all other checks 
which should be skipped if it shows success.


2. If there's an error checking the status (for example lets say 
/proc/swaps is deprecated), we should print the informational message 
but not error out.


On 2013/02/13 11:58 AM, Zach Brown wrote:

- First always open with O_EXCL.  If it succeeds then there's no reason
   to check /proc/swaps at all.  (Maybe it doesn't need to try
   check_mounted() there either?  Not sure if it's protecting against
   accidentally mounting mounted shared storage or not.)

...

- At no point is failure of any of the /proc/swaps parsing fatal.  It'd
   carry on ignoring errors until it doesnt have work to do.  It'd only
   ever print the nice message when it finds a match.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs send receive produces Too many open files in system

2013-03-03 Thread Brendan Hide


On 2013/02/18 12:37 PM, Adam Ryczkowski wrote:

...
to migrate btrfs from one partition layout to another.
...
source sits on top of lvm2 logical volume, which sits on top of 
cryptsetup Luks device which subsequentely sits on top of mdadm RAID-6 
spanning a partition on each of 4 hard drives ... is a read-only 
snaphot which I estimate contain ca. 100GB data.

...
destination is btrfs multidevice raid10 filesystem, which is based 
on 4 cryptsetup Luks devices, each live as a separate partition on the 
same 4 physical hard drives ...

...
about 8MB/sek read (and the same speed of write) from each of all 4 
hard drives).



I hope you've solved this already - but if not:

The unnecessarily complex setup aside, a 4-disk RAID6 is going to be 
slow - most would have gone for a RAID10 configuration, albeit that it 
has less redundancy.


Another real problem here is that you are copying data from these disks 
to themselves. This means that for every read and write that all four of 
the disks have to do two seeks. This is time-consuming of the order of 
7ms per seek depending on the disks you have. The way to avoid these 
unnecessary seeks is to first copy the data to a separate unrelated 
device and then to copy from that device to your final destination device.


To increase RAID6 write performance (Perhaps irrelevant here) you can 
try optimising the stripe_cache_size value. It can use a ton of memory 
depending on how large a stripe cache setting you end up with. Search 
online for mdraid stripe_cache_size.


To increase the read performance you can try optimising the md arrays' 
readahead. As above, search online for blockdev setra. This should 
hopefully make a noticeable difference.


Good luck.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: inconsistent output on sub list

2013-01-25 Thread Brendan Hide


Confirmed the fix is working.

^ TLDR can stop here :)

Recompiled the original Archlinux package from newly-synced ABS but also 
with your patch. I then tested the newly-compiled btrfs with the patch. 
I realised after I should have also tested immediately before installing 
the patched version, so I reinstalled the original unpatched version for 
the final test to confirm the problem was rectified entirely by the 
one-line change. The attached log shows the detail.


On 24/01/13 12:00, Anand Jain wrote:



Brendan,

 ---
  [root@watricky mnt]# btrfs subvolume list / -a
  ID 258 gen 4226 top level 384 path media/smbshare
 ::
  [root@watricky mnt]# btrfs subvolume list /home -a
  ID 258 gen 4226 top level 5 path 
FS_TREE/__active/media/smbshare4.snap

 ---
 This is definitely a bug. Thanks for reporting.

 I have made some fair-assumptions, and have sent out the
 patch[1] to fix this bug (ref this email thread). Could you
 kindly test it and report the result ?

[1]
[PATCH] Btrfs-progs: we need to have the string null terminated

Thanks,  Anand


On 01/23/2013 03:42 AM, Brendan Hide wrote:

Linux watricky 3.6.11-1-ARCH #1 SMP PREEMPT Tue Dec 18 08:57:15 CET 2012
x86_64 GNU/Linux

In working on a snapshot maintenance script I've noticed some odd
behaviour. Note the smbshare path. I've put this into its own subvolume
as I don't plan on snapshotting it.

In the first command's output, this path is printed correctly, however
in the second output it has 4.snap appended, similar to the names of
the snapshots I made 22 hours ago.

If this is a documented issue with a fix then no worries. But if not and
anyone wants me to check into any further specifics, please let me know.

  [root@watricky mnt]# btrfs subvolume list / -a
  ID 258 gen 4226 top level 384 path media/smbshare
  ID 259 gen 4337 top level 384 path home
  ID 384 gen 4321 top level 5 path FS_TREE/__active
  ID 392 gen 4337 top level 384 path var
  ID 393 gen 4267 top level 384 path usr
  ID 428 gen 4267 top level 5 path
FS_TREE/__snapshot/__active.20130121-23h44.snap
  ID 429 gen 3980 top level 5 path
FS_TREE/__snapshot/__active_home.20130121-23h45.snap
  ID 430 gen 4043 top level 5 path
FS_TREE/__snapshot/__active_var.20130121-23h45.snap
  ID 431 gen 4267 top level 5 path
FS_TREE/__snapshot/__active_usr.20130121-23h45.snap
  [root@watricky mnt]# btrfs subvolume list /home -a
  ID 258 gen 4226 top level 5 path 
FS_TREE/__active/media/smbshare4.snap

  ID 259 gen 4337 top level 5 path FS_TREE/__active/home
  ID 384 gen 4321 top level 5 path FS_TREE/__active
  ID 392 gen 4337 top level 5 path FS_TREE/__active/var
  ID 393 gen 4267 top level 5 path FS_TREE/__active/usr
  ID 428 gen 4267 top level 5 path
FS_TREE/__snapshot/__active.20130121-23h44.snap
  ID 429 gen 3980 top level 5 path
FS_TREE/__snapshot/__active_home.20130121-23h45.snap
  ID 430 gen 4043 top level 5 path
FS_TREE/__snapshot/__active_var.20130121-23h45.snap
  ID 431 gen 4267 top level 5 path
FS_TREE/__snapshot/__active_usr.20130121-23h45.snap
  [root@watricky mnt]#

Note that the only directly mounted share is __active, mounted at /.




--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

[ bren...@watricky.invalid.co.za : 15:03:10 : ~/build/btrfs-progs ]
:) sudo pacman -U btrfs-progs-0.19.20121005-4-x86_64.pkg.tar.xz 
[sudo] password for brendan: 
loading packages...
warning: btrfs-progs-0.19.20121005-4 is up to date -- reinstalling
resolving dependencies...
looking for inter-conflicts...

Targets (1): btrfs-progs-0.19.20121005-4

Total Installed Size:   2.43 MiB
Net Upgrade Size:   -0.04 MiB

Proceed with installation? [Y/n] 
(1/1) checking package integrity [###] 100%
(1/1) loading package files  [###] 100%
(1/1) checking for file conflicts[###] 100%
(1/1) checking available disk space  [###] 100%
(1/1) upgrading btrfs-progs  [###] 100%
[ bren...@watricky.invalid.co.za : 15:03:33 : ~/build/btrfs-progs ]
:) sudo su -
[root@watricky ~]# btrfs subvolume list / -a
ID 258 gen 5034 top level 384 path media/smbshare
ID 259 gen 5161 top level 384 path home
ID 384 gen 5161 top level 5 path FS_TREE/__active
ID 392 gen 5161 top level 384 path var
ID 393 gen 5161 top level 384 path usr
ID 428 gen 5043 top level 5 path FS_TREE/__snapshot

Re: [PATCH] Btrfs-progs: Exit if not running as root

2013-01-25 Thread Brendan Hide


On 25/01/13 14:43, Hugo Mills wrote:

On Fri, Jan 25, 2013 at 07:29:44AM -0500, Gene Czarcinski wrote:

On 01/25/2013 06:55 AM, Roman Mamedov wrote:

On Fri, 25 Jan 2013 06:32:30 -0500
Gene Czarcinski g...@czarc.net wrote:


This patch hits a lot of files but adds little code.  It
could be considered a bugfix,  Currently, when one of the
btrfs user-space programs is executed by a regular user,
the result if oftem a number of strange error messages
which do not indicate the real problem.  This patch changes
that situation.

A test is performed as to whether the program is running
as root.  If it is not, issue an error message and exit.
Signed-off-by: Gene Czarcinski g...@czarc.net

$ ls -la /dev/sda
brw-rw---T 1 root disk 8, 0 Jan 15 12:11 /dev/sda

The user does not have to be root, they can be a member of the group disk to
manage this device.

Also some or all of the tools accept not just a block device, but also a
regular file as their parameter.

Wouldn't it be better to check whether or not the running user has
*write access* to the device or file to be operated on, before failing?

I knew there would be corner cases where root was not required for
execution.  After all, I do not need to be root to execute btrfs
--version.  Now, is it worth the effort to determine the corner
cases and do you have a proposed solution as to determining what
privileges are needed when?  I can understand when it could be a
regular file but is it all that common for users to be part of group
disk?

Don't try to check all the possible success conditions beforehand
-- that's what leads to websites that fail to work because your
browser is not IE, but work perfectly when you change your user-agent
string to MSIE. This is highly frustrating for users.

Instead, try whatever it is you were trying to do (open a file,
send an ioctl), and determine, as well as you can, why it failed by
looking at the error codes that you get back, and report that.
Permission denied - means you don't have permissions - you need to
be root, or have yourself put in the disk group, or get the
disk-management-capability. Let the user work out which of those
solutions they need, rather than forcing them to use the one you
thought of.

Hugo.
As Hugo suggested, I'd rather that we fix or refine the code in order to 
get better error messages. All the different exceptions to requiring or 
not requiring root overly complicates things that, strictly speaking, 
shouldn't need to be handled in advance.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

inconsistent output on sub list

2013-01-22 Thread Brendan Hide

Linux watricky 3.6.11-1-ARCH #1 SMP PREEMPT Tue Dec 18 08:57:15 CET 2012 
x86_64 GNU/Linux


In working on a snapshot maintenance script I've noticed some odd 
behaviour. Note the smbshare path. I've put this into its own subvolume 
as I don't plan on snapshotting it.


In the first command's output, this path is printed correctly, however 
in the second output it has 4.snap appended, similar to the names of 
the snapshots I made 22 hours ago.


If this is a documented issue with a fix then no worries. But if not and 
anyone wants me to check into any further specifics, please let me know.


 [root@watricky mnt]# btrfs subvolume list / -a
 ID 258 gen 4226 top level 384 path media/smbshare
 ID 259 gen 4337 top level 384 path home
 ID 384 gen 4321 top level 5 path FS_TREE/__active
 ID 392 gen 4337 top level 384 path var
 ID 393 gen 4267 top level 384 path usr
 ID 428 gen 4267 top level 5 path 
FS_TREE/__snapshot/__active.20130121-23h44.snap
 ID 429 gen 3980 top level 5 path 
FS_TREE/__snapshot/__active_home.20130121-23h45.snap
 ID 430 gen 4043 top level 5 path 
FS_TREE/__snapshot/__active_var.20130121-23h45.snap
 ID 431 gen 4267 top level 5 path 
FS_TREE/__snapshot/__active_usr.20130121-23h45.snap

 [root@watricky mnt]# btrfs subvolume list /home -a
 ID 258 gen 4226 top level 5 path FS_TREE/__active/media/smbshare4.snap
 ID 259 gen 4337 top level 5 path FS_TREE/__active/home
 ID 384 gen 4321 top level 5 path FS_TREE/__active
 ID 392 gen 4337 top level 5 path FS_TREE/__active/var
 ID 393 gen 4267 top level 5 path FS_TREE/__active/usr
 ID 428 gen 4267 top level 5 path 
FS_TREE/__snapshot/__active.20130121-23h44.snap
 ID 429 gen 3980 top level 5 path 
FS_TREE/__snapshot/__active_home.20130121-23h45.snap
 ID 430 gen 4043 top level 5 path 
FS_TREE/__snapshot/__active_var.20130121-23h45.snap
 ID 431 gen 4267 top level 5 path 
FS_TREE/__snapshot/__active_usr.20130121-23h45.snap

 [root@watricky mnt]#

Note that the only directly mounted share is __active, mounted at /.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] Btrfs-progs: detect if the disk we are formatting is a ssd

2013-01-19 Thread Brendan Hide


On 2013/01/19 08:06 PM, Gene Czarcinski wrote:

Signed-off-by: Josef Bacik jba...@fusionio.com
Signed-off-by: Gene Czarcinski g...@czarc.net
---
-values are raid0, raid1, raid10 or single.
+values are raid0, raid1, raid10, single or dup.  Single device will have dup
+set by default except in the case of SSDs which will default to single.  This 
is
+because SSDs can remap blocks internally so duplicate blocks could end up in 
the
+same erase block which negates the benefits of doing metadata duplication.
Can't help but suggest that a NO_DEDUP command could be added to the 
SATA Transport Protocol/SCSI Command set. Not sure where to submit that 
idea ... :-/


--
Brendan Hide

http://swiftspirit.co.za/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 103 matches

Mail list logo