Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Goffredo Baroncelli
On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
> 15.09.2017 01:00, Goffredo Baroncelli пишет:
>>
>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>> the checksum may be incorrect.
>>
> 
> Is it btrfs specific ? If buffer is updated before it was actually
> consumed by kernel, this likely means data corruption on any filesystem.

I don't see any corruption in other FS. The fact that application push to 
filesystem garbage, doesn't allow the filesystem to be corrupted. 
In this case the filesystem became corrupted, because another application which 
try to read the data (without O_DIRECT) may got -EIO.

I repeat, the problem is a data race when the data is in the FS camp, and the 
kernel does wrong checksum.


IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS on 
linux); I think that it could be allowed only for  nodatasum files.

> I.e. there should be clear indication from kernel that buffer can be
> reused by application, in your example - when pwrite returns. So when
> data corruption happens - during pwrite or after? 
> If data is corrupted
> during pwrite, it is arguably application fault - it should disallow
> concurrent access.





> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Andrei Borzenkov
15.09.2017 01:00, Goffredo Baroncelli пишет:
> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 

Is it btrfs specific? If buffer is updated before it was actually
consumed by kernel, this likely means data corruption on any filesystem.
I.e. there should be clear indication from kernel that buffer can be
reused by application, in your example - when pwrite returns. So when
data corruption happens - during pwrite or after? If data is corrupted
during pwrite, it is arguably application fault - it should disallow
concurrent access.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-14 Thread Andrei Borzenkov
14.09.2017 18:32, Hugo Mills пишет:
> On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote:
>> I use encfs on top of btrfs.
>> I can create btrfs snapshots, but I have no suggestive access to the files
>> in these snaspshots, because they look like:
>>
>> drwx--  framstag users- 2017-09-08 11:47:18 
>> uHjprldmxo3-nSfLmcH54HMW
>> drwxr-xr-x  framstag users- 2017-09-08 11:47:18 
>> wNEWaDCgyXTj0d-Myk8wXZfh
>> -rw-r--r--  framstag users  377 2015-06-12 14:02:53 
>> -zDmc7xfobKDkbl8z7oKOHxv
>> -rw-r--r--  framstag users2,367 2012-07-10 14:32:30 
>> 7pfKs27K9k5zANE4WOQEuFa2
>> -rw---  framstag users  692 2009-10-20 13:45:41 
>> 8SQElYCph85kDdcFasUHybVr
>> -rw---  framstag users2,872 2017-08-31 16:21:52 
>> bm,yNi1e4fsAClDv7lNxxSfJ
>> lrwxrwxrwx  framstag users- 2017-06-01 15:53:00 
>> GZxNYI0Gy96R18fz40f7k5rl -> 
>> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
>> -rw-r--r--  framstag users  182 2016-12-01 13:34:31 
>> rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1
>>
>> I have to mount the snapshot with encfs, to have access to the (decrypted)
>> files. 
>>
>> Any better ideas?
> 
>I'd say it's doing exactly what it should be doing. You're making a
> copy of an encrypted data store,

With all respect - snapshot is not a copy.

> and the result is encrypted. In order
> to read it, it needs to have the decrpytion layer applied to it with
> the correct key (which is the need to mount the snapshot with encfs).
> 

But snapshot *is* mounted implicitly as it is part of mounted btrfs
filesystem. So I can see that this behavior could be rather unexpected.

>Would you _really_ want a system where the encrypted contents of a
> subvolume can be decrypted by simply snapshotting it?

The actual question is - do you need to mount each individual btrfs
subvolume when using encfs? If yes, this behavior is at least
consistent. If not - how are snapshots different?



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs-progs: build: generate all dependency files

2017-09-14 Thread Naohiro Aota
On 2017年09月14日 21:41, David Sterba wrote:
> On Thu, Sep 14, 2017 at 07:10:46PM +0900, Naohiro Aota wrote:
>> We're missing several dependency files like:
>>
>> $ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. 
>> -f2|sort)
>>--- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900
>>+++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900
> 
> Please note that an actual diff in the changelog is understood as start
> of the patch by git-am, indenting the --- or +++ lines makes it work
> again.

Oops, I forgot about that limitation. Thank you for the fix.

> 
>> @@ -3,7 +3,6 @@
>>  /btrfs-corrupt-block
>>  /btrfs-debug-tree
>>  /btrfs-find-root
>> -/btrfs-list
>>  /btrfs-map-logical
>>  /btrfs-select-super
>>  /btrfstune
>> @@ -29,11 +28,6 @@
>>  /cmds-scrub
>>  /cmds-send
>>  /cmds-subvolume
>> -/convert/common
>> -/convert/main
>> -/convert/source-ext2
>> -/convert/source-fs
>> -/convert/source-reiserfs
>>  /ctree
>>  /dir-item
>>  /disk-io
>> 
>>
>> This is due to moving things out of objects and cmds_objects variables. Such
>> missing dependency files cause mis-building of some source files (try touch
>> utils.h; make mkfs/main.o).
>>
>> This patch introduce a new variable "all_objects" to keep all the objects and
>> use the variable to generate proper dependency file building rules.
>>
>> Signed-off-by: Naohiro Aota 
> 
> Applied, thanks.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Tomasz Kłoczko
On 14 September 2017 at 19:53, Austin S. Hemmelgarn
 wrote:
[..]
> While it's not for BTRFS< a tool called e4rat might be of interest to you
> regarding this.  It reorganizes files on an ext4 filesystem so that stuff
> used by the boot loader is right at the beginning of the device, and I've
> know people to get insane performance improvements (on the order of 20x in
> some pathologicallyb ad cases) in the time taken from the BIOS handing
> things off to GRUB to GRUB handing execution off to the kernel.

Do you know that what you've just wrote has nothing to do with fragmentation?
Intentionally or not you just trying to change the subject.

[..]
> This shouldn't need examples.  It's trivial math combined with basic
> knowledge of hardware behavior.  Every request to a device has a minimum
> amount of overhead.  On traditional hard drives, this is usually dominated
> by seek latency, but on SSD's, the request setup, dispatch, and completion
> are the dominant factor.  Assumign you have a 2 micro-second overhead
> per-request (not an exact number, just chosen for demonstration purposes
> because it makes the math easy), and a 1GB file, the time difference between
> reading ten 100MB extents and reading ten thousand 100kB extents is just
> short of 0.02 seconds, or a factor of about one thousand (which, no surprise
> here, is the factor of difference between the number of extents).

So to produce few seconds delay during boot you need to make few
hundreds thousands if not millions more IOs  and on reading everything
using ideal long sequential reads.
Almost every package upgrade on rewrite some files in 100% will
produce by using COW fully continuous areas per file.
You know .. there is no so many files in typical distribution
installation to produce such measurable impact.
On my current laptop I have a lot of devel and debug stuff installed
and still I have only

$ rpm -qal | wc -l
276489

files (from which only small fractions are ELF DSOs or executables)
installed by:

$ rpm -qa | wc -l
2314

packages.

I can bet that during even very complicated boot process it will be
touched (by read IOs) only few hundreds files. None of those files
will be read sequentially because this is not how executable content
is usually loaded into the buffer cache. Simple change block device
read ahead may improve boot time enough without putting all blocks in
perfect order. All what you need is start enough early "blockdev
--setra N" where N is greater than default 256 blocks. All this can be
done without thinking about fragmentation.
Seems you don't know that Linux by default is reading data from block
dev using at least 256 blocks (1KB each one) chunks because such IO
size is part of default RA settings, You can change those settings
just for boot time and you will have way lower number of IOs and sill
no significant improvement like few times shorter time. Fragmentation
will be in such case secondary factor.
All this could be done without bothering about fragmentation.

In other words still you are talking about some institutionally
possible results which will be falsified if you will try at least one
time do some real tests and measurements.
Last time when I've been doing some boot time measurements it was
about using sequential start of all services vs. maximum
palatalization. And yes by this it was possible to improve boot time
by few times. All without bothering about fragmentation.

Current fedora systemd base services definition can be improved in
many places by add more dependencies and execute many small services
in parallel. All those corrections can be done without even thinking
about fragmentation. Because these base sett of systemd services comes
with systemd source code those improvements can be done for almost all
Linux systemd based distros.

kloczek
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: cleanup 'start' subtraction from try uncompressed inline extent

2017-09-14 Thread Timofey Titovets
Was added in:
  c8b978188c9a0fd3d535c13debd19d522b726f1f
  "Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).

Because 'start' checked for zero before branch, so it's
safe to remove that subtraction.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 02ef32149c15..81123408e82e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -570,7 +570,7 @@ static noinline void compress_file_range(struct inode 
*inode,
 cont:
if (start == 0) {
/* lets try to make an inline extent */
-   if (ret || total_in < (actual_end - start)) {
+   if (ret || total_in < actual_end) {
/* we didn't compress the entire range, try
 * to make an uncompressed inline extent.
 */
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Hugo Mills
   As far as I know, both of these are basically known issues, with no
good solution, other than not using O_DIRECT. Certainly the first
issue is one I recognise. The second isn't one I recognise directly,
but is unsurprising to me.

   There have been discussions -- including developers -- on this list
as recent as a month or so ago. The general outcome seems to be that
any problems with O_DIRECT are not going to be fixed.

   Hugo.

On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I discovered two bugs when O_DIRECT is used...
> 
> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> 
> Normally BTRFS prevents to access the contents of a corrupted file; however I 
> was able read the content of a corrupted file simply using O_DIRECT
> 
> # in a new btrfs filesystem, create a file
> $ sudo mkfs.btrfs -f /dev/sdd5
> $ mount /dev/sdd5 t
> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> bs=$((16*1024)) iflag=fullblock count=1024
> 
> # corrupt the file
> $ sudo filefrag -v t/abcd 
> Filesystem type is: 9123683e
> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..3475:  70656.. 74131:   3476:
>1: 3476..4095:  74212.. 74831:620:  74132: last,eof
> t/abcd: 2 extents found
> $ sudo umount t
> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> /dev/sdd5
> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> corrupting 289406976 copy 1
> 
> # try to access the file; expected result: -EIO
> $ sudo mount /dev/sdd5 t
> $ dd if=t/abcd | hexdump -c | head
> dd: error reading 't/abcd': Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> 
> 
> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
> file is accessible
> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> *
> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 
> (dmesg report the checksum mismatch)
> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> 
> Note the first 4k filled by 0x01 !
> 
> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> access it, using O_DIRECT
> a) no error is returned to the caller
> b) instead of the page stored on the disk, it is returned a page filled with 
> 0x01 (according also with the function __readpage_endio_check())
> 
> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 
> At the end of the email there is the code which shows the problem: two 
> process share the same memory: the first write it to the disk, the second 
> update the buffer continuously. A third process try to read the file, but it 
> got time to time -EIO
> 
> If you ran my code in a btrfs filesystem you got a lot of 
> 
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; e = 5 - Input/output error
> ERROR: read thread; e = 5 - Input/output error
> 
> The firsts lines are related to a shorter read (which may happens). The lasts 
> lines are related to a checksum mismatch. The dmesg is filled by lines like
> [...]
> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
> [...]
> 
> This is definitely a bug. 
> 
> I think that using O_DIRECT and updating a page at the same time could happen 
> in a VM. In BTRFS this  could lead to a wrong checksum. The problem is that 
> if BTRFS detects a checksum error during a reading:
> a) if O_DIRECT is not used in the read
>   * -EIO is returned
> Definitely BAD
> 
> b) if O_DIRECT is used in the read
>   * it doesn't return the error to the caller
>   * it returns a page filled by 0x01 instead of the data from the disk
> Even worse than a)
> 
> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
> Note3: even 

BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Goffredo Baroncelli
Hi all,

I discovered two bugs when O_DIRECT is used...

1) a corrupted file doesn't return -EIO when O_DIRECT is used

Normally BTRFS prevents to access the contents of a corrupted file; however I 
was able read the content of a corrupted file simply using O_DIRECT

# in a new btrfs filesystem, create a file
$ sudo mkfs.btrfs -f /dev/sdd5
$ mount /dev/sdd5 t
$ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd bs=$((16*1024)) 
iflag=fullblock count=1024

# corrupt the file
$ sudo filefrag -v t/abcd 
Filesystem type is: 9123683e
File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..3475:  70656.. 74131:   3476:
   1: 3476..4095:  74212.. 74831:620:  74132: last,eof
t/abcd: 2 extents found
$ sudo umount t
$ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
/dev/sdd5
mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
corrupting 289406976 copy 1

# try to access the file; expected result: -EIO
$ sudo mount /dev/sdd5 t
$ dd if=t/abcd | hexdump -c | head
dd: error reading 't/abcd': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.000477413 s, 0.0 kB/s


# try to access the file using O_DIRECT; expected result: -EIO, instead the 
file is accessible
$ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
*
0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g

(dmesg report the checksum mismatch)
[13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
csum 0x98f94189 expected csum 0x0ab6be80 mirror 1

Note the first 4k filled by 0x01 !

Conclusion: even if the file is corrupted and normally BTRFS prevent to access 
it, using O_DIRECT
a) no error is returned to the caller
b) instead of the page stored on the disk, it is returned a page filled with 
0x01 (according also with the function __readpage_endio_check())


2) The second bug, is a more severe bug. If during a writing of a buffer with 
O_DIRECT, the buffer is updated at the same time by a second process, the 
checksum may be incorrect.

At the end of the email there is the code which shows the problem: two process 
share the same memory: the first write it to the disk, the second update the 
buffer continuously. A third process try to read the file, but it got time to 
time -EIO

If you ran my code in a btrfs filesystem you got a lot of 

ERROR: read thread; r = 8192, expected = 16384
ERROR: read thread; r = 8192, expected = 16384
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error

The firsts lines are related to a shorter read (which may happens). The lasts 
lines are related to a checksum mismatch. The dmesg is filled by lines like
[...]
[14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 4096 
csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
[...]

This is definitely a bug. 

I think that using O_DIRECT and updating a page at the same time could happen 
in a VM. In BTRFS this  could lead to a wrong checksum. The problem is that if 
BTRFS detects a checksum error during a reading:
a) if O_DIRECT is not used in the read
* -EIO is returned
Definitely BAD

b) if O_DIRECT is used in the read
* it doesn't return the error to the caller
* it returns a page filled by 0x01 instead of the data from the disk
Even worse than a)  

Note1: even using O_DIRECT with O_SYNC, the problem still persist.
Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, but 
also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't 
support it; see https://github.com/zfsonlinux/zfs/issues/224

BR
G.Baroncelli

- cut --- cut --- cut 

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILESIZE(4096*4)

int fd;
char *buffer = NULL;

void read_thread(const char *nf) {

void *data = mmap(NULL,  FILESIZE,
PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

assert(data);
fprintf(stderr, "read_thread:  data = %p\n", data);
int rfd;
rfd = open(nf, O_RDONLY);

Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow
Am Thu, 14 Sep 2017 18:48:54 +0100
schrieb Tomasz Kłoczko :

> On 14 September 2017 at 16:24, Kai Krakow 
> wrote: [..]
> > Getting e.g. boot files into read order or at least nearby improves
> > boot time a lot. Similar for loading applications.  
> 
> By how much it is possible to improve boot time?
> Just please some example which I can try to replay which ill be
> showing that we have similar results.
> I still have one one of my laptops with spindle on btrfs root fs ( and
> no other FSess in use) so I could be able to confirm that my numbers
> are enough close to your numbers.

I need to create a test setup because this system uses bcache. The
difference (according to systemd-analyze) between warm bcache and no
bcache at all ranges from 16-30s boot time vs. 3+ minutes boot time.

I could turn off bcache, do a boot trace, try to rearrange boot files,
boot again. However, that is not very reproducible as the current file
layout is not defined. It'd be better to setup a separate machine where
I could start over from a "well defined" state before applying
optimization steps to see the differences between different strategies.
At least readahead is not very helpful, I tested that in the past. It
reduces boot time just by a few seconds, maybe 20-30, thus going from
3+ minutes to 2+ minutes.

I still have an old laptop lying around: Single spindle, should make a
good test scenario. I'll have to see if I can get it back into shape.
It will take me some time.


> > Shake tries to
> > improve this by rewriting the files - and this works because file
> > systems (given enough free space) already do a very good job at
> > doing this. But constant system updates degrade this order over
> > time.  
> 
> OK. Please prepare some database, import some data which size will be
> few times of not used RAM (best if this multiplication factor will be
> at least 10). Then do some batch of selects measuring distribution
> latencies of those queries.

Well, this is pretty easy. Systemd-journald is a real beast when it
comes to cow fragmentation. Results can be easily generated and
reproduced. There are long traces of discussions in the systemd mailing
list and I simply decided to make the files nocow right from the start
and that fixed it for me. I can simply revert it and create benchmarks.


> This will give you some data about. not fragmented data.

Well, I would probably do it the other way around: Generate a
fragmented journal file (as that is how journald creates the file over
time), then rewrite it by some manner to reduce extents, then run
journal operations again on this file. Does it bother you to turn this
around?


> Then on next stage try to apply some number of update queries and
> after reboot the system or drop all caches. and repeat the same set of
> selects.
> After this all what you need to do is compare distribution of the
> latencies.

Which tool to use to measure which latencies?

Speaking of latencies: What's of interest here is perceived
performance resulting mostly from seek overhead (except probably in the
journal file case which just overwhelmes by the pure amount of
extents). I'm not sure if measuring VFS latencies would provide any
useful insights here. VFS probably works fast enough still in this
case.


> > It really doesn't matter if some big file is laid out in 1
> > allocation of 1 GB or in 250 allocations of 4MB: It really doesn't
> > make a big difference.
> >
> > Recombining extents into bigger once, tho, can make a big
> > difference in an aging btrfs, even on SSDs.  
> 
> That it may be an issue with using extents.

I can't follow why you argue that a file with thousands of extents vs
a file of same size but only a few extents would makes no difference to
operate on. And of course this has to do with extents. But btrfs uses
extents. Do you suggest to use ZFS instead?

Due to how cow works, the effect would probably be less or barely
noticable for writes, but read scanning through the file becomes slow
with clearly more "noise" from the moving heads.


> Again: please show some results of some test unit which anyone will be
> able to reply and confirm or not that this effect really exist.
> 
> If problem really exist and is related ot extents you should have real
> scenario explanation why ZFS is not using extents.

That was never the discussion. You brought in the ZFS point. I read
about the design reasoning behind ZFS when it appeared and started gain
public interest years back.


> btrfs is not to far from classic approach do FS because it srill uses
> allocation structures.
> This is not the case in context of ZFS because this technology has no
> information about what is already allocates.

What about btrfs free space tree? Isn't that more or less the same? But
I don't believe that makes a significant difference for desktop-sized
storages. I think introduction of free space tree was due to
performance of many-TB file systems up to 

Re: defragmenting best practice?

2017-09-14 Thread Austin S. Hemmelgarn

On 2017-09-14 13:48, Tomasz Kłoczko wrote:

On 14 September 2017 at 16:24, Kai Krakow  wrote:
[..]

Getting e.g. boot files into read order or at least nearby improves
boot time a lot. Similar for loading applications.


By how much it is possible to improve boot time?
Just please some example which I can try to replay which ill be
showing that we have similar results.
I still have one one of my laptops with spindle on btrfs root fs ( and
no other FSess in use) so I could be able to confirm that my numbers
are enough close to your numbers.
While it's not for BTRFS< a tool called e4rat might be of interest to 
you regarding this.  It reorganizes files on an ext4 filesystem so that 
stuff used by the boot loader is right at the beginning of the device, 
and I've know people to get insane performance improvements (on the 
order of 20x in some pathologicallyb ad cases) in the time taken from 
the BIOS handing things off to GRUB to GRUB handing execution off to the 
kernel.



Shake tries to
improve this by rewriting the files - and this works because file
systems (given enough free space) already do a very good job at doing
this. But constant system updates degrade this order over time.


OK. Please prepare some database, import some data which size will be
few times of not used RAM (best if this multiplication factor will be
at least 10). Then do some batch of selects measuring distribution
latencies of those queries.
This will give you some data about. not fragmented data.
Then on next stage try to apply some number of update queries and
after reboot the system or drop all caches. and repeat the same set of
selects.
After this all what you need to do is compare distribution of the latencies.


It really doesn't matter if some big file is laid out in 1 allocation
of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
difference.

Recombining extents into bigger once, tho, can make a big difference in
an aging btrfs, even on SSDs.


That it may be an issue with using extents.
Again: please show some results of some test unit which anyone will be
able to reply and confirm or not that this effect really exist.
This shouldn't need examples.  It's trivial math combined with basic 
knowledge of hardware behavior.  Every request to a device has a minimum 
amount of overhead.  On traditional hard drives, this is usually 
dominated by seek latency, but on SSD's, the request setup, dispatch, 
and completion are the dominant factor.  Assumign you have a 2 
micro-second overhead per-request (not an exact number, just chosen for 
demonstration purposes because it makes the math easy), and a 1GB file, 
the time difference between reading ten 100MB extents and reading ten 
thousand 100kB extents is just short of 0.02 seconds, or a factor of 
about one thousand (which, no surprise here, is the factor of difference 
between the number of extents).


If problem really exist and is related ot extents you should have real
scenario explanation why ZFS is not using extents.
Extents have nothing to do with it.  What matters is how much of the 
file data is contiguous (and therefore can be read as a single request) 
and how smart the FS is about figuring that out.  Extents help figure 
that out, but the primary reason to use them is to save space encoding 
block allocations within a file (go take a look at how ext2 handles 
allocations, and then compare that to ext4, the difference is insane in 
terms of space savings).

btrfs is not to far from classic approach do FS because it srill uses
allocation structures.
This is not the case in context of ZFS because this technology has no
information about what is already allocates.
ZFS uses free lists so by negation whatever is not on free list is
already allocated.
I'm not trying to point that ZFS is better but only point that by
changing allocation strategy you may not be blasted by something like
some extents bottleneck (which sill needs to be proven)

There are at least few very good reason why it is even necessary to
change sometimes strategy from allocations structures to free lists.
First: ZFS free list management is very similar to known from Linux
memory SLAB allocator.
Did you heard that someone needs to do system memory defragnentation
because fragmented memory adds some additional latency to memory
access?
Other consequence is that with growing size of the files and number of
files or directories FS metadata are growing exponentially with size
and numbers of such objects. In case of free lists there is no such
growth and all structures are growing with linear correlation.
Caching in memory free list data takes much less than caching b-trees.
Last thing is effort on deallocating something in FS with allocation
structure and with free lists.
In classic approach number of such operations is growing with depth of b-trees.
In case free list all hat you need to do is compare ctime of the
allocated block with volume or snapshot ctime to make 

Re: [PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()

2017-09-14 Thread David Sterba
On Thu, Sep 14, 2017 at 03:18:06PM +0200, Jan Kara wrote:
> We want only pages from given range in btree_write_cache_pages() and
> extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
> pagevec_lookup_tag() and remove unnecessary code.
> 
> CC: linux-btrfs@vger.kernel.org
> CC: David Sterba 
> Signed-off-by: Jan Kara 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Tomasz Kłoczko
On 14 September 2017 at 16:24, Kai Krakow  wrote:
[..]
> Getting e.g. boot files into read order or at least nearby improves
> boot time a lot. Similar for loading applications.

By how much it is possible to improve boot time?
Just please some example which I can try to replay which ill be
showing that we have similar results.
I still have one one of my laptops with spindle on btrfs root fs ( and
no other FSess in use) so I could be able to confirm that my numbers
are enough close to your numbers.

> Shake tries to
> improve this by rewriting the files - and this works because file
> systems (given enough free space) already do a very good job at doing
> this. But constant system updates degrade this order over time.

OK. Please prepare some database, import some data which size will be
few times of not used RAM (best if this multiplication factor will be
at least 10). Then do some batch of selects measuring distribution
latencies of those queries.
This will give you some data about. not fragmented data.
Then on next stage try to apply some number of update queries and
after reboot the system or drop all caches. and repeat the same set of
selects.
After this all what you need to do is compare distribution of the latencies.

> It really doesn't matter if some big file is laid out in 1 allocation
> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
> difference.
>
> Recombining extents into bigger once, tho, can make a big difference in
> an aging btrfs, even on SSDs.

That it may be an issue with using extents.
Again: please show some results of some test unit which anyone will be
able to reply and confirm or not that this effect really exist.

If problem really exist and is related ot extents you should have real
scenario explanation why ZFS is not using extents.
btrfs is not to far from classic approach do FS because it srill uses
allocation structures.
This is not the case in context of ZFS because this technology has no
information about what is already allocates.
ZFS uses free lists so by negation whatever is not on free list is
already allocated.
I'm not trying to point that ZFS is better but only point that by
changing allocation strategy you may not be blasted by something like
some extents bottleneck (which sill needs to be proven)

There are at least few very good reason why it is even necessary to
change sometimes strategy from allocations structures to free lists.
First: ZFS free list management is very similar to known from Linux
memory SLAB allocator.
Did you heard that someone needs to do system memory defragnentation
because fragmented memory adds some additional latency to memory
access?
Other consequence is that with growing size of the files and number of
files or directories FS metadata are growing exponentially with size
and numbers of such objects. In case of free lists there is no such
growth and all structures are growing with linear correlation.
Caching in memory free list data takes much less than caching b-trees.
Last thing is effort on deallocating something in FS with allocation
structure and with free lists.
In classic approach number of such operations is growing with depth of b-trees.
In case free list all hat you need to do is compare ctime of the
allocated block with volume or snapshot ctime to make decision about
return or not block to free list.
No matter how many snapshots, volumes, files or directories allays it
will be *just one compare* of the block or vol/snapshot ctime.
With necessity to do just only one compare comes way better
predictable behavior of whole FS and simplicity of the code making
such decisions.
In other words ZFS internally uses well know SLAB allocator with
caching some data about best possible location to allocate some
different sizes allocation unit size multiplied by n^2 like you can
see on Linux in /proc/slabinfo in case of *kmalloc* SLABs.
This is why in case of ZFS number of volumes, snapshots has zero
impact on avg speed of interactions over VFS layer.

If you will be able present real impact of the fragmentation (again
*if*) this may trigger other actions.
So AFAIK no one been able to deliver real numbers or scenarios about
such impact.
And *if* such impact really exist one of the solutions may be just
mimic what ZFS is doing (maybe there are other solutions).

So please show us test unit exposing problem with measurement
methodology presenting pathology related to fragmentation.

> Bees is, btw, not about defragmentation: I have some OS containers
> running and I want to deduplicate data after updates.

Deduplication done in userspace has natural consequences in form of
security issues.
executable doing such things will need full access to everything and
needs to have exposed some API/ABI allowing fiddle with content of the
btrfs. Which adds second batch of security related risks.

Try to have look how deduplication is working in case of ZFS without
offline deduplication.

>> In other words if someone 

Re: [PATCH] Btrfs: fix confusing worker helper info

2017-09-14 Thread David Sterba
On Wed, Sep 13, 2017 at 12:09:28PM -0600, Liu Bo wrote:
> We've seen the following backtrace stack in ftrace or dmesg log,
> 
>   kworker/u16:10-4244  [000] 241942.480955: function: 
> btrfs_put_ordered_extent
>   kworker/u16:10-4244  [000] 241942.480956: kernel_stack:  trace>
> => finish_ordered_fn (a0384475)
> => btrfs_scrubparity_helper (a03ca577)<-"incorrect"
> => btrfs_freespace_write_helper (a03ca98e)<-"correct"
> => process_one_work (81117b2f)
> => worker_thread (81118c2a)
> => kthread (81121de0)
> => ret_from_fork (81d7087a)
> 
> btrfs_freespace_write_helper is actually calling normal_worker_helper
> instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
> incorrect function address while unwinding the stack,
> btrfs_scrubparity_helper really shouldn't be shown up.
> 
> It's caused by compiler doing inline for our helper function, adding a
> noinline tag can fix that.
> 
> Signed-off-by: Liu Bo 
> cc: David Sterba 

Ok, understood now, thanks. I suggest to use noinline_for_stack, that is
made exactly for this situation (I'll change it so you don't need to
resend).

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] Btrfs: remove bio_flags which indicates a meta block of log-tree

2017-09-14 Thread David Sterba
On Wed, Sep 13, 2017 at 12:18:22PM -0600, Liu Bo wrote:
> Since both committing transaction and writing log-tree are doing
> plugging on metadata IO, we can unify to use %sync_writers to benefit
> both cases, instead of checking bio_flags while writing meta blocks of
> log-tree.
> 
> We can remove this bio_flags because in order to write dirty blocks,
> log tree also uses btrfs_write_marked_extents(), inside which we
> has enabled %sync_writers, therefore, every write goes in a
> synchronous way, so does checksuming.
> 
> Please also note that, bio_flags is applied per-context while
> %sync_writers is applied per-inode, so this might incur some overhead, ie.
> 
> 1) while log tree is flushing its dirty blocks via
>btrfs_write_marked_extents(), in which %sync_writers is increased
>by one.
> 
> 2) in the meantime, some writeback operations may happen upon btrfs's
>metadata inode, so these writes go synchronously, too.
> 
> However, AFAICS, the overhead is not a big one while the win is that
> we unify the two places that needs synchronous way and remove a
> special hack/flag.
> 
> This removes the bio_flags related stuff for writing log-tree.
> 
> Signed-off-by: Liu Bo 

Much better, thanks.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow
Am Thu, 14 Sep 2017 17:24:34 +0200
schrieb Kai Krakow :

Errors corrected, see below...


> Am Thu, 14 Sep 2017 14:31:48 +0100
> schrieb Tomasz Kłoczko :
> 
> > On 14 September 2017 at 12:38, Kai Krakow 
> > wrote: [..]  
> > >
> > > I suggest you only ever defragment parts of your main subvolume or
> > > rely on autodefrag, and let bees do optimizing the snapshots.  
> 
> Please read that again including the parts you omitted.
> 
> 
> > > Also, I experimented with adding btrfs support to shake, still
> > > working on better integration but currently lacking time... :-(
> > >
> > > Shake is an adaptive defragger which rewrites files. With my
> > > current patches it clones each file, and then rewrites it to its
> > > original location. This approach is currently not optimal as it
> > > simply bails out if some other process is accessing the file and
> > > leaves you with an (intact) temporary copy you need to move back
> > > in place manually.
> > 
> > If you really want to have real and *ideal* distribution of the data
> > across physical disk first you need to build time travel device.
> > This device will allow you to put all blocks which needs to be read
> > in perfect order (to read all data only sequentially without seek).
> > However it will be working only in case of spindles because in case
> > of SSDs there is no seek time.
> > Please let us know when you will write drivers/timetravel/ Linux
> > kernel driver. When such driver will be available I promise I'll
> > write all necessary btrfs code by myself in matter of few days (it
> > will be piece of cake compare to build such device).
> > 
> > But seriously ..  
> 
> Seriously: Defragmentation on spindles is IMHO not about getting the
> perfect continuous allocation but providing better spatial layout of
> the files you work with.
> 
> Getting e.g. boot files into read order or at least nearby improves
> boot time a lot. Similar for loading applications. Shake tries to
> improve this by rewriting the files - and this works because file
> systems (given enough free space) already do a very good job at doing
> this. But constant system updates degrade this order over time.
> 
> It really doesn't matter if some big file is laid out in 1 allocation
> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
> difference.
> 
> Recombining extents into bigger once, tho, can make a big difference
> in an aging btrfs, even on SSDs.
> 
> Bees is, btw, not about defragmentation: I have some OS containers
> running and I want to deduplicate data after updates. It seems to do a
> good job here, better than other deduplicators I found. And if some
> defrag tools destroyed your snapshot reflinks, bees can also help
> here. On its way it may recombine extents so it may improve
> fragmentation. But usually it probably defragments because it needs
 ^^^
It fragments!

> to split extents that a defragger combined.
> 
> But well, I think getting 100% continuous allocation is really not the
> achievement you want to get, especially when reflinks are a primary
> concern.
> 
> 
> > Only context/scenario when you may want to lower defragmentation is
> > when you are something needs to allocate continuous area lower than
> > free space and larger than largest free chunk. Something like this
> > happens only when volume is working on almost 100% allocated space.
> > In such scenario even you bees cannot do to much as it may be not
> > enough free space to move some other data in larger chunks to
> > defragment FS physical space.  
> 
> Bees does not do that.
> 
> 
> > If your workload will be still writing
> > new data to FS such defragmentation may give you (maybe) few more
> > seconds and just after this FS will be 100% full,
> > 
> > In other words if someone is thinking that such defragmentation
> > daemon is solving any problems he/she may be 100% right .. such
> > person is only *thinking* that this is truth.  
> 
> Bees is not about that.
> 
> 
> > kloczek
> > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> > it".  
> 
> Do you know the saying "think first, then act"?
> 
> 
> > So first show that fragmentation is hurting latency of the
> > access to btrfs data and it will be possible to measurable such
> > impact. Before you will start measuring this you need to learn how o
> > sample for example VFS layer latency. Do you know how to do this to
> > deliver such proof?  
> 
> You didn't get the point. You only read "defragmentation" and your
> alarm lights lid up. You even think bees would be a defragmenter. It
> probably is more the opposite because it introduces more fragments in
> exchange for more reflinks.
> 
> 
> > PS2. The same "discussions" about fragmentation where in the past
> > about +10 years ago after ZFS has been introduced. Just to let you
> > know that after initial ZFS introduction up to now was 

Re: snapshots of encrypted directories?

2017-09-14 Thread Hugo Mills
On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote:
> I use encfs on top of btrfs.
> I can create btrfs snapshots, but I have no suggestive access to the files
> in these snaspshots, because they look like:
> 
> drwx--  framstag users- 2017-09-08 11:47:18 
> uHjprldmxo3-nSfLmcH54HMW
> drwxr-xr-x  framstag users- 2017-09-08 11:47:18 
> wNEWaDCgyXTj0d-Myk8wXZfh
> -rw-r--r--  framstag users  377 2015-06-12 14:02:53 
> -zDmc7xfobKDkbl8z7oKOHxv
> -rw-r--r--  framstag users2,367 2012-07-10 14:32:30 
> 7pfKs27K9k5zANE4WOQEuFa2
> -rw---  framstag users  692 2009-10-20 13:45:41 
> 8SQElYCph85kDdcFasUHybVr
> -rw---  framstag users2,872 2017-08-31 16:21:52 
> bm,yNi1e4fsAClDv7lNxxSfJ
> lrwxrwxrwx  framstag users- 2017-06-01 15:53:00 
> GZxNYI0Gy96R18fz40f7k5rl -> 
> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
> -rw-r--r--  framstag users  182 2016-12-01 13:34:31 
> rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1
> 
> I have to mount the snapshot with encfs, to have access to the (decrypted)
> files. 
> 
> Any better ideas?

   I'd say it's doing exactly what it should be doing. You're making a
copy of an encrypted data store, and the result is encrypted. In order
to read it, it needs to have the decrpytion layer applied to it with
the correct key (which is the need to mount the snapshot with encfs).

   Would you _really_ want a system where the encrypted contents of a
subvolume can be decrypted by simply snapshotting it?

   Hugo.

-- 
Hugo Mills | Great films about cricket: Umpire of the Rising Sun
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow
Am Thu, 14 Sep 2017 14:31:48 +0100
schrieb Tomasz Kłoczko :

> On 14 September 2017 at 12:38, Kai Krakow 
> wrote: [..]
> >
> > I suggest you only ever defragment parts of your main subvolume or
> > rely on autodefrag, and let bees do optimizing the snapshots.

Please read that again including the parts you omitted.


> > Also, I experimented with adding btrfs support to shake, still
> > working on better integration but currently lacking time... :-(
> >
> > Shake is an adaptive defragger which rewrites files. With my current
> > patches it clones each file, and then rewrites it to its original
> > location. This approach is currently not optimal as it simply bails
> > out if some other process is accessing the file and leaves you with
> > an (intact) temporary copy you need to move back in place
> > manually.  
> 
> If you really want to have real and *ideal* distribution of the data
> across physical disk first you need to build time travel device. This
> device will allow you to put all blocks which needs to be read in
> perfect order (to read all data only sequentially without seek).
> However it will be working only in case of spindles because in case of
> SSDs there is no seek time.
> Please let us know when you will write drivers/timetravel/ Linux
> kernel driver. When such driver will be available I promise I'll
> write all necessary btrfs code by myself in matter of few days (it
> will be piece of cake compare to build such device).
> 
> But seriously ..

Seriously: Defragmentation on spindles is IMHO not about getting the
perfect continuous allocation but providing better spatial layout of
the files you work with.

Getting e.g. boot files into read order or at least nearby improves
boot time a lot. Similar for loading applications. Shake tries to
improve this by rewriting the files - and this works because file
systems (given enough free space) already do a very good job at doing
this. But constant system updates degrade this order over time.

It really doesn't matter if some big file is laid out in 1 allocation
of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
difference.

Recombining extents into bigger once, tho, can make a big difference in
an aging btrfs, even on SSDs.

Bees is, btw, not about defragmentation: I have some OS containers
running and I want to deduplicate data after updates. It seems to do a
good job here, better than other deduplicators I found. And if some
defrag tools destroyed your snapshot reflinks, bees can also help here.
On its way it may recombine extents so it may improve fragmentation.
But usually it probably defragments because it needs to split extents
that a defragger combined.

But well, I think getting 100% continuous allocation is really not the
achievement you want to get, especially when reflinks are a primary
concern.


> Only context/scenario when you may want to lower defragmentation is
> when you are something needs to allocate continuous area lower than
> free space and larger than largest free chunk. Something like this
> happens only when volume is working on almost 100% allocated space.
> In such scenario even you bees cannot do to much as it may be not
> enough free space to move some other data in larger chunks to
> defragment FS physical space.

Bees does not do that.


> If your workload will be still writing
> new data to FS such defragmentation may give you (maybe) few more
> seconds and just after this FS will be 100% full,
> 
> In other words if someone is thinking that such defragmentation daemon
> is solving any problems he/she may be 100% right .. such person is
> only *thinking* that this is truth.

Bees is not about that.


> kloczek
> PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> it".

Do you know the saying "think first, then act"?


> So first show that fragmentation is hurting latency of the
> access to btrfs data and it will be possible to measurable such
> impact. Before you will start measuring this you need to learn how o
> sample for example VFS layer latency. Do you know how to do this to
> deliver such proof?

You didn't get the point. You only read "defragmentation" and your
alarm lights lid up. You even think bees would be a defragmenter. It
probably is more the opposite because it introduces more fragments in
exchange for more reflinks.


> PS2. The same "discussions" about fragmentation where in the past
> about +10 years ago after ZFS has been introduced. Just to let you
> know that after initial ZFS introduction up to now was not written
> even single line of ZFS code to handle active fragmentation and no one
> been able to prove that something about active defragmentation needs
> to be done in case of ZFS.

Btrfs has autodefrag to reduce the number of fragments by rewriting
small portions of the file being written to. This is needed, otherwise
the feature won't be there. Why? Have you tried working with 1GB files
broken into 

snapshots of encrypted directories?

2017-09-14 Thread Ulli Horlacher
I use encfs on top of btrfs.
I can create btrfs snapshots, but I have no suggestive access to the files
in these snaspshots, because they look like:

drwx--  framstag users- 2017-09-08 11:47:18 uHjprldmxo3-nSfLmcH54HMW
drwxr-xr-x  framstag users- 2017-09-08 11:47:18 wNEWaDCgyXTj0d-Myk8wXZfh
-rw-r--r--  framstag users  377 2015-06-12 14:02:53 -zDmc7xfobKDkbl8z7oKOHxv
-rw-r--r--  framstag users2,367 2012-07-10 14:32:30 7pfKs27K9k5zANE4WOQEuFa2
-rw---  framstag users  692 2009-10-20 13:45:41 8SQElYCph85kDdcFasUHybVr
-rw---  framstag users2,872 2017-08-31 16:21:52 bm,yNi1e4fsAClDv7lNxxSfJ
lrwxrwxrwx  framstag users- 2017-06-01 15:53:00 
GZxNYI0Gy96R18fz40f7k5rl -> 
wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
-rw-r--r--  framstag users  182 2016-12-01 13:34:31 
rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1

I have to mount the snapshot with encfs, to have access to the (decrypted)
files. 

Any better ideas?

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170914145739.ga32...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Tomasz Kłoczko
On 14 September 2017 at 12:38, Kai Krakow  wrote:
[..]
>
> I suggest you only ever defragment parts of your main subvolume or rely
> on autodefrag, and let bees do optimizing the snapshots.
>
> Also, I experimented with adding btrfs support to shake, still working
> on better integration but currently lacking time... :-(
>
> Shake is an adaptive defragger which rewrites files. With my current
> patches it clones each file, and then rewrites it to its original
> location. This approach is currently not optimal as it simply bails out
> if some other process is accessing the file and leaves you with an
> (intact) temporary copy you need to move back in place manually.

If you really want to have real and *ideal* distribution of the data
across physical disk first you need to build time travel device. This
device will allow you to put all blocks which needs to be read in
perfect order (to read all data only sequentially without seek).
However it will be working only in case of spindles because in case of
SSDs there is no seek time.
Please let us know when you will write drivers/timetravel/ Linux kernel driver.
When such driver will be available I promise I'll write all necessary
btrfs code by myself in matter of few days (it will be piece of cake
compare to build such device).

But seriously ..
Only context/scenario when you may want to lower defragmentation is
when you are something needs to allocate continuous area lower than
free space and larger than largest free chunk. Something like this
happens only when volume is working on almost 100% allocated space.
In such scenario even you bees cannot do to much as it may be not
enough free space to move some other data in larger chunks to
defragment FS physical space. If your workload will be still writing
new data to FS such defragmentation may give you (maybe) few more
seconds and just after this FS will be 100% full,

In other words if someone is thinking that such defragmentation daemon
is solving any problems he/she may be 100% right .. such person is
only *thinking* that this is truth.

kloczek
PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix it".
So first show that fragmentation is hurting latency of the access to
btrfs data and it will be possible to measurable such impact.
Before you will start measuring this you need to learn how o sample
for example VFS layer latency. Do you know how to do this to deliver
such proof?
PS2. The same "discussions" about fragmentation where in the past
about +10 years ago after ZFS has been introduced. Just to let you
know that after initial ZFS introduction up to now was not written
even single line of ZFS code to handle active fragmentation and no one
been able to prove that something about active defragmentation needs
to be done in case of ZFS.
Why? Because all stands on the shoulders of enough cleaver *allocation
algorithm*. Only this and nothing more.
PS3. Please can we stop this/EOT?
--
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()

2017-09-14 Thread Jan Kara
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

CC: linux-btrfs@vger.kernel.org
CC: David Sterba 
Signed-off-by: Jan Kara 
---
 fs/btrfs/extent_io.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0f077c5db58e..9b7936ea3a88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3819,8 +3819,8 @@ int btree_write_cache_pages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL)
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3830,11 +3830,6 @@ int btree_write_cache_pages(struct address_space 
*mapping,
if (!PagePrivate(page))
continue;
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   break;
-   }
-
spin_lock(>private_lock);
if (!PagePrivate(page)) {
spin_unlock(>private_lock);
@@ -3966,8 +3961,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3992,12 +3987,6 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
continue;
}
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   unlock_page(page);
-   continue;
-   }
-
if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not backup tree roots when fsync

2017-09-14 Thread David Sterba
On Thu, Sep 14, 2017 at 09:55:48AM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月14日 02:25, Liu Bo wrote:
> > It doens't make sense to backup tree roots when doing fsync, since
> > during fsync those tree roots have not been consistent on disk.
> > 
> > Signed-off-by: Liu Bo 
> 
> Reviewed-by: Qu Wenruo 
> 
> With a pit can be improved.
> > ---
> >   fs/btrfs/disk-io.c | 9 -
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 79ac228..a145a88 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3668,7 +3668,14 @@ int write_all_supers(struct btrfs_fs_info *fs_info, 
> > int max_mirrors)
> > u64 flags;
> >   
> > do_barriers = !btrfs_test_opt(fs_info, NOBARRIER);
> > -   backup_super_roots(fs_info);
> > +
> > +   /*
> > +* max_mirrors == 0 indicates we're from commit_transaction,
> > +* not from fsync where the tree roots in fs_info have not
> > +* been consistent on disk.
> > +*/
> > +   if (max_mirrors == 0)
> > +   backup_super_roots(fs_info);
> 
> BTW, the @max_mirrors naming here is really confusing.
> Normally I would expect max_mirrors == 0 means we don't need to backup 
> super roots...

Agreed it's confusing, could be something like "bool write_backups" (in a
separate patch).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: build: omit unnecessary -MD flag

2017-09-14 Thread David Sterba
On Thu, Sep 14, 2017 at 07:10:56PM +0900, Naohiro Aota wrote:
> According to gcc(1), "-MD is equivalent to -M -MF file, except that -E is not
> implied." Since the rule in the Makefile is just generating dependency file
> and not building object file, it is no use to have "-MD" here. Also, it's
> overridden and conflicting with the following "-MM" flag. I guess we can drop
> it.
> 
> Signed-off-by: Naohiro Aota 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: build: generate all dependency files

2017-09-14 Thread David Sterba
On Thu, Sep 14, 2017 at 07:10:46PM +0900, Naohiro Aota wrote:
> We're missing several dependency files like:
> 
> $ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. 
> -f2|sort)
>--- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900
>+++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900

Please note that an actual diff in the changelog is understood as start
of the patch by git-am, indenting the --- or +++ lines makes it work
again.

> @@ -3,7 +3,6 @@
>  /btrfs-corrupt-block
>  /btrfs-debug-tree
>  /btrfs-find-root
> -/btrfs-list
>  /btrfs-map-logical
>  /btrfs-select-super
>  /btrfstune
> @@ -29,11 +28,6 @@
>  /cmds-scrub
>  /cmds-send
>  /cmds-subvolume
> -/convert/common
> -/convert/main
> -/convert/source-ext2
> -/convert/source-fs
> -/convert/source-reiserfs
>  /ctree
>  /dir-item
>  /disk-io
> 
> 
> This is due to moving things out of objects and cmds_objects variables. Such
> missing dependency files cause mis-building of some source files (try touch
> utils.h; make mkfs/main.o).
> 
> This patch introduce a new variable "all_objects" to keep all the objects and
> use the variable to generate proper dependency file building rules.
> 
> Signed-off-by: Naohiro Aota 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Austin S. Hemmelgarn

On 2017-09-14 03:54, Duncan wrote:

Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as
excerpted:


The tricky part though is that differing workloads are impacted
differently by fragmentation.  Using just four generic examples:

* Mostly sequential write focused workloads (like security recording
systems) tend to be impacted by free space fragmentation more than data
fragmentation.  Balancing filesystems used for such workloads is likely
to give a noticeable improvement, but defragmenting probably won't give
much.
* Mostly sequential read focused workloads (like a streaming media
server)
tend to be the most impacted by data fragmentation, but aren't generally
impacted by free space fragmentation.  As a result, defrag will help
here a lot, but balance won't as much.
* Mostly random write focused workloads (like most database systems or
virtual machines) are often impacted by both free space and data
fragmentation, and are a pathological case for CoW filesystems.  Balance
and defrag will help here, but they won't help for long.
* Mostly random read focused workloads (like most non-multimedia desktop
usage) are not impacted much by either aspect, but if you're on a
traditional hard drive they can be impacted significantly by how the
data is spread across the disk.  Balance can help here, but only because
it improves data locality, not because it compacts free space.


This is a very useful analysis, particularly given the examples.  Maybe
put it on the wiki under the defrag discussion?  (Assuming something like
it isn't already there.  I've not looked in awhile.)

I've actually been meaning to write up something more thoroughly about 
this online (probably as a Gist).  When finally get around to that 
(probably in the next few weeks), I'll try to make sure a link ends up 
on the defrag page on the wiki.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow
Am Tue, 12 Sep 2017 18:28:43 +0200
schrieb Ulli Horlacher :

> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> > When I do a 
> > btrfs filesystem defragment -r /directory
> > does it defragment really all files in this directory tree, even if
> > it contains subvolumes?
> > The man page does not mention subvolumes on this topic.  
> 
> No answer so far :-(
> 
> But I found another problem in the man-page:
> 
>   Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as
> well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or
> >= 3.13.4 will break up the ref-links of COW data (for example files
> >copied with
>   cp --reflink, snapshots or de-duplicated data). This may cause
>   considerable increase of space usage depending on the broken up
>   ref-links.
> 
> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
> snapshots.
> Therefore, I better should avoid calling "btrfs filesystem defragment
> -r"?
> 
> What is the defragmenting best practice?
> Avoid it completly?

You may want to try https://github.com/Zygo/bees. It is a daemon
watching the file system generation changes, scanning the blocks and
then recombines them. Of course, this process somewhat defeats the
purpose of defragging in the first place as it will undo some of the
defragmenting.

I suggest you only ever defragment parts of your main subvolume or rely
on autodefrag, and let bees do optimizing the snapshots.

Also, I experimented with adding btrfs support to shake, still working
on better integration but currently lacking time... :-(

Shake is an adaptive defragger which rewrites files. With my current
patches it clones each file, and then rewrites it to its original
location. This approach is currently not optimal as it simply bails out
if some other process is accessing the file and leaves you with an
(intact) temporary copy you need to move back in place manually.

Shake works very well with the idea of detecting how defragmented, how
old, and how far away from an "ideal" position a file is and exploits
standard Linux file systems behavior to optimally placing files by
rewriting them. It then records its status per file in extended
attributes. It also works with non-btrfs file systems. My patches try
to avoid defragging files with shared extents, so this may help your
situation. However, it will still shuffle files around if they are too
far from their ideal position, thus destroying shared extents. A future
patch could use extent recombining and skip shared extents in that
process. But first I'd like to clean out some of the rough edges
together with the original author of shake.

Look here: https://github.com/unbrice/shake and also check out the pull
requests and comments there. You shouldn't currently run shake
unattended and only on specific parts of your FS you feel need
defragmenting.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]

2017-09-14 Thread Marco Lorenzo Crociani

On 07/09/2017 16:43, Peter Becker wrote:

2017-09-07 16:37 GMT+02:00 Marco Lorenzo Crociani
:
[...]

I got:

00-49:  1
50-79:  0
80-89:  0
90-99:  1
100:25540

this means that fs has only one block group used under 50% and 1 between 90
and 99% while the rest are all full?



yes .. imo, balance wouldn't help
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hi,
after
btrfs balance start -musage=50 /data/R6HW/
and
btrfs balance start -musage=99 /data/R6HW/

I wasn't able to reproduce those messages.

Regards,

--
Marco Crociani
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: build: generate all dependency files

2017-09-14 Thread Naohiro Aota
We're missing several dependency files like:

$ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. 
-f2|sort)
--- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900
+++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900
@@ -3,7 +3,6 @@
 /btrfs-corrupt-block
 /btrfs-debug-tree
 /btrfs-find-root
-/btrfs-list
 /btrfs-map-logical
 /btrfs-select-super
 /btrfstune
@@ -29,11 +28,6 @@
 /cmds-scrub
 /cmds-send
 /cmds-subvolume
-/convert/common
-/convert/main
-/convert/source-ext2
-/convert/source-fs
-/convert/source-reiserfs
 /ctree
 /dir-item
 /disk-io


This is due to moving things out of objects and cmds_objects variables. Such
missing dependency files cause mis-building of some source files (try touch
utils.h; make mkfs/main.o).

This patch introduce a new variable "all_objects" to keep all the objects and
use the variable to generate proper dependency file building rules.

Signed-off-by: Naohiro Aota 
---
 Makefile |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index a114eca..c00dff6 100644
--- a/Makefile
+++ b/Makefile
@@ -121,6 +121,9 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
kernel-lib/rbtree.h btrfs-l
 convert_objects = convert/main.o convert/common.o convert/source-fs.o \
  convert/source-ext2.o convert/source-reiserfs.o
 mkfs_objects = mkfs/main.o mkfs/common.o
+image_objects = image/main.o
+all_objects = $(objects) $(cmds_objects) $(libbtrfs_objects) 
$(convert_objects) \
+ $(mkfs_objects) $(image_objects)
 
 TESTS = fsck-tests.sh convert-tests.sh
 
@@ -591,5 +594,5 @@ uninstall:
cd $(DESTDIR)$(bindir); $(RM) -f -- btrfsck fsck.btrfs $(progs_install)
 
 ifneq ($(MAKECMDGOALS),clean)
--include $(objects:.o=.o.d) $(cmds_objects:.o=.o.d) $(subst .btrfs,, 
$(filter-out btrfsck.o.d, $(progs:=.o.d)))
+-include $(all_objects:.o=.o.d) $(subst .btrfs,, $(filter-out btrfsck.o.d, 
$(progs:=.o.d)))
 endif

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: build: omit unnecessary -MD flag

2017-09-14 Thread Naohiro Aota
According to gcc(1), "-MD is equivalent to -M -MF file, except that -E is not
implied." Since the rule in the Makefile is just generating dependency file
and not building object file, it is no use to have "-MD" here. Also, it's
overridden and conflicting with the following "-MM" flag. I guess we can drop
it.

Signed-off-by: Naohiro Aota 
---
 Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index c00dff6..60c802a 100644
--- a/Makefile
+++ b/Makefile
@@ -264,7 +264,7 @@ else
 endif
 
 %.o.d: %.c
-   $(Q)$(CC) -MD -MM -MG -MF $@ -MT $(@:.o.d=.o) -MT $(@:.o.d=.static.o) 
-MT $@ $(CFLAGS) $<
+   $(Q)$(CC) -MM -MG -MF $@ -MT $(@:.o.d=.o) -MT $(@:.o.d=.static.o) -MT 
$@ $(CFLAGS) $<
 
 #
 # Pick from per-file variables, btrfs_*_cflags

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-14 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as
excerpted:

> The tricky part though is that differing workloads are impacted
> differently by fragmentation.  Using just four generic examples:
> 
> * Mostly sequential write focused workloads (like security recording
> systems) tend to be impacted by free space fragmentation more than data
> fragmentation.  Balancing filesystems used for such workloads is likely
> to give a noticeable improvement, but defragmenting probably won't give
> much.
> * Mostly sequential read focused workloads (like a streaming media
> server)
> tend to be the most impacted by data fragmentation, but aren't generally
> impacted by free space fragmentation.  As a result, defrag will help
> here a lot, but balance won't as much.
> * Mostly random write focused workloads (like most database systems or
> virtual machines) are often impacted by both free space and data
> fragmentation, and are a pathological case for CoW filesystems.  Balance
> and defrag will help here, but they won't help for long.
> * Mostly random read focused workloads (like most non-multimedia desktop
> usage) are not impacted much by either aspect, but if you're on a
> traditional hard drive they can be impacted significantly by how the
> data is spread across the disk.  Balance can help here, but only because
> it improves data locality, not because it compacts free space.

This is a very useful analysis, particularly given the examples.  Maybe 
put it on the wiki under the defrag discussion?  (Assuming something like 
it isn't already there.  I've not looked in awhile.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html