Re: [PATCH v2 7/7] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-15 Thread Qu Wenruo



On 2017年09月15日 23:48, David Sterba wrote:

On Fri, Sep 15, 2017 at 09:24:19PM +0800, Qu Wenruo wrote:

I'm going to review & merge this series to devel. Tests and
documentation should be updated to make the usecase clear.


I'm happy to address any comment, both code and doc/test.


For the tests I'd like to see:

* with file target, try non-existent file, zero-length file, too-small
   file and large-enough


No problem.

But it would be better to define the expected behavior first.
I'll show the behavior difference below to see if you're OK with it.

For non-existent file:
New: Fail, no such file at the very beginning
Old: Create new one

Zero-length:
New: Fail, too small file at the very beginning
Old: Expand file and continue mkfs

Too-small:
New: Fail, showing -28 errno half way
Old: Same as zero length.

Large-enough:
New: Success, without shrink
Old: Success, shrink



* for block device target, I think loop device would do, too-small and
   large-enough


Same as above too-small and large-enough case.



* after the test, verify that the files have same size and contents


I'll reuse convert facilities to do that.

And I'll add one extra test case:
No enough privilege to read some file in the directory:
New: Fail gracefully
Old: Fail with BUG_ON.

If you're OK with the behavior difference, I'll create new test cases 
for mkfs --rootdir.


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Kai Krakow
Am Sat, 16 Sep 2017 00:02:01 +0200
schrieb Ulli Horlacher :

> On Fri 2017-09-15 (23:44), Ulli Horlacher wrote:
> > On Fri 2017-09-15 (22:07), Peter Grandi wrote:
> >   
>  [...]  
> > > 
> > > Ordinary permissions still apply both to 'create' and 'delete':  
> > 
> > My user tux is the owner of the snapshot directory, because he has
> > created it!  
> 
> I can delete normal subvolumes but not the readonly snapshots:
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume create test
> Create subvolume './test'
> 
> tux@xerus:/test/tux/zz/.snapshot: ll
> drwxr-xr-x  tux  users- 2017-09-15 18:22:26
> 2017-09-15_1822.test drwxr-xr-x  tux  users- 2017-09-15
> 18:22:26 2017-09-15_1824.test drwxr-xr-x  tux  users-
> 2017-09-15 18:57:39 2017-09-15_1859.test drwxr-xr-x  tux
> users- 2017-09-15 23:58:51 test
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete test
> Delete subvolume (no-commit): '/test/tux/zz/.snapshot/test'
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete
> 2017-09-15_1859.test Delete subvolume (no-commit):
> '/test/tux/zz/.snapshot/2017-09-15_1859.test' ERROR: cannot delete
> '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only file system

See "man mount" in section btrfs mount options: There is a mount option
to allow normal user to delete snapshots. But this is said to has
security implication I cannot currently tell. Maybe someone else knows.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not backup tree roots when fsync

2017-09-15 Thread Liu Bo
On Thu, Sep 14, 2017 at 02:49:03PM +0200, David Sterba wrote:
> On Thu, Sep 14, 2017 at 09:55:48AM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2017年09月14日 02:25, Liu Bo wrote:
> > > It doens't make sense to backup tree roots when doing fsync, since
> > > during fsync those tree roots have not been consistent on disk.
> > > 
> > > Signed-off-by: Liu Bo 
> > 
> > Reviewed-by: Qu Wenruo 
> > 
> > With a pit can be improved.
> > > ---
> > >   fs/btrfs/disk-io.c | 9 -
> > >   1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index 79ac228..a145a88 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -3668,7 +3668,14 @@ int write_all_supers(struct btrfs_fs_info 
> > > *fs_info, int max_mirrors)
> > >   u64 flags;
> > >   
> > >   do_barriers = !btrfs_test_opt(fs_info, NOBARRIER);
> > > - backup_super_roots(fs_info);
> > > +
> > > + /*
> > > +  * max_mirrors == 0 indicates we're from commit_transaction,
> > > +  * not from fsync where the tree roots in fs_info have not
> > > +  * been consistent on disk.
> > > +  */
> > > + if (max_mirrors == 0)
> > > + backup_super_roots(fs_info);
> > 
> > BTW, the @max_mirrors naming here is really confusing.
> > Normally I would expect max_mirrors == 0 means we don't need to backup 
> > super roots...
> 
> Agreed it's confusing, could be something like "bool write_backups" (in a
> separate patch).

Good point, will do in a separate one, thanks to both for the
comments.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks

2017-09-15 Thread Liu Bo
commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errros are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Signed-off-by: Liu Bo 
Reported-by: Goffredo Baroncelli 
cc: Goffredo Baroncelli 
---
 fs/btrfs/inode.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24bcd5c..a46799e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8267,11 +8267,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
blk_status_t err = bio->bi_status;
 
-   if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) {
+   if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
err = btrfs_subio_endio_read(inode, io_bio, err);
-   if (!err)
-   bio->bi_status = 0;
-   }
 
unlock_extent(_I(inode)->io_tree, dip->logical_offset,
  dip->logical_offset + dip->bytes - 1);
@@ -8279,7 +8276,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 
kfree(dip);
 
-   dio_bio->bi_status = bio->bi_status;
+   dio_bio->bi_status = err;
dio_end_io(dio_bio);
 
if (io_bio->end_io)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (23:44), Ulli Horlacher wrote:
> On Fri 2017-09-15 (22:07), Peter Grandi wrote:
> 
> > >  [ ... ] mounted with option user_subvol_rm_allowed [ ... ]
> > > root can delete this snapshot, but not the user. Why? [ ... ]
> > 
> > Ordinary permissions still apply both to 'create' and 'delete':
> 
> My user tux is the owner of the snapshot directory, because he has created it!

I can delete normal subvolumes but not the readonly snapshots:

tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume create test
Create subvolume './test'

tux@xerus:/test/tux/zz/.snapshot: ll
drwxr-xr-x  tux  users- 2017-09-15 18:22:26 2017-09-15_1822.test
drwxr-xr-x  tux  users- 2017-09-15 18:22:26 2017-09-15_1824.test
drwxr-xr-x  tux  users- 2017-09-15 18:57:39 2017-09-15_1859.test
drwxr-xr-x  tux  users- 2017-09-15 23:58:51 test

tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete test
Delete subvolume (no-commit): '/test/tux/zz/.snapshot/test'

tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete 2017-09-15_1859.test
Delete subvolume (no-commit): '/test/tux/zz/.snapshot/2017-09-15_1859.test'
ERROR: cannot delete '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only 
file system

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170915214434.gg32...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (22:07), Peter Grandi wrote:
> >  [ ... ] mounted with option user_subvol_rm_allowed [ ... ]
> > root can delete this snapshot, but not the user. Why? [ ... ]
> 
> Ordinary permissions still apply both to 'create' and 'delete':

My user tux is the owner of the snapshot directory, because he has created it!

tux@xerus:/test/tux/zz: id
uid=1000(tux) gid=100(users) groups=100(users)

tux@xerus:/test/tux/zz: ll .snapshot/
drwxr-xr-x  tux  users- 2017-09-15 18:22:26 
.snapshot/2017-09-15_1822.test
drwxr-xr-x  tux  users- 2017-09-15 18:22:26 
.snapshot/2017-09-15_1824.test
drwxr-xr-x  tux  users- 2017-09-15 18:57:39 
.snapshot/2017-09-15_1859.test

tux@xerus:/test/tux/zz: btrfs subvolume delete .snapshot/2017-09-15_1859.test
Delete subvolume (no-commit): '/test/tux/zz/.snapshot/2017-09-15_1859.test'
ERROR: cannot delete '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only 
file system

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<22972.16657.822204.646...@tree.ty.sabi.co.uk>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Peter Grandi
>  [ ... ] mounted with option user_subvol_rm_allowed [ ... ]
> root can delete this snapshot, but not the user. Why? [ ... ]

Ordinary permissions still apply both to 'create' and 'delete':

  tree$  sudo mkdir /fs/sda7/dir
  tree$  btrfs sub create /fs/sda7/dir/sub
  ERROR: cannot access /fs/sda7/dir/sub: Permission denied

  tree$  sudo chmod a+rwx /fs/sda7/dir
  tree$  btrfs sub create /fs/sda7/dir/sub
  Create subvolume '/fs/sda7/dir/sub'

  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  ERROR: cannot delete '/fs/sda7/dir/sub': Operation not permitted
  tree$  sudo mount -o remount,user_subvol_rm_allowed /fs/sda7
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'

  tree$  btrfs sub create /fs/sda7/dir/sub
  Create subvolume '/fs/sda7/dir/sub'

  tree$  sudo chmod a-w /fs/sda7/dir
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  ERROR: cannot delete '/fs/sda7/dir/sub': Permission denied

  tree$  sudo chmod a+w /fs/sda7/dir
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  tree$  sudo rmdir /fs/sda7/dir
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Liu Bo
On Fri, Sep 15, 2017 at 08:57:41PM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 07:01 PM, Liu Bo wrote:
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page filled 
> >> with 0x01 (according also with the function __readpage_endio_check())
> >>
> > We've queued a patch[1] to fix it, the behavior was introduced a long
> > time ago and we guess it was for testing purpose.
> > 
> > With the patch, you'll get -EIO from dio read, which is consistent
> > with buffered read.
> 
> I tried your patch, but it doesn't seem to address this issue. I still got 
> -EIO with a normal file access, and a page filled by 0x01 with O_DIRECT.

Oh, nice, thanks for trying it out.

I see it now, the regression is actually introduced by a cleanup
patch,

commit 4246a0b63bd8f56a1469b12eafeb875b1041a451
block: add a bi_error field to struct bio

It overrides our -EIO which should be returned by dio_end_io().


But I think we need further clarification on the correct behavior,
ie.

If a file got corrupted (wrong) data, both buffer'd read and O_DIRECT
read will return -EIO to userspace, and that parcularly corrupted page
will be reset with 0x01 for not exposing data to userspace.  No
exception should exist.

> 
> If I understand your patch, this address the case where there is no any 
> checksum, which is not this case. The checksum exists and it is valid. It is 
> the data wrong.
>

Not for the case of nodatacsum, it's actually for the case where some
checksum somehow got zero'd, and btrfs was not returning -EIO even if
it got a mismatch.

Anyway, your problem is surely a regression, thanks for reporting it.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (13:16), Austin S. Hemmelgarn wrote:

> >> And then mount enryptfs:
> >>
> >> mount.ecryptfs / /
> > 
> > This only possible by root.
> > For a user it is not possible to have access for his own snapshots.
> > Bad.
> 
> Which is why you use EncFS (which is a FUSE module that runs in 
> userspace and requires no root privileges) instead of eCryptFS (which is 
> a kernel assisted filesystem that doesn't use FUSE, has more complicated 
> setup constraints, and requires CAP_SYS_ADMIN or root access).

I use both, encfs and ecryptfs, for different use cases.
I use ecryptfs on my notebooks for $HOME, which has some kind of
automounter on login (via pam).
This setup is not possible with encfs, which is also much slower and has
a lower security level.

But even for encfs it is very circumstantial for a user to have access to
snapshots.

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<6cd1ef22-7cab-4c8c-0b73-d254aeca8...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (13:08), Austin S. Hemmelgarn wrote:
> On 2017-09-15 12:37, Ulli Horlacher wrote:
> 
> > I have my btrfs filesystem mounted with option user_subvol_rm_allowed
> > tux@xerus: btrfs subvolume delete 
> > /test/tux/zz/.snapshot/2017-09-15_1824.test
> > Delete subvolume (no-commit): '/test/tux/zz/.snapshot/2017-09-15_1824.test'
> > ERROR: cannot delete '/test/tux/zz/.snapshot/2017-09-15_1824.test': 
> > Read-only file system
> > 
> > root can delete this snapshot, but not the user. Why?
> > 
> 
> Add 'user_subvol_rm' to the mount options and try again.

root@xerus:~# mount -vo user_subvol_rm /dev/sdd4 /test
mount: wrong fs type, bad option, bad superblock on /dev/sdd4,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

root@xerus:~# dmesg | tail -2
[1514588.018991] BTRFS info (device sdd4): unrecognized mount option 
'user_subvol_rm'
[1514588.028430] BTRFS: open_ctree failed

user_subvol_rm is not listed on
https://btrfs.wiki.kernel.org/index.php/Mount_options

Did you mean user_subvol_rm_allowed?
I have already used this option, without success. See above.


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<38b35e41-313a-1eed-667a-ee3138c81...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Tomasz Kłoczko
On 15 September 2017 at 18:08, Kai Krakow  wrote:
[..]
> According to Tomasz, your tests should not run at vastly different
> speeds because fragmentation has no impact on performance, quod est
> demonstrandum... I think we will not get to the "erat" part.

No. This is not precisely what I'm trying to tell.
Now however seeing that there is no precise/fully repeatable
methodology of performing proposed test I have huge doubts about what
is reported has effect has anything to do do with fragmentation or it
is side effect of using COW (which allow glue some number random
updates into larger sequential write IOs).

kloczek
-- 
Tomasz Kłoczko  LinkedIn: http://lnkd.in/FXPWxH
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: \o/ compsize

2017-09-15 Thread David Sterba
On Mon, Sep 04, 2017 at 08:42:29PM +0200, Adam Borowski wrote:
> On Mon, Sep 04, 2017 at 07:07:25PM +0300, Timofey Titovets wrote:
> > 2017-09-04 18:11 GMT+03:00 Adam Borowski :
> > > Here's an utility to measure used compression type + ratio on a set of 
> > > files
> > > or directories: https://github.com/kilobyte/compsize
> > >
> > > It should be of great help for users, and also if you:
> > > * muck with compression levels
> > > * add new compression types
> > > * add heurestics that could err on withholding compression too much
> > 
> > Packaged to AUR:
> > https://aur.archlinux.org/packages/compsize-git/
> 
> Cool!  I'd wait until people say the code is sane (I don't really know these
> ioctls) but if you want to make poor AUR folks our beta testers, that's ok.
> 
> However, one issue: I did not set a license; your packaging says GPL3. 
> It would be better to have something compatible with btrfs-progs which are
> GPL2-only.  What about GPL2-or-higher?
> 
> After adding some related info (like wasted space in pinned extents, reuse
> of extents), it'd be nice to have this tool inside btrfs-progs, either as a
> part of "fi du" or another command.

I've now implemented a prototype that calculates the compressed size of
extents per-file. As 'fi du' knows about what extents are shared, the
compression can be also calculated shared/exclusive.

There's no summary like compsize does, this would need a bit more
precise tracking of the extents, not just the compressed size but also
which algo was used. I can imagine all sorts of output enhancements,
like summarize the inline-compressed extents or print the algo summary
per-file. This should be easy once the calclation code is there. I
haven't reused compsize.c, as I needed only the ioctl part and wire it
to 'fi du', but the search ioctl is the same.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 07:01 PM, Liu Bo wrote:
>> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
>> access it, using O_DIRECT
>> a) no error is returned to the caller
>> b) instead of the page stored on the disk, it is returned a page filled with 
>> 0x01 (according also with the function __readpage_endio_check())
>>
> We've queued a patch[1] to fix it, the behavior was introduced a long
> time ago and we guess it was for testing purpose.
> 
> With the patch, you'll get -EIO from dio read, which is consistent
> with buffered read.

I tried your patch, but it doesn't seem to address this issue. I still got -EIO 
with a normal file access, and a page filled by 0x01 with O_DIRECT.

If I understand your patch, this address the case where there is no any 
checksum, which is not this case. The checksum exists and it is valid. It is 
the data wrong.

BR
G.Baroncelli

> 
> [1],
> https://patchwork.kernel.org/patch/9835505/
> Btrfs: report errors when checksum is not found
> 
> Thanks,
> 
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4 v3] btrfs-progs: add a parameter to btrfs_mksubvol

2017-09-15 Thread Yingyi Luo
From: yingyil 

A convert parameter is added as a flag to indicate if btrfs_mksubvol()
is used for btrfs-convert. The change cascades down to the callchain.

Signed-off-by: yingyil 
---
v3: changed the convert flag from int type to bool type.

 convert/main.c |  2 +-
 ctree.h|  4 +++-
 inode.c| 26 --
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 0babf0e8..882daf7c 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -1197,7 +1197,7 @@ static int do_convert(const char *devname, u32 
convert_flags, u32 nodesize,
}
 
image_root = btrfs_mksubvol(root, subvol_name,
-   CONV_IMAGE_SUBVOL_OBJECTID);
+   CONV_IMAGE_SUBVOL_OBJECTID, true);
if (!image_root) {
error("unable to link subvolume %s", subvol_name);
goto fail;
diff --git a/ctree.h b/ctree.h
index 22313576..ae25eed5 100644
--- a/ctree.h
+++ b/ctree.h
@@ -19,6 +19,8 @@
 #ifndef __BTRFS_CTREE_H__
 #define __BTRFS_CTREE_H__
 
+#include 
+
 #if BTRFS_FLAT_INCLUDES
 #include "list.h"
 #include "kerncompat.h"
@@ -2751,7 +2753,7 @@ int btrfs_add_orphan_item(struct btrfs_trans_handle 
*trans,
 int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
char *name, int namelen, u64 parent_ino, u64 *ino, int mode);
 struct btrfs_root *btrfs_mksubvol(struct btrfs_root *root, const char *base,
- u64 root_objectid);
+ u64 root_objectid, bool convert);
 
 /* file.c */
 int btrfs_get_extent(struct btrfs_trans_handle *trans,
diff --git a/inode.c b/inode.c
index 5dd816bf..ea812ce8 100644
--- a/inode.c
+++ b/inode.c
@@ -573,7 +573,8 @@ out:
 }
 
 struct btrfs_root *btrfs_mksubvol(struct btrfs_root *root,
- const char *base, u64 root_objectid)
+ const char *base, u64 root_objectid,
+ bool convert)
 {
struct btrfs_trans_handle *trans;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -639,16 +640,21 @@ struct btrfs_root *btrfs_mksubvol(struct btrfs_root *root,
key.type = BTRFS_ROOT_ITEM_KEY;
 
memcpy(buf, base, len);
-   for (i = 0; i < 1024; i++) {
-   ret = btrfs_insert_dir_item(trans, root, buf, len,
-   dirid, , BTRFS_FT_DIR, index);
-   if (ret != -EEXIST)
-   break;
-   len = snprintf(buf, ARRAY_SIZE(buf), "%s%d", base, i);
-   if (len < 1 || len > BTRFS_NAME_LEN) {
-   ret = -EINVAL;
-   break;
+   if (convert) {
+   for (i = 0; i < 1024; i++) {
+   ret = btrfs_insert_dir_item(trans, root, buf, len,
+   dirid, , BTRFS_FT_DIR, index);
+   if (ret != -EEXIST)
+   break;
+   len = snprintf(buf, ARRAY_SIZE(buf), "%s%d", base, i);
+   if (len < 1 || len > BTRFS_NAME_LEN) {
+   ret = -EINVAL;
+   break;
+   }
}
+   } else {
+   ret = btrfs_insert_dir_item(trans, root, buf, len, dirid, ,
+   BTRFS_FT_DIR, index);
}
if (ret)
goto fail;
-- 
2.14.1.690.gbb1197296e-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-09-15 Thread Nick Terrell
On 9/15/17, 7:53 AM, "David Sterba"  wrote:
> On Sun, Aug 20, 2017 at 06:38:50PM -0600, Chris Murphy wrote:
> > On Fri, Aug 18, 2017 at 10:08 AM, David Sterba  wrote:
> > 
> > > That's quite a lot, in kernel. IIRC zlib and lzo use less than 200kb,
> > > zstd wants 800kb for level 1. And this needs to be contiguous memory, so
> > > if we're lucky and get the memory at the mount time, fine. In general
> > > the memory can be fragmented (in the worst case, there are only 4k
> > > chunks available), so we'd have to vmalloc and consume the virtual,
> > > mappings in great numbers.
> > 
> > Any thoughts on bootloader support, both in general, and as it relates
> > to levels of compression and memory constraints? GRUB switches to
> > protected mode early on but other bootloaders might have more
> > limitations. I guess really it's just GRUB and extlinux right now,
> > there were patches some time ago for Das U-Boot but they still aren't
> > merged.
> 
> I'm not sure if the memory requirements are same for compression and
> decompression. The table is related to compression. I've looked around
> web to find what the decompression requirements are but havne't found
> anything.

In the BtrFS implementation, the same workspaces are used both for
compression and decompression, so they allocate enough memory for
compression. Decompression requires a constant 550 KB with a 128 KB window
size (which BtrFS has). You can find the numbers using the functions
ZSTD_estimateCStreamSize() and ZSTD_estimateDStreamSize().


N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Liu Bo
On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I discovered two bugs when O_DIRECT is used...
> 
> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> 
> Normally BTRFS prevents to access the contents of a corrupted file; however I 
> was able read the content of a corrupted file simply using O_DIRECT
> 
> # in a new btrfs filesystem, create a file
> $ sudo mkfs.btrfs -f /dev/sdd5
> $ mount /dev/sdd5 t
> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> bs=$((16*1024)) iflag=fullblock count=1024
> 
> # corrupt the file
> $ sudo filefrag -v t/abcd 
> Filesystem type is: 9123683e
> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..3475:  70656.. 74131:   3476:
>1: 3476..4095:  74212.. 74831:620:  74132: last,eof
> t/abcd: 2 extents found
> $ sudo umount t
> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> /dev/sdd5
> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> corrupting 289406976 copy 1 
> 
> # try to access the file; expected result: -EIO
> $ sudo mount /dev/sdd5 t
> $ dd if=t/abcd | hexdump -c | head
> dd: error reading 't/abcd': Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> 
> 
> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
> file is accessible
> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> *
> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 
> (dmesg report the checksum mismatch)
> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> 
> Note the first 4k filled by 0x01 !
> 
> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> access it, using O_DIRECT
> a) no error is returned to the caller
> b) instead of the page stored on the disk, it is returned a page filled with 
> 0x01 (according also with the function __readpage_endio_check())
>

We've queued a patch[1] to fix it, the behavior was introduced a long
time ago and we guess it was for testing purpose.

With the patch, you'll get -EIO from dio read, which is consistent
with buffered read.

[1],
https://patchwork.kernel.org/patch/9835505/
Btrfs: report errors when checksum is not found

Thanks,

-liubo

> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 
> At the end of the email there is the code which shows the problem: two 
> process share the same memory: the first write it to the disk, the second 
> update the buffer continuously. A third process try to read the file, but it 
> got time to time -EIO
> 
> If you ran my code in a btrfs filesystem you got a lot of 
> 
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; e = 5 - Input/output error
> ERROR: read thread; e = 5 - Input/output error
> 
> The firsts lines are related to a shorter read (which may happens). The lasts 
> lines are related to a checksum mismatch. The dmesg is filled by lines like
> [...]
> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
> [...]
> 
> This is definitely a bug. 
> 
> I think that using O_DIRECT and updating a page at the same time could happen 
> in a VM. In BTRFS this  could lead to a wrong checksum. The problem is that 
> if BTRFS detects a checksum error during a reading:
> a) if O_DIRECT is not used in the read
>   * -EIO is returned
> Definitely BAD
> 
> b) if O_DIRECT is used in the read
>   * it doesn't return the error to the caller
>   * it returns a page filled by 0x01 instead of the data from the disk
> Even worse than a)
> 
> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't 
> support it; see 

[PATCH 1/4 v3] btrfs-progs: convert: move link_subvol out of main

2017-09-15 Thread Yingyi Luo
From: yingyil 

link_subvol() is moved to inode.c and renamed as btrfs_mksubvol().
The change cascades down to the callchain.

Signed-off-by: yingyil 
---
v3: moved link_subvol to inode.c and put its header in ctree.h. The name is
changed to btrfs_mksubvol. Thanks for your comments.

 convert/main.c | 126 +
 ctree.h|   2 +
 inode.c| 124 
 3 files changed, 128 insertions(+), 124 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 7ec6202d..0babf0e8 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -838,129 +838,6 @@ out:
return ret;
 }
 
-static struct btrfs_root* link_subvol(struct btrfs_root *root,
-   const char *base, u64 root_objectid)
-{
-   struct btrfs_trans_handle *trans;
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_root *tree_root = fs_info->tree_root;
-   struct btrfs_root *new_root = NULL;
-   struct btrfs_path path;
-   struct btrfs_inode_item *inode_item;
-   struct extent_buffer *leaf;
-   struct btrfs_key key;
-   u64 dirid = btrfs_root_dirid(>root_item);
-   u64 index = 2;
-   char buf[BTRFS_NAME_LEN + 1]; /* for snprintf null */
-   int len;
-   int i;
-   int ret;
-
-   len = strlen(base);
-   if (len == 0 || len > BTRFS_NAME_LEN)
-   return NULL;
-
-   btrfs_init_path();
-   key.objectid = dirid;
-   key.type = BTRFS_DIR_INDEX_KEY;
-   key.offset = (u64)-1;
-
-   ret = btrfs_search_slot(NULL, root, , , 0, 0);
-   if (ret <= 0) {
-   error("search for DIR_INDEX dirid %llu failed: %d",
-   (unsigned long long)dirid, ret);
-   goto fail;
-   }
-
-   if (path.slots[0] > 0) {
-   path.slots[0]--;
-   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
-   if (key.objectid == dirid && key.type == BTRFS_DIR_INDEX_KEY)
-   index = key.offset + 1;
-   }
-   btrfs_release_path();
-
-   trans = btrfs_start_transaction(root, 1);
-   if (IS_ERR(trans)) {
-   error("unable to start transaction");
-   goto fail;
-   }
-
-   key.objectid = dirid;
-   key.offset = 0;
-   key.type =  BTRFS_INODE_ITEM_KEY;
-
-   ret = btrfs_lookup_inode(trans, root, , , 1);
-   if (ret) {
-   error("search for INODE_ITEM %llu failed: %d",
-   (unsigned long long)dirid, ret);
-   goto fail;
-   }
-   leaf = path.nodes[0];
-   inode_item = btrfs_item_ptr(leaf, path.slots[0],
-   struct btrfs_inode_item);
-
-   key.objectid = root_objectid;
-   key.offset = (u64)-1;
-   key.type = BTRFS_ROOT_ITEM_KEY;
-
-   memcpy(buf, base, len);
-   for (i = 0; i < 1024; i++) {
-   ret = btrfs_insert_dir_item(trans, root, buf, len,
-   dirid, , BTRFS_FT_DIR, index);
-   if (ret != -EEXIST)
-   break;
-   len = snprintf(buf, ARRAY_SIZE(buf), "%s%d", base, i);
-   if (len < 1 || len > BTRFS_NAME_LEN) {
-   ret = -EINVAL;
-   break;
-   }
-   }
-   if (ret)
-   goto fail;
-
-   btrfs_set_inode_size(leaf, inode_item, len * 2 +
-btrfs_inode_size(leaf, inode_item));
-   btrfs_mark_buffer_dirty(leaf);
-   btrfs_release_path();
-
-   /* add the backref first */
-   ret = btrfs_add_root_ref(trans, tree_root, root_objectid,
-BTRFS_ROOT_BACKREF_KEY,
-root->root_key.objectid,
-dirid, index, buf, len);
-   if (ret) {
-   error("unable to add root backref for %llu: %d",
-   root->root_key.objectid, ret);
-   goto fail;
-   }
-
-   /* now add the forward ref */
-   ret = btrfs_add_root_ref(trans, tree_root, root->root_key.objectid,
-BTRFS_ROOT_REF_KEY, root_objectid,
-dirid, index, buf, len);
-   if (ret) {
-   error("unable to add root ref for %llu: %d",
-   root->root_key.objectid, ret);
-   goto fail;
-   }
-
-   ret = btrfs_commit_transaction(trans, root);
-   if (ret) {
-   error("transaction commit failed: %d", ret);
-   goto fail;
-   }
-
-   new_root = btrfs_read_fs_root(fs_info, );
-   if (IS_ERR(new_root)) {
-   error("unable to fs read root: %lu", PTR_ERR(new_root));
-   new_root = NULL;
-   }
-fail:
-   btrfs_init_path();
-   return new_root;
-}

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 11:50 AM, Marat Khalili wrote:
> May I state my user's point of view:
> 
> I know one applications that uses O_DIRECT, and it is subtly broken
> on BTRFS. I know no applications that use O_DIRECT and are not
> broken. (Really more statistics would help here, probably some exist
> that provably work.) According to developers making O_DIRECT work on
> BTRFS is difficult if not impossible. Isn't it time to disable
> O_DIRECT like ZFS does AFAIU? Data safety is certainly more important
> than performance gain it may or may not give some applications.

I agree with you, but it should be sufficient to disable O_DIRECT when the file 
has the checksums. I was unable to observe filesystem corruption when the 
checksums are disabled.

The use cases where O_DIRECT is useful are VM and databasesM and in these cases 
also nodatacsum/nodatacow are useful.

BR
G.Baroncelli

> 
> --
> 
> With Best Regards, Marat Khalili
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 07:01 PM, Andrei Borzenkov wrote:
> 15.09.2017 08:50, Goffredo Baroncelli пишет:
>> On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
>>> 15.09.2017 01:00, Goffredo Baroncelli пишет:

 2) The second bug, is a more severe bug. If during a writing of a buffer 
 with O_DIRECT, the buffer is updated at the same time by a second process, 
 the checksum may be incorrect.

>>>
>>> Is it btrfs specific ? If buffer is updated before it was actually
>>> consumed by kernel, this likely means data corruption on any filesystem.
>>
>> I don't see any corruption in other FS. The fact that application push to 
>> filesystem garbage, doesn't allow the filesystem to be corrupted. 
> 
> I did not say "filesystem corruption", 
I did it.. because there is a filesystem corruption only in BTRFS .
I agree with you that a concurrent access between different processes to the 
same page which is in writing is bad. However only in BTRFS this lead to a 
filesystem corruption. This is the first problem. The second one is that BTRFS 
doesn't warn the user about that (read more below), however it returns wrong 
data (a page filled by 0x01)

> I said "data corruption".
And, are you sure that there is data corruption ?  don't have evidence of that 
in other filesystem. Proof is that there are a lot of problem in BTRFS when 
O_DIRECT is used, but not in other filesystem. 
If this is due to a better serialization or is due to the application touch the 
page which is going to written but doesn't care about the data I don't know. 

Anyway from open(2) man page:

[...]
O_DIRECT  I/Os should never be run concurrently with the fork(2)[...] if the 
memory buffer is a private mapping (i.e., any mapping created  with the mmap(2) 
MAP_PRIVATE flag.
[...]This restriction does not apply when the  memory buffer for the O_DIRECT 
I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag.
[...]

So it is a valid and accepted use case O_DIRECT+mmap(MAP_SHARED)+fork()


> 
>> In this case the filesystem became corrupted, because another application 
>> which try to read the data (without O_DIRECT) may got -EIO.
>>
> 
> No. *Data* on this filesystem was corrupted and luckily btrfs makes you
> aware of it. 

Please re-read my messages. BTRFS doesn't make the user aware of it: if the 
data is read with O_DIRECT, it doesn't return an error in case of checksum 
mismatch (this is the second problem)

> On different filesystem you still may have the same data
> corruption, but silent.
> 
>> I repeat, the problem is a data race when the data is in the FS camp, and 
>> the kernel does wrong checksum.
>>
> 
> Of course it is race. But again - I expect that when pwrite() returns it
> means data buffer can be reused. Otherwise I cannot see how O_DIRECT can
> be sensibly used at all. In this case you need to demonstrate that data
> corruption happens after pwrite() returns - this makes it btrfs issue
> indeed. If data corruption happens while thread is waiting for pwrite()
> to return, I say this is expected behavior and application fault - it
> need to protect against concurrent write and modification.
> 
>>
>> IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS 
>> on linux); I think that it could be allowed only for  nodatasum files.
>>
>>> I.e. there should be clear indication from kernel that buffer can be
>>> reused by application, in your example - when pwrite returns. So when
>>> data corruption happens - during pwrite or after? 
>>> If data is corrupted
>>> during pwrite, it is arguably application fault - it should disallow
>>> concurrent access.
>>
>>
>>
>>
>>
>>>
>>
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-15 Thread Andrei Borzenkov
15.09.2017 15:35, Austin S. Hemmelgarn пишет:
> On 2017-09-14 23:45, Andrei Borzenkov wrote:
>> 14.09.2017 18:32, Hugo Mills пишет:
>>> On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote:
 I use encfs on top of btrfs.
 I can create btrfs snapshots, but I have no suggestive access to the
 files
 in these snaspshots, because they look like:

 drwx--  framstag users    - 2017-09-08 11:47:18
 uHjprldmxo3-nSfLmcH54HMW
 drwxr-xr-x  framstag users    - 2017-09-08 11:47:18
 wNEWaDCgyXTj0d-Myk8wXZfh
 -rw-r--r--  framstag users  377 2015-06-12 14:02:53
 -zDmc7xfobKDkbl8z7oKOHxv
 -rw-r--r--  framstag users    2,367 2012-07-10 14:32:30
 7pfKs27K9k5zANE4WOQEuFa2
 -rw---  framstag users  692 2009-10-20 13:45:41
 8SQElYCph85kDdcFasUHybVr
 -rw---  framstag users    2,872 2017-08-31 16:21:52
 bm,yNi1e4fsAClDv7lNxxSfJ
 lrwxrwxrwx  framstag users    - 2017-06-01 15:53:00
 GZxNYI0Gy96R18fz40f7k5rl ->
 wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
 -rw-r--r--  framstag users  182 2016-12-01 13:34:31
 rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1

 I have to mount the snapshot with encfs, to have access to the
 (decrypted)
 files.

 Any better ideas?
>>>
>>>     I'd say it's doing exactly what it should be doing. You're making a
>>> copy of an encrypted data store,
>>
>> With all respect - snapshot is not a copy.
> From a userspace perspective, it is, and that's what matters since EncFS
> is a userspace tool.  In fact, part of the point of a snapshot is that
> it's functionally indistinguishable from a direct copy of the data
> unless you start looking at block layouts (which nothing in userspace
> that isn't an administration tool should be doing).
>>
>>> and the result is encrypted. In order
>>> to read it, it needs to have the decrpytion layer applied to it with
>>> the correct key (which is the need to mount the snapshot with encfs).
>>>
>>
>> But snapshot *is* mounted implicitly as it is part of mounted btrfs
>> filesystem. So I can see that this behavior could be rather unexpected.
>>
>>>     Would you _really_ want a system where the encrypted contents of a
>>> subvolume can be decrypted by simply snapshotting it?
>>
>> The actual question is - do you need to mount each individual btrfs
>> subvolume when using encfs? If yes, this behavior is at least
>> consistent. If not - how are snapshots different?
> I think you're not understanding the layering here.  EncFS is a FUSE
> filesystem that is run as a separate layer on top of another filesystem.
>  It is completely agnostic of the underlying data storage
> implementation, provided that said data storage enforces POSIX I/O
> semantics.
> 

I did understand layering but I was not familiar with subtleties. I now
tested encfs with subvolumes and subvolumes, present in encrypted
directory, are not visible in unencrypted directory at all (and of
course I cannot create subvolume on overlay mount). But - if I know
encrypted name I can create subvolume with the same name - and it
becomes visible under cleartext mount:

bor@10:~> /usr/sbin/btrfs sub cre .encfs/CxRowQNCTNP3Bm0DlEFADnj5
Create subvolume '.encfs/CxRowQNCTNP3Bm0DlEFADnj5'
bor@10:~> ll .encfs/
total 0
drwxr-xr-x 1 bor users 0 Sep 15 20:18 CxRowQNCTNP3Bm0DlEFADnj5
drwxr-xr-x 1 bor users 0 Sep 15 20:11 sub2
bor@10:~> encfs ~/.encfs ~/encfs
EncFS Password:
bor@10:~> ll encfs/
total 0
drwxr-xr-x 1 bor users 0 Sep 15 20:18 sub1
bor@10:~> mkdir encfs/sub1/even-deeper
bor@10:~> ll .encfs/
total 0
drwxr-xr-x 1 bor users 48 Sep 15 20:19 CxRowQNCTNP3Bm0DlEFADnj5
drwxr-xr-x 1 bor users  0 Sep 15 20:11 sub2
bor@10:~> ll .encfs/CxRowQNCTNP3Bm0DlEFADnj5/
total 0
drwxr-xr-x 1 bor users 0 Sep 15 20:19 Z4yelS9xOwbOxZ0tGxopma8K
bor@10:~>

That is what I meant - subvolumes themselves are transparent as long as
you can traverse them. So if they remain under the encfs mount root and
have names that encfs can decode, it is possible to reach them.

> To clarify, the procedure for mounting an EncFS volume is:
> 1. Mount the underlying filesystem (usually done at boot by the init
> system).
> 2. Mount the EncFS instance that is using that underlying filesystem as
> storage (usually done on user log-in by the session manager or PAM).
> 
> On BTRFS, step 1 is implicit if it's a subvolume, but step 2 is never
> implicit, regardless of the filesystem.  Hugo's mention of needing
> mounting the snapshot with EncFS refers to the second step here, not the
> first.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-15 Thread Austin S. Hemmelgarn

On 2017-09-15 12:28, Ulli Horlacher wrote:

On Fri 2017-09-15 (12:15), Peter Becker wrote:

2017-09-15 12:01 GMT+02:00 Ulli Horlacher :


On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote:


The actual question is - do you need to mount each individual btrfs
subvolume when using encfs?


And even worse it goes with ecryptfs: I do not know at all how to mount a
snapshot, so that the user has access to it.


A snapshot is simply a subvolume.

Get the ID of the snapshot and mount it:

btrfs subvolume list /btrfs
mount -o subvolid= /dev/ /

Or mount the snapshot directly by path:

mount -o subvol=/snapshots/home/2015-12-01 /

And then mount enryptfs:

mount.ecryptfs / /


This only possible by root.
For a user it is not possible to have access for his own snapshots.
Bad.

Which is why you use EncFS (which is a FUSE module that runs in 
userspace and requires no root privileges) instead of eCryptFS (which is 
a kernel assisted filesystem that doesn't use FUSE, has more complicated 
setup constraints, and requires CAP_SYS_ADMIN or root access).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 10:26 AM, Hugo Mills wrote:
> On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
>> On 09/15/2017 12:18 AM, Hugo Mills wrote:
>>>As far as I know, both of these are basically known issues, with no
>>> good solution, other than not using O_DIRECT. Certainly the first
>>> issue is one I recognise. The second isn't one I recognise directly,
>>> but is unsurprising to me.
>>>
>>>There have been discussions -- including developers -- on this list
>>> as recent as a month or so ago. The general outcome seems to be that
>>> any problems with O_DIRECT are not going to be fixed.
>>
>> I missed this thread; could you point it to me ?
> 
>No, you didn't miss it -- you were part of it. :)
> 
>http://www.spinics.net/lists/linux-btrfs/msg68244.html
> 
I hoped that there was more deeper analysis. This messages was more or less an 
acknowledge than an analysis :(


>Hugo.
> 
>> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
>> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
>> O_DIRECT at all...
>>
>> In fact most of the applications which benefit from O_DIRECT (it comes to me 
>> VM e DB), are the ones which need also nodatasum to have good performance.
>>
>> One of the strongest point of BTRFS was the checksums; but these are not 
>> effective when the file is opened with O_DIRECT; worse there are cases where 
>> the file is corrupted and the application got -EIO; not mentioning that the 
>> dmesg is filled by "csum failed  "
>>
>>
>>>
>>>Hugo.
>>>
>>> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
 Hi all,

 I discovered two bugs when O_DIRECT is used...

 1) a corrupted file doesn't return -EIO when O_DIRECT is used

 Normally BTRFS prevents to access the contents of a corrupted file; 
 however I was able read the content of a corrupted file simply using 
 O_DIRECT

 # in a new btrfs filesystem, create a file
 $ sudo mkfs.btrfs -f /dev/sdd5
 $ mount /dev/sdd5 t
 $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
 bs=$((16*1024)) iflag=fullblock count=1024

 # corrupt the file
 $ sudo filefrag -v t/abcd 
 Filesystem type is: 9123683e
 File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
  ext: logical_offset:physical_offset: length:   expected: 
 flags:
0:0..3475:  70656.. 74131:   3476:
1: 3476..4095:  74212.. 74831:620:  74132: 
 last,eof
 t/abcd: 2 extents found
 $ sudo umount t
 $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
 /dev/sdd5
 mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
 corrupting 289406976 copy 1

 # try to access the file; expected result: -EIO
 $ sudo mount /dev/sdd5 t
 $ dd if=t/abcd | hexdump -c | head
 dd: error reading 't/abcd': Input/output error
 0+0 records in
 0+0 records out
 0 bytes copied, 0.000477413 s, 0.0 kB/s


 # try to access the file using O_DIRECT; expected result: -EIO, instead 
 the file is accessible
 $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
 *
 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g

 (dmesg report the checksum mismatch)
 [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1

 Note the first 4k filled by 0x01 !

 Conclusion: even if the file is corrupted and normally BTRFS prevent to 
 access it, using O_DIRECT
 a) no error is returned to the caller
 b) instead of the page stored on the disk, it is returned a page filled 
 with 0x01 (according also with the function __readpage_endio_check())


 2) The second bug, is a more severe bug. If during a writing of a buffer 
 with O_DIRECT, the buffer is updated at the same time by a second process, 
 the checksum may be incorrect.

 At the end of the email there is the code which shows the problem: two 
 process share the same memory: the first write it to the disk, the second 
 update the buffer continuously. A third process try to read the file, but 
 it got time to 

Re: [RFC 0/3]: settable compression level for zstd

2017-09-15 Thread Austin S. Hemmelgarn

On 2017-09-15 11:34, Adam Borowski wrote:

Hi!
Here's a patch set that allows changing the compression level for zstd,
currently at mount time only.  I've played with it for a month, so despite
being a quick hack, it's reasonably well tested.  Tested on 4.13 +
btrfs-for-4.14 only, though -- I've booted 11th-day-of-merge-window only an
hour ago on one machine, no explosions yet.

As a quick hack, it doesn't conserve memory as it should: all workspace
allocations assume level 15 and waste space otherwise.

Because of an (easily changeable) quirk of compression level encoding, the
max is set at 15, but I guess higher levels are pointless for 128KB blocks.
Nick and co can tell us more -- for me zstd is mostly a black box so it's
you who knows anything about tuning it.
I've got limited knowledge of the zstandard algorithm, but based on my 
(probably incomplete and possibly inaccurate) analysis of the code in 
the kernel, I think this assessment is correct.  At a minimum, higher 
levels are likely to provide no benefit considering how slow things get 
(even the speed of level 15 means it's probably not going to be used much).


There are three patches:
* [David Sterba] btrfs: allow to set compression level for zlib
   Unmodified version of the patch from Jul 24, I'm re-sending it for
   convenience.
* btrfs: allow setting zlib compression level via :9
   Some bikeshedding: it looks like Chris Mason also favours zlib:9 over
   zlib9 as the former is more readable.  If you disagree... well, it's up
   to you to decide anyway.  This patch accepts both syntaxes.
On this in particular, I think there _might_ be a potential issue with 
option parsing, I'd advocate something less likely to be a metacharacter 
(like '-') instead of ':', but I do agree that it's a bit more concise 
when the algorithm and level are separate.

* btrfs: allow setting zstd level


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Austin S. Hemmelgarn

On 2017-09-15 12:37, Ulli Horlacher wrote:

I have my btrfs filesystem mounted with option user_subvol_rm_allowed

tux@xerus: btrfs --version
btrfs-progs v4.4

tux@xerus: uname -a
Linux xerus 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

tux@xerus: id
uid=1000(tux) gid=100(users) groups=100(users)

tux@xerus: btrfs subvolume snapshot -r /test/tux/zz 
/test/tux/zz/.snapshot/2017-09-15_1824.test
tux@xerus: btrfs subvolume delete /test/tux/zz/.snapshot/2017-09-15_1824.test
Delete subvolume (no-commit): '/test/tux/zz/.snapshot/2017-09-15_1824.test'
ERROR: cannot delete '/test/tux/zz/.snapshot/2017-09-15_1824.test': Read-only 
file system

root can delete this snapshot, but not the user. Why?


Add 'user_subvol_rm' to the mount options and try again.

It's because of a poor decision made regarding permissions surrounding 
subvolume operations.  The original argument was to prevent accidental 
data loss (stupid argument, rm -rf is just as bad and nearly as fast on 
most modern systems), and that behavior has never been changed.  In 
fact, I'd advocate using that option on everything, because otherwise 
you have trivial resource exhaustion issues (users can create a resource 
they then can't remove).  Ideally, normal users wouldn't be able to 
create or delete subvolumes unless explicitly allowed (possibly through 
delegation by ownership or something).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Kai Krakow
Am Fri, 15 Sep 2017 16:11:50 +0200
schrieb Michał Sokołowski :

> On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote:
> > [...]
> > Case #1
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2
> > storage -> guest BTRFS filesystem  
> > SQL table row insertions per second: 1-2
> >
> > Case #2
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> > storage -> guest EXT4 filesystem
> > SQL table row insertions per second: 10-15
> > Q -1) why you are comparing btrfs against ext4 on top of the btrfs
> > which is doing own COW operations on bottom of such sandwiches ..
> > if we SUPPOSE to be talking about impact of the fragmentation on
> > top of btrfs?  
> 
> Tomasz,
> you seem to be convinced that fragmentation does not matter. I found
> this (extremely bad, true) example says otherwise.

Sorry to jump this, but did you at least set the qemu image to nocow?
Otherwise this example is totally flawed because you're testing qemu
storage layer mostly and not btrfs.

A better test would've been to test qemu raw on btrfs cow vs on btrfs
nocow, with both the same file system inside the qemu image.

But you are modifying multiple parameters at once during the test, and
I expect then everyone has a huge impact on performance but only one is
specific to btrfs which you apparently did not test this way.

Personally, running qemu cow2 on btrfs cow really helps nothing except
really bad performance. Make one of both layers nocow and it should
become better.

If you want to give some better numbers, please reduce this test to
just one cow layer, the one at the top layer: btrfs host fs. Copy the
image somewhere else to restore from, and ensure (using filefrag) that
the starting situation matches each test run.

Don't change any parameters of the qemu layer at each test. And run a
file system inside which doesn't do any fancy stuff, like ext2 or ext3
without journal. Use qemu raw storage.

Then test again with cow vs nocow on the host side.

Create a nocow copy of your image (use size of the source image for
truncate):

# rm -f qemu-image-nocow.raw
# touch qemu-image-nocow.raw
# chattr +C -c qemu-image-nocow.raw
# dd if=source-image.raw of=qemu-image-nocow.raw bs=1M
# btrfs fi defrag -f qemu-image-nocow.raw
# filefrag -v qemu-image-nocow.raw

Create a cow copy of your image:

# rm -f qemu-image-cow.raw
# touch qemu-image-cow.raw
# chattr -C -c qemu-image-cow.raw
# dd if=source-image.raw of=qemu-image-cow.raw bs=1M
# btrfs fi defrag -f qemu-image-cow.raw
# filefrag -v qemu-image-cow.raw

Given that host btrfs is mounted datacow,compress=none and without
autodefrag, and you don't touch the source image contents during tests.

Now run your test script inside both qemu machines, take your
measurements and check fragmentation again after the run.

filefrag should report no more fragments than before the test for the
first test, but should report a magnitude more for the second test.

Now copy (cp) both one at a time to a new file and measure the time. It
should be slower for the highly fragmented version.

Don't forget to run tests with and without flushed caches so we get
cold and warm numbers.

In this scenario, qemu would only be the application to modify the raw
image files and you're actually testing the impact of fragmentation of
btrfs.

You could also make a reflink copy of the nocow test image and do a
third test to see that it introduces fragmentation now, tho probably
much lower than for the cow test image. You can verify the numbers with
filefrag.

According to Tomasz, your tests should not run at vastly different
speeds because fragmentation has no impact on performance, quod est
demonstrandum... I think we will not get to the "erat" part.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Andrei Borzenkov
15.09.2017 08:50, Goffredo Baroncelli пишет:
> On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
>> 15.09.2017 01:00, Goffredo Baroncelli пишет:
>>>
>>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>>> the checksum may be incorrect.
>>>
>>
>> Is it btrfs specific ? If buffer is updated before it was actually
>> consumed by kernel, this likely means data corruption on any filesystem.
> 
> I don't see any corruption in other FS. The fact that application push to 
> filesystem garbage, doesn't allow the filesystem to be corrupted. 

I did not say "filesystem corruption", I said "data corruption".

> In this case the filesystem became corrupted, because another application 
> which try to read the data (without O_DIRECT) may got -EIO.
> 

No. *Data* on this filesystem was corrupted and luckily btrfs makes you
aware of it. On different filesystem you still may have the same data
corruption, but silent.

> I repeat, the problem is a data race when the data is in the FS camp, and the 
> kernel does wrong checksum.
> 

Of course it is race. But again - I expect that when pwrite() returns it
means data buffer can be reused. Otherwise I cannot see how O_DIRECT can
be sensibly used at all. In this case you need to demonstrate that data
corruption happens after pwrite() returns - this makes it btrfs issue
indeed. If data corruption happens while thread is waiting for pwrite()
to return, I say this is expected behavior and application fault - it
need to protect against concurrent write and modification.

> 
> IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS 
> on linux); I think that it could be allowed only for  nodatasum files.
> 
>> I.e. there should be clear indication from kernel that buffer can be
>> reused by application, in your example - when pwrite returns. So when
>> data corruption happens - during pwrite or after? 
>> If data is corrupted
>> during pwrite, it is arguably application fault - it should disallow
>> concurrent access.
> 
> 
> 
> 
> 
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A user cannot remove his readonly snapshots?!

2017-09-15 Thread Ulli Horlacher
I have my btrfs filesystem mounted with option user_subvol_rm_allowed

tux@xerus: btrfs --version
btrfs-progs v4.4

tux@xerus: uname -a
Linux xerus 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

tux@xerus: id
uid=1000(tux) gid=100(users) groups=100(users)

tux@xerus: btrfs subvolume snapshot -r /test/tux/zz 
/test/tux/zz/.snapshot/2017-09-15_1824.test
tux@xerus: btrfs subvolume delete /test/tux/zz/.snapshot/2017-09-15_1824.test
Delete subvolume (no-commit): '/test/tux/zz/.snapshot/2017-09-15_1824.test'
ERROR: cannot delete '/test/tux/zz/.snapshot/2017-09-15_1824.test': Read-only 
file system

root can delete this snapshot, but not the user. Why?

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170915163736.gd32...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Peter Grandi
[ ... ]
 Case #1
 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
 -> qemu cow2 storage -> guest BTRFS filesystem
 SQL table row insertions per second: 1-2

 Case #2
 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
 -> qemu raw storage -> guest EXT4 filesystem
 SQL table row insertions per second: 10-15
[ ... ]

>> Q 0) what do you think that you measure here?

> Cow's fragmentation impact on SQL write performance.

That's not what you are measuring, you are measing the impact on
speed of configurations "designed" (perhaps unintentionally) for
maximum flexibility, lowest cost, and complete disregard for
speed.

[ ... ]

> It was quick and dirty task to find, prove and remove
> performance bottleneck at minimal cost.

This is based on the usual confusion between "performance" (the
result of several tradeoffs) and "speed". When you report "row
insertions per second" you are reporting a rate, that is a
"speed", not "performance", which is always multi-dimensional.
http://www.sabi.co.uk/blog/15-two.html?151023#151023

In the cases above speed is low, but I think that, taking into
account flexibility and cost, performance is pretty good.

> AFAIR removing storage cow2 and guest BTRFS storage gave us ~
> 10 times boost.

"Oh doctor, if I stop stabbing my hand with a fork it no longer
hurts, but running while carrying a rucksack full of bricks is
still slower than with a rucksack full of feathers".

[ ... ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (12:15), Peter Becker wrote:
> 2017-09-15 12:01 GMT+02:00 Ulli Horlacher :
> 
> > On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote:
> >
> >> The actual question is - do you need to mount each individual btrfs
> >> subvolume when using encfs?
> >
> > And even worse it goes with ecryptfs: I do not know at all how to mount a
> > snapshot, so that the user has access to it.
> 
> A snapshot is simply a subvolume.
> 
> Get the ID of the snapshot and mount it:
> 
> btrfs subvolume list /btrfs
> mount -o subvolid= /dev/ /
> 
> Or mount the snapshot directly by path:
> 
> mount -o subvol=/snapshots/home/2015-12-01 /
> 
> And then mount enryptfs:
> 
> mount.ecryptfs / /

This only possible by root.
For a user it is not possible to have access for his own snapshots.
Bad.


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/7] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-15 Thread David Sterba
On Fri, Sep 15, 2017 at 09:24:19PM +0800, Qu Wenruo wrote:
> > I'm going to review & merge this series to devel. Tests and
> > documentation should be updated to make the usecase clear.
> 
> I'm happy to address any comment, both code and doc/test.

For the tests I'd like to see:

* with file target, try non-existent file, zero-length file, too-small
  file and large-enough

* for block device target, I think loop device would do, too-small and
  large-enough

* after the test, verify that the files have same size and contents
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] btrfs: allow to set compression level for zlib

2017-09-15 Thread Adam Borowski
From: David Sterba 

Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib # default
$ mount -o compess=zlib0# same
$ mount -o compess=zlib9# level 9, slower sync, less data
$ mount -o compess=zlib1# level 1, faster sync, more data
$ mount -o remount,compress=zlib3   # level set by remount

The level is visible in the same format in /proc/mounts. Level set via
file property does not work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 20 +++-
 fs/btrfs/compression.h |  6 +-
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/inode.c   |  5 -
 fs/btrfs/lzo.c |  5 +
 fs/btrfs/super.c   |  7 +--
 fs/btrfs/zlib.c| 12 +++-
 7 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b51d23f5cafa..70a50194fcf5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -867,6 +867,11 @@ static void free_workspaces(void)
  * Given an address space and start and length, compress the bytes into @pages
  * that are allocated on demand.
  *
+ * @type_level is encoded algorithm and level, where level 0 means whatever
+ * default the algorithm chooses and is opaque here;
+ * - compression algo are 0-3
+ * - the level are bits 4-7
+ *
  * @out_pages is an in/out parameter, holds maximum number of pages to allocate
  * and returns number of actually allocated pages
  *
@@ -881,7 +886,7 @@ static void free_workspaces(void)
  * @max_out tells us the max number of bytes that we're allowed to
  * stuff into pages
  */
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -889,9 +894,11 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
 {
struct list_head *workspace;
int ret;
+   int type = type_level & 0xF;
 
workspace = find_workspace(type);
 
+   btrfs_compress_op[type - 1]->set_level(workspace, type_level);
ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1081,3 +1088,14 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
 
return ret;
 }
+
+unsigned int btrfs_compress_str2level(const char *str)
+{
+   if (strncmp(str, "zlib", 4) != 0)
+   return 0;
+
+   if ('1' <= str[4] && str[4] <= '9' )
+   return str[4] - '0';
+
+   return 0;
+}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index d2781ff8f994..da20755ebf21 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -76,7 +76,7 @@ struct compressed_bio {
 void btrfs_init_compress(void);
 void btrfs_exit_compress(void);
 
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -95,6 +95,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 int mirror_num, unsigned long bio_flags);
 
+unsigned btrfs_compress_str2level(const char *str);
+
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
@@ -124,6 +126,8 @@ struct btrfs_compress_op {
  struct page *dest_page,
  unsigned long start_byte,
  size_t srclen, size_t destlen);
+
+   void (*set_level)(struct list_head *ws, unsigned int type);
 };
 
 extern const struct btrfs_compress_op btrfs_zlib_compress;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a8933da39a7..dd07a7ef234c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -791,6 +791,7 @@ struct btrfs_fs_info {
 */
unsigned long pending_changes;
unsigned long compress_type:4;
+   unsigned int compress_level;
int commit_interval;
/*
 * It is a suggestive number, the read side is safe even it gets a
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 128f3e58634f..28201b924575 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -530,7 +530,10 @@ static noinline void compress_file_range(struct inode 
*inode,
 */
extent_range_clear_dirty_for_io(inode, 

[RFC PATCH 2/3] btrfs: allow setting zlib compression level via :9

2017-09-15 Thread Adam Borowski
This is bikeshedding, but it seems people are drastically more likely to
understand "zlib:9" as compression level rather than an algorithm version
compared to "zlib9".

Signed-off-by: Adam Borowski 
---
 fs/btrfs/compression.c | 2 ++
 fs/btrfs/super.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 70a50194fcf5..71782ec976c7 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1096,6 +1096,8 @@ unsigned int btrfs_compress_str2level(const char *str)
 
if ('1' <= str[4] && str[4] <= '9' )
return str[4] - '0';
+   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9')
+   return str[5] - '0';
 
return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5467187701ef..537e04120457 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1248,7 +1248,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
else
seq_printf(seq, ",compress=%s", compress_type);
if (info->compress_level)
-   seq_printf(seq, "%d", info->compress_level);
+   seq_printf(seq, ":%d", info->compress_level);
}
if (btrfs_test_opt(info, NOSSD))
seq_puts(seq, ",nossd");
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] btrfs: allow setting zstd level

2017-09-15 Thread Adam Borowski
Capped at 15 because of currently used encoding, which is also a reasonable
limit because highest levels shine only on blocks much bigger than btrfs'
128KB.

Memory is allocated for the biggest supported level rather than for
what is actually used.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/compression.c | 21 +++--
 fs/btrfs/props.c   |  2 +-
 fs/btrfs/super.c   |  3 ++-
 fs/btrfs/zstd.c| 24 +++-
 4 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 71782ec976c7..2d4337756fef 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1091,13 +1091,22 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
 
 unsigned int btrfs_compress_str2level(const char *str)
 {
-   if (strncmp(str, "zlib", 4) != 0)
+   long level;
+   int max;
+
+   if (strncmp(str, "zlib", 4) == 0)
+   max = 9;
+   else if (strncmp(str, "zstd", 4) == 0)
+   max = 15; // encoded on 4 bits, real max is 22
+   else
return 0;
 
-   if ('1' <= str[4] && str[4] <= '9' )
-   return str[4] - '0';
-   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9')
-   return str[5] - '0';
+   str += 4;
+   if (*str == ':')
+   str++;
 
-   return 0;
+   if (kstrtoul(str, 10, ))
+   return 0;
+
+   return (level > max) ? 0 : level;
 }
diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index f6a05f836629..2e35aa2b2d79 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -414,7 +414,7 @@ static int prop_compression_apply(struct inode *inode,
type = BTRFS_COMPRESS_LZO;
else if (!strncmp("zlib", value, 4))
type = BTRFS_COMPRESS_ZLIB;
-   else if (!strncmp("zstd", value, len))
+   else if (!strncmp("zstd", value, 4))
type = BTRFS_COMPRESS_ZSTD;
else
return -EINVAL;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 537e04120457..f9d4522336db 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -515,9 +515,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
btrfs_clear_opt(info->mount_opt, NODATASUM);
btrfs_set_fs_incompat(info, COMPRESS_LZO);
no_compress = 0;
-   } else if (strcmp(args[0].from, "zstd") == 0) {
+   } else if (strncmp(args[0].from, "zstd", 4) == 0) {
compress_type = "zstd";
info->compress_type = BTRFS_COMPRESS_ZSTD;
+   info->compress_level = 
btrfs_compress_str2level(args[0].from);
btrfs_set_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 607ce47b483a..99e11cb2d60e 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -26,11 +26,14 @@
 #define ZSTD_BTRFS_MAX_WINDOWLOG 17
 #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
 #define ZSTD_BTRFS_DEFAULT_LEVEL 3
+// Max supported by the algorithm is 22, but gains for small blocks (128KB)
+// are limited, thus we cap at 15.
+#define ZSTD_BTRFS_MAX_LEVEL 15
 
-static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
+static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len, int level)
 {
-   ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
-   src_len, 0);
+   ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
+   BUG_ON(level > ZSTD_maxCLevel());
 
if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
@@ -43,6 +46,7 @@ struct workspace {
size_t size;
char *buf;
struct list_head list;
+   int level;
 };
 
 static void zstd_free_workspace(struct list_head *ws)
@@ -57,7 +61,8 @@ static void zstd_free_workspace(struct list_head *ws)
 static struct list_head *zstd_alloc_workspace(void)
 {
ZSTD_parameters params =
-   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
+   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT,
+ ZSTD_BTRFS_MAX_LEVEL);
struct workspace *workspace;
 
workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
@@ -101,7 +106,7 @@ static int zstd_compress_pages(struct list_head *ws,
unsigned long len = *total_out;
const unsigned long nr_dest_pages = *out_pages;
unsigned long max_out = nr_dest_pages * PAGE_SIZE;
-   ZSTD_parameters params = zstd_get_btrfs_parameters(len);
+   

[RFC 0/3]: settable compression level for zstd

2017-09-15 Thread Adam Borowski
Hi!
Here's a patch set that allows changing the compression level for zstd,
currently at mount time only.  I've played with it for a month, so despite
being a quick hack, it's reasonably well tested.  Tested on 4.13 +
btrfs-for-4.14 only, though -- I've booted 11th-day-of-merge-window only an
hour ago on one machine, no explosions yet.

As a quick hack, it doesn't conserve memory as it should: all workspace
allocations assume level 15 and waste space otherwise.

Because of an (easily changeable) quirk of compression level encoding, the
max is set at 15, but I guess higher levels are pointless for 128KB blocks. 
Nick and co can tell us more -- for me zstd is mostly a black box so it's
you who knows anything about tuning it.

There are three patches:
* [David Sterba] btrfs: allow to set compression level for zlib
  Unmodified version of the patch from Jul 24, I'm re-sending it for
  convenience.
* btrfs: allow setting zlib compression level via :9
  Some bikeshedding: it looks like Chris Mason also favours zlib:9 over
  zlib9 as the former is more readable.  If you disagree... well, it's up
  to you to decide anyway.  This patch accepts both syntaxes.
* btrfs: allow setting zstd level


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-09-15 Thread David Sterba
On Sun, Aug 20, 2017 at 06:38:50PM -0600, Chris Murphy wrote:
> On Fri, Aug 18, 2017 at 10:08 AM, David Sterba  wrote:
> 
> > That's quite a lot, in kernel. IIRC zlib and lzo use less than 200kb,
> > zstd wants 800kb for level 1. And this needs to be contiguous memory, so
> > if we're lucky and get the memory at the mount time, fine. In general
> > the memory can be fragmented (in the worst case, there are only 4k
> > chunks available), so we'd have to vmalloc and consume the virtual,
> > mappings in great numbers.
> 
> Any thoughts on bootloader support, both in general, and as it relates
> to levels of compression and memory constraints? GRUB switches to
> protected mode early on but other bootloaders might have more
> limitations. I guess really it's just GRUB and extlinux right now,
> there were patches some time ago for Das U-Boot but they still aren't
> merged.

I'm not sure if the memory requirements are same for compression and
decompression. The table is related to compression. I've looked around
web to find what the decompression requirements are but havne't found
anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Michał Sokołowski
On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote:
> [...]
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2
>
> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
> guest EXT4 filesystem
> SQL table row insertions per second: 10-15
> Q -1) why you are comparing btrfs against ext4 on top of the btrfs
> which is doing own COW operations on bottom of such sandwiches ..  if
> we SUPPOSE to be talking about impact of the fragmentation on top of
> btrfs?

Tomasz,
you seem to be convinced that fragmentation does not matter. I found
this (extremely bad, true) example says otherwise.

> Q 0) what do you think that you measure here?

Cow's fragmentation impact on SQL write performance.

> Q 1) how did you produce those time measurements? time command?
> looking on the watch?

Time command (real) of bash script inserting 1000 rows (index and 128B
random string).

> Q 2) why there are ranges of timings? did you repeat some operations
> few times (how many times and with or without dropping caches or doing
> reboots?)

Yes, we've repeated it. With and without flushing cache (it didn't seem
to have any impact). I cannot remember whenever there were any reboots.
Those big time ranges are because, I don't have exact numbers on me. It
was quick and dirty task to find, prove and remove performance
bottleneck at minimal cost. AFAIR removing storage cow2 and guest BTRFS
storage gave us ~ 10 times boost. Surprisingly for us this boost seems
to be consistent (it does not degrade noticeably over time - 2 months
from the change).

> Q 3) What kind of SQL engine? with what kind of settings? with what
> kind of tables? (indexes? foreign keys?) What kind of transactions
> semantics?

PostgreSQL and MySQL both gave us those results. *

> Q 4) where is the example set of inserts which I can replay in my
> setup? did you drop caches before batch of inserts? (do you know that
> every insert generates as well some number of read IOs so information
> is something is already cached before batch of inserts is *crucial*)
> Did you restart SQL engine?
> Q 5) are both test have been executed on the same box? if not which
> one version of the kernel(s) have been used?

Same distribution, machine and kernel. *

> Q 6) ) effectively how many IOs have been done during those tests? how
> did you measured those numbers (dtrace? perf? systemtap?)

I didn't check that. *

> Q7) why you are running your tests over qemu? Is it anything more
> running on the host system during those tests?

Because of "production" environment location. No, there was not.

*) If you're really interested in (which I doubt), then I can put
example environment somewhere and gather more data.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v2 7/7] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-15 Thread Qu Wenruo



On 2017年09月15日 20:56, David Sterba wrote:

On Wed, Sep 13, 2017 at 08:32:50AM +0800, Qu Wenruo wrote:

On 2017年09月13日 02:07, David Sterba wrote:

On Tue, Sep 12, 2017 at 07:50:19PM +0200, Goffredo Baroncelli wrote:

On 09/12/2017 07:03 PM, David Sterba wrote:

Say I want to prepare a minimal image but will provide a large file

  ^^

I think that if the target is a file AND --minimize is passed, it is a
reasonable expectation that the file is created "on the fly" and grown
up to what needed.

What I mean is that "--minimize" is passed (and a file is passed), mkfs.btrfs 
should
a) create the file if it doesn't exist, and avoid any check about its length
b) truncate the file at the end


I think the farthest-offset write will extend the file size, so we
should not truncate it, in the sense 'make it smaller than it is', but
rather 'align it to the size where the filesystem expects live data".


Unfortunately, that's not the case.

Current implementation must determine its size at the very beginning.

That's to say, if mkfs determines that it's going to use 1G size, then
current btrfs-progs will be guaranteed that no write over 1G boundary.
(Unless there is some hidden bug)

So if we are realling going to implement "--minimize" option, we are
doing asymmetric behavior:

1) For fs smaller than file
 Good, we truncate it

2) For fs larger than file
 Bad, we will get ENOSPC error.

One workaround I mentioned in V1 is to make a large enough sparse file
first and then truncate it.

But such workaround has some problem, especially related to the chunk
allocator.

For a 1G sparse file as example, we will have about 100M data chunk.
While for 128M space file, we will have about 10M data chunk.
That will cause a lot of space wasted if using large spare file and then
truncating it than using a small existing file.

Considering the all the variables involved, I choose the KISS method.
Let user to handle it, and mkfs.btrfs --rootdir will just do the copy work.

If user really wants to do a minimal image, the best (AFAIK the only
correct) method is to implement btrfs-progs balance (offline balanace)
to compact all these meta and data, then truncate it (offline resize).

The initial --rootdir is doing too many things at the same time in mkfs.
And that's why I want to make it simple.

BTW, just as I stated in cover letter, "mkfs.ext4" -d is doing it easy
as what I did here, and the complex work is handled by out of e2fsprogs
utility, genext2fs.



And there may be some use case relying on this truncation, but hey, if
the need is really high, there should already be a lot of complains
about --rootdir shrinking fs, and more tested/documented --rootdir
implementation.

So it's not really worth it to implement a complex code base to handle
something while there is not much user base.


I think all points have been covered, thanks.

The backward compatibility does not seem to be broken, the difference
between old and new --rootdir is only the size of the resulting
filesystem, otherwise the file size is left untouched and sufficient
size is required.

Minimizing can come as a separate step, either using off-line balance,
or possibly also as an option to mkfs.


Glad to hear that.



I'm going to review & merge this series to devel. Tests and
documentation should be updated to make the usecase clear.


I'm happy to address any comment, both code and doc/test.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Tomasz Kłoczko
On 15 September 2017 at 11:54, Michał Sokołowski  wrote:
[..]
>> Just please some example which I can try to replay which ill be
>> showing that we have similar results.
>
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2
>
> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
> guest EXT4 filesystem
> SQL table row insertions per second: 10-15

Q -1) why you are comparing btrfs against ext4 on top of the btrfs
which is doing own COW operations on bottom of such sandwiches ..  if
we SUPPOSE to be talking about impact of the fragmentation on top of
btrfs?
Q 0) what do you think that you measure here?
Q 1) how did you produce those time measurements? time command?
looking on the watch?
Q 2) why there are ranges of timings? did you repeat some operations
few times (how many times and with or without dropping caches or doing
reboots?)
Q 3) What kind of SQL engine? with what kind of settings? with what
kind of tables? (indexes? foreign keys?) What kind of transactions
semantics?
Q 4) where is the example set of inserts which I can replay in my
setup? did you drop caches before batch of inserts? (do you know that
every insert generates as well some number of read IOs so information
is something is already cached before batch of inserts is *crucial*)
Did you restart SQL engine?
Q 5) are both test have been executed on the same box? if not which
one version of the kernel(s) have been used?
Q 6) ) effectively how many IOs have been done during those tests? how
did you measured those numbers (dtrace? perf? systemtap?)
Q7) why you are running your tests over qemu? Is it anything more
running on the host system during those tests?
.
.
.
I can probably make this list of questions 2 or 3 times longer.

koczek
--
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/7] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-15 Thread David Sterba
On Wed, Sep 13, 2017 at 08:32:50AM +0800, Qu Wenruo wrote:
> On 2017年09月13日 02:07, David Sterba wrote:
> > On Tue, Sep 12, 2017 at 07:50:19PM +0200, Goffredo Baroncelli wrote:
> >> On 09/12/2017 07:03 PM, David Sterba wrote:
> >>> Say I want to prepare a minimal image but will provide a large file
> >>  ^^
> >>
> >> I think that if the target is a file AND --minimize is passed, it is a
> >> reasonable expectation that the file is created "on the fly" and grown
> >> up to what needed.
> >>
> >> What I mean is that "--minimize" is passed (and a file is passed), 
> >> mkfs.btrfs should
> >> a) create the file if it doesn't exist, and avoid any check about its 
> >> length
> >> b) truncate the file at the end
> > 
> > I think the farthest-offset write will extend the file size, so we
> > should not truncate it, in the sense 'make it smaller than it is', but
> > rather 'align it to the size where the filesystem expects live data".
> 
> Unfortunately, that's not the case.
> 
> Current implementation must determine its size at the very beginning.
> 
> That's to say, if mkfs determines that it's going to use 1G size, then 
> current btrfs-progs will be guaranteed that no write over 1G boundary.
> (Unless there is some hidden bug)
> 
> So if we are realling going to implement "--minimize" option, we are 
> doing asymmetric behavior:
> 
> 1) For fs smaller than file
> Good, we truncate it
> 
> 2) For fs larger than file
> Bad, we will get ENOSPC error.
> 
> One workaround I mentioned in V1 is to make a large enough sparse file 
> first and then truncate it.
> 
> But such workaround has some problem, especially related to the chunk 
> allocator.
> 
> For a 1G sparse file as example, we will have about 100M data chunk.
> While for 128M space file, we will have about 10M data chunk.
> That will cause a lot of space wasted if using large spare file and then 
> truncating it than using a small existing file.
> 
> Considering the all the variables involved, I choose the KISS method.
> Let user to handle it, and mkfs.btrfs --rootdir will just do the copy work.
> 
> If user really wants to do a minimal image, the best (AFAIK the only 
> correct) method is to implement btrfs-progs balance (offline balanace) 
> to compact all these meta and data, then truncate it (offline resize).
> 
> The initial --rootdir is doing too many things at the same time in mkfs.
> And that's why I want to make it simple.
> 
> BTW, just as I stated in cover letter, "mkfs.ext4" -d is doing it easy 
> as what I did here, and the complex work is handled by out of e2fsprogs 
> utility, genext2fs.
> 
> 
> 
> And there may be some use case relying on this truncation, but hey, if 
> the need is really high, there should already be a lot of complains 
> about --rootdir shrinking fs, and more tested/documented --rootdir 
> implementation.
> 
> So it's not really worth it to implement a complex code base to handle 
> something while there is not much user base.

I think all points have been covered, thanks.

The backward compatibility does not seem to be broken, the difference
between old and new --rootdir is only the size of the resulting
filesystem, otherwise the file size is left untouched and sufficient
size is required.

Minimizing can come as a separate step, either using off-line balance,
or possibly also as an option to mkfs.

I'm going to review & merge this series to devel. Tests and
documentation should be updated to make the usecase clear.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-15 Thread Austin S. Hemmelgarn

On 2017-09-14 23:45, Andrei Borzenkov wrote:

14.09.2017 18:32, Hugo Mills пишет:

On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote:

I use encfs on top of btrfs.
I can create btrfs snapshots, but I have no suggestive access to the files
in these snaspshots, because they look like:

drwx--  framstag users- 2017-09-08 11:47:18 uHjprldmxo3-nSfLmcH54HMW
drwxr-xr-x  framstag users- 2017-09-08 11:47:18 wNEWaDCgyXTj0d-Myk8wXZfh
-rw-r--r--  framstag users  377 2015-06-12 14:02:53 -zDmc7xfobKDkbl8z7oKOHxv
-rw-r--r--  framstag users2,367 2012-07-10 14:32:30 7pfKs27K9k5zANE4WOQEuFa2
-rw---  framstag users  692 2009-10-20 13:45:41 8SQElYCph85kDdcFasUHybVr
-rw---  framstag users2,872 2017-08-31 16:21:52 bm,yNi1e4fsAClDv7lNxxSfJ
lrwxrwxrwx  framstag users- 2017-06-01 15:53:00 GZxNYI0Gy96R18fz40f7k5rl 
-> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
-rw-r--r--  framstag users  182 2016-12-01 13:34:31 
rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1

I have to mount the snapshot with encfs, to have access to the (decrypted)
files.

Any better ideas?


I'd say it's doing exactly what it should be doing. You're making a
copy of an encrypted data store,


With all respect - snapshot is not a copy.
From a userspace perspective, it is, and that's what matters since 
EncFS is a userspace tool.  In fact, part of the point of a snapshot is 
that it's functionally indistinguishable from a direct copy of the data 
unless you start looking at block layouts (which nothing in userspace 
that isn't an administration tool should be doing).



and the result is encrypted. In order
to read it, it needs to have the decrpytion layer applied to it with
the correct key (which is the need to mount the snapshot with encfs).



But snapshot *is* mounted implicitly as it is part of mounted btrfs
filesystem. So I can see that this behavior could be rather unexpected.


Would you _really_ want a system where the encrypted contents of a
subvolume can be decrypted by simply snapshotting it?


The actual question is - do you need to mount each individual btrfs
subvolume when using encfs? If yes, this behavior is at least
consistent. If not - how are snapshots different?
I think you're not understanding the layering here.  EncFS is a FUSE 
filesystem that is run as a separate layer on top of another filesystem. 
 It is completely agnostic of the underlying data storage 
implementation, provided that said data storage enforces POSIX I/O 
semantics.


To clarify, the procedure for mounting an EncFS volume is:
1. Mount the underlying filesystem (usually done at boot by the init 
system).
2. Mount the EncFS instance that is using that underlying filesystem as 
storage (usually done on user log-in by the session manager or PAM).


On BTRFS, step 1 is implicit if it's a subvolume, but step 2 is never 
implicit, regardless of the filesystem.  Hugo's mention of needing 
mounting the snapshot with EncFS refers to the second step here, not the 
first.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Austin S. Hemmelgarn

On 2017-09-14 22:26, Tomasz Kłoczko wrote:

On 14 September 2017 at 19:53, Austin S. Hemmelgarn
 wrote:
[..]

While it's not for BTRFS< a tool called e4rat might be of interest to you
regarding this.  It reorganizes files on an ext4 filesystem so that stuff
used by the boot loader is right at the beginning of the device, and I've
know people to get insane performance improvements (on the order of 20x in
some pathologicallyb ad cases) in the time taken from the BIOS handing
things off to GRUB to GRUB handing execution off to the kernel.


Do you know that what you've just wrote has nothing to do with fragmentation?
Intentionally or not you just trying to change the subject.
As hard as it may be to believe, this _is_ relevant to the part of your 
reply that I was responding to, namely:


> By how much it is possible to improve boot time?

Note that discussion of file ordering impacting boot times is almost 
always centered around the boot loader, and _not_ userspace (because as 
you choose to focus on in changing the subject for the rest of this 
message, it's trivially possible to improve performance in userspace 
with some really simple tweaks).


You wanted examples regarding reordering of data in a localized manner 
improving boot time, so I gave _the_ reference for this on Linux (e4rat 
is the only publicly available tool I know of that does this).


[..]

This shouldn't need examples.  It's trivial math combined with basic
knowledge of hardware behavior.  Every request to a device has a minimum
amount of overhead.  On traditional hard drives, this is usually dominated
by seek latency, but on SSD's, the request setup, dispatch, and completion
are the dominant factor.  Assumign you have a 2 micro-second overhead
per-request (not an exact number, just chosen for demonstration purposes
because it makes the math easy), and a 1GB file, the time difference between
reading ten 100MB extents and reading ten thousand 100kB extents is just
short of 0.02 seconds, or a factor of about one thousand (which, no surprise
here, is the factor of difference between the number of extents).


So to produce few seconds delay during boot you need to make few
hundreds thousands if not millions more IOs  and on reading everything
using ideal long sequential reads.
No, that isn't what I was talking about.  Quit taking things out of 
context and assuming all of someone's reply is about only part of yours.


This was responding solely to this:

> That it may be an issue with using extents.
> Again: please show some results of some test unit which anyone will be
> able to reply and confirm or not that this effect really exist.

And has nothing to do with boot time.


Almost every package upgrade on rewrite some files in 100% will
produce by using COW fully continuous areas per file.
You know .. there is no so many files in typical distribution
installation to produce such measurable impact. > On my current laptop I have a 
lot of devel and debug stuff installed
and still I have only

$ rpm -qal | wc -l
276489

files (from which only small fractions are ELF DSOs or executables)
installed by:

$ rpm -qa | wc -l
2314

packages.

I can bet that during even very complicated boot process it will be
touched (by read IOs) only few hundreds files. None of those files
will be read sequentially because this is not how executable content
is usually loaded into the buffer cache. Simple change block device
read ahead may improve boot time enough without putting all blocks in
perfect order. All what you need is start enough early "blockdev
--setra N" where N is greater than default 256 blocks. All this can be
done without thinking about fragmentation.
As I mentioned above, the primary argument for reordering data for boot 
is largely related to the boot-loader, which doesn't have intelligent 
I/O scheduling and doesn't do read ahead, and is primarily about usage 
with traditional hard drives, where seek latency caused by lack of data 
locality actually does have a significant (and well documented) impact.



Seems you don't know that Linux by default is reading data from block
dev using at least 256 blocks (1KB each one) chunks because such IO
size is part of default RA settings, You can change those settings
just for boot time and you will have way lower number of IOs and sill
no significant improvement like few times shorter time. Fragmentation
will be in such case secondary factor.
All this could be done without bothering about fragmentation.
The block-level read-ahead done by the kernel has near zero impact on 
performance unless your data is already highly local (not necessarily 
ordered, but at least all in the same place), which will almost never be 
the case on BTRFS when dealing with an active data set because of its 
copy on write semantics.


In other words still you are talking about some institutionally
possible results which will be falsified if you will try at least one
time do some real tests and 

Re: defragmenting best practice?

2017-09-15 Thread Peter Grandi
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2

"Doctor, if I stab my hand with a fork it hurts a lot: can you
cure that?"

> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> storage -> guest EXT4 filesystem
> SQL table row insertions per second: 10-15

"Doctor, I can't run as fast with a backpack full of bricks as
without it: can you cure that?"

:-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Michał Sokołowski
On 09/14/2017 07:48 PM, Tomasz Kłoczko wrote:
> On 14 September 2017 at 16:24, Kai Krakow  wrote:
> [..]
>> > Getting e.g. boot files into read order or at least nearby improves
>> > boot time a lot. Similar for loading applications.
> [...]
> Just please some example which I can try to replay which ill be
> showing that we have similar results.

Case #1
2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
-> guest BTRFS filesystem
SQL table row insertions per second: 1-2

Case #2
2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
guest EXT4 filesystem
SQL table row insertions per second: 10-15




smime.p7s
Description: S/MIME Cryptographic Signature


Re: snapshots of encrypted directories?

2017-09-15 Thread Peter Becker
2017-09-15 12:01 GMT+02:00 Ulli Horlacher :
> On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote:
>
>> The actual question is - do you need to mount each individual btrfs
>> subvolume when using encfs?
>
> And even worse it goes with ecryptfs: I do not know at all how to mount a
> snapshot, so that the user has access to it.

A snapshot is simply a subvolume.

Get the ID of the snapshot and mount it:

btrfs subvolume list /btrfs
mount -o subvolid= /dev/ /

Or mount the snapshot directly by path:

mount -o subvol=/snapshots/home/2015-12-01 /

And then mount enryptfs:

mount.ecryptfs / /
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Marat Khalili
> Sent: Friday, 15 September 2017 7:50 PM
> To: Hugo Mills ; Goffredo Baroncelli
> ; linux-btrfs 
> Subject: Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and
> wrong data
> 
> May I state my user's point of view:
> 
> I know one applications that uses O_DIRECT, and it is subtly broken on BTRFS.
> I know no applications that use O_DIRECT and are not broken.
> (Really more statistics would help here, probably some exist that provably
> work.) According to developers making O_DIRECT work on BTRFS is difficult if
> not impossible. Isn't it time to disable O_DIRECT like ZFS does AFAIU? Data
> safety is certainly more important than performance gain it may or may not
> give some applications.

I agree - I've had trouble with this before because I didn't understand the 
consequences of using it.
It would be better to hide it behind a mount option or similar IMHO.

Paul.





Re: snapshots of encrypted directories?

2017-09-15 Thread Ulli Horlacher
On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote:

> The actual question is - do you need to mount each individual btrfs
> subvolume when using encfs? 

And even worse it goes with ecryptfs: I do not know at all how to mount a
snapshot, so that the user has access to it.

It seems snapshots are incompatible with encrypted filesystems :-(


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Marat Khalili

May I state my user's point of view:

I know one applications that uses O_DIRECT, and it is subtly broken on 
BTRFS. I know no applications that use O_DIRECT and are not broken. 
(Really more statistics would help here, probably some exist that 
provably work.) According to developers making O_DIRECT work on BTRFS is 
difficult if not impossible. Isn't it time to disable O_DIRECT like ZFS 
does AFAIU? Data safety is certainly more important than performance 
gain it may or may not give some applications.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: incremental send, apply asynchronous page cache readahead

2017-09-15 Thread peterh

Much appreciate your suggestion. I've modified the patch based on your
advice and sent out a new patch with new subject "Btrfs: send, apply 
asynchronous

 page cache readahead to enhance page read".

Filipe Manana 於 2017-09-13 18:45 寫到:

On Wed, Sep 13, 2017 at 7:38 AM, peterh  wrote:

From: Kuanling Huang 

By analyzing the perf on btrfs send, we found it take large
amount of cpu time on page_cache_sync_readahead. This effort
can be reduced after switching to asynchronous one. Overall
performance gain on HDD and SSD were 9 and 15 respectively if
simply send a large file.


Besides what was pointed before, about saying what those 9 and 15 are,
the subject mentions incremental send, but there's nothing here that
is specific to incremental sends, as it applies to full send
operations as well, so please also change the subject.


Signed-off-by: Kuanling Huang 
---
 fs/btrfs/send.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 63a6152..ac67ff6 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -4475,16 +4475,27 @@ static ssize_t fill_read_buf(struct send_ctx 
*sctx, u64 offset, u32 len)

/* initial readahead */
memset(>ra, 0, sizeof(struct file_ra_state));
file_ra_state_init(>ra, inode->i_mapping);
-   btrfs_force_ra(inode->i_mapping, >ra, NULL, index,
-  last_index - index + 1);

while (index <= last_index) {
unsigned cur_len = min_t(unsigned, len,
 PAGE_CACHE_SIZE - pg_offset);
-   page = find_or_create_page(inode->i_mapping, index, 
GFP_NOFS);


You based this patch on an old code base. Currently it is GFP_KERNEL
and not GFP_NOFS anymore.
Please update the patch.


+   page = find_lock_page(inode->i_mapping, index);
if (!page) {
-   ret = -ENOMEM;
-   break;
+   page_cache_sync_readahead(inode->i_mapping,
+   >ra, NULL, index,
+   last_index + 1 - index);
+
+   page = find_or_create_page(inode->i_mapping, 
index, GFP_NOFS);

+   if (unlikely(!page)) {


Please avoid polluting the code with unlikely/likely macros (unless
there's really a significant performance win, which isn't the case
here I bet).



+   ret = -ENOMEM;
+   break;
+   }
+   }
+
+   if (PageReadahead(page)) {
+   page_cache_async_readahead(inode->i_mapping,
+   >ra, NULL, page, index,
+   last_index + 1 - index);
}

if (!PageUptodate(page)) {
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: send, apply asynchronous page cache readahead to enhance page read

2017-09-15 Thread peterh
From: Kuanling Huang 

By analyzing the perf on btrfs send, we found it take large
amount of cpu time on page_cache_sync_readahead. This effort
can be reduced after switching to asynchronous one. Overall
performance gain on HDD and SSD were 9 and 15 percent if
simply send a large file.

Signed-off-by: Kuanling Huang 
---
 fs/btrfs/send.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 63a6152..7a5eb66 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -4475,16 +4475,27 @@ static ssize_t fill_read_buf(struct send_ctx *sctx, u64 
offset, u32 len)
/* initial readahead */
memset(>ra, 0, sizeof(struct file_ra_state));
file_ra_state_init(>ra, inode->i_mapping);
-   btrfs_force_ra(inode->i_mapping, >ra, NULL, index,
-  last_index - index + 1);
 
while (index <= last_index) {
unsigned cur_len = min_t(unsigned, len,
 PAGE_CACHE_SIZE - pg_offset);
-   page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+   page = find_lock_page(inode->i_mapping, index);
if (!page) {
-   ret = -ENOMEM;
-   break;
+   page_cache_sync_readahead(inode->i_mapping,
+   >ra, NULL, index,
+   last_index + 1 - index);
+
+   page = find_or_create_page(inode->i_mapping, index, 
GFP_KERNEL);
+   if (!page) {
+   ret = -ENOMEM;
+   break;
+   }
+   }
+
+   if (PageReadahead(page)) {
+   page_cache_async_readahead(inode->i_mapping,
+   >ra, NULL, page, index,
+   last_index + 1 - index);
}
 
if (!PageUptodate(page)) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Hugo Mills
On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 12:18 AM, Hugo Mills wrote:
> >As far as I know, both of these are basically known issues, with no
> > good solution, other than not using O_DIRECT. Certainly the first
> > issue is one I recognise. The second isn't one I recognise directly,
> > but is unsurprising to me.
> > 
> >There have been discussions -- including developers -- on this list
> > as recent as a month or so ago. The general outcome seems to be that
> > any problems with O_DIRECT are not going to be fixed.
> 
> I missed this thread; could you point it to me ?

   No, you didn't miss it -- you were part of it. :)

   http://www.spinics.net/lists/linux-btrfs/msg68244.html

   Hugo.

> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
> O_DIRECT at all...
> 
> In fact most of the applications which benefit from O_DIRECT (it comes to me 
> VM e DB), are the ones which need also nodatasum to have good performance.
> 
> One of the strongest point of BTRFS was the checksums; but these are not 
> effective when the file is opened with O_DIRECT; worse there are cases where 
> the file is corrupted and the application got -EIO; not mentioning that the 
> dmesg is filled by "csum failed  "
> 
> 
> > 
> >Hugo.
> > 
> > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> >> Hi all,
> >>
> >> I discovered two bugs when O_DIRECT is used...
> >>
> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> >>
> >> Normally BTRFS prevents to access the contents of a corrupted file; 
> >> however I was able read the content of a corrupted file simply using 
> >> O_DIRECT
> >>
> >> # in a new btrfs filesystem, create a file
> >> $ sudo mkfs.btrfs -f /dev/sdd5
> >> $ mount /dev/sdd5 t
> >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> >> bs=$((16*1024)) iflag=fullblock count=1024
> >>
> >> # corrupt the file
> >> $ sudo filefrag -v t/abcd 
> >> Filesystem type is: 9123683e
> >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
> >>  ext: logical_offset:physical_offset: length:   expected: 
> >> flags:
> >>0:0..3475:  70656.. 74131:   3476:
> >>1: 3476..4095:  74212.. 74831:620:  74132: 
> >> last,eof
> >> t/abcd: 2 extents found
> >> $ sudo umount t
> >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> >> /dev/sdd5
> >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> >> corrupting 289406976 copy 1
> >>
> >> # try to access the file; expected result: -EIO
> >> $ sudo mount /dev/sdd5 t
> >> $ dd if=t/abcd | hexdump -c | head
> >> dd: error reading 't/abcd': Input/output error
> >> 0+0 records in
> >> 0+0 records out
> >> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> >>
> >>
> >> # try to access the file using O_DIRECT; expected result: -EIO, instead 
> >> the file is accessible
> >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> >> *
> >> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >>
> >> (dmesg report the checksum mismatch)
> >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
> >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> >>
> >> Note the first 4k filled by 0x01 !
> >>
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page filled 
> >> with 0x01 (according also with the function __readpage_endio_check())
> >>
> >>
> >> 2) The second bug, is a more severe bug. If during a writing of a buffer 
> >> with O_DIRECT, the buffer is updated at the same time by a second process, 
> >> the checksum may be incorrect.
> >>
> >> At the end of the email there is the code which shows the problem: two 
> >> process share the same memory: the first write it to the disk, the second 
> >> update the buffer continuously. A third process try to read the file, but 
> >> it got time to time -EIO
> >>
> >> If you ran my code in a btrfs filesystem you got a lot of 
> >>
> >> ERROR: read thread; r = 8192, expected = 16384
> >> ERROR: read thread; r = 8192, expected = 

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 12:18 AM, Hugo Mills wrote:
>As far as I know, both of these are basically known issues, with no
> good solution, other than not using O_DIRECT. Certainly the first
> issue is one I recognise. The second isn't one I recognise directly,
> but is unsurprising to me.
> 
>There have been discussions -- including developers -- on this list
> as recent as a month or so ago. The general outcome seems to be that
> any problems with O_DIRECT are not going to be fixed.

I missed this thread; could you point it to me ?


If csum and O_DIRECT are not reliable, why not disallow one of them: i.e allow 
O_DIRECT only on nodatasum files... ZFS (on linux) do not support O_DIRECT at 
all...

In fact most of the applications which benefit from O_DIRECT (it comes to me VM 
e DB), are the ones which need also nodatasum to have good performance.

One of the strongest point of BTRFS was the checksums; but these are not 
effective when the file is opened with O_DIRECT; worse there are cases where 
the file is corrupted and the application got -EIO; not mentioning that the 
dmesg is filled by "csum failed  "


> 
>Hugo.
> 
> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>>
>> I discovered two bugs when O_DIRECT is used...
>>
>> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
>>
>> Normally BTRFS prevents to access the contents of a corrupted file; however 
>> I was able read the content of a corrupted file simply using O_DIRECT
>>
>> # in a new btrfs filesystem, create a file
>> $ sudo mkfs.btrfs -f /dev/sdd5
>> $ mount /dev/sdd5 t
>> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
>> bs=$((16*1024)) iflag=fullblock count=1024
>>
>> # corrupt the file
>> $ sudo filefrag -v t/abcd 
>> Filesystem type is: 9123683e
>> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>>  ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..3475:  70656.. 74131:   3476:
>>1: 3476..4095:  74212.. 74831:620:  74132: 
>> last,eof
>> t/abcd: 2 extents found
>> $ sudo umount t
>> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
>> /dev/sdd5
>> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
>> corrupting 289406976 copy 1
>>
>> # try to access the file; expected result: -EIO
>> $ sudo mount /dev/sdd5 t
>> $ dd if=t/abcd | hexdump -c | head
>> dd: error reading 't/abcd': Input/output error
>> 0+0 records in
>> 0+0 records out
>> 0 bytes copied, 0.000477413 s, 0.0 kB/s
>>
>>
>> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
>> file is accessible
>> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
>> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
>> *
>> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>>
>> (dmesg report the checksum mismatch)
>> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
>> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
>>
>> Note the first 4k filled by 0x01 !
>>
>> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
>> access it, using O_DIRECT
>> a) no error is returned to the caller
>> b) instead of the page stored on the disk, it is returned a page filled with 
>> 0x01 (according also with the function __readpage_endio_check())
>>
>>
>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>> the checksum may be incorrect.
>>
>> At the end of the email there is the code which shows the problem: two 
>> process share the same memory: the first write it to the disk, the second 
>> update the buffer continuously. A third process try to read the file, but it 
>> got time to time -EIO
>>
>> If you ran my code in a btrfs filesystem you got a lot of 
>>
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; e = 5 - Input/output error
>> ERROR: read thread; e = 5 - Input/output error
>>
>> The firsts lines are related to a shorter read (which may happens). The 
>> lasts lines are related to a checksum mismatch. The dmesg is filled by lines 
>> like
>> [...]
>> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
>> 4096 csum 0x0683c6df expected csum 0x55eb85e6