Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-13 Thread Darrick J. Wong
On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers  wrote:
> > 
> > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang 
> >> out]
> >> 
> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> >>> Theodore Ts'o  writes:
> >>> 
>  On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> > 1.) Privacy implications.  Say the filesystem is being shared between 
> > multiple
> >users, and one user unpacks foo.tar.gz into their home directory, 
> > which
> >they've set to mode 700 to hide from other users.  Because of this 
> > new
> >ioctl, all users will be able to see every (inode number, size in 
> > blocks)
> >pair that was added to the filesystem, as well as the exact layout 
> > of the
> >physical block allocations which might hint at how the files were 
> > created.
> >If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >regard, its presence on the filesystem will be revealed to all 
> > users.  And
> >if any filesystems happen to prefer allocating blocks near the 
> > containing
> >directory, the directory the files are in would likely be revealed 
> > too.
> >> 
> >> Frankly, why are container users even allowed to make unrestricted ioctl
> >> calls?  I thought we had a bunch of security infrastructure to constrain
> >> what userspace can do to a system, so why don't ioctls fall under these
> >> same protections?  If your containers are really that adversarial, you
> >> ought to be blacklisting as much as you can.
> >> 
> > 
> > Personally I don't find the presence of sandboxing features to be a very 
> > good
> > excuse for introducing random insecure ioctls.  Not everyone has everything
> > perfectly "sandboxed" all the time, for obvious reasons.  It's easy to 
> > forget
> > about the filesystem ioctls, too, since they can be executed on any regular
> > file, without having to open some device node in /dev.
> > 
> > (And this actually does happen; the SELinux policy in Android, for example,
> > still allows apps to call any ioctl on their data files, despite all the 
> > effort
> > that has gone into whitelisting other types of ioctls.  Which should be 
> > fixed,
> > of course, but it shows that this kind of mistake is very easy to make.)
> > 
>  Unix/Linux has historically not been terribly concerned about trying
>  to protect this kind of privacy between users.  So for example, in
>  order to do this, you would have to call GETFSMAP continously to track
>  this sort of thing.  Someone who wanted to do this could probably get
>  this information (and much, much more) by continuously running "ps" to
>  see what processes are running.
>  
>  (I will note. wryly, that in the bad old days, when dozens of users
>  were sharing a one MIPS Vax/780, it was considered a *good* thing
>  that social pressure could be applied when it was found that someone
>  was running a CPU or memory hogger on a time sharing system.  The
>  privacy right of someone running "xtrek" to be able to hide this from
>  other users on the system was never considered important at all.  :-)
> >> 
> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> >> both from the high kernel cpu usage and the huge number of metadata
> >> operations.
> > 
> > Well, only if that someone running GETFSMAP actually wants to watch things 
> > in
> > real-time (it's not necessary for all scenarios that have been mentioned), 
> > *and*
> > there is monitoring in place which actually detects it and can do something
> > about it.
> > 
> > Yes, PIDs have traditionally been global, but today we have PID namespaces, 
> > and
> > many other isolation features such as mount namespaces.  Nothing is 
> > perfect, of
> > course, and containers are a lot worse than VMs, but it seems weird to use 
> > that
> > as an excuse to knowingly make things worse...
> > 
> >> 
>  Fortunately, the days of timesharing seem to well behind us.  For
>  those people who think that containers are as secure as VM's (hah,
>  hah, hah), it might be that best way to handle this is to have a mount
>  option that requires root access to this functionality.  For those
>  people who really care about this, they can disable access.
> >> 
> >> Or use separate filesystems for each container so that exploitable bugs
> >> that shut down the filesystem can't be used to kill the other
> >> containers.  You could use a torrent of metadata-heavy operations
> >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> >> the other containers.
> >> 
> >>> What would be the reason for not putting this behind
> >>> capable(CAP_SYS_ADMIN)?
> >>> 
> 

Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-13 Thread Andreas Dilger
On May 10, 2017, at 11:10 PM, Eric Biggers  wrote:
> 
> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang 
>> out]
>> 
>> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
>>> Theodore Ts'o  writes:
>>> 
 On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> 1.) Privacy implications.  Say the filesystem is being shared between 
> multiple
>users, and one user unpacks foo.tar.gz into their home directory, which
>they've set to mode 700 to hide from other users.  Because of this new
>ioctl, all users will be able to see every (inode number, size in 
> blocks)
>pair that was added to the filesystem, as well as the exact layout of 
> the
>physical block allocations which might hint at how the files were 
> created.
>If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>regard, its presence on the filesystem will be revealed to all users.  
> And
>if any filesystems happen to prefer allocating blocks near the 
> containing
>directory, the directory the files are in would likely be revealed too.
>> 
>> Frankly, why are container users even allowed to make unrestricted ioctl
>> calls?  I thought we had a bunch of security infrastructure to constrain
>> what userspace can do to a system, so why don't ioctls fall under these
>> same protections?  If your containers are really that adversarial, you
>> ought to be blacklisting as much as you can.
>> 
> 
> Personally I don't find the presence of sandboxing features to be a very good
> excuse for introducing random insecure ioctls.  Not everyone has everything
> perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> about the filesystem ioctls, too, since they can be executed on any regular
> file, without having to open some device node in /dev.
> 
> (And this actually does happen; the SELinux policy in Android, for example,
> still allows apps to call any ioctl on their data files, despite all the 
> effort
> that has gone into whitelisting other types of ioctls.  Which should be fixed,
> of course, but it shows that this kind of mistake is very easy to make.)
> 
 Unix/Linux has historically not been terribly concerned about trying
 to protect this kind of privacy between users.  So for example, in
 order to do this, you would have to call GETFSMAP continously to track
 this sort of thing.  Someone who wanted to do this could probably get
 this information (and much, much more) by continuously running "ps" to
 see what processes are running.
 
 (I will note. wryly, that in the bad old days, when dozens of users
 were sharing a one MIPS Vax/780, it was considered a *good* thing
 that social pressure could be applied when it was found that someone
 was running a CPU or memory hogger on a time sharing system.  The
 privacy right of someone running "xtrek" to be able to hide this from
 other users on the system was never considered important at all.  :-)
>> 
>> Not to mention someone running GETFSMAP in a loop will be pretty obvious
>> both from the high kernel cpu usage and the huge number of metadata
>> operations.
> 
> Well, only if that someone running GETFSMAP actually wants to watch things in
> real-time (it's not necessary for all scenarios that have been mentioned), 
> *and*
> there is monitoring in place which actually detects it and can do something
> about it.
> 
> Yes, PIDs have traditionally been global, but today we have PID namespaces, 
> and
> many other isolation features such as mount namespaces.  Nothing is perfect, 
> of
> course, and containers are a lot worse than VMs, but it seems weird to use 
> that
> as an excuse to knowingly make things worse...
> 
>> 
 Fortunately, the days of timesharing seem to well behind us.  For
 those people who think that containers are as secure as VM's (hah,
 hah, hah), it might be that best way to handle this is to have a mount
 option that requires root access to this functionality.  For those
 people who really care about this, they can disable access.
>> 
>> Or use separate filesystems for each container so that exploitable bugs
>> that shut down the filesystem can't be used to kill the other
>> containers.  You could use a torrent of metadata-heavy operations
>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>> the other containers.
>> 
>>> What would be the reason for not putting this behind
>>> capable(CAP_SYS_ADMIN)?
>>> 
>>> What possible legitimate function could this functionality serve to
>>> users who don't own your filesystem?
>> 
>> As I've said before, it's to enable dedupe tools to decide, given a set
>> of files with shareable blocks, roughly how many other times each of
>> those shareable blocks are shared 

balancing every night broke balancing so now I can't balance anymore?

2017-05-13 Thread Marc MERLIN
Kernel 4.11, btrfs-progs v4.7.3

I run scrub and balance every night, been doing this for 1.5 years on this
filesystem.
But it has just started failing:
saruman:~# btrfs balance start -musage=0  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks
saruman:~# btrfs balance start -dusage=0  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks

saruman:~# btrfs balance start -musage=1  /mnt/btrfs_pool1
ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device
aruman:~# btrfs balance start -dusage=10  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks
saruman:~# btrfs balance start -dusage=20  /mnt/btrfs_pool1
ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device
There may be more info in syslog - try dmesg | tail

BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): relocating block group 598566305792 flags data
BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): relocating block group 598566305792 flags data
BTRFS info (device dm-2): 1 enospc errors during balance

saruman:~# btrfs fi show /mnt/btrfs_pool1/
Label: 'btrfs_pool1'  uuid: bc115001-a8d1-445c-9ec9-6050620efd0a
Total devices 1 FS bytes used 169.73GiB
devid1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1

saruman:~# btrfs fi usage /mnt/btrfs_pool1/
Overall:
Device size: 228.67GiB
Device allocated:228.67GiB
Device unallocated:1.00MiB
Device missing:  0.00B
Used:171.25GiB
Free (estimated): 55.32GiB  (min: 55.32GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:221.60GiB, Used:166.28GiB
   /dev/mapper/pool1 221.60GiB

Metadata,single: Size:7.03GiB, Used:4.96GiB
   /dev/mapper/pool1   7.03GiB

System,single: Size:32.00MiB, Used:48.00KiB
   /dev/mapper/pool1  32.00MiB

Unallocated:
   /dev/mapper/pool1   1.00MiB


How did I get into such a misbalanced state when I balance every night?

My filesystem is not full, I can write just fine, but I sure cannot
rebalance now.

Besides adding another device to add space, is there a way around this
and more generally not getting into that state anymore considering that
I already rebalance every night?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating btrfs RAID on LUKS devs makes devices disappear

2017-05-13 Thread Andrei Borzenkov
13.05.2017 18:28, Ochi пишет:
> Hello,
> 
> okay, I think I now have a repro that is stupidly simple, I'm not even
> sure if I overlook something here. No multi-device btrfs involved, but
> notably it does happen with btrfs, but not with e.g. ext4.
> 

I could not reproduce it with single device but I finally was able to
reliably reproduce your problem with multiple devices. It looks a bit
different, I think difference is due to encrypted root in your case. Anyway

https://github.com/systemd/systemd/issues/5955

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-05-13 Thread Alex Lyakas
Hi Liu,

On Wed, Mar 22, 2017 at 1:40 AM, Liu Bo  wrote:
> On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
>> We have a commit_root_sem, which is a read-write semaphore that protects the
>> commit roots.
>> But it is also used to protect the list of caching block groups.
>>
>> As a result, while doing "slow" caching, the following issue is seen:
>>
>> Some of the caching threads are scanning the extent tree with
>> commit_root_sem
>> acquired in shared mode, with stack like:
>> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
>> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
>> [btrfs]
>> [] read_tree_block+0x40/0x70 [btrfs]
>> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
>> [] btrfs_search_slot+0x3c6/0xb10 [btrfs]
>> [] caching_thread+0x1b9/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>>
>> IO requests that want to allocate space are waiting in cache_block_group()
>> to acquire the commit_root_sem in exclusive mode. But they only want to add
>> the caching control structure to the list of caching block-groups:
>> [] schedule+0x29/0x70
>> [] rwsem_down_write_failed+0x145/0x320
>> [] call_rwsem_down_write_failed+0x13/0x20
>> [] cache_block_group+0x25b/0x450 [btrfs]
>> [] find_free_extent+0xd16/0xdb0 [btrfs]
>> [] btrfs_reserve_extent+0xaf/0x160 [btrfs]
>>
>> Other caching threads want to continue their scanning, and for that they
>> are waiting to acquire commit_root_sem in shared mode. But since there are
>> IO threads that want the exclusive lock, the caching threads are unable
>> to continue the scanning, because (I presume) rw_semaphore guarantees some
>> fairness:
>> [] schedule+0x29/0x70
>> [] rwsem_down_read_failed+0xc5/0x120
>> [] call_rwsem_down_read_failed+0x14/0x30
>> [] caching_thread+0x1a1/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>> [] process_one_work+0x146/0x410
>>
>> This causes slowness of the IO, especially when there are many block groups
>> that need to be scanned for free space. In some cases it takes minutes
>> until a single IO thread is able to allocate free space.
>>
>> I don't see a deadlock here, because the caching threads that were able to
>> acquire
>> the commit_root_sem will call rwsem_is_contended() and should give up the
>> semaphore,
>> so that IO threads are able to acquire it in exclusive mode.
>>
>> However, introducing a separate mutex that protects only the list of caching
>> block groups makes things move forward much faster.
>>
>
> The problem did exist and the patch looks good to me.
>
>> This patch is based on kernel 3.18.
>> Unfortunately, I am not able to submit a patch based on one of the latest
>> kernels, because
>> here btrfs is part of the larger system, and upgrading the kernel is a
>> significant effort.
>> Hence marking the patch as RFC.
>> Hopefully, this patch still has some value to the community.
>>
>> Signed-off-by: Alex Lyakas 
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 42d11e7..74feacb 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>> struct list_head trans_list;
>> struct list_head dead_roots;
>> struct list_head caching_block_groups;
>> +/* protects the above list */
>> +struct mutex caching_block_groups_mutex;
>>
>> spinlock_t delayed_iput_lock;
>> struct list_head delayed_iputs;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5177954..130ec58 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>> INIT_LIST_HEAD(_info->delayed_iputs);
>> INIT_LIST_HEAD(_info->delalloc_roots);
>> INIT_LIST_HEAD(_info->caching_block_groups);
>> +mutex_init(_info->caching_block_groups_mutex);
>> spin_lock_init(_info->delalloc_root_lock);
>> spin_lock_init(_info->trans_lock);
>> spin_lock_init(_info->fs_roots_radix_lock);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a067065..906fb08 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -637,10 +637,10 @@ static int cache_block_group(struct
>> btrfs_block_group_cache *cache,
>> return 0;
>> }
>>
>> -down_write(_info->commit_root_sem);
>> +mutex_lock(_info->caching_block_groups_mutex);
>> atomic_inc(_ctl->count);
>> list_add_tail(_ctl->list, _info->caching_block_groups);
>> -up_write(_info->commit_root_sem);
>> +mutex_unlock(_info->caching_block_groups_mutex);
>>
>> btrfs_get_block_group(cache);
>>
>> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct
>> btrfs_trans_handle *trans,
>>
>> down_write(_info->commit_root_sem);
>>
>
> Witht the new mutex, it's not necessary to take commit_root_sem here
> because a) pinned_extents could not be modified outside of a
> transaction, and b) while at 

Re: Creating btrfs RAID on LUKS devs makes devices disappear

2017-05-13 Thread Ochi

Hello,

okay, I think I now have a repro that is stupidly simple, I'm not even 
sure if I overlook something here. No multi-device btrfs involved, but 
notably it does happen with btrfs, but not with e.g. ext4.


[Sidenote: At first I thought it had to do with systemd-cryptsetup 
opening multiple devices with the same key that makes a difference. 
Rationale: I think the whole systemd machinery for opening crypt devices 
is able to try the same password on multiple devices when manual 
keyphrase input is used, and I thought maybe the same is true for 
keyfiles which may cause race conditions, but after all it doesn't seem 
to matter much. Also it seemed to relate to multi-device btrfs volumes, 
but now it appears to be simpler than that. That said, I can't be sure 
whether there are more problems hidden when actually using RAID.]


I have tried to repro the issue on a completely fresh Arch Linux in a 
VirtualBox VM. No custom systemd magic involved whatsoever, all stock 
services, generators, etc. In addition to the root volume (no crypto), 
there is another virtual HDD with one partition. This is a LUKS 
partition with a keyfile added to open it automatically on boot. I added 
a corresponding /etc/crypttab line as follows:


storage0/dev/sdb1/etc/crypto/keyfile

Let's suppose we open the crypt device manually the first time and 
perform mkfs.btrfs on the /dev/mapper/storage0 device. Reboot the system 
such that systemd-cryptsetup can do its magic to open the dm device.


After reboot, log in. /dev/mapper/storage0 should be there, and of 
course the corresponding /dev/dm-*. Perform another mkfs.btrfs on 
/dev/mapper/storage0. What I observe is (possibly try multiple times, 
but it has been pretty reliable in my testing):


- /dev/mapper/storage0 and the /dev/dm-* device are gone.

- A process systemd-cryptsetup is using 100% CPU (haven't noticed 
before, but now on my laptop I can actually hear it)


- The dm-device was eliminated by systemd, see the logs below.

- Logging out and in again (as root in my case) solves the issue, the 
device is back.


I have prepared outputs of journalctl and udevadm info --export-db 
produced after the last step (logging out and back in). Since the logs 
are quite large, I link them here, I hope that is okay:


https://pastebin.com/1r6j1Par
https://pastebin.com/vXLGFQ0Z

In the journal, the interesting spots are after the two "ROOT LOGIN ON 
tty1". A few seconds after the first one, I performed the mkfs.


Notably, it doesn't seem to happen when using e.g. ext4 instead of 
btrfs. Also, it doesn't happen when opening the crypt device manually, 
without crypttab and thus without systemd-cryptsetup, 
systemd-cryptsetup-generator, etc. which parses crypttab.


So after all, I suspect the systemd-cryptsetup to be the culprit in 
combination with btrfs volumes. Maybe someone can repro that.


Versions used in the VM:
- Current Arch Linux
- Kernel 4.10.13
- btrfs-progs 4.10.2
- systemd v232 (also tested v233 from testing repo with same results)

Hope this helps
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/19] Btrfs-progs offline scrub

2017-05-13 Thread Lakshmipathi.G
>
> Ping?
>
> Any comments?
>
> Thanks,
> Qu

Can  I inject corruption with existing script [1] and expect offline
scrub to fix it? If so, I'll give it try and let you know the results.

[1] https://patchwork.kernel.org/patch/9583455/


Cheers,
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs-progs: btrfs-convert: Add larger device support

2017-05-13 Thread Lakshmipathi.G
With larger file system (in this case its 22TB), ext2fs_open() returns
EXT2_ET_CANT_USE_LEGACY_BITMAPS error message with ext2fs_read_block_bitmap().

To overcome this issue, we need pass EXT2_FLAG_64BITS flag with ext2fs_open
and also use 64-bit functions like ext2fs_get_block_bitmap_range2,
ext2fs_inode_data_blocks2,ext2fs_read_ext_attr2

bug: https://bugzilla.kernel.org/show_bug.cgi?id=194795

Signed-off-by: Lakshmipathi.G 
---
 convert/source-ext2.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/convert/source-ext2.c b/convert/source-ext2.c
index 1b0576b..275cb89 100644
--- a/convert/source-ext2.c
+++ b/convert/source-ext2.c
@@ -34,8 +34,9 @@ static int ext2_open_fs(struct btrfs_convert_context *cctx, 
const char *name)
ext2_filsys ext2_fs;
ext2_ino_t ino;
u32 ro_feature;
+   int open_flag = EXT2_FLAG_SOFTSUPP_FEATURES | EXT2_FLAG_64BITS;
 
-   ret = ext2fs_open(name, 0, 0, 0, unix_io_manager, _fs);
+   ret = ext2fs_open(name, open_flag, 0, 0, unix_io_manager, _fs);
if (ret) {
fprintf(stderr, "ext2fs_open: %s\n", error_message(ret));
return -1;
@@ -148,7 +149,7 @@ static int ext2_read_used_space(struct 
btrfs_convert_context *cctx)
return -ENOMEM;
 
for (i = 0; i < fs->group_desc_count; i++) {
-   ret = ext2fs_get_block_bitmap_range(fs->block_map, blk_itr,
+   ret = ext2fs_get_block_bitmap_range2(fs->block_map, blk_itr,
block_nbytes * 8, block_bitmap);
if (ret) {
error("fail to get bitmap from ext2, %s",
@@ -353,7 +354,7 @@ static int ext2_create_symlink(struct btrfs_trans_handle 
*trans,
int ret;
char *pathname;
u64 inode_size = btrfs_stack_inode_size(btrfs_inode);
-   if (ext2fs_inode_data_blocks(ext2_fs, ext2_inode)) {
+   if (ext2fs_inode_data_blocks2(ext2_fs, ext2_inode)) {
btrfs_set_stack_inode_size(btrfs_inode, inode_size + 1);
ret = ext2_create_file_extents(trans, root, objectid,
btrfs_inode, ext2_fs, ext2_ino,
@@ -627,9 +628,9 @@ static int ext2_copy_extended_attrs(struct 
btrfs_trans_handle *trans,
ret = -ENOMEM;
goto out;
}
-   err = ext2fs_read_ext_attr(ext2_fs, ext2_inode->i_file_acl, buffer);
+   err = ext2fs_read_ext_attr2(ext2_fs, ext2_inode->i_file_acl, buffer);
if (err) {
-   fprintf(stderr, "ext2fs_read_ext_attr: %s\n",
+   fprintf(stderr, "ext2fs_read_ext_attr2: %s\n",
error_message(err));
ret = -1;
goto out;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[OT] SSD performance patterns (was: Btrfs/SSD)

2017-05-13 Thread Kai Krakow
Am Sat, 13 May 2017 09:39:39 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:
> 
> > In the end, the more continuous blocks of free space there are, the
> > better the chance for proper wear leveling.  
> 
> Talking about which...
> 
> When I was doing my ssd research the first time around, the going 
> recommendation was to keep 20-33% of the total space on the ssd
> entirely unallocated, allowing it to use that space as an FTL
> erase-block management pool.
> 
> At the time, I added up all my "performance matters" data dirs and 
> allowing for reasonable in-filesystem free-space, decided I could fit
> it in 64 GB if I had to, tho 80 GB would be a more comfortable fit,
> so allowing for the above entirely unpartitioned/unused slackspace 
> recommendations, had a target of 120-128 GB, with a reasonable range 
> depending on actual availability of 100-160 GB.
> 
> It turned out, due to pricing and availability, I ended up spending 
> somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed
> me much more flexibility than I had expected and I ended up with
> basically everything but the media partition on the ssds, PLUS I
> still left them at only just over 50% partitioned, (using the gdisk
> figures, 51%- partitioned, 49%+ free).

I put by ESP (for UEFI) onto the SSD and also played with putting swap
onto it dedicated to hibernation. But I discarded the hibernation idea
and removed the swap because it didn't work well: It wasn't much faster
then waking from HDD, and hibernation is not that reliable anyways.
Also, hybrid hibernation is not yet integrated into KDE so I stick to
sleep mode currently.

The rest of my SSD (also 500GB) is dedicated to bcache. This fits my
complete work set of daily work with hit ratios going up to 90% and
beyond. My filesystem boots and feels like SSD, the HDDs are almost
silent and still my file system is 3TB on 3x 1TB HDD.


> Given that, I've not enabled btrfs trim/discard (which saved me from
> the bugs with it a few kernel cycles ago), and while I do have a
> weekly fstrim systemd timer setup, I've not had to be too concerned
> about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was
> known not to be trimming everything it really should have been.

This is a good recommendation as TRIM is still a slow operation because
Queued TRIM is not used for most drives due to buggy firmware. So you
not only circumvent kernel and firmware bugs, but also get better
performance that way.


> Anyway, that 20-33% left entirely unallocated/unpartitioned 
> recommendation still holds, right?  Am I correct in asserting that if
> one is following that, the FTL already has plenty of erase-blocks
> available for management and the discussion about filesystem level
> trim and free space management becomes much less urgent, tho of
> course it's still worth considering if it's convenient to do so?
> 
> And am I also correct in believing that while it's not really worth 
> spending more to over-provision to the near 50% as I ended up doing,
> if things work out that way as they did with me because the
> difference in price between 30% overprovisioning and 50%
> overprovisioning ends up being trivial, there's really not much need
> to worry about active filesystem trim at all, because the FTL has
> effectively half the device left to play erase-block musical chairs
> with as it decides it needs to?

I think, things may have changed since long ago. See below. But it
certainly depends on which drive manufacturer you chose, I guess.

I can at least confirm that bigger drives wear their write cycles much
slower, even when filled up. My old 128MB Crucial drive was worn after
only 1 year (I swapped it early, I kept an eye on SMART numbers). My
500GB Samsung drive is around 1 year old now, I do write a lot more
data to it, but according to SMART it should work for at least 5 to 7
more years. By that time, I probably already swapped it for a bigger
drive.

So I guess you should maybe look at your SMART numbers and calculate
the expected life time:

Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE))
with WLC = Wear_Leveling_Count

should get you the expected remaining power on hours. My drive is
powered on 24/7 most of the time but if you power your drive only 8
hours per day, you can easily ramp up the life time by three times of
days vs. me. ;-)

There is also Total_LBAs_Written but that, at least for me, usually
gives much higher lifetime values so I'd stick with the pessimistic
ones.

Even when WLC goes to zero, the drive should still have reserved blocks
available. My drive sets the threshold to 0 for WLC which makes me
think that it is not fatal when it hits 0 because the drive still has
reserved blocks. And for reserved blocks, the threshold is 10%.

Now combine that with your planning of getting a new drive, and you can
optimize space efficiency vs. lifetime better.


> Of course the higher per-GiB cost 

Re: Btrfs/SSD

2017-05-13 Thread Janos Toth F.
> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

I never liked that idea. And I really disliked how people considered
it to be (and even passed it down as) some magical, absolute
stupid-proof fail-safe thing (because it's not).

1: Unless you reliably trim the whole LBA space (and/or run
ata_secure_erase on the whole drive) before you (re-)partition the LBA
space, you have zero guarantee that the drive's controller/firmware
will treat the unallocated space as empty or will keep it's content
around as useful data (even if it's full of zeros because zero could
be very useful data unless it's specifically marked as "throwaway" by
trim/erase). On the other hand, a trim-compatible filesystem should
properly mark (trim) all (or at least most of) the free space as free
(= free to erase internally by the controller's discretion). And even
if trim isn't fail-proof either, those bugs should be temporary (and
it's not like a sane SSD will die in a few weeks due to these kind of
issues during sane usage and crazy drivers will often fail under crazy
usage regardless of trim and spare space).

2: It's not some daemon-summoning, world-ending catastrophe if you
occasionally happen to fill your SSD to ~100%. It probably won't like
it (it will probably get slow by the end of the writes and the
internal write amplification might skyrocket at it's peak) but nothing
extraordinary will happen and normal operation (high write speed,
normal internal write amplification, etc) should resume soon after you
make some room (for example, you delete your temporary files or move
some old content to an archive storage and you properly trim that
space). That space is there to be used, just don't leave it close to
100% all the time and try never leaving it close to 100% when you plan
to keep it busy with many small random writes.

3: Some drives have plenty of hidden internal spare space (especially
the expensive kinds offered for datacenters or "enthusiast" consumers
by big companies like Intel and such). Even some cheap drivers might
have plenty of erased space at 100% LBA allocation if they use
compression internally (and you don't fill it up to 100% with
in-compressible content).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Kai Krakow
Am Sat, 13 May 2017 14:52:47 +0500
schrieb Roman Mamedov :

> On Fri, 12 May 2017 20:36:44 +0200
> Kai Krakow  wrote:
> 
> > My concern is with fail scenarios of some SSDs which die unexpected
> > and horribly. I found some reports of older Samsung SSDs which
> > failed suddenly and unexpected, and in a way that the drive
> > completely died: No more data access, everything gone. HDDs start
> > with bad sectors and there's a good chance I can recover most of
> > the data except a few sectors.  
> 
> Just have your backups up-to-date, doesn't matter if it's SSD, HDD or
> any sort of RAID.
> 
> In a way it's even better, that SSDs [are said to] fail abruptly and
> entirely. You can then just restore from backups and go on. Whereas a
> failing HDD can leave you puzzled on e.g. whether it's a cable or
> controller problem instead, and possibly can even cause some data
> corruption which you won't notice until too late.

My current backup strategy can handle this. I never backup files from
the source again if it didn't change by timestamp. That way, silent data
corruption won't creep into the backup. Additionally, I keep a backlog
of 5 years of file history. Even if a corrupted file creeps into the
backup, there is enough time to get a good copy back. If it's older, it
probably doesn't hurt so much anyway.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Roman Mamedov
On Fri, 12 May 2017 20:36:44 +0200
Kai Krakow  wrote:

> My concern is with fail scenarios of some SSDs which die unexpected and
> horribly. I found some reports of older Samsung SSDs which failed
> suddenly and unexpected, and in a way that the drive completely died:
> No more data access, everything gone. HDDs start with bad sectors and
> there's a good chance I can recover most of the data except a few
> sectors.

Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort
of RAID.

In a way it's even better, that SSDs [are said to] fail abruptly and entirely.
You can then just restore from backups and go on. Whereas a failing HDD can
leave you puzzled on e.g. whether it's a cable or controller problem instead,
and possibly can even cause some data corruption which you won't notice until
too late.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Duncan
Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:

> In the end, the more continuous blocks of free space there are, the
> better the chance for proper wear leveling.

Talking about which...

When I was doing my ssd research the first time around, the going 
recommendation was to keep 20-33% of the total space on the ssd entirely 
unallocated, allowing it to use that space as an FTL erase-block 
management pool.

At the time, I added up all my "performance matters" data dirs and 
allowing for reasonable in-filesystem free-space, decided I could fit it 
in 64 GB if I had to, tho 80 GB would be a more comfortable fit, so 
allowing for the above entirely unpartitioned/unused slackspace 
recommendations, had a target of 120-128 GB, with a reasonable range 
depending on actual availability of 100-160 GB.

It turned out, due to pricing and availability, I ended up spending 
somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed me 
much more flexibility than I had expected and I ended up with basically 
everything but the media partition on the ssds, PLUS I still left them at 
only just over 50% partitioned, (using the gdisk figures, 51%- 
partitioned, 49%+ free).

Given that, I've not enabled btrfs trim/discard (which saved me from the 
bugs with it a few kernel cycles ago), and while I do have a weekly fstrim 
systemd timer setup, I've not had to be too concerned about btrfs bugs 
(also now fixed, I believe) when fstrim on btrfs was known not to be 
trimming everything it really should have been.


Anyway, that 20-33% left entirely unallocated/unpartitioned 
recommendation still holds, right?  Am I correct in asserting that if one 
is following that, the FTL already has plenty of erase-blocks available 
for management and the discussion about filesystem level trim and free 
space management becomes much less urgent, tho of course it's still worth 
considering if it's convenient to do so?

And am I also correct in believing that while it's not really worth 
spending more to over-provision to the near 50% as I ended up doing, if 
things work out that way as they did with me because the difference in 
price between 30% overprovisioning and 50% overprovisioning ends up being 
trivial, there's really not much need to worry about active filesystem 
trim at all, because the FTL has effectively half the device left to play 
erase-block musical chairs with as it decides it needs to?


Of course the higher per-GiB cost of ssd as compared to spinning rust 
does mean that the above overprovisioning recommendation really does 
hurt, most of the time, driving per-usable-GB costs even higher, and as I 
recall that was definitely the case back then between 80 GiB and 160 GiB, 
and it was basically an accident of timing, that I was buying just as the 
manufactures flooded the market with newly cost-effective 256 GB devices, 
that meant they were only trivially more expensive than the 128 or 160 
GB, AND unlike the smaller devices, actually /available/ in the 500-ish 
MB/sec performance range that (for SATA-based SSDs) is actually capped by 
SATA-600 bus speeds more than the chips themselves.  (There were lower 
cost 128 GB devices, but they were lower speed than I wanted, too.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating btrfs RAID on LUKS devs makes devices disappear

2017-05-13 Thread Andrei Borzenkov
12.05.2017 20:07, Chris Murphy пишет:
> On Thu, May 11, 2017 at 5:24 PM, Ochi  wrote:
>> Hello,
>>
>> here is the journal.log (I hope). It's quite interesting. I rebooted the
>> machine, performed a mkfs.btrfs on dm-{2,3,4} and dm-3 was missing
>> afterwards (around timestamp 66.*). However, I then logged into the machine
>> from another terminal (around timestamp 118.*) which triggered something to
>> make the device appear again :O Indeed, dm-3 was once again there after
>> logging in. Does systemd mix something up?
> 

Yes :) Did you doubt?

Please, try to reproduce it and provide both journalctl and "udevadm
info --export-db" output. I have my theory what happens here.

> I don't see any Btrfs complaints. If dm-3 is the device Btrfs is
> expecting and it vanishes, then Btrfs would mention it on a read or
> write. So either nothing is happening and Btrfs isn't yet aware that
> dm-3 is gone, or it's looking at some other instance of that encrypted
> volume, maybe it's using /dev/mapper/storage1. You can find out with
> btrfs fi show.
> 

/dev/mapper/xxx should be link to /dev/dm-NN (although you are never
sure with Linux). dm-NN is *the* device. /dev/mapper/storage1 cannot
exist without /dev/dm-3, irrespectively of what btrfs shows.

> Whenever I use Btrfs on LUKS I invariably see fi show, show me one
> device using /dev/dm-0 notation and the other device is
> /dev/mapper/blah notation. I think this is just an artifact of all the
> weird behind the scenes shit with symlinks and such, and that is a
> systemd thing as far as I know.
> 

Not quite.

As implemented today, when device appears (i.e. udev gets ADD uevent)
and it is detected as part of btrfs, udev rule scans device (with
equivalent of "btrfs device ready"). At the time event is being
processed the only name that is available is canonical kernel /dev/dm-0;
all convenience syminks are created later, after all rules have been
processed. btrfs remembers device name that was passed to it.

What makes it even more confusing - some btrfs utilities seem to resolve
/dev/dm-0 to /dev/mapper/blah by itself, and some not. I am not sure
which ones.

Recently btrfsprogs got extra rule that repeats "btrfs device ready" but
now *after* symlinks have been created (using RUN directive with
/dev/mapper/blah). It updates kernel with new name. So my guess is that
for this device this rule is missing (probably it gets created in initrd
and rule is not added to it).

> Anyway, more information is needed. First see if the device is really
> missing per Btrfs (read or write something and also check with 'btrfs
> fi show') You can add systemd.log_level=debug as a boot parameter to
> get more information in the journal, although it's rather verbose. You
> could combine it with rd.udev.debug but that gets really crazy verbose
> so I tend not to use them at the same time.
> 

@ochi: Please, before running with debug repeat your test as you did and
provide udevadm info --export-db. This may be enough. In my experience
debug output while being extremely verbose contains very little
additional useful information, but has tendency of skewing relative
timing thus changing behavior.

> The other possibility is there's a conflict with dracut which may be
> doing some things, and the debug switch for that is rd.debug and is
> likewise extremely verbose, producing huge logs, so I would start out
> with just the systemd debug and see if that reveals anything
> *assuming* Btrfs is complaining. If Btrfs doesn't complain about a
> device going away then I wouldn't worry about it.
> 
> 

This is not btrfs issue in any case (except btrfs folks should really
work together with systemd folks and finally come to common
implementation of multi-device filesystems).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html