date:20141201

Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Russell Coker

On Mon, 1 Dec 2014, Chris Murphy  wrote:
> On Sun, Nov 30, 2014 at 3:06 PM, Russell Coker  wrote:
> > When the 2 disks have different data mdadm has no way of knowing which
> > one is correct and has a 50% chance of overwriting good data. But BTRFS
> > does checksums on all reads and solves the problem of corrupt data - as
> > long as you don't have 2 corrupt sectors in matching blocks.
> 
> Yeah. I'm not sure though if openSUSE 13.2 prevents users from
> creating btrfs raid1 volumes entirely, or if it's just an install time
> limitation.

With BTRFS you can make it RAID-1 afterwards.  The possibility of data loss 
during system install usually isn't something you are concerned about so this 
shouldn't be a problem.

> I know that Fedora's installer won't allow the user to create Btrfs on
> LVM, and it probably doesn't allow it on md raid either.

For LVM that's reasonable, for MD-RAID that would be a bug IMHO.

On Mon, 1 Dec 2014, Roman Mamedov  wrote:
>   * mdadm RAID has much better read balancing;
> Btrfs reads are satisfied from what's in effect a random drive
> (PID-based balancing of threads to drives), mdadm reads from the
> less-loaded drive. Also mdadm has a way to specify some RAID1 array
> members as to be never used for reads if at all possible ("write-mostly"),
> which helps in RAID1 of HDD and SSD.

True.  But that's just a lack of performance tuning in the current code, it 
will be fixed at some future time.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Gour

On Mon, 01 Dec 2014 09:06:19 +1100
Russell Coker  wrote:

> When the 2 disks have different data mdadm has no way of knowing
> which one is correct and has a 50% chance of overwriting good data.
> But BTRFS does checksums on all reads and solves the problem of
> corrupt data - as long as you don't have 2 corrupt sectors in
> matching blocks.

Hmm, this is very interesting and valuable info. Thank you.


Sincerely,
Gour

-- 
Many, many births both you and I have passed. I can remember 
all of them, but you cannot, O subduer of the enemy!

-- 
One is understood to be in full knowledge whose every endeavor 
is devoid of desire for sense gratification. He is said by sages 
to be a worker for whom the reactions of work have been burned 
up by the fire of perfect knowledge.

http://www.atmarama.net | Hlapicina (Croatia) | GPG: 52B5C810


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread Christoph Hellwig

On Thu, Nov 27, 2014 at 03:27:31PM -0500, Theodore Ts'o wrote:
> I can do that, but part of the reason why we were doing this rather
> involved set of changes was to allow other file systems to be able to
> take advantage of lazytime.  I suppose there is value in allowing
> other file systems, such as jfs, f2fs, etc., to use it, but still,
> it's a bit of a shame to drop btrfs and xfs support for this feature.

I want to see xfs and btrfs support, but I think we're running in some
conceptual problems here.  I don't have the time right now to fully
review the XFS changes for correctness and test them, and I'd rather
keep things as-is for a while and then add properly designed and fully
teste support in rather than something possible broken.

> I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
> tracking either --- see my other comments about the naming of
> "mark_inode_dirty" being a bit misleading, at least for all/most of
> the major file systems.  The problem seems to be that replacement
> schemes that we've all using are slightly different.  :-/

Indeed.  It seems all existing ->dirty_inode instances basically
just try to work around the problem that the VFS simply updates
timestamps by writing into the inode without involving the filesystem.
There are all kinds of bugs in different instances, as well as comments
mentioning an assumption that this only happens for atime although
the VFS also dos this "trick" for c/mtime, including a caller from
the page fault code that the filesystems can't even avoid by providing
non-default methods everywhere.

> I suppose should let the btrfs folks decide whether they want to add
> is_readonly() and write_time() function --- or maybe help with the
> cleanup work so that mark_inode_dirty() can reflect an error to its
> callers.   Chris, David, what do you think?

The ->is_readonly method seems like a clear winner to me, I'm all for
adding it, and thus suggested moving it first in the series.

I've read a bit more through the series and would like to suggest
the following approach for the rest:

 - convert ext3/4 to use ->update_time instead of the ->dirty_time
   callout so it gets and exact notifications (preferably the few
   remaining filesystems as well, although that shouldn't really be a
   blocker)
 - defer timestamp updates for any filesystems not defining
   ->update_time (or ->dirty_time for now), and allow filesystems
   using ->update_time to defer the update as well by calling
   mark_inode_dirty with the I_DIRTY_TIME flag so that XFS and btrfs
   don't have to opt-in without testing.
 - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode
   incrementally.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Gour

On Mon, 1 Dec 2014 10:42:36 +0500
Roman Mamedov  wrote:

> Pros:

[...]

> Con:
> 
>   * You only get the ability to recover from a checksum failure with
> Btrfs RAID1, not with mdadm RAID1 (see Russell's reply).

For the reasons you mentioned I'll keep my root under btrfs' native raid
and do some more thinking in regard to /home and possibility to use XFS
for it.


Sincerely,
Gour

-- 
Abandoning all attachment to the results of his activities, 
ever satisfied and independent, he performs no fruitive action, 
although engaged in all kinds of undertakings.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Gour

On Sun, 30 Nov 2014 18:00:37 -0700
Chris Murphy  wrote:

> Yeah. I'm not sure though if openSUSE 13.2 prevents users from
> creating btrfs raid1 volumes entirely, or if it's just an install time
> limitation.

I was able to create btrfs raid1 volumes under lvm, but installer failed
at installing Grub2.

Don't know if it is installer limitation and decided to use btrfs'
native raid1, at least for root. Will see whether to use it for /home as
well or use XFS as suggested by SUSE.


Sincerely,
Gour

-- 
In the material world, one who is unaffected by whatever good 
or evil he may obtain, neither praising it nor despising it, 
is firmly fixed in perfect knowledge.

-- 
The humble sages, by virtue of true knowledge, see with equal 
vision a learned and gentle brāhmana, a cow, an elephant, a dog 
and a dog-eater.

http://www.atmarama.net | Hlapicina (Croatia) | GPG: 52B5C810


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie

If we failed to reading out the checksum, we would free all the checksums
in the list. But the current code accessed the list head, not the entry
in the list. Fix it.

Signed-off-by: Miao Xie 
---
 fs/btrfs/file-item.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 783a943..c26b58f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
ret = 0;
 fail:
while (ret < 0 && !list_empty(&tmplist)) {
-   sums = list_entry(&tmplist, struct btrfs_ordered_sum, list);
+   sums = list_first_entry(&tmplist, struct btrfs_ordered_sum,
+   list);
list_del(&sums->list);
kfree(sums);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC]: Btrfs: Decoupling block-size and page-size in BTRFS.

2014-12-01 Thread Christoph Hellwig

Is this topic relevant for the broarder FS community?  Maybe the btrfs
community should look into organizing a meeting co-hosted with Vault
similar to what we did for ext4 and XFS in the past?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie

Please ignore this patch, Chris has fixed this problem.

Thanks
Miao

On Mon, 1 Dec 2014 18:04:13 +0800, Miao Xie wrote:
> If we failed to reading out the checksum, we would free all the checksums
> in the list. But the current code accessed the list head, not the entry
> in the list. Fix it.
> 
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/file-item.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 783a943..c26b58f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
> start, u64 end,
>   ret = 0;
>  fail:
>   while (ret < 0 && !list_empty(&tmplist)) {
> - sums = list_entry(&tmplist, struct btrfs_ordered_sum, list);
> + sums = list_first_entry(&tmplist, struct btrfs_ordered_sum,
> + list);
>   list_del(&sums->list);
>   kfree(sums);
>   }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

BRFS balance crash

2014-12-01 Thread Swâmi Petaramesh

Hi,

I'm running Fedora Core 21 beta with kernel 3.17.4-300.fc21.x86_64 and

Btrfs-progs-3.17-1.fc21.x86_64

As my SSD was pretty full, I started :

btrfs balance start -dusage=75 -musage=75 /

This ended in "segmentation fault".

Afterwards my system wouldn't access the disk anymore, and needed a hard 
reboot.

After reboot balance seemingly has resumed then finished :

[root@vajra ~]# btrfs balance status /
Balance on '/' is running
17 out of about 24 chunks balanced (123 considered),  29% left

(later)

[root@vajra ~]# btrfs balance status /
No balance found on '/'


...but my system has logged a (very high) number of kernel WARNINGS such as :

déc. 01 12:11:26 vajra kernel: [ cut here ]
déc. 01 12:11:26 vajra kernel: WARNING: CPU: 2 PID: 966 at fs/btrfs/delayed-
inode.c:1410 btrfs_assert_delayed_root_empty+0x39/0x40 [btrfs]()
déc. 01 12:11:26 vajra kernel: Modules linked in: ccm rfcomm ip6t_rpfilter 
ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables i
déc. 01 12:11:26 vajra kernel:  rfkill wmi snd_pcm parport_pc parport hp_accel 
snd_timer lis3lv02d mei_me input_polldev tpm_tis mei snd hp_wireless shpchp 
i2c_i801 sou
déc. 01 12:11:26 vajra kernel: CPU: 2 PID: 966 Comm: btrfs-balance Tainted: G   
 
W  OE  3.17.4-300.fc21.x86_64 #1
déc. 01 12:11:26 vajra kernel: Hardware name: Hewlett-Packard HP EliteBook 820 
G1/1991, BIOS L71 Ver. 01.12 06/25/2014
déc. 01 12:11:26 vajra kernel:   9fd55060 
8800a8467ac8 8173f929
déc. 01 12:11:26 vajra kernel:   8800a8467b00 
810970ad 
88014b7ee780
déc. 01 12:11:26 vajra kernel:   88022fcec800 
88022f9d1a40 

déc. 01 12:11:26 vajra kernel: Call Trace:
déc. 01 12:11:26 vajra kernel:  [] dump_stack+0x45/0x56
déc. 01 12:11:26 vajra kernel:  [] 
warn_slowpath_common+0x7d/0xa0
déc. 01 12:11:26 vajra kernel:  [] 
warn_slowpath_null+0x1a/0x20
déc. 01 12:11:26 vajra kernel:  [] 
btrfs_assert_delayed_root_empty+0x39/0x40 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] 
btrfs_commit_transaction+0x388/0x950 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] 
prepare_to_merge+0x20d/0x230 
[btrfs]
déc. 01 12:11:26 vajra kernel:  [] 
relocate_block_group+0x403/0x6d0 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] 
btrfs_relocate_block_group+0x1e6/0x2f0 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] 
btrfs_relocate_chunk.isra.27+0x6a/0x750 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] ? 
btrfs_set_path_blocking+0x41/0x80 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] ? 
btrfs_search_slot+0x4ad/0xa70 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] ? 
btrfs_get_token_64+0x119/0x140 [btrfs]
déc. 01 12:11:26 vajra kernel:  [] ? 
free_extent_buffer+0x4f/0xa0 
[btrfs]
déc. 01 12:11:26 vajra kernel:  [] btrfs_balance+0x980/0xf40 
[btrfs]
déc. 01 12:11:26 vajra kernel:  [] balance_kthread+0x5d/0x80 
[btrfs]
déc. 01 12:11:26 vajra kernel:  [] ? 
btrfs_balance+0xf40/0xf40 
[btrfs]
déc. 01 12:11:26 vajra kernel:  [] kthread+0xea/0x100
déc. 01 12:11:26 vajra kernel:  [] ? 
kthread_create_on_node+0x1a0/0x1a0
déc. 01 12:11:26 vajra kernel:  [] ret_from_fork+0x7c/0xb0
déc. 01 12:11:26 vajra kernel:  [] ? 
kthread_create_on_node+0x1a0/0x1a0
déc. 01 12:11:26 vajra kernel: ---[ end trace f1f27f48a0abdf6f ]---


déc. 01 12:11:27 vajra kernel: [ cut here ]
déc. 01 12:11:27 vajra kernel: WARNING: CPU: 2 PID: 966 at fs/btrfs/delayed-
inode.c:1410 btrfs_assert_delayed_root_empty+0x39/0x40 [btrfs]()
déc. 01 12:11:27 vajra kernel: Modules linked in: ccm rfcomm ip6t_rpfilter 
ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables i
déc. 01 12:11:27 vajra kernel:  rfkill wmi snd_pcm parport_pc parport hp_accel 
snd_timer lis3lv02d mei_me input_polldev tpm_tis mei snd hp_wireless shpchp 
i2c_i801 sou
déc. 01 12:11:27 vajra kernel: CPU: 2 PID: 966 Comm: btrfs-balance Tainted: G   
 
W  OE  3.17.4-300.fc21.x86_64 #1
déc. 01 12:11:27 vajra kernel: Hardware name: Hewlett-Packard HP EliteBook 820 
G1/1991, BIOS L71 Ver. 01.12 06/25/2014
déc. 01 12:11:27 vajra kernel:   9fd55060 
8800a8467b20 8173f929
déc. 01 12:11:27 vajra kernel:   8800a8467b58 
810970ad 
880183e6b820
déc. 01 12:11:27 vajra kernel:   88022fcec800 
88022f9d1950 

déc. 01 12:11:27 vajra kernel: Call Trace:
déc. 01 12:11:27 vajra kernel:  [] dump_stack+0x45/0x56
déc. 01 12:11:27 vajra kernel:  [] 
warn_slowpath_common+0x7d/0xa0
déc. 01 12:11:27 vajra kernel:  [] 
warn_slowpath_null+0x1a/0x20
déc. 01 12:11:27 vajra kernel:  [] 
btrfs_assert_delayed_root_empty+0x39/0x40 [btrfs]
déc. 01 12:11:27 vajra kernel:  [] 
btrfs_commit_transaction+0x388/0x950 [btrfs]
déc. 01 12:11:27 vajra kernel:  [] 
relocate_block_group+0x454/0x6d0 [btrfs]
déc. 01 12:11:27 vajra kernel:  [] 
btrfs_relocate_block_group+0x1e6/0x2f0 [btrfs]
déc. 01 12:11:27 vajra kerne

btrfs stuck with lot's of files

2014-12-01 Thread Peter Volkov

Hi, guys.

We have a problem with btrfs file system: sometimes it became stuck
without leaving me any way to interrupt it (shutdown -r now is unable to
restart server). By stuck I mean some processes that previously were
able to write on disk are unable to cope with load and load average goes
up:

top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
149.29
Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
0.0 st
KiB Mem:  65922104 total, 65414856 used,   507248 free, 1844 buffers
KiB Swap:0 total,0 used,0 free. 62570804 cached
Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND  
 8644 root  20   0   0  0  0 R  96.5  0.0 127:21.95
kworker/u16:16   
 5047 dvr   20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
dvrserver
30223 root  20   0   20140   2600   2132 R   6.4  0.0   0:00.01
top  
1 root  20   04276   1628   1524 S   0.0  0.0   0:40.19
init 



There are about 300 treads on server, some of which are writing on disk.
A bit information about this btrfs filesystem: this is 22 disk file
system with raid1 for metadata and raid0 for data:

 # btrfs filesystem df /store/
Data, single: total=11.92TiB, used=10.86TiB
System, RAID1: total=8.00MiB, used=1.27MiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=46.00GiB, used=33.49GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=128.00KiB
 # btrfs property get /store/
ro=false
label=store
 # btrfs device stats /store/
(shows all zeros)
 # btrfs balance status /store/
No balance found on '/store/'
 # btrfs filesystem show /store/
Btrfs v3.17.1
(btw, is it supposed to have only version here?)

As for load we write quite small files of size (some of 313K, some of
800K), that's why metadata takes that much. So back to the problem.
iostat 1 exposes following problem:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  16.960.00   17.09   65.950.000.00

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda   0.00 0.00 0.00  0  0
sdc   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sde   0.00 0.00 0.00  0  0
sdd   0.00 0.00 0.00  0  0
sdf   0.00 0.00 0.00  0  0
sdg   0.00 0.00 0.00  0  0
sdj   0.00 0.00 0.00  0  0
sdh   0.00 0.00 0.00  0  0
sdk   0.00 0.00 0.00  0  0
sdi   1.00 0.00   200.00  0200
sdl   0.00 0.00 0.00  0  0
sdn  48.00 0.00 17260.00  0  17260
sdm   0.00 0.00 0.00  0  0
sdp   0.00 0.00 0.00  0  0
sdo   0.00 0.00 0.00  0  0
sdq   0.00 0.00 0.00  0  0
sdr   0.00 0.00 0.00  0  0
sds   0.00 0.00 0.00  0  0
sdt   0.00 0.00 0.00  0  0
sdv   0.00 0.00 0.00  0  0
sdw   0.00 0.00 0.00  0  0
sdu   0.00 0.00 0.00  0  0


write goes to one disk. I've tried to debug what's going in kworker and
did

$ echo workqueue:workqueue_queue_work
> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2

trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
here?

Server has 64Gb of RAM. Is it possible that it is unable to keep all
metadata in memory, can we encrease this memory limit, if exists?


Thanks in advance for any pointers,
--
Peter.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Austin S Hemmelgarn


On 2014-11-29 16:21, John Williams wrote:

On Sat, Nov 29, 2014 at 1:07 PM, Alex Elsayed  wrote:

I'd suggest looking more closely at the crypto api section of menuconfig -
it already has crc32c, among others. Just because it's called the "crypto
api" doesn't mean it only has cryptographically-strong algorithms.


I have looked. What 128- or 256-bit hash functions in "crypto api" are
you referring to that are as fast as Spooky2 or CityHash?


Just because it's a filesystem doesn't always mean that speed is the 
most important thing.  Personally, I can think of multiple cases where 
using a cryptographically strong hash would be preferable, for example:

 * On an fs used solely for backup purposes
 * On a fs used for /boot
 * On an fs spread across a very large near-line disk array and mounted
   by a system with a powerful CPU
 * Almost any other case where data integrity is more important than
   speed

The biggest reason to use the in-kernel Crypto API though, is that it 
gives a huge amount of flexibility, and provides pretty much transparent 
substitution of CPU optimized versions of the exported hash functions 
(for example, you don't have to know whether or not your processor 
supports Intel's CRC32 ISA extensions).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-01 Thread Austin S Hemmelgarn


On 2014-11-30 20:58, Qu Wenruo wrote:

[BACKGROUND]
I'm trying to implement the function to repair missing inode item.
Under that case, inode type must be salvaged(although it can be fallback to
FILE).

One case should be, if there is any dir_item/index or inode_ref refers the
inode as parent, the type of that inode must be DIR.

However, currently btrfsck implement (inode_record only records
backref), we
are unable to search the inode_backref whose parent is given inode number.

[FIRST IMPLEMENT DESIGN]
My first thought is to implement an generic inode-relation structure,
recording parent ino, child ino, name and namelen, and restore the
structure
in a rbtree, not in the child/parent's list.

But I soon recognize that this is a perfect use case for relational
database,
as 'ino' as the primary key for INODE table,
('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.

[CRAZY IDEA]
So why not using SQL to implement the btrfsck inode-record things?

With such crazy idea, it will be much much easier to do any iteration
from a
given ino, and with the already mature RDB implement, like sqlite3, we can
save hundreds of lines of codes implementing the rb-tree or list.

[PROS]
1. Easy to maintain
Now we don't need to maintain the rbtree searching or list
iteration, but
easy SQL lines and its wrapper.

2. Easy to extend
If we need to record something more, like extents and its relation to
inode, we only need to create 2 tables and several SQL and wrappers.

3. Reduced memory usage for HUGE fs.
When metadata grows to several TB or even more, current rb-tree based
implement may run short of memory since they are all stored in memory.
But if use SQL, RDBMS like sqlite3 can restore things in either
memory or
disk, which may hugely reduce the memory usage for huge btrfs.

If not use existing RDBMS, we need to implement complicated memory
control
system to manage memory in userland.

[CONS]
1. Heavy implement
SQL hide the rb-tree or B+ tree implement but costs more memory(if not
compressed) and CPU cycles, which will be slower than the simple
rb-tree
implement even using lightweight RDBMS like sqlite3.

2. Heavy dependency
If use it, btrfs-progs will include RDBMS as the make and runtime
dependency.
Such low level progs depend on high level programs like sqlite3 may
be very
strange.

3. A lot of rework on existing codes.
Even SQL is easier to maintain and extend, if we use it, we still
need to
reimplement several hundreds or even thousands lines of code to
implement
it, not to mention the regression tests.

4. Copyright
Will it cause any copyright problem if using non-GPL RDBMS like
sqlite3 in
GPLv2 btrfs-progs?

[NEED FEEDBACK]
Any feedback or discussion on the crazy idea is welcomed, since this may
needs
a lot of work, it definitely needs a lot review on the idea before it
comes to
codes.

So, I think this does a good job of highlighting one of the bigger 
issues with btrfsck when it is compared to ext* and/or xfs.  Despite 
this being a problem, I really don't think using a rdbms is the way to 
fix it, both for reasons outlined in other responses, and because fsck 
should be as fast as possible when nothing is wrong with the fs.





smime.p7s
Description: S/MIME Cryptographic Signature

PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread MegaBrutal

Hi all,

I've reported the bug I've previously posted about in "BTRFS messes up
snapshot LV with origin" in the Kernel Bug Tracker.
https://bugzilla.kernel.org/show_bug.cgi?id=89121

Since the other thread went off into theoretical debates about UUIDs
and their generic relation to BTRFS, their everyday use cases, and the
philosophical meaning behind uniqueness of copies and UUIDs; I'd like
to specifically ask you to only post here about the ACTUAL problem at
hand. Don't get me wrong, I find the discussion in the other thread
really interesting, I'm following it, but it is only very remotely
related to the original issue, so please keep it there! If you're
interested to catch up about the actual bug symptoms, please read the
bug report linked above, and (optionally) reproduce the problem
yourself!

A virtual machine image on which I've already reproduced the
conditions can be downloaded here:
http://undead.megabrutal.com/kvm-reproduce-1391429.img.xz
(Download size: 113 MB; Unpacked image size: 2 GB.)

Re-tested with mainline kernel 3.18.0-rc7 just today.


Regards,
MegaBrutal
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Austin S Hemmelgarn


On 2014-11-29 23:23, Marc MERLIN wrote:

On Sun, Nov 30, 2014 at 09:03:14AM +0530, Shriramana Sharma wrote:

IIUC with BtrFS while it is possible to easily undelete a file or
ordinary directory if a snapshot of the containing subvol exists, it
seems that it's not elementary to undelete a subvol itself, because
all subvols are under the root-level subvol (id 0 or 5, see my other
q) but even snapshotting the root subvol will not snapshot any subvols
under it.

So is there any way to undo a subvol delete?


If you didn't snapshot that volume before deleting it, you're SOL.
If you snapshotted it, rename that snapshot to the other name, and
you're done.

Btrfs doesn't offer undelete, it only lets you keep multiple copies of
your data at very little cost, so you can retrieve a snapshot copy if
you deleted your current volume's data.

Marc

Well, in theory, if you unmount the FS _immediately_ after the subvol 
delete, without writing _anything_ else to it, it _might_ be possible to 
recover the data using some (probably almost incomprehensible) 
incantation of btrfs-find-root and btrfs recover/restore.


In practice though, for anyone who doesn't have expert level knowledge 
of the on-disk structure and fs internals, deleting a subvolume can't be 
undone.


We might want to consider adding an option to btrfs subvol del to ask 
for confirmation (or make it do so by default and add an option to 
disable asking for confirmation).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Possible to undo subvol delete?

2014-12-01 Thread Shriramana Sharma

On Mon, Dec 1, 2014 at 6:42 PM, Austin S Hemmelgarn
 wrote:
>
> We might want to consider adding an option to btrfs subvol del to ask for
> confirmation (or make it do so by default and add an option to disable
> asking for confirmation).

I already reported: https://bugzilla.kernel.org/show_bug.cgi?id=89091.
As I requested there, I prefer for confirmation by default and -f to
force otherwise, rather than behaviour of rm which requires -i to ask
confirmation.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread MegaBrutal

2014-12-01 14:12 GMT+01:00 Austin S Hemmelgarn :
>
> We might want to consider adding an option to btrfs subvol del to ask for
> confirmation (or make it do so by default and add an option to disable
> asking for confirmation).
>

I've also noticed, a subvolume can just be deleted with an "rm -r",
just like an ordinary directory. I'd consider to only allow subvolume
deletions with exact "btrfs subvolume delete" commands, and they
should be protected against an ordinary "rm". There also could be a
tunable FS feature to allow or disable ordinary subvolume deletions,
which could be set or unset by btrfstune. I think a subvolume really
deserves to be treated specially over an ordinary directory.

As for undeletion, while I have no idea how to do that, I noticed they
don't get deleted immediately. With older btrfs tools (3.12), I
noticed, if I delete a subvolume and then immediately issue a "btrfs
subvolume list", my deleted subvolume is still listed with a "DELETED"
caption, and allocated space doesn't immediately gets freed if I check
with "df -m". It takes a few seconds for the DELETED entry to
disappear and the allocated space to be freed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Roman Mamedov

On Mon, 1 Dec 2014 18:49:23 +0530
Shriramana Sharma  wrote:

> As I requested there, I prefer for confirmation by default and -f to
> force otherwise, rather than behaviour of rm which requires -i to ask
> confirmation.

And I prefer the current behavior (also replied on the bug).

A more sensible idea could be adding a global-level '-i' switch, same as in
'rm', so that you or distros could then alias 'btrfs' to 'btrfs -i' (ask
confirmation on any irreversible action).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Roman Mamedov

On Mon, 1 Dec 2014 14:38:16 +0100
MegaBrutal  wrote:

> I've also noticed, a subvolume can just be deleted with an "rm -r",
> just like an ordinary directory. I'd consider to only allow subvolume
> deletions with exact "btrfs subvolume delete" commands, and they

This is already the case. 'rm -r' will remove all files in a subvolume, but
the empty subvolume itself is only deletable via the 'btrfs' command.

If you want to make snapshots which can't be removed by ordinary tools, use
the 'read-only' mode when creating them.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Austin S Hemmelgarn


On 2014-12-01 08:38, MegaBrutal wrote:

2014-12-01 14:12 GMT+01:00 Austin S Hemmelgarn :


We might want to consider adding an option to btrfs subvol del to ask for
confirmation (or make it do so by default and add an option to disable
asking for confirmation).



I've also noticed, a subvolume can just be deleted with an "rm -r",
just like an ordinary directory. I'd consider to only allow subvolume
deletions with exact "btrfs subvolume delete" commands, and they
should be protected against an ordinary "rm". There also could be a
tunable FS feature to allow or disable ordinary subvolume deletions,
which could be set or unset by btrfstune. I think a subvolume really
deserves to be treated specially over an ordinary directory.
I don't know what distro/kernel version you might be using, but every 
version of btrfs I have used required the use of 'btrfs subvol del' to 
actually delete a subvolume, even an empty one.  It would not surprise 
me though if RHEL or SuSE had patched the kernel to allow using rm on a 
subvolume.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Possible to undo subvol delete?

2014-12-01 Thread Holger Hoffstätte

On Mon, 01 Dec 2014 14:38:16 +0100, MegaBrutal wrote:

> I've also noticed, a subvolume can just be deleted with an "rm -r",
> just like an ordinary directory. I'd consider to only allow subvolume

Nope:

root>btrfs subvolume create foo
Create subvolume './foo'
root>touch foo/bla
root>ll foo 
total 0
-rw-r--r-- 1 root root 0 Dec  1 14:47 bla
root>rm -rf foo 
rm: cannot remove ‘foo’: Operation not permitted
root>

The files of the subvolume do get deleted though, but that's correct.

Not sure what you did, but what you want is already the case.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread MegaBrutal

2014-12-01 14:47 GMT+01:00 Roman Mamedov :
> On Mon, 1 Dec 2014 14:38:16 +0100
> MegaBrutal  wrote:
>
>> I've also noticed, a subvolume can just be deleted with an "rm -r",
>> just like an ordinary directory. I'd consider to only allow subvolume
>> deletions with exact "btrfs subvolume delete" commands, and they
>
> This is already the case. 'rm -r' will remove all files in a subvolume, but
> the empty subvolume itself is only deletable via the 'btrfs' command.

That's great! And there is no way to protect against recursive
deletions (besides setting the subvolume read-only, as you suggested
below), as files are processes individually by "rm". But it's OK,
people should always be very careful with "rm", and it doesn't change
with btrfs. ;)


> If you want to make snapshots which can't be removed by ordinary tools, use
> the 'read-only' mode when creating them.

Yeah, good idea! Anyway, is it possible to change a read-only snapshot
to read-write and vica-versa, or you can only specify read-only while
creating them?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC]: Btrfs: Decoupling block-size and page-size in BTRFS.

2014-12-01 Thread Chris Mason




On Mon, Dec 1, 2014 at 5:13 AM, Christoph Hellwig  
wrote:

Is this topic relevant for the broarder FS community?  Maybe the btrfs
community should look into organizing a meeting co-hosted with Vault
similar to what we did for ext4 and XFS in the past?


Yeah, this is a very btrfs specific topic.  I'm happy to organize an 
official Btrfs sub-topic as part of vault or LSF.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Online Drive Replacement: BTRFS with RAID 6

2014-12-01 Thread Oliver


Hi All,

on a testing machine I installed four HDDs and they are configured as 
RAID6. For a test I removed one of the drives (/dev/sdk) while the 
volume was mounted and data was written to it. This worked well, as far 
as I can see. Some I/O errors were written to /var/log/syslog, but the 
volume kept working. Unfortunately the command "btrfs fi sh" did not 
show any missing drives. So I remounted the volume in degraded mode: 
"mount -t btrfs /dev/sdx1 -o remount,rw,degraded,noatime /mnt". This way 
the drive in question was reported as missing. Then I plugged in the HDD 
again (it is of course /dev/sdk again) and started a balancing in hope 
that this will restore RAID6: "btrfs filesystem balance start /mnt". Now 
the volume looks like this:


$ btrfs fi sh
Label: none  uuid: 28410e37-77c1-4c01-8075-0d5068d9ffc2
Total devices 4 FS bytes used 257.05GiB
devid1 size 465.76GiB used 262.03GiB path /dev/sdi1
devid2 size 465.76GiB used 262.00GiB path /dev/sdj1
devid3 size 465.76GiB used 261.03GiB path /dev/sdh1
devid4 size 465.76GiB used 0.00 path /dev/sdk1

How do I reinitiate /dev/sdk1? As running "btrfs fi ba start /mnt" does 
not help, I tried to remove the hdd, but


$ btrfs de de /dev/sdk1 /mnt/
ERROR: error removing the device '/dev/sdk1' - unable to go below four 
devices on raid6


A replacement does not work this way either:

$ btrfs replace start -f -r /dev/sdk1 /dev/sdk1 /mnt
/dev/sdk1 is mounted

Are there other ways to replace/reinitiate the hdd then converting to 
RAID 5?



Here are some more information about my configuration:

$   uname -a
Linux hostname 3.13.0-30-generic #55-Ubuntu SMP Fri Jul 4 21:40:53 UTC 
2014 x86_64 x86_64 x86_64 GNU/Linux

$   btrfs --version
Btrfs v3.12
$ btrfs fi df /mnt
Data, RAID6: total=263.00GiB, used=256.82GiB
System, RAID1: total=32.00MiB, used=36.00KiB
Metadata, RAID1: total=1.00GiB, used=271.13MiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread Theodore Ts'o

On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
> 
> The ->is_readonly method seems like a clear winner to me, I'm all for
> adding it, and thus suggested moving it first in the series.

It's a real winner for me as well, but the reason why I dropped it is
because if btrfs() has to keep its ->update_time function, we wouldn't
actually have a user for is_readonly().  I suppose we could have
update_time() call ->is_readonly() and then ->update_time() if they
exist, but it only seemed to add an extra call and a bit of extra
overhead without really simplifying things for btrfs.

If there were other users of ->is_readonly, then it would make sense,
but it seemed better to move into a separate code refactoring series.

> I've read a bit more through the series and would like to suggest
> the following approach for the rest:
> 
>  - convert ext3/4 to use ->update_time instead of the ->dirty_time
>callout so it gets and exact notifications (preferably the few
>remaining filesystems as well, although that shouldn't really be a
>blocker)

We could do that, although ext3/4's ->update_time() would be exactly
the same as the generic update_time() function, so there would be code
duplication.  If the goal is to get rid of the magic in
-->dirty_inode() being used to work around how the VFS makes changes
to fields that end up in the on-disk inode, we would need to audit a
lot of extra code paths; at the very least, in how the generic quota
code handles updates to i_size and i_blocks (for example).

And BTW, we don't actually have a dirty_time() function any more in
the current patch series.  update_time() is currently looking like
this:

static int update_time(struct inode *inode, struct timespec *time, int flags)
{
if (inode->i_op->update_time)
return inode->i_op->update_time(inode, time, flags);

if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
inode_inc_iversion(inode);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;

if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
!(inode->i_state & I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC);
return 0;
}

>  - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode
>incrementally.

Right, so xfs and btrfs (which are the two file systems that have
update_time at the moment) can just drop update_time() and then check
the ->dirty_time() for (flags & I_DIRTY_TIME).  Hmm, I suspect this
might be better for xfs, yes?

if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
!(inode->i_state & I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME);

XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
use the I_DIRTY_TIME flag to log the journal timestamps if it so
desires, and perhaps drop the need for it to use update_time().  (And
with XFS doing logical journalling, it may be that you might want to
include the timestamp update in the journal if you have a journal
transaction open already, so the disk is spun up or likely to be spin
up anyway, right?)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-12-01 Thread Zygo Blaxell

On Fri, Nov 28, 2014 at 11:55:07PM -0800, Robert White wrote:
> On 11/28/2014 08:59 PM, Zygo Blaxell wrote:
> >On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote:
> >>On 11/27/2014 05:15 AM, Zygo Blaxell wrote:
> >>>This is a weakness of the current udev and asynchronous device hotplug
> >>>concept:  there is no notion of bus enumeration in progress, so we can be
> >>>trying to assemble multi-device storage before we have all the devices
> >>>visible.  Assembly of aggregate storage (whatever it is--btrfs, md,
> >>>lvm2...) has to wait until all known storage buses are fully enumerated
> >>>in order to detect if there are duplicates.
> >>
> >>It is more complex than that. Some devices may appear after the "1st" bus
> >>enumeration.
> >
> >That case is well handled already--a new enumeration will start with the
> >second (and all later) hotplug events.
> >
> >The problem arises when we try to assemble disk arrays before the
> >known end of the "1st" (or any) enumeration.  There is no way for an
> >enumerating agent to tell other agents "this is definitely not the
> >complete list of devices yet, other devices may be inserted imminently"
> >and defer all the multi-device assembly until the address space of the
> >enumering bus is fully covered.
> >
> MDADM has an "attached" but not "started" state for arrays that
> handles this condition during incremental assembly. (see "mdadm
> --incremental /dev/whatever"),

> [...very complicated mdadm-architecture-invades-the-filesystem-layer
> thing snipped...]

I don't see why it can't all be done in user-space more or less the same
way LVM does.  Scan all the parititions known to be available, build a
table of devices with UUIDs matching the target filesystem, check for
sufficiency, check for uniqueness, and if the configuration passes all the
sanity checks (or we have hints from the user that resolve ambiguity),
submit the entire list of devices to the kernel as a BTRFS filesystem.
If there are UUID duplicates or missing devices, submit nothing to the
kernel at all.

initramfs-less multi-disk configurations can calculate all that in
advance and generate a rootflags parameter for the kernel command line.
It's not necessary to resolve every possible situation in the kernel.

signature.asc
Description: Digital signature

Re: BRFS balance crash

2014-12-01 Thread Swâmi Petaramesh

Hi,

I got another kernel crash during a balance, this time with a nice "kernel 
bug"...:

déc. 01 16:19:09 vajra kernel: [ cut here ]
déc. 01 16:19:09 vajra kernel: WARNING: CPU: 0 PID: 5396 at fs/btrfs/extent-
tree.c:876 btrfs_lookup_extent_info+0x4c6/0x4e0 [btrfs]()
déc. 01 16:19:09 vajra kernel: Modules linked in: vfat fat uas usb_storage ccm 
rfcomm ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge 
stp llc 
déc. 01 16:19:09 vajra kernel:  rfkill snd_seq wmi snd_seq_device snd_pcm 
parport_pc parport snd_timer mei_me hp_accel lis3lv02d i2c_i801 mei 
input_polldev snd tpm_tis
déc. 01 16:19:09 vajra kernel: CPU: 0 PID: 5396 Comm: btrfs Tainted: G  
 
OE  3.17.4-300.fc21.x86_64 #1
déc. 01 16:19:09 vajra kernel: Hardware name: Hewlett-Packard HP EliteBook 820 
G1/1991, BIOS L71 Ver. 01.12 06/25/2014
déc. 01 16:19:09 vajra kernel:   d0ba426b 
880193fdb7e0 8173f929
déc. 01 16:19:09 vajra kernel:   880193fdb818 
810970ad 
88017913b2d0
déc. 01 16:19:09 vajra kernel:  8802298f9800 00ccd2424000 
0001 88011c675320
déc. 01 16:19:09 vajra kernel: Call Trace:
déc. 01 16:19:09 vajra kernel:  [] dump_stack+0x45/0x56
déc. 01 16:19:09 vajra kernel:  [] 
warn_slowpath_common+0x7d/0xa0
déc. 01 16:19:09 vajra kernel:  [] 
warn_slowpath_null+0x1a/0x20
déc. 01 16:19:09 vajra kernel:  [] 
btrfs_lookup_extent_info+0x4c6/0x4e0 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] walk_down_proc+0x1cd/0x340 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] walk_down_tree+0x73/0x110 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
btrfs_drop_snapshot+0x414/0x880 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
merge_reloc_roots+0x109/0x260 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
relocate_block_group+0x40e/0x6d0 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
btrfs_relocate_block_group+0x1e6/0x2f0 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
btrfs_relocate_chunk.isra.27+0x6a/0x750 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] ? 
btrfs_set_path_blocking+0x41/0x80 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] ? 
btrfs_search_slot+0x4ad/0xa70 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] ? 
btrfs_get_token_64+0x119/0x140 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] ? 
free_extent_buffer+0x4f/0xa0 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] btrfs_balance+0x980/0xf40 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] 
btrfs_ioctl_balance+0x168/0x3c0 [btrfs]
déc. 01 16:19:09 vajra kernel:  [] btrfs_ioctl+0x558/0x27d0 
[btrfs]
déc. 01 16:19:09 vajra kernel:  [] ? 
handle_mm_fault+0xa88/0x1010
déc. 01 16:19:09 vajra kernel:  [] ? path_openat+0xcb/0x6d0
déc. 01 16:19:09 vajra kernel:  [] ? final_putname+0x22/0x50
déc. 01 16:19:09 vajra kernel:  [] ? putname+0x29/0x40
déc. 01 16:19:09 vajra kernel:  [] ? 
__do_page_fault+0x29c/0x580
déc. 01 16:19:09 vajra kernel:  [] ? __vma_link_rb+0xb8/0xe0
déc. 01 16:19:09 vajra kernel:  [] do_vfs_ioctl+0x2d0/0x4b0
déc. 01 16:19:09 vajra kernel:  [] SyS_ioctl+0x81/0xa0
déc. 01 16:19:09 vajra kernel:  [] 
system_call_fastpath+0x16/0x1b
déc. 01 16:19:09 vajra kernel: ---[ end trace b3c094f5b3ca386b ]---
déc. 01 16:19:09 vajra kernel: [ cut here ]
déc. 01 16:19:09 vajra kernel: kernel BUG at fs/btrfs/extent-tree.c:7733!
déc. 01 16:19:09 vajra kernel: invalid opcode:  [#1] SMP 
déc. 01 16:19:09 vajra kernel: Modules linked in: vfat fat uas usb_storage ccm 
rfcomm ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge 
stp llc 
déc. 01 16:19:09 vajra kernel:  rfkill snd_seq wmi snd_seq_device snd_pcm 
parport_pc parport snd_timer mei_me hp_accel lis3lv02d i2c_i801 mei 
input_polldev snd tpm_tis
déc. 01 16:19:09 vajra kernel: CPU: 0 PID: 5396 Comm: btrfs Tainted: G
W  OE  3.17.4-300.fc21.x86_64 #1
déc. 01 16:19:09 vajra kernel: Hardware name: Hewlett-Packard HP EliteBook 820 
G1/1991, BIOS L71 Ver. 01.12 06/25/2014
déc. 01 16:19:09 vajra kernel: task: 880088586220 ti: 880193fd8000 
task.ti: 880193fd8000
déc. 01 16:19:09 vajra kernel: RIP: 0010:[]  
[] 
walk_down_proc+0x31a/0x340 [btrfs]
déc. 01 16:19:09 vajra kernel: RSP: 0018:880193fdb8e8  EFLAGS: 00010246
déc. 01 16:19:09 vajra kernel: RAX:  RBX: 880182abbb40 RCX: 

déc. 01 16:19:09 vajra kernel: RDX: 008cf241 RSI: 88017913b2d0 RDI: 
880232924400
déc. 01 16:19:09 vajra kernel: RBP: 880193fdb928 R08: 60fdc12005c0 R09: 
a0238926
déc. 01 16:19:09 vajra kernel: R10: 8802298f9800 R11: 88011c675320 R12: 
0003
déc. 01 16:19:09 vajra kernel: R13: 88017913b870 R14: 8800605455f8 R15: 
0003
déc. 01 16:19:09 vajra kernel: FS:  7f96e0bd78c0() 
GS:88023ea0() knlGS:
déc. 01 16:19:09 vajra kernel: CS:  0010 DS:  ES:  CR0: 
80050033
déc. 01 16:19:09 vajra kernel: CR2: 7f7172b51000 CR3: 0001348b1000 
CR4: 001407f0
déc. 01 16:19:09 v

[PATCH RFC v2] btrfs: add sysfs layout to show volume info

2014-12-01 Thread Anand Jain

From: Anand Jain 

Not yet ready for integration, but for review and testing of the new sysfs 
layout
which is currently under /sys/fs/btrfs/by_fsid

This patch makes btrfs_fs_devices and btrfs_device information readable
from sysfs. This uses the sysfs group visible entry point to mark
certain attributes visible/hidden depending the FS state (mount/unmounted).

The new layout is as shown below.

/sys/fs/btrfs/by_fsid*
./7b047f4d-c2ce-4f22-94a3-68c09057f1bf*
status
fsid*
missing_devices
num_devices*
open_devices
opened*
rotating
rw_devices
seeding
total_devices*
total_rw_bytes
./e6701882-220a-4416-98ac-a99f095bddcc*
active_pending
bdev
bytes_used
can_discard
devid*
dev_root_fsid
devstats_valid
dev_totalbytes
generation*
in_fs_metadata
io_align
io_width
missing
name*
nobarriers
replace_tgtdev
sector_size
total_bytes
type
uuid*
writeable

(* indicates that attribute will be visible even when device is
unmounted but registered with btrfs kernel)

The old kobject  will be merged into this new 'by_fsid' kobject,
so that older attributes under  and newer attributed under by_fsid
will be merged together as well.

v2: added support for device add/delete/replace
rebase on the latest integration branch

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c |   7 +
 fs/btrfs/super.c   |  15 ++
 fs/btrfs/sysfs.c   | 383 +
 fs/btrfs/sysfs.h   |   6 +
 fs/btrfs/volumes.c |  42 ++
 fs/btrfs/volumes.h |   6 +
 6 files changed, 459 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 715a115..31ce3a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -474,6 +474,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
u8 uuid_tmp[BTRFS_UUID_SIZE];
struct btrfs_trans_handle *trans;
int ret = 0;
+   char uuid_buf[BTRFS_UUID_UNPARSED_SIZE];
 
/* don't allow cancel or unmount to disturb the finishing procedure */
mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
@@ -595,7 +596,13 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
/* replace the sysfs entry */
btrfs_kobj_rm_device(fs_info, src_device);
btrfs_kobj_add_device(fs_info, tgt_device);
+   btrfs_destroy_dev_sysfs(src_device);
btrfs_rm_dev_replace_free_srcdev(fs_info, src_device);
+   snprintf(uuid_buf, BTRFS_UUID_UNPARSED_SIZE, "%pU",
+   tgt_device->uuid);
+   if (kobject_rename(&tgt_device->dev_kobj, uuid_buf))
+   printk(KERN_ERR "BTRFS: sysfs uuid %s rename error\n",
+   uuid_buf);
 
/* write back the superblocks */
trans = btrfs_start_transaction(root, 0);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 017d92d..918eb9d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1389,6 +1389,11 @@ static struct dentry *btrfs_mount(struct 
file_system_type *fs_type, int flags,
goto error_sec_opts;
}
 
+   error = btrfs_update_by_fsid_sysfs_group(fs_devices);
+   if (error)
+   btrfs_warn(fs_info, "sysfs update error during mount: %d",
+   error);
+
return root;
 
 error_close_devices:
@@ -1885,8 +1890,18 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
 static void btrfs_kill_super(struct super_block *sb)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+   struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+   int error;
+
+   set_bit(BTRFS_FS_DEVS_UNMOUNTING, &fs_devs->flags);
+   error = btrfs_update_by_fsid_sysfs_group(fs_devs);
+   if (error)
+   btrfs_warn(fs_info, "sysfs update error during unmount: %d",
+   error);
+
kill_anon_super(sb);
free_fs_info(fs_info);
+   clear_bit(BTRFS_FS_DEVS_UNMOUNTING, &fs_devs->flags);
 }
 
 static struct file_system_type btrfs_fs_type = {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 92db3f6..b658812 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ctree.h"
 #include "disk-io.h"
@@ -32,6 +3

Re: Possible to undo subvol delete?

2014-12-01 Thread Shriramana Sharma

On Mon, Dec 1, 2014 at 7:16 PM, Roman Mamedov  wrote:
>
> A more sensible idea could be adding a global-level '-i' switch, same as in
> 'rm', so that you or distros could then alias 'btrfs' to 'btrfs -i' (ask
> confirmation on any irreversible action).

Well the difference being that there doesn't seem to be any other
irreversible action from my scan of man btrfs -- am I missing
anything? This is the only thing that actually leads to loss of data.

When btrfs has so many features (esp snapshots) to prevent user
accidentally deleting data (I liked especially
http://www.youtube.com/v/9H7e6BcI5Fo?start=209) I think there has to
be *some* modicum of support for warning against deleting a subvolume
(and it seems others agree too).

But I see what you mean in the bugzilla comment about not wanting your
existing backup snapshot scripts to fail because they don't have a -f.
At the same time, aliasing via -i on top level btrfs binary may not be
so practical here because this is the only command which will actually
use it (again, correct if wrong).

Perhaps exporting some envvar in the default shell's rc file (or
whichever file will be read only if the shell is interactive) would
work? Like in ~/.bashrc:

export BTRFS_SUBVOLUME_DELETE_CONFIRM=1

Ideas?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Shriramana Sharma

On Mon, Dec 1, 2014 at 7:24 PM, MegaBrutal  wrote:
>
>> If you want to make snapshots which can't be removed by ordinary tools, use
>> the 'read-only' mode when creating them.
>
> Yeah, good idea! Anyway, is it possible to change a read-only snapshot
> to read-write and vica-versa, or you can only specify read-only while
> creating them?

IIUC you can only specify RO while creating but you can always cheaply
create a RW snapshot of an RO one or an RO snapshot of an RW one...

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/6] Btrfs: fix race between writing free space cache and trimming

2014-12-01 Thread Filipe Manana

Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor 
raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop 
parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore 
serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif 
sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy 
ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: 
btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: GW  
3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: 8800b12f6410 ti: 880071538000 task.ti: 
880071538000
[55650.807166] RIP: 0010:[]  [] 
write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:88007153bc30  EFLAGS: 00010246
[55650.807687] RAX: 5d1ec000 RBX: 8800a665df08 RCX: 0400
[55650.807885] RDX: 88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: 88005d1ec000
[55650.808017] RBP: 88007153bc58 R08: ddd51536 R09: 01e0
[55650.808017] R10:  R11: 0037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: 88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: 88007153bc98
[55650.808017] FS:  () GS:88023ec8() 
knlGS:
[55650.808017] CS:  0010 DS:  ES:  CR0: 8005003b
[55650.808017] CR2: 02273b88 CR3: b18f6000 CR4: 06e0
[55650.808017] Stack:
[55650.808017]  88020e834e00 880172d68db0  
88019257c800
[55650.808017]  8801d42ea720 88007153bd10 a037d2fa 
880224e99180
[55650.808017]  8801469a6188 880224e99140 880172d68c50 
000300b7
[55650.808017] Call Trace:
[55650.808017]  [] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017]  [] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017]  [] btrfs_write_dirty_block_groups+0x4b5/0x505 
[btrfs]
[55650.808017]  [] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017]  [] ? _raw_spin_lock+0xe/0x10
[55650.808017]  [] btrfs_commit_transaction+0x411/0x882 
[btrfs]
[55650.808017]  [] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017]  [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 
[btrfs]
[55650.808017]  [] kthread+0xb7/0xbf
[55650.808017]  [] ? __kthread_parkme+0x67/0x67
[55650.808017]  [] ret_from_fork+0x7c/0xb0
[55650.808017]  [] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 
7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00  a5 
4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP  [] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017]  RSP 
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana 
---

V2: Enlonged the critical section to include the cache writeout of bitmaps,
since I ran into this recently. The issue is that a concurrent trim can
modify or free the bitmaps while the cache writeout task is using them.
Updated commit message with crash trace example.

 fs/btrfs/free-space-cache.c | 73 +
 fs/btrfs/f

[PATCH] Btrfs: fix extent map leak on chunk allocation failure

2014-12-01 Thread Filipe Manana

On error, after adding the extent map to the tree and to the pending
chunks list, we would leave decrementing the extent map's refcount
by 2 instead of 3 (our allocation + tree reference + list reference).

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: GWL 
3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]   8800ba867eb8 813e7a13 
8800a2e11040
[20770.106132]  8800ba867ed0 81105d0c  
8800ba867ee0
[20770.106134]  a035d65e 8800ba867ef0 a03b0654 
8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [] dump_stack+0x45/0x56
[20770.106145]  [] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [] SyS_delete_module+0x153/0x1c4
[20770.106182]  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [] system_call_fastpath+0x16/0x1b

Signed-off-by: Filipe Manana 
---
 fs/btrfs/volumes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66a5a1e..e936fe3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4496,6 +4496,8 @@ error_del_extent:
free_extent_map(em);
/* One for the tree reference */
free_extent_map(em);
+   /* One for the pending_chunks list reference */
+   free_extent_map(em);
 error:
kfree(devices_info);
return ret;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix memory leak after block remove + trimming

2014-12-01 Thread Filipe Manana

There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:

   CPU 1  CPU 2

  btrfs_trim_block_group()
  trim_no_bitmap()
  remove free space entry from
  block group cache's rbtree
  do_trimming()

btrfs_remove_block_group()

btrfs_remove_free_space_cache()

  add back free space entry to
  block group's cache rbtree
  btrfs_put_block_group()

(...)
btrfs_put_block_group()

kfree(bg->free_space_ctl)
kfree(bg)

The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:

[ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
[ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: GWL 
3.17.0-rc5-btrfs-next-2+ #1
[ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 8234.642664]   8801af1b3eb8 8140c7b6 
8801dbedd0c0
[ 8234.642670]  8801af1b3ed0 811149ce  
8801af1b3ee0
[ 8234.642676]  a042dbe7 8801af1b3ef0 a0487422 
8801af1b3f78
[ 8234.642682] Call Trace:
[ 8234.642692]  [] dump_stack+0x4d/0x66
[ 8234.642699]  [] kmem_cache_destroy+0x4d/0x92
[ 8234.642731]  [] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[ 8234.642757]  [] exit_btrfs_fs+0x9/0xbe7 [btrfs]
[ 8234.642762]  [] SyS_delete_module+0x155/0x1c6
[ 8234.642768]  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8234.642773]  [] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

Signed-off-by: Filipe Manana 
---
 fs/btrfs/free-space-cache.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 2ee73c2..030847b 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3200,6 +3200,12 @@ out:
/* once for us and once for the tree */
free_extent_map(em);
free_extent_map(em);
+
+   /*
+* We've left one free space entry and other tasks trimming
+* this block group have left 1 entry each one. Free them.
+*/
+   __btrfs_remove_free_space_cache(block_group->free_space_ctl);
} else {
spin_unlock(&block_group->lock);
}
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

2014-12-01 Thread Filipe Manana

Stress btrfs' block group allocation and deallocation while running
fstrim in parallel. Part of the goal is also to get data block groups
deallocated so that new metadata block groups, using the same physical
device space ranges, get allocated while fstrim is running. This caused
several issues ranging from invalid memory accesses, kernel crashes,
metadata or data corruption, free space cache inconsistencies, free
space leaks and memory leaks.

Signed-off-by: Filipe Manana 
---

V2: Addressed Dave's comments.

 tests/generic/038 | 152 ++
 tests/generic/038.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 155 insertions(+)
 create mode 100755 tests/generic/038
 create mode 100644 tests/generic/038.out

diff --git a/tests/generic/038 b/tests/generic/038
new file mode 100755
index 000..217aa7a
--- /dev/null
+++ b/tests/generic/038
@@ -0,0 +1,152 @@
+#! /bin/bash
+# FSQA Test No. 038
+#
+# This test was motivated by btrfs issues, but it's generic enough as it
+# doesn't use any btrfs specific features.
+#
+# Stress btrfs' block group allocation and deallocation while running fstrim in
+# parallel. Part of the goal is also to get data block groups deallocated so
+# that new metadata block groups, using the same physical device space ranges,
+# get allocated while fstrim is running. This caused several issues ranging
+# from invalid memory accesses, kernel crashes, metadata or data corruption,
+# free space cache inconsistencies, free space leaks and memory leaks.
+#
+# These issues were fixed by the following btrfs linux kernel patches:
+#
+#   Btrfs: fix invalid block group rbtree access after bg is removed
+#   Btrfs: fix crash caused by block group removal
+#   Btrfs: fix freeing used extents after removing empty block group
+#   Btrfs: fix race between fs trimming and block group remove/allocation
+#   Btrfs: fix race between writing free space cache and trimming
+#   Btrfs: make btrfs_abort_transaction consider existence of new block groups
+#   Btrfs: fix memory leak after block remove + trimming
+#   Btrfs: fix extent map leak on chunk allocation failure
+#
+# The issues were found on a qemu/kvm guest with 4 virtual CPUs, 4Gb of ram and
+# scsi-hd devices with discard support enabled (that means hole punching in the
+# disk's image file is performed by the host).
+#
+#---
+#
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fstrim
+
+rm -f $seqres.full
+
+# Keep allocating and deallocating 1G of data space with the goal of creating
+# and deleting 1 block group constantly. The intention is to race with the
+# fstrim loop below.
+fallocate_loop()
+{
+   local name=$1
+   while true; do
+   $XFS_IO_PROG -f -c "falloc -k 0 1G" \
+   $SCRATCH_MNT/$name &> /dev/null
+   sleep 3
+   $XFS_IO_PROG -c "truncate 0" \
+   $SCRATCH_MNT/$name &> /dev/null
+   sleep 3
+   done
+}
+
+trim_loop()
+{
+   while true; do
+   $FSTRIM_PROG $SCRATCH_MNT
+   done
+}
+
+# Create a bunch of small files that get their single extent inlined in the
+# btree, so that we consume a lot of metadata space and get a chance of a
+# data block group getting deleted and reused for metadata later. Sometimes
+# the creation of all these files succeeds other times we get ENOSPC failures
+# at some point - this depends on how fast the btrfs' cleaner kthread is
+# notified about empty block groups, how fast it deletes them and how fast
+# the fallocate calls happen. So we don't really care if they all succeed or
+# not, the goal is just to keep metadata space usage growing while data block
+# groups are deleted.
+create_files()
+{
+

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread David Sterba

On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
> On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
> > 
> > The ->is_readonly method seems like a clear winner to me, I'm all for
> > adding it, and thus suggested moving it first in the series.
> 
> It's a real winner for me as well, but the reason why I dropped it is
> because if btrfs() has to keep its ->update_time function, we wouldn't
> actually have a user for is_readonly().  I suppose we could have
> update_time() call ->is_readonly() and then ->update_time() if they
> exist, but it only seemed to add an extra call and a bit of extra
> overhead without really simplifying things for btrfs.

We would use is_readonly in order to remove some extra checks from btrfs
(setxattr, removexattr, possibly setsize).

> If there were other users of ->is_readonly, then it would make sense,
> but it seemed better to move into a separate code refactoring series.

Yeah it would be better addressed separately as it's not the point of
lazytime patchset and only turned out to be a good idea during the
iterations.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Robert White


On 12/01/2014 08:40 AM, Shriramana Sharma wrote:

IIUC you can only specify RO while creating but you can always cheaply
create a RW snapshot of an RO one or an RO snapshot of an RW one...


You can turn ReadOnly status on and off (er. "true" and "false") with 
btrfs property get/set ro=true/false /path/to/subvolume




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 4:39 AM, Austin S Hemmelgarn
 wrote:

> Just because it's a filesystem doesn't always mean that speed is the most
> important thing.  Personally, I can think of multiple cases where using a
> cryptographically strong hash would be preferable, for example:
>  * On an fs used solely for backup purposes
>  * On a fs used for /boot
>  * On an fs spread across a very large near-line disk array and mounted
>by a system with a powerful CPU
>  * Almost any other case where data integrity is more important than
>speed

What does data integrity have to do with whether the hash is
cryptographic or not? The primary difference between a cryptographic
and non-cryptographic hash is that the non-cryptographic hash can be
easily guessed / predicted (eg., an attack to deliberately create
collisions) whereas the cryptographic hash cannot (given reasonable
assumptions of CPU power).

For filesystem checksums it is difficult to imagine a deliberate
attack on the checksums. Consequently, the only really important
quality for the hash besides speed is collision resistance. The
non-crypto hashes that I have mentioned in this thread have excellent
collision resistant properties.

> The biggest reason to use the in-kernel Crypto API though, is that it gives
> a huge amount of flexibility, and provides pretty much transparent
> substitution of CPU optimized versions of the exported hash functions (for
> example, you don't have to know whether or not your processor supports
> Intel's CRC32 ISA extensions).

Which is worse than useless if the CPU-optimized crypto hash is slower
than the default non-crypto hash, and that will almost always be the
case. Besides, there is nothing magic happening in the Crypto API
library. If you implement your own hash, you can easily do a few
checks and choose the best code for the CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Austin S Hemmelgarn


On 2014-12-01 08:54, MegaBrutal wrote:

2014-12-01 14:47 GMT+01:00 Roman Mamedov :

On Mon, 1 Dec 2014 14:38:16 +0100
MegaBrutal  wrote:


I've also noticed, a subvolume can just be deleted with an "rm -r",
just like an ordinary directory. I'd consider to only allow subvolume
deletions with exact "btrfs subvolume delete" commands, and they


This is already the case. 'rm -r' will remove all files in a subvolume, but
the empty subvolume itself is only deletable via the 'btrfs' command.


That's great! And there is no way to protect against recursive
deletions (besides setting the subvolume read-only, as you suggested
below), as files are processes individually by "rm". But it's OK,
people should always be very careful with "rm", and it doesn't change
with btrfs. ;)



If you want to make snapshots which can't be removed by ordinary tools, use
the 'read-only' mode when creating them.


Yeah, good idea! Anyway, is it possible to change a read-only snapshot
to read-write and vica-versa, or you can only specify read-only while
creating them?


IIRC, there is something that you can do with the properties interface.
Personally though, I just make the snapshot RW to start with, and then 
recursively make it immutable (chattr -r +I), as I never use immutable 
files for anything else, and it works on any _sane_ filesystem, not just 
btrfs.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread Robert White


On 12/01/2014 04:56 AM, MegaBrutal wrote:

Since the other thread went off into theoretical debates about UUIDs
and their generic relation to BTRFS, their everyday use cases, and the
philosophical meaning behind uniqueness of copies and UUIDs; I'd like
to specifically ask you to only post here about the ACTUAL problem at
hand. Don't get me wrong, I find the discussion in the other thread
really interesting, I'm following it, but it is only very remotely
related to the original issue, so please keep it there! If you're
interested to catch up about the actual bug symptoms, please read the
bug report linked above, and (optionally) reproduce the problem
yourself!


That discussion _was_ the actual discussion of the actual problem. A 
problem that is not particularly theoretical, a problem that is common 
to block-level snapshots, and a discussion that contained the actual 
work-arounds.


I suggest a re-read. 8-)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread David Sterba

On Mon, Dec 01, 2014 at 08:50:09AM -0500, Austin S Hemmelgarn wrote:
> It would not surprise 
> me though if RHEL or SuSE had patched the kernel to allow using rm on a 
> subvolume.

This would be quite a big change in behaviour that we would not do
without taking it upstream first.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2] btrfs: add sysfs layout to show volume info

2014-12-01 Thread Goffredo Baroncelli

Hi Anand,

On 12/01/2014 06:33 PM, Anand Jain wrote:
> From: Anand Jain 
> 
> Not yet ready for integration, but for review and testing of the new sysfs 
> layout
> which is currently under /sys/fs/btrfs/by_fsid
> 
> This patch makes btrfs_fs_devices and btrfs_device information readable
> from sysfs. This uses the sysfs group visible entry point to mark
> certain attributes visible/hidden depending the FS state (mount/unmounted).
> 
> The new layout is as shown below.
> 
> /sys/fs/btrfs/by_fsid*
>   ./7b047f4d-c2ce-4f22-94a3-68c09057f1bf*
>   status
>   fsid*
>   missing_devices
>   num_devices*
>   open_devices
>   opened*
>   rotating
>   rw_devices
>   seeding
>   total_devices*
>   total_rw_bytes
>   ./e6701882-220a-4416-98ac-a99f095bddcc*
>   active_pending
>   bdev
>   bytes_used
>   can_discard
>   devid*
>   dev_root_fsid
>   devstats_valid
>   dev_totalbytes
>   generation*
>   in_fs_metadata
>   io_align
>   io_width
>   missing
>   name*
>   nobarriers
>   replace_tgtdev
>   sector_size
>   total_bytes
>   type
>   uuid*
>   writeable
> 
> (* indicates that attribute will be visible even when device is
> unmounted but registered with btrfs kernel)

Thanks, for working on that; I really like the idea to export more information.
- it is possible to put the device uuid under a directory like: by_dev_uuid/, 
this will help the parsing via script
- it is possible to make a directory under /sys/fs/btrfs/by_dev_uuid where
a link links to the related device; i.e.:
/sys/fs/btrfs/by_dev_uuid/e6701882-220a-4416-98ac-a99f095bddcc ->

../by_fsid/7b047f4d-c2ce-4f22-94a3-68c09057f1bf/by_dev_uuid/e6701882-220a-4416-98ac-a99f095bddc


This would help to know which devices are registered by the kernel


> 
> The old kobject  will be merged into this new 'by_fsid' kobject,
> so that older attributes under  and newer attributed under by_fsid
> will be merged together as well.

It would be fully backward compatible ? I really like your layout more
than the current one, but I think that the current sysfs is like a
binary API and so it has to be maintained forever

> 
> v2: added support for device add/delete/replace
> rebase on the latest integration branch
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/dev-replace.c |   7 +
>  fs/btrfs/super.c   |  15 ++
>  fs/btrfs/sysfs.c   | 383 
> +
>  fs/btrfs/sysfs.h   |   6 +
>  fs/btrfs/volumes.c |  42 ++
>  fs/btrfs/volumes.h |   6 +
>  6 files changed, 459 insertions(+)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 715a115..31ce3a9 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -474,6 +474,7 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   u8 uuid_tmp[BTRFS_UUID_SIZE];
>   struct btrfs_trans_handle *trans;
>   int ret = 0;
> + char uuid_buf[BTRFS_UUID_UNPARSED_SIZE];
>  
>   /* don't allow cancel or unmount to disturb the finishing procedure */
>   mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
> @@ -595,7 +596,13 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   /* replace the sysfs entry */
>   btrfs_kobj_rm_device(fs_info, src_device);
>   btrfs_kobj_add_device(fs_info, tgt_device);
> + btrfs_destroy_dev_sysfs(src_device);
>   btrfs_rm_dev_replace_free_srcdev(fs_info, src_device);
> + snprintf(uuid_buf, BTRFS_UUID_UNPARSED_SIZE, "%pU",
> + tgt_device->uuid);
> + if (kobject_rename(&tgt_device->dev_kobj, uuid_buf))
> + printk(KERN_ERR "BTRFS: sysfs uuid %s rename error\n",
> + uuid_buf);
>  
>   /* write back the superblocks */
>   trans = btrfs_start_transaction(root, 0);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 017d92d..918eb9d 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1389,6 +1389,11 @@ static struct dentry *btrfs_mount(struct 
> file_system_type *fs_type, int flags,
>   goto error_sec_opts;
>   }
>  
> + error = btrfs_update_by_fsid_sysfs_group(fs_devices);
> + if (error)
> + btrfs_warn(fs_info, "sysfs update error during mount: %d",
> + error);
> +
>   return root;
>  
>  error_close_devices:
> @@ -1885,8 +1890,18 @@ static int btrfs_sta

Re: Possible to undo subvol delete?

2014-12-01 Thread David Sterba

On Mon, Dec 01, 2014 at 08:12:02AM -0500, Austin S Hemmelgarn wrote:
> On 2014-11-29 23:23, Marc MERLIN wrote:
> > On Sun, Nov 30, 2014 at 09:03:14AM +0530, Shriramana Sharma wrote:
> >> IIUC with BtrFS while it is possible to easily undelete a file or
> >> ordinary directory if a snapshot of the containing subvol exists, it
> >> seems that it's not elementary to undelete a subvol itself, because
> >> all subvols are under the root-level subvol (id 0 or 5, see my other
> >> q) but even snapshotting the root subvol will not snapshot any subvols
> >> under it.
> >>
> >> So is there any way to undo a subvol delete?
> >
> > If you didn't snapshot that volume before deleting it, you're SOL.
> > If you snapshotted it, rename that snapshot to the other name, and
> > you're done.
> >
> > Btrfs doesn't offer undelete, it only lets you keep multiple copies of
> > your data at very little cost, so you can retrieve a snapshot copy if
> > you deleted your current volume's data.
> >
> > Marc
> >
> Well, in theory, if you unmount the FS _immediately_ after the subvol 
> delete, without writing _anything_ else to it, it _might_ be possible to 
> recover the data using some (probably almost incomprehensible) 
> incantation of btrfs-find-root and btrfs recover/restore.
> 
> In practice though, for anyone who doesn't have expert level knowledge 
> of the on-disk structure and fs internals, deleting a subvolume can't be 
> undone.

Agreed, though there's not so much magic involved. Deleting a subvolume
means removing the directory entry and the backrefrence. Undoing that
can make the subvolume live again, although the dir/name and original
parent tree cannot be reconstructed.

I'll add it to the project ideas.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Austin S Hemmelgarn


On 2014-12-01 12:22, John Williams wrote:

On Mon, Dec 1, 2014 at 4:39 AM, Austin S Hemmelgarn
 wrote:


Just because it's a filesystem doesn't always mean that speed is the most
important thing.  Personally, I can think of multiple cases where using a
cryptographically strong hash would be preferable, for example:
  * On an fs used solely for backup purposes
  * On a fs used for /boot
  * On an fs spread across a very large near-line disk array and mounted
by a system with a powerful CPU
  * Almost any other case where data integrity is more important than
speed


What does data integrity have to do with whether the hash is
cryptographic or not? The primary difference between a cryptographic
and non-cryptographic hash is that the non-cryptographic hash can be
easily guessed / predicted (eg., an attack to deliberately create
collisions) whereas the cryptographic hash cannot (given reasonable
assumptions of CPU power).

For filesystem checksums it is difficult to imagine a deliberate
attack on the checksums. Consequently, the only really important
quality for the hash besides speed is collision resistance. The
non-crypto hashes that I have mentioned in this thread have excellent
collision resistant properties.
I'm not saying they don't have excellent collision resistance 
properties.  I'm also not saying that we shouldn't support such 
non-cryptographic hashes, just that we shouldn't explicitly NOT support 
other hashes, and that if we are going to support more than one hash 
algorithm, we should use the infrastructure already in place in the 
kernel for such things because it greatly simplifies maintaining the code.


In fact, if I had the time, I'd just write CryptoAPI implementations of 
those hashes myself.



The biggest reason to use the in-kernel Crypto API though, is that it gives
a huge amount of flexibility, and provides pretty much transparent
substitution of CPU optimized versions of the exported hash functions (for
example, you don't have to know whether or not your processor supports
Intel's CRC32 ISA extensions).


Which is worse than useless if the CPU-optimized crypto hash is slower
than the default non-crypto hash, and that will almost always be the
case. Besides, there is nothing magic happening in the Crypto API
library. If you implement your own hash, you can easily do a few
checks and choose the best code for the CPU.

Except most of the CPU optimized hashes aren't crypto hashes (other than 
the various SHA implementations).  Furthermore, I've actually tested the 
speed of a generic CRC32c implementation versus SHA-1 using the SHA 
instructions on an UltraSPARC processor, and the difference ammounts to 
a few microseconds in _favor_ of the optimized crypto hash; and I've run 
the math for every other ISA that has instructions for computing SHA 
hashes (I don't have the hardware for any of the others), and expect 
similar results for those as well.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 9:42 AM, Austin S Hemmelgarn > Except most of
the CPU optimized hashes aren't crypto hashes (other than the
> various SHA implementations).  Furthermore, I've actually tested the speed
> of a generic CRC32c implementation versus SHA-1 using the SHA instructions
> on an UltraSPARC processor, and the difference ammounts to a few
> microseconds in _favor_ of the optimized crypto hash; and I've run the math
> for every other ISA that has instructions for computing SHA hashes (I don't
> have the hardware for any of the others), and expect similar results for
> those as well.

I think the confusion here is that I am talking about 128-bit and
256-bit hashes, which is what you would choose for filesystem
checksums if you want to have extremely strong collision resistance
(eg., you could also use it for dedup).

You seem to be talking about 32-bit (and maybe 64-bit) hashes.

The speed difference between crypto 128- and 256-bit hashes and
non-crypto equivalents that I have mentioned is an order of magnitude
or more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ToS page does not exist?

2014-12-01 Thread David Sterba

On Sun, Nov 30, 2014 at 09:21:26AM +0530, Shriramana Sharma wrote:
> I am asked to read the ToS before signing up on the wiki:
> 
> Make sure that you first read the Terms of Service before requesting an 
> account.
> 
> ... but the link is red and the page does not exist.

Reported to kernel.org maintainers.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-01 Thread Robert White


On 11/30/2014 10:18 PM, Qu Wenruo wrote:

(advocacy for using SQL internally for btrfsck)


All of these ideas you want to toss a entire SQL front end on are more 
simply handled with simple data structures.


In C++ terms "map" and/or "map>" 
beats the heck out of including all of SQL and its related indexes and 
type conversions (sqlite, for example, stores integers as doubles, or 
decimal numbers depending on version).


RDBMS _are_ good at representing things, so noticing that a thing _can_ 
be represented with an RDBMS is very common.


But by the time you put two or three indexes on relation->(parent, 
child, name) you've given yourself three or four copies of the core data 
in three or four different places. And those copies are largely 
immutable and randomly distributed and will include the overhead in 
memory for fairly sparse trees.


It's not that it's an unworkable idea.

But it is unnecessarily generic and adds an order of magnitude of 
complexity to your problems.


For instance, if I boot from a CD to run a btrfsck where will the 
database files be written to?


If it is an in-memory table why do I want the overhead of SQL to look up 
something indexed by integer?


If the sparse vectors of integers don't fit in memory why would the SQL 
tables of integers fit "better"?


SQL would be the second slowest possible for representing this data -- 
The slowest would be an XML schema stored as flat text.


So your crazy ides is also a pretty bad one compared to most if not all 
sparse data representations and techniques that come to bear on this 
problem set. All you are really doing is pushing the same work (walking 
a tree to find an integer) into a difficult "spell it out in SQL" space.


Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child = 
%d"); execute_sql(cursor,child); and its possible error returns actually 
clearer or better than "parent=inheretance.find(child); if 
(parent!=inheretance.end()) {...}" (as it might be written in C++)?


Do you want to know if (keep track of whether) an inode is allocated and 
referenced? There's a sparse bit-vector for that...


Want to be able to get back to an inode's location on disk, a sparse 
array of disk offsets exists (among other options).


Before you can even access the RDBMS you'd have to fill it completely; 
otherwise you wouldn't know if a select returning zero rows was an 
authoritative indication that the datum didn't exist or if it was 
instead an indication that the datum hadn't been populated yet.


THIS IS NOT SARCASM: If you strongly disagree, I suggest you start 
coding. Seriously, don't ask, do... And in a month really check to see 
if your solution is any smaller, faster, easier, or in _any_ _way_ more 
optimal than using native data structures. The attempt will answer the 
question definitively and then we'll all know...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] btrfs: enable swap file support

2014-12-01 Thread David Sterba

On Mon, Nov 24, 2014 at 02:03:02PM -0800, Omar Sandoval wrote:
> Alright, I took a look at this. My understanding is that a PREALLOC extent
> represents a region on disk that has already been allocated but isn't in use
> yet, but please correct me if I'm wrong. Judging by this comment in
> btrfs_get_blocks_direct, we don't have to worry about PREALLOC extents in
> general:
> 
> /*
>  * We don't allocate a new extent in the following cases
>  *
>  * 1) The inode is marked as NODATACOW.  In this case we'll just use the
>  * existing extent.
>  * 2) The extent is marked as PREALLOC.  We're good to go here and can
>  * just use the extent.
>  *
>  */

Ok, thanks for checking.

> A couple of other considerations that cropped up:
> 
> - btrfs_get_extent does a bunch of extra work if the extent is not cached in
>   the extent map tree that would be nice to avoid when swapping
> - We might still have to do a COW if the swap file is in a snapshot
> 
> We can avoid the btrfs_get_extent by pinning the extents in memory one way or
> another in btrfs_swap_activate.

That's preferrable, the overhead should be small.

> The snapshot issue is a little tricker to resolve. I see a few options:
> 
> 1. Just do the COW and hope for the best

Snapshotting of NODATACOW files work, the extents are COWed on the first
write, the original owner sees no change.

> 2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot
> happens while a swap file is active, we'll fall back to 1.

So we should make sure that any write to the swapfile will not lead to
the first-COW, this would require to scan all the extents and see if
we're the owner and COW eventually.

Doing that automatically is IMO better from the user perspective,
compared to an erroring out and requesting a manual "dd" over the file.

Possibly, the implied COW could create preallocated extents so we do not
have to rewrite the data.

> 3. Clobber any swap file extents which are in a snapshot, i.e., always use the
> existing extent.

Easy to implement but could lead to bad suprises.

More thoughts:

There are some guards in the generic code to prevent unwanted
modifications to the swapfiles (eg. no truncate, no fallocate, no
deletion). Further we should audit btrfs ioctls that may interfere with
swap and forbid any action (notably the cloning and deduplication ones).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Robert White


On 12/01/2014 01:26 AM, Gour wrote:

On Mon, 01 Dec 2014 09:06:19 +1100
Russell Coker  wrote:


When the 2 disks have different data mdadm has no way of knowing
which one is correct and has a 50% chance of overwriting good data.
But BTRFS does checksums on all reads and solves the problem of
corrupt data - as long as you don't have 2 corrupt sectors in
matching blocks.


Hmm, this is very interesting and valuable info. Thank you.



Don't get too excited. If the data disagrees in either system because of 
partial starts (e.g. one mount /dev/sda1 is hot, the next mount 
/dev/sdb1 is hot) leaving both disks with equally valid but disagreeing 
data, "magical guessing" _will_ ensue.


In the BTRFS case the system will guess based on generation numbers.

In the MDADM case the system will guess based on the write intent bitmaps.

In both cases, you are _way_ better off invalidating any missing array 
elements if you find yourself reaching a write-available event with said 
missing elements.


BTRFS checksums really help when a write goes to both devices (e.g. 
/dev/sda1 and /dev/sdb1) and the disk subsystem silently fails one of 
the necessary writes, creating a disagreement between the checksum and 
the data block on the drive in question.


It's an outstanding feature, but it doesn't protect you from weak 
maintenance after a partial start.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread David Sterba

On Wed, Nov 26, 2014 at 08:58:50AM -0500, Austin S Hemmelgarn wrote:
> On 2014-11-26 08:38, Brendan Hide wrote:
> > On 2014/11/25 18:47, David Sterba wrote:
> >> We could provide an interface for external applications that would make
> >> use of the strong checksums. Eg. external dedup, integrity db. The
> >> benefit here is that the checksum is always up to date, so there's no
> >> need to compute the checksums again. At the obvious cost.
> >
> > I can imagine some use-cases where you might even want more than one
> > algorithm to be used and stored. Not sure if that makes me a madman,
> > though. ;)
> >
> Not crazy at all, I would love to have the ability to store multiple 
> different weak but fast hash values.  For example, on my laptop, it is 
> actually faster to compute crc32c, adler32, and md5 hashes together than 
> it is to compute pretty much any 256-bit hash I've tried.

Well, this is doable :) there's space for 256 bits in general, the order of
checksum bytes in one "checksum word" would be given by fixed order the
algorighms are defined. The code complexity would increase, but not that
much I think.

> This then brings up the issue of what to do when we try to mount such a 
> fs on a system that doesn't support some or all of the hashes used.

I see two modes: first fail if all not present, or relaxed by a mount
option to accept at least one.

But let's keep this open, I'm not yet convinced that combining more weak
algos makes sense from the crypto POV. If this should protect against
random bitflips, would one fast-but-weak be comparable to a combination?
Or other expectations.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs stuck with lot's of files

2014-12-01 Thread Robert White


On 12/01/2014 03:46 AM, Peter Volkov wrote:

Hi, guys.

> (stuff about getting hung up trying to write to one drive)

That drive (/dev/sdn) is probably starting to fail. Some older drives 
basically go unresponsive when they start to go bad. Particularly if 
they've gone bad enough to have run out of spare tracks/sectors. 
Sometimes they will just refuse to answer. Sometimes they will go into 
"try again" mode, and the same activity will be retried indefinitely. 
This will then fill up your write queues and jam up all sorts of subsystems.


Step 1: Backup your data. Since you didn't RAID your data at all, when 
that drive dies your data is going to go away in fascinating and 
unpredictable ways. (RAID1 metadata with no RAID1 or RAID5 of the data 
means you have essentially no media failure protection.)


Step 2: Turn on SMART (if you can and you can) and check whether the 
drive is in its final moments of life. If your disk is all green lights 
according to smart, you may be able to un-jamb it by just doing a 
balance as described and explained after the next time I quote you.


Step 3: Switch your data mode to RAID5. It will cost you about half of 
your currenly free data space, but it won't leave you _as_ _vulnerable_ 
to complete data loss as you are now. SMART might be wrong about your 
drive being fine if it says it is.



  # btrfs filesystem df /store/
Data, single: total=11.92TiB, used=10.86TiB


Reguardless of the above...

You have a terabyte of unused but allocated data storage. You probably 
need to balance your system to un-jamb that. That's a lot of space that 
is unavailable to the metadata (etc).


ASIDE: Having your metadata set to RAID1 (as opposed to the default of 
DUP) seems a little iffy since your data is still set to DUP. This 
configuration is not going to leave you with a mountable filesystem if 
you lose a disk. I'm not sure if the RAID1 layout is going to want to 
put specific datum in specific places, but it might, which if it does 
might leave you in an irreconcilable position.


Either way, you will probably un-jam your system in the short run by 
doing a balance. A full balance (no filter args at all) would be your 
best bet.


FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given 
22 volumes and 10% empty empty space it would only cost you half of your 
existing empty space. If you don't RAID your data, there is no real 
point to putting your metadata in RAID.


[Yes, I said my basic points about your current layout two different 
ways and times. You are either "just a little over-committed on space" 
or you are "about to lose all your data" and it's impossible to tell 
which is the case from here.]


Backup your data. NOW!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread David Sterba

On Thu, Nov 27, 2014 at 11:52:20AM +0800, Liu Bo wrote:
> > There are several checksum algorithms that trade off speed and strength
> > so we may want to support more than just sha256. Easy to add but I'd
> > rather see them added in all at once than one by one.
> > 
> > Another question is if we'd like to use different checksum for data and
> > metadata. This would not cost any format change if we use the 2 bytes in
> > super block csum_type.
> 
> Yes, but breaking it into meta_csum_type and data_csum_type will need a
> imcompat flag.

Not necessarily a new bit. If we read the field as-is, see if it's zero
we know it's the previous version, otherwise the new one and then set
only in-memory fileds for data and metadata.

The backward compatibility is fine, old kernels will refuse to mount
with csum_type != 0.

> > Optional/crazy/format change stuff:
> > 
> > * per-file checksum algorithm - unlike compression, the whole file would
> >   have to use the same csum algo
> >   reflink would work iff the algos match
> >   snapshotting is unaffected
> > 
> > * per-subvolume checksum algorithm - specify the csum type at creation
> >   time, or afterwards unless it's modified
> 
> I thought about this before, if we enable this, a few cases need to be dealt
> with(at least),
> 1. convert file data's csum from one algorithm to another

On-line or offline? I'd rather avoid doing that on a mounted filesystem.

> 2. to make different checksum length co-exist, we can either use different
>key.type for different algorithms, or pack checksum into a new structure 
> that
>has algorithm types(and length).

Oh right, the mixed sizes of checksums could be a problem and would
require a format change (and thus the incompatibility bit).

The key.type approach looks better, we'd encode the algorithm type
effectively, the item bytes contain only fixed-size checksums.
(Here I'm thinking a new BTRFS_EXTENT_CSUM_KEY per checksum type.)

OTOH storing the algo type (size is not needed) would add overhead
per-checksum (probably only a single byte but still).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-12-01 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/25/2014 6:13 PM, Chris Murphy wrote:
> The drive will only issue a read error when its ECC absolutely
> cannot recover the data, hard fail.
> 
> A few years ago companies including Western Digital started
> shipping large cheap drives, think of the "green" drives. These had
> very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT
> ERC. Later they completely took out the ability to configure this
> error recovery timing so you only get the upward of 2 minutes to
> actually get a read error reported by the drive. Presumably if the
> ECC determines it's a hard fail and no point in reading the same
> sector 14000 times, it would issue a read error much sooner. But
> again, the linux-raid list if full of cases where this doesn't
> happen, and merely by changing the linux SCSI command timer from 30
> to 121 seconds, now the drive reports an explicit read error with
> LBA information included, and now md can correct the problem.

I have one of those and took it out of service when it started reporting
read errors ( not timeouts ).  I tried several times to write over the
bad sectors to force reallocation and it worked again for a while...
then the bad sectors kept coming back.  Oddly, the SMART values never
indicated anything had been reallocated.

> That's my whole point. When the link is reset, no read error is 
> submitted by the drive, the md driver has no idea what the drive's 
> problem was, no idea that it's a read problem, no idea what LBA is 
> affected, and thus no way of writing over the affected bad sector.
> If the SCSI command timer is raised well above 30 seconds, this
> problem is resolved. Also replacing the drive with one that
> definitively errors out (or can be configured with smartctl -l
> scterc) before 30 seconds is another option.

It doesn't know why or exactly where, but it does know *something* went
wrong.

> It doesn't really matter, clearly its time out for drive commands
> is much higher than the linux default of 30 seconds.

Only if you are running linux and can see the timeouts.  You can't
assume that's what is going on under windows just because the desktop
stutters.

> OK that doesn't actually happen and it would be completely f'n
> wrong behavior if it were happening. All the kernel knows is the
> command timer has expired, it doesn't know why the drive isn't
> responding. It doesn't know there are uncorrectable sector errors
> causing the problem. To just assume link resets are the same thing
> as bad sectors and to just wholesale start writing possibly a
> metric shit ton of data when you don't know what the problem is
> would be asinine. It might even be sabotage. Jesus...

In normal single disk operation sure: the kernel resets the drive and
retries the request.  But like I said before, I could have sworn there
was an early failure flag that md uses to tell the lower layers NOT to
attempt that kind of normal recovery, and instead just to return the
failure right away so md can just go grab the data from the drive that
isn't wigging out.  That prevents the system from stalling on paging IO
while the drive plays around with its deep recovery, and copying back
512k to the drive with the one bad sector isn't really that big of a
deal.

> Then there is one option which is to increase the value of the
> SCSI command timer. And that applies to all raid: md, lvm, btrfs,
> and hardware.

And then you get stupid hanging when you could just get the data from
the other drive immediately.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+
vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/
YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2
IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd
R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS
nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI=
=FrB9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 9:42 AM, Austin S Hemmelgarn > Except most of
> the CPU optimized hashes aren't crypto hashes (other than the
>> various SHA implementations).  Furthermore, I've actually tested the
>> speed of a generic CRC32c implementation versus SHA-1 using the SHA
>> instructions on an UltraSPARC processor, and the difference ammounts to a
>> few microseconds in _favor_ of the optimized crypto hash; and I've run
>> the math for every other ISA that has instructions for computing SHA
>> hashes (I don't have the hardware for any of the others), and expect
>> similar results for those as well.
> 
> I think the confusion here is that I am talking about 128-bit and
> 256-bit hashes, which is what you would choose for filesystem
> checksums if you want to have extremely strong collision resistance
> (eg., you could also use it for dedup).
> 
> You seem to be talking about 32-bit (and maybe 64-bit) hashes.
> 
> The speed difference between crypto 128- and 256-bit hashes and
> non-crypto equivalents that I have mentioned is an order of magnitude
> or more.

I think there's a fundamental set of points being missed.

* The Crypto API can be used to access non-cryptographic hashes. Full stop.

* He was comparing CRC32 (a 32-bit non-cryptographic hash, *via the Crypto 
API*) against SHA-1 (a 128-bit cryptographic hash, via the Crypto API), and 
SHA-1 _still_ won. CRC32 tends to beat the pants off 128-bit non-
cryptographic hashes simply because those require multiple registers to 
store the state if nothing else; which makes this a rather strong argument 
that _hardware matters a heck of a lot_, quite possibly _more_ than the 
algorithm.

Even if SHA-1 in software is vastly slower than CityHash or whatever in 
software, the Crypto API implementation *may not be purely software*.

* The main benefit of the Crypto API is not any specific hash, it's that 
it's a _common API_ for _using any supported hash_.

* Your preferred non-cryptographic hashes can, thus, be used _via_ the 
Crypto API.

* This has benefits of:
* Code reuse (for anyone else who wants to use such a hash).

* Optimization opportunities (if a CPU implements some primitive, it can 
be leveraged in an arch-specific implementation, which the Crypto API will 
use _automatically_).

* Flexibility (by using the Crypto API, _any_ supported hash can be used 
generically, so the _user_ can decide whether they want rather than a small, 
hard-coded menu of options in btrfs).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Online Drive Replacement: BTRFS with RAID 6

2014-12-01 Thread Robert White


On 12/01/2014 06:47 AM, Oliver wrote:

Hi All,

on a testing machine I installed four HDDs and they are configured as
RAID6. For a test I removed one of the drives (/dev/sdk) while the
volume was mounted and data was written to it. This worked well, as far
as I can see. Some I/O errors were written to /var/log/syslog, but the
volume kept working. Unfortunately the command "btrfs fi sh" did not
show any missing drives. So I remounted the volume in degraded mode:
"mount -t btrfs /dev/sdx1 -o remount,rw,degraded,noatime /mnt". This way
the drive in question was reported as missing. Then I plugged in the HDD
again (it is of course /dev/sdk again) and started a balancing in hope
that this will restore RAID6: "btrfs filesystem balance start /mnt". Now
the volume looks like this:


Since it was already running and such, remounting it as degraded was 
probably not a good thing (or even vaguely necessary).


The WIKI, in discussing add/remove and failed drives goes to great 
lengths (big red box) to discuss the current instability of RAID5/6 format.


I am guessing here but I _think_ you should do the following...

(0) Backup your data. [okay this is a test system that you deliberately 
purturbed but still... 8-) ]


Option 1:

(reasonable, but undocumented .. Either blance or scrub _ought_ to look 
at the disk sectors and trigger some re-copying from the good parts.)


The disk is in the array (again), you may just need to re-balance or 
scrub the array to get the data on the drive back in harmony with the 
state of the array overall.


Option 2:

(unlikely :: add and remove are about making the geometry smaller/larger 
and, as stated, a RAID 6 cannot be less than 4 drives by definition, so 
there is no three-drive geometry for a RAID 6.)


re-unplug the device, then use btrfs remove /dev/sdk /mnt
then re-plug-in the device and use btrfs add /dev/sdk /mnt


Option 3:

(reasonable, but undocumented :: replace by device id -- 4 in your 
example case -- instead of system path. This would, I should think, skip 
the check of /dev/sdk1's separate status)


btrfs replace start -f 4 /dev/sdk1 /mnt

Option 3a:

(got to get /dev/sdk1 back out of the list of active devices for /mnt so 
the system wont see /dev/sdk1 as "mounted" (e.g. held by a subsystem))


unplug device.
mount -o remount,degraded etc...
plug in device.
btrfs replace start -f 4 /dev/sdk1 /mnt

Option 4:

(most likely, most time consuming)

Unplug /dev/sdk. Plug it into another computer and zero a decent chunk 
of partition 1.

Plug it back into the original computer
do the replace operation as in Option 3.

This is the most-likely correct option if a simple rebalance or scrub 
doesn't work, as you will be presenting the system with three attached 
drives, one "missing" drive that will not match any necessary 
signatures, and a "new, blank" drive in its place.


===

In all cases, you may need to unmount or remount or remount degraded  in 
there somewhere, particularly because you have already done so at least 
once.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Alex Elsayed wrote:

> * He was comparing CRC32 (a 32-bit non-cryptographic hash, *via the Crypto
> API*) against SHA-1 (a 128-bit cryptographic hash, via the Crypto API),
> and SHA-1 _still_ won. CRC32 tends to beat the pants off 128-bit non-
> cryptographic hashes simply because those require multiple registers to
> store the state if nothing else; which makes this a rather strong argument
> that _hardware matters a heck of a lot_, quite possibly _more_ than the
> algorithm.

Ah, correction - it seems he was comparing his own implementations, rather 
than the Crypto API ones - but the points still hold, seeing as the Crypto 
API does provide both algorithms.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 11:28 AM, Alex Elsayed  wrote:

> I think there's a fundamental set of points being missed.

That may be true, but it is not me who is missing them.
> * The Crypto API can be used to access non-cryptographic hashes. Full stop.

Irrelevant to my point. I am talking about specific non-cryptographic
hashes, and they are not currently in the Crypto API.

> * He was comparing CRC32 (a 32-bit non-cryptographic hash, *via the Crypto
> API*) against SHA-1 (a 128-bit cryptographic hash, via the Crypto API), and
> SHA-1 _still_ won. CRC32 tends to beat the pants off 128-bit non-
> cryptographic hashes simply because those require multiple registers to
> store the state if nothing else; which makes this a rather strong argument
> that _hardware matters a heck of a lot_, quite possibly _more_ than the
> algorithm.

Again, irrelevant. The Spooky2, CityHash256, and Murmur3 hashes that I
am talking about can and do take advantage of CPU architecture. For
128- and 256-bit hashes, one (or more) of those three will be
significantly faster than any crypto hash in the Crypto API,
regardless of the CPU it is run on.

As for the possibility of adding more hash functions to Crypto API for
btrfs to use, I do not believe I have argued against it, so I am not
sure why you repeated the point. It seems to me that is a discussion
that must be had with the maintainer(s) of Crypto API (will they
accept additional non-crypto 128- and 256-bit hash functions, etc.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 11:28 AM, Alex Elsayed 
> wrote:
> 
>> I think there's a fundamental set of points being missed.
> 
> That may be true, but it is not me who is missing them.
>> * The Crypto API can be used to access non-cryptographic hashes. Full
>> stop.
> 
> Irrelevant to my point. I am talking about specific non-cryptographic
> hashes, and they are not currently in the Crypto API.

Yes, but they're not anywhere else in the kernel either.

>> * He was comparing CRC32 (a 32-bit non-cryptographic hash, *via the
>> Crypto API*) against SHA-1 (a 128-bit cryptographic hash, via the Crypto
>> API), and SHA-1 _still_ won. CRC32 tends to beat the pants off 128-bit
>> non- cryptographic hashes simply because those require multiple registers
>> to store the state if nothing else; which makes this a rather strong
>> argument that _hardware matters a heck of a lot_, quite possibly _more_
>> than the algorithm.
> 
> Again, irrelevant. The Spooky2, CityHash256, and Murmur3 hashes that I
> am talking about can and do take advantage of CPU architecture. For
> 128- and 256-bit hashes, one (or more) of those three will be
> significantly faster than any crypto hash in the Crypto API,
> regardless of the CPU it is run on.

Sure.

> As for the possibility of adding more hash functions to Crypto API for
> btrfs to use, I do not believe I have argued against it, so I am not
> sure why you repeated the point. It seems to me that is a discussion
> that must be had with the maintainer(s) of Crypto API (will they
> accept additional non-crypto 128- and 256-bit hash functions, etc.)

In that case, I'm not sure what the reason for the thread continuing is? If 
they go in the Crypto API, there's no need to argue against cryptographic 
hashes either - it becomes the user's choice. That's pretty much the entire 
reason I kept responding; I figured that arguing against the cryptographic 
hashes _was_ an objection to the Crypto API, since they're basically a 
freebie for no effort if we use it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Alex Elsayed wrote:

> John Williams wrote:
>> Again, irrelevant. The Spooky2, CityHash256, and Murmur3 hashes that I
>> am talking about can and do take advantage of CPU architecture. For
>> 128- and 256-bit hashes, one (or more) of those three will be
>> significantly faster than any crypto hash in the Crypto API,
>> regardless of the CPU it is run on.
> 
> Sure.

Actually, I said "Sure" here, but this isn't strictly true. At some point, 
you're more memory-bound than CPU-bound, and with CPU intrinsic instructions 
(like SPARC and recent x86 have for SHA) you're often past that. Then, 
you're not going to see any real difference - and the accelerated 
cryptographic hashes may even win out, because the intrinsics may be faster 
(less stuff of the I$, pipelined single instruction beating multiple simpler 
instructions, etc) than the software non-cryptographic hash.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Austin S Hemmelgarn


On 2014-12-01 14:34, Alex Elsayed wrote:

Alex Elsayed wrote:


* He was comparing CRC32 (a 32-bit non-cryptographic hash, *via the Crypto
API*) against SHA-1 (a 128-bit cryptographic hash, via the Crypto API),
and SHA-1 _still_ won. CRC32 tends to beat the pants off 128-bit non-
cryptographic hashes simply because those require multiple registers to
store the state if nothing else; which makes this a rather strong argument
that _hardware matters a heck of a lot_, quite possibly _more_ than the
algorithm.


Ah, correction - it seems he was comparing his own implementations, rather
than the Crypto API ones - but the points still hold, seeing as the Crypto
API does provide both algorithms.
Actually, I did the tests using the userspace interface to the kernel's 
Crypto API.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Austin S Hemmelgarn


On 2014-12-01 13:37, David Sterba wrote:

On Wed, Nov 26, 2014 at 08:58:50AM -0500, Austin S Hemmelgarn wrote:

On 2014-11-26 08:38, Brendan Hide wrote:

On 2014/11/25 18:47, David Sterba wrote:

We could provide an interface for external applications that would make
use of the strong checksums. Eg. external dedup, integrity db. The
benefit here is that the checksum is always up to date, so there's no
need to compute the checksums again. At the obvious cost.


I can imagine some use-cases where you might even want more than one
algorithm to be used and stored. Not sure if that makes me a madman,
though. ;)


Not crazy at all, I would love to have the ability to store multiple
different weak but fast hash values.  For example, on my laptop, it is
actually faster to compute crc32c, adler32, and md5 hashes together than
it is to compute pretty much any 256-bit hash I've tried.


Well, this is doable :) there's space for 256 bits in general, the order of
checksum bytes in one "checksum word" would be given by fixed order the
algorighms are defined. The code complexity would increase, but not that
much I think.


This then brings up the issue of what to do when we try to mount such a
fs on a system that doesn't support some or all of the hashes used.


I see two modes: first fail if all not present, or relaxed by a mount
option to accept at least one.

But let's keep this open, I'm not yet convinced that combining more weak
algos makes sense from the crypto POV. If this should protect against
random bitflips, would one fast-but-weak be comparable to a combination?
Or other expectations.

My only reasoning is that with this set of hashes (crc32c, adler32, and 
md5), the statistical likely-hood of running into a hash collision with 
more than one of them at a time is infinitesimally small compared to the 
likely-hood of any one of them having a collision (or even compared to 
something ridiculous like the probability of being killed by a meteor 
strike), and the combination is faster on most systems that I have tried 
than many 256-bit crypto hashes.


It's still a tradeoff though, I also think that the idea mentioned 
elsewhere in this thread of having separate hashes stored for 
subsections of the same block is also worth looking at.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 12:08 PM, Alex Elsayed  wrote:
> Actually, I said "Sure" here, but this isn't strictly true. At some point,
> you're more memory-bound than CPU-bound, and with CPU intrinsic instructions
> (like SPARC and recent x86 have for SHA) you're often past that. Then,
> you're not going to see any real difference - and the accelerated
> cryptographic hashes may even win out, because the intrinsics may be faster
> (less stuff of the I$, pipelined single instruction beating multiple simpler
> instructions, etc) than the software non-cryptographic hash.

In practice, I am skeptical whether any 128- or 256-bit crypto hashes
will be as fast as the non-crypto hashes I mentioned, even on CPUs
with specific instructions for the crypto hashes. The non-crypto
hashes can (and do) take advantage of special CPU instructions as
well.

But even if true that the crypto hashes approach the speed of
non-crypto hashes on certain CPUs, that does not provide a strong
argument for using the crypto hashes, since on the common x64 CPUs,
the non-crypto hashes I mentioned are significantly faster than the
equivalent crypto hashes.

So, you have some rare architectures where the crypto hashes may
almost be as fast as the non-crypto, and common CPUs where the
non-crypto are much faster. That makes the non-crypto hash functions I
mentioned the obvious choice in the vast majority of systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 12:35 PM, Austin S Hemmelgarn
 wrote:
> My only reasoning is that with this set of hashes (crc32c, adler32, and
> md5), the statistical likely-hood of running into a hash collision with more
> than one of them at a time is infinitesimally small compared to the
> likely-hood of any one of them having a collision (or even compared to
> something ridiculous like the probability of being killed by a meteor
> strike), and the combination is faster on most systems that I have tried
> than many 256-bit crypto hashes.

I have not seen any evidence that combining hashes like that actually
reduces the chances of collision, but if we assume it does, then
again, the non-crypto hashes would be faster. For example, 128-bit
Spooky2 combined with 128-bit CityHash would produce a 256-bit hash
and would be faster than MD5 + whatever.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kernel lockup: "fs/btrfs/delayed-inoce.c:1410 btrfs_assert_delayed_root_empty"

2014-12-01 Thread Bernardo Donadio


Hi,

I'm having fairly frequent kernel lockups caused by btrfs, which I think 
it might be a serious bug. I'm using linux-3.17.3.200.fc20.x86_64. It 
freezes the whole system, and spits the error trace in the journal a few 
seconds later.


Here's the journal log, notice that there are 2 stack traces with little 
time between them:


Dez 01 19:15:12 darwin.donadio.be kernel: [ cut here 
]
Dez 01 19:15:12 darwin.donadio.be kernel: WARNING: CPU: 3 PID: 494 at 
fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x34/0x40 
[btrfs]()
Dez 01 19:15:12 darwin.donadio.be kernel: Modules linked in: ufs hfsplus 
hfs minix vfat msdos fat jfs xfs libcrc32c reiserfs rfcomm fuse 
xt_CHECKSUM iptable_mangle tun bridge stp llc ip6table_filter ip6_tables 
ebtable_nat ebtables cfg80211 bnep usblp btusb bluetooth rfkill uvcvideo 
videobuf2_vmalloc videobuf2_memops videobuf2_core snd_usb_audio 
v4l2_common videodev snd_usbmidi_lib snd_rawmidi media joydev kvm_amd 
kvm serio_raw k10temp edac_core snd_hda_codec_realtek snd_hda_codec_hdmi 
snd_hda_codec_generic edac_mce_amd snd_hda_intel snd_hda_controller 
sp5100_tco snd_hda_codec snd_seq i2c_piix4 snd_hwdep snd_seq_device 
snd_pcm snd_timer snd soundcore shpchp acpi_cpufreq binfmt_misc nfsd 
auth_rpcgss nfs_acl lockd sunrpc btrfs xor raid6_pq r8169 mii radeon 
i2c_algo_bit drm_kms_helper ttm drm
Dez 01 19:15:12 darwin.donadio.be kernel: CPU: 3 PID: 494 Comm: 
btrfs-transacti Tainted: GW  3.17.3-200.fc20.x86_64 #1
Dez 01 19:15:12 darwin.donadio.be kernel: Hardware name: ECS 
A890GXM-A/A890GXM-A, BIOS 080015  03/24/2010
Dez 01 19:15:12 darwin.donadio.be kernel:   
314b2122 8802213e3db8 81728acc
Dez 01 19:15:12 darwin.donadio.be kernel:   
8802213e3df0 81094e6d 8801e6d78000
Dez 01 19:15:12 darwin.donadio.be kernel:  88022101e800 
8802097d9f00  

Dez 01 19:15:12 darwin.donadio.be kernel: Call Trace:
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
dump_stack+0x45/0x56
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
warn_slowpath_common+0x7d/0xa0
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
warn_slowpath_null+0x1a/0x20
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
btrfs_assert_delayed_root_empty+0x34/0x40 [btrfs]
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
btrfs_commit_transaction+0x3a2/0x9c0 [btrfs]
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
transaction_kthread+0x1c5/0x250 [btrfs]
Dez 01 19:15:12 darwin.donadio.be kernel:  [] ? 
btrfs_cleanup_transaction+0x550/0x550 [btrfs]
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
kthread+0xd8/0xf0
Dez 01 19:15:12 darwin.donadio.be kernel:  [] ? 
kthread_create_on_node+0x190/0x190
Dez 01 19:15:12 darwin.donadio.be kernel:  [] 
ret_from_fork+0x7c/0xb0
Dez 01 19:15:12 darwin.donadio.be kernel:  [] ? 
kthread_create_on_node+0x190/0x190
Dez 01 19:15:12 darwin.donadio.be kernel: ---[ end trace 
81e909e70c8984e0 ]---
Dez 01 19:15:42 darwin.donadio.be kernel: [ cut here 
]
Dez 01 19:15:42 darwin.donadio.be kernel: WARNING: CPU: 2 PID: 494 at 
fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x34/0x40 
[btrfs]()
Dez 01 19:15:42 darwin.donadio.be kernel: Modules linked in: ufs hfsplus 
hfs minix vfat msdos fat jfs xfs libcrc32c reiserfs rfcomm fuse 
xt_CHECKSUM iptable_mangle tun bridge stp llc ip6table_filter ip6_tables 
ebtable_nat ebtables cfg80211 bnep usblp btusb bluetooth rfkill uvcvideo 
videobuf2_vmalloc videobuf2_memops videobuf2_core snd_usb_audio 
v4l2_common videodev snd_usbmidi_lib snd_rawmidi media joydev kvm_amd 
kvm serio_raw k10temp edac_core snd_hda_codec_realtek snd_hda_codec_hdmi 
snd_hda_codec_generic edac_mce_amd snd_hda_intel snd_hda_controller 
sp5100_tco snd_hda_codec snd_seq i2c_piix4 snd_hwdep snd_seq_device 
snd_pcm snd_timer snd soundcore shpchp acpi_cpufreq binfmt_misc nfsd 
auth_rpcgss nfs_acl lockd sunrpc btrfs xor raid6_pq r8169 mii radeon 
i2c_algo_bit drm_kms_helper ttm drm
Dez 01 19:15:42 darwin.donadio.be kernel: CPU: 2 PID: 494 Comm: 
btrfs-transacti Tainted: GW  3.17.3-200.fc20.x86_64 #1
Dez 01 19:15:42 darwin.donadio.be kernel: Hardware name: ECS 
A890GXM-A/A890GXM-A, BIOS 080015  03/24/2010
Dez 01 19:15:42 darwin.donadio.be kernel:   
314b2122 8802213e3db8 81728acc
Dez 01 19:15:42 darwin.donadio.be kernel:   
8802213e3df0 81094e6d 8801eb8fc140
Dez 01 19:15:42 darwin.donadio.be kernel:  88022101e800 
88009da84b40  

Dez 01 19:15:42 darwin.donadio.be kernel: Call Trace:
Dez 01 19:15:42 darwin.donadio.be kernel:  [] 
dump_stack+0x45/0x56
Dez 01 19:15:42 darwin.donadio.be kernel:  [] 
warn_slowpath_common+0x7d/0xa0
Dez 01 19:15:42 darwin.donadio.be kernel:  [] 
warn_slowpath_null+0x1a/0x20
Dez 01 19:15:42 darwin.donadio.be kernel:  [] 
btrfs_assert_delayed_root_empty+0x34/0x40 [btrfs]
Dez 01 19:15:42

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread Konstantin


MegaBrutal schrieb am 01.12.2014 um 13:56:
> Hi all,
>
> I've reported the bug I've previously posted about in "BTRFS messes up
> snapshot LV with origin" in the Kernel Bug Tracker.
> https://bugzilla.kernel.org/show_bug.cgi?id=89121
Hi MegaBrutal. If I understand your report correctly, I can give you
another example where this bug is appearing. It is so bad that it leads
to freezing the system and I'm quite sure it's the same thing. I was
thinking about filing a bug but didn't have the time for that yet. Maybe
you could add this case to your bug report as well.

The bug appears also when using mdadm RAID1 - when one of the drives is
detached from the array then the OS discovers it and after a while (not
directly, it takes several minutes) it appears under /proc/mounts:
instead of /dev/md0p1 I see there /dev/sdb1. And usually after some hour
or so (depending on system workload) the PC completely freezes. So
discussion about the uniqueness of UUIDs or not, a crashing kernel is
telling me that there is a serious bug.

While in my case detaching was intentional, there are several real
possibilities when a RAID1 disk can get detached and currently this
leads to crashing the server when using BTRFS. That not what is intended
when using RAID ;-).

In my case I wanted to do something which was working perfectly all the
years before with all other file systems - checking the file system of
the root disk while the server is running. The procedure is simple:

1. detach one of the disks
2. do fsck on the disk device
3. mdadm --zero-superblock on the device so it gets completely rewritten
4. mdadm --add it to the array

There were some surprises with BTRFS - if 2. is not done directly after
1. btrfsck refuses to check the disk as it is reported to be mounted by
/proc/mounts. And while 2. or even after finishing it the system was
freezing. If I got to get to 4. fast enough everything was OK, but
again, that's not what I expect from a good operating system. Any
objections?

Konstantin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread MegaBrutal

2014-12-01 18:27 GMT+01:00 Robert White :
> On 12/01/2014 04:56 AM, MegaBrutal wrote:
>>
>> Since the other thread went off into theoretical debates about UUIDs
>> and their generic relation to BTRFS, their everyday use cases, and the
>> philosophical meaning behind uniqueness of copies and UUIDs; I'd like
>> to specifically ask you to only post here about the ACTUAL problem at
>> hand. Don't get me wrong, I find the discussion in the other thread
>> really interesting, I'm following it, but it is only very remotely
>> related to the original issue, so please keep it there! If you're
>> interested to catch up about the actual bug symptoms, please read the
>> bug report linked above, and (optionally) reproduce the problem
>> yourself!
>
>
> That discussion _was_ the actual discussion of the actual problem. A problem
> that is not particularly theoretical, a problem that is common to
> block-level snapshots, and a discussion that contained the actual
> work-arounds.
>
> I suggest a re-read. 8-)
>

The majority of the discussion was about how the kernel should react
UPON mounting a file system when more than one device of the same UUID
exist on the system. While it is a very legit problem worth to discuss
and mitigate, this is not the same situation as how the kernel behaves
when an identical device appears WHILE the file system is being
mounted.

Actually, I would not identify devices by UUIDs when I know that
duplicates could exist due to snapshots, therefore I mount devices by
LVM paths. And when a file system is already mounted with all its
devices, that is a clear situation: all devices are open and locked by
the kernel, any mixup at that point is an error. What is the case with
multiple-device file systems? Supply all their devices with device=
mount options. Just don't identify devices by UUIDs when you know
there could be duplicates. Use UUIDs when you don't use LVM.
Identifying file systems by UUIDs were invented because classic
/dev/sdXX device names might change. But LVM names don't change. They
only change when you intentionally change them e.g. with lvrename.

Since having duplicate UUIDs on devices is not a problem for me since
I can tell them apart by LVM names, the discussion is of little
relevance to my use case. Of course it's interesting and I like to
read it along, it is not about the actual problem at hand.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 12:08 PM, Alex Elsayed 
> wrote:
>> Actually, I said "Sure" here, but this isn't strictly true. At some
>> point, you're more memory-bound than CPU-bound, and with CPU intrinsic
>> instructions (like SPARC and recent x86 have for SHA) you're often past
>> that. Then, you're not going to see any real difference - and the
>> accelerated cryptographic hashes may even win out, because the intrinsics
>> may be faster (less stuff of the I$, pipelined single instruction beating
>> multiple simpler instructions, etc) than the software non-cryptographic
>> hash.
> 
> In practice, I am skeptical whether any 128- or 256-bit crypto hashes
> will be as fast as the non-crypto hashes I mentioned, even on CPUs
> with specific instructions for the crypto hashes. The non-crypto
> hashes can (and do) take advantage of special CPU instructions as
> well.
> 
> But even if true that the crypto hashes approach the speed of
> non-crypto hashes on certain CPUs, that does not provide a strong
> argument for using the crypto hashes, since on the common x64 CPUs,
> the non-crypto hashes I mentioned are significantly faster than the
> equivalent crypto hashes.
> 
> So, you have some rare architectures where the crypto hashes may
> almost be as fast as the non-crypto, and common CPUs where the
> non-crypto are much faster. That makes the non-crypto hash functions I
> mentioned the obvious choice in the vast majority of systems.

And as I said upthread, one benefit of the Crypto API is that the filesystem 
developers _no longer have to choose_. By using the shash or ahash interface 
to the Crypto API, the _user_ can choose *any* hash the kernel supports. And 
the default is (and will almost certainly continue to be) crc32, so the user 
would need to specify a hash anyway - making whether some other non-
cryptographic hash is the "obvious choice" a completely moot point.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 12:08 PM, Alex Elsayed 
> wrote:
>> Actually, I said "Sure" here, but this isn't strictly true. At some
>> point, you're more memory-bound than CPU-bound, and with CPU intrinsic
>> instructions (like SPARC and recent x86 have for SHA) you're often past
>> that. Then, you're not going to see any real difference - and the
>> accelerated cryptographic hashes may even win out, because the intrinsics
>> may be faster (less stuff of the I$, pipelined single instruction beating
>> multiple simpler instructions, etc) than the software non-cryptographic
>> hash.
> 
> In practice, I am skeptical whether any 128- or 256-bit crypto hashes
> will be as fast as the non-crypto hashes I mentioned, even on CPUs
> with specific instructions for the crypto hashes. The non-crypto
> hashes can (and do) take advantage of special CPU instructions as
> well.
> 
> But even if true that the crypto hashes approach the speed of
> non-crypto hashes on certain CPUs, that does not provide a strong
> argument for using the crypto hashes, since on the common x64 CPUs,
> the non-crypto hashes I mentioned are significantly faster than the
> equivalent crypto hashes.
> 
> So, you have some rare architectures where the crypto hashes may
> almost be as fast as the non-crypto, and common CPUs where the
> non-crypto are much faster. That makes the non-crypto hash functions I
> mentioned the obvious choice in the vast majority of systems.

Incidentally, you can be 'skeptical' all you like - per Austin's message 
upthread, he was testing the Crypto API. Thus, skeptical as you may be, hard 
evidence shows that SHA-1 was equal to or faster than CRC32, which is 
unequivocally simpler and faster than CityHash (though CityHash comes 
close).

And the CPUs in question are *not* particularly rare - Intel since Sandy 
Bridge or so, the majority of SPARC systems, a goodly number of ARM systems 
via coprocessors...


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 12:35 PM, Austin S Hemmelgarn
>  wrote:
>> My only reasoning is that with this set of hashes (crc32c, adler32, and
>> md5), the statistical likely-hood of running into a hash collision with
>> more than one of them at a time is infinitesimally small compared to the
>> likely-hood of any one of them having a collision (or even compared to
>> something ridiculous like the probability of being killed by a meteor
>> strike), and the combination is faster on most systems that I have tried
>> than many 256-bit crypto hashes.
> 
> I have not seen any evidence that combining hashes like that actually
> reduces the chances of collision, but if we assume it does, then
> again, the non-crypto hashes would be faster. For example, 128-bit
> Spooky2 combined with 128-bit CityHash would produce a 256-bit hash
> and would be faster than MD5 + whatever.

It has no real benefit, but _why_ depends on what your model is.

There's a saying that engineers worry about stochastic failure; security 
professionals have to worry about malicious failure.

If your only concern is stochastic failure (random bitflips, etc), then the 
chances of collision with 128-bit CityHash or MurmurHash or SipHash or what-
have-you are already so small that every single component in your laptop 
dying simultaneously is more likely. Adding another hash is thus just a 
waste of cycles.

If your concern is malicious failure (in-band deduplication attack or 
similar, ignoring for now that btrfs actually compares the extent data as 
well IIRC), then it's well-known in the cryptographic community that the 
concatenation of multiple hashes is as strong as the strongest hash, _but no 
stronger_ [1].

Since the strongest cipher in the above list is either a non-cryptographic 
hash or MD5, which is known-weak to the point of there being numerous toy 
programs finding collisions for arbitrary data, it would not be worth much.

The only place this might be of use is if you used N strong/unbroken hashes, 
in order to hedge against up to N-1 of them being broken. However, the gain 
of that is (again) infinetismal, and the performance cost quite large 
indeed.

[1] http://eprint.iacr.org/2008/075

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread Robert White


On 12/01/2014 02:10 PM, MegaBrutal wrote:

Since having duplicate UUIDs on devices is not a problem for me since
I can tell them apart by LVM names, the discussion is of little
relevance to my use case. Of course it's interesting and I like to
read it along, it is not about the actual problem at hand.



Which is why you use the device= mount option, which would take LVM 
names and which was repeatedly discussed as solving this very problem.


Once you decide to duplicate the UUIDs with LVM snapshots you take up 
the burden of disambiguating your storage.


Which is part of why re-reading was suggested as this was covered in 
some depth and _is_ _exactly_ about the problem at hand.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 3:05 PM, Alex Elsayed  wrote:
> Incidentally, you can be 'skeptical' all you like - per Austin's message
> upthread, he was testing the Crypto API. Thus, skeptical as you may be, hard
> evidence shows that SHA-1 was equal to or faster than CRC32, which is
> unequivocally simpler and faster than CityHash (though CityHash comes
> close).
>
> And the CPUs in question are *not* particularly rare - Intel since Sandy
> Bridge or so, the majority of SPARC systems, a goodly number of ARM systems
> via coprocessors...

You can make convoluted, incorrect claims all you like, but the fact
is that SHA-1 is not as fast as Spooky2 or CityHash128 on x64 Intel
CPUs, and Murmur3 is faster on ARM systems. And it is not even close.
Your claims are absurd.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2] btrfs: add sysfs layout to show volume info

2014-12-01 Thread anand jain



Hi Goffredo,

 inline below..

On 02/12/2014 01:29, Goffredo Baroncelli wrote:

Hi Anand,

On 12/01/2014 06:33 PM, Anand Jain wrote:

From: Anand Jain 

Not yet ready for integration, but for review and testing of the new sysfs 
layout
which is currently under /sys/fs/btrfs/by_fsid

This patch makes btrfs_fs_devices and btrfs_device information readable
from sysfs. This uses the sysfs group visible entry point to mark
certain attributes visible/hidden depending the FS state (mount/unmounted).

The new layout is as shown below.

/sys/fs/btrfs/by_fsid*
./7b047f4d-c2ce-4f22-94a3-68c09057f1bf*
status
fsid*
missing_devices
num_devices*
open_devices
opened*
rotating
rw_devices
seeding
total_devices*
total_rw_bytes
./e6701882-220a-4416-98ac-a99f095bddcc*
active_pending
bdev
bytes_used
can_discard
devid*
dev_root_fsid
devstats_valid
dev_totalbytes
generation*
in_fs_metadata
io_align
io_width
missing
name*
nobarriers
replace_tgtdev
sector_size
total_bytes
type
uuid*
writeable

(* indicates that attribute will be visible even when device is
unmounted but registered with btrfs kernel)


Thanks, for working on that; I really like the idea to export more information.
- it is possible to put the device uuid under a directory like: by_dev_uuid/,
this will help the parsing via script
- it is possible to make a directory under /sys/fs/btrfs/by_dev_uuid where
a link links to the related device; i.e.:
/sys/fs/btrfs/by_dev_uuid/e6701882-220a-4416-98ac-a99f095bddcc ->

../by_fsid/7b047f4d-c2ce-4f22-94a3-68c09057f1bf/by_dev_uuid/e6701882-220a-4416-98ac-a99f095bddc


This would help to know which devices are registered by the kernel



firstly we want the actual file layout so that we could create links
further as we find suitable. it can be done.




The old kobject  will be merged into this new 'by_fsid' kobject,
so that older attributes under  and newer attributed under by_fsid
will be merged together as well.


It would be fully backward compatible ? I really like your layout more
than the current one, but I think that the current sysfs is like a
binary API and so it has to be maintained forever


That was big challenge in this whole effort, yes it will be backward 
compatible.


Thanks, Anand




v2: added support for device add/delete/replace
 rebase on the latest integration branch

Signed-off-by: Anand Jain 
---
  fs/btrfs/dev-replace.c |   7 +
  fs/btrfs/super.c   |  15 ++
  fs/btrfs/sysfs.c   | 383 +
  fs/btrfs/sysfs.h   |   6 +
  fs/btrfs/volumes.c |  42 ++
  fs/btrfs/volumes.h |   6 +
  6 files changed, 459 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 715a115..31ce3a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -474,6 +474,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
u8 uuid_tmp[BTRFS_UUID_SIZE];
struct btrfs_trans_handle *trans;
int ret = 0;
+   char uuid_buf[BTRFS_UUID_UNPARSED_SIZE];

/* don't allow cancel or unmount to disturb the finishing procedure */
mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
@@ -595,7 +596,13 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
/* replace the sysfs entry */
btrfs_kobj_rm_device(fs_info, src_device);
btrfs_kobj_add_device(fs_info, tgt_device);
+   btrfs_destroy_dev_sysfs(src_device);
btrfs_rm_dev_replace_free_srcdev(fs_info, src_device);
+   snprintf(uuid_buf, BTRFS_UUID_UNPARSED_SIZE, "%pU",
+   tgt_device->uuid);
+   if (kobject_rename(&tgt_device->dev_kobj, uuid_buf))
+   printk(KERN_ERR "BTRFS: sysfs uuid %s rename error\n",
+   uuid_buf);

/* write back the superblocks */
trans = btrfs_start_transaction(root, 0);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 017d92d..918eb9d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1389,6 +1389,11 @@ static struct dentry *btrfs_mount(struct 
file_system_type *fs_type, int flags,
goto error_sec_opts;
}

+   error = btrfs_update_by_fsid_sysfs_group(fs_devices);
+   if (error)
+

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 3:05 PM, Alex Elsayed  wrote:
>> Incidentally, you can be 'skeptical' all you like - per Austin's message
>> upthread, he was testing the Crypto API. Thus, skeptical as you may be,
>> hard evidence shows that SHA-1 was equal to or faster than CRC32, which
>> is unequivocally simpler and faster than CityHash (though CityHash comes
>> close).
>>
>> And the CPUs in question are *not* particularly rare - Intel since Sandy
>> Bridge or so, the majority of SPARC systems, a goodly number of ARM
>> systems via coprocessors...
> 
> You can make convoluted, incorrect claims all you like, but the fact
> is that SHA-1 is not as fast as Spooky2 or CityHash128 on x64 Intel
> CPUs, and Murmur3 is faster on ARM systems. And it is not even close.
> Your claims are absurd.
And that is t


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 3:05 PM, Alex Elsayed  wrote:
> hard evidence shows that SHA-1 was equal to or faster than CRC32, which is
> unequivocally simpler and faster than CityHash (though CityHash comes
> close).
>
> And the CPUs in question are *not* particularly rare - Intel since Sandy
> Bridge or so, the majority of SPARC systems, a goodly number of ARM systems
> via coprocessors...

By the way, your "hard evidence" is imaginary.

Here you can see that SHA-1 is about 5 cycles per byte on Sandybridge:

https://blake2.net/

While SpookyHash (and CityHash) are about 3 bytes per cycle (on long
keys) which is about 0.33 cycles per byte. More than 10 times faster
than SHA-1.

http://burtleburtle.net/bob/hash/spooky.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 3:05 PM, Alex Elsayed  wrote:
>> Incidentally, you can be 'skeptical' all you like - per Austin's message
>> upthread, he was testing the Crypto API. Thus, skeptical as you may be,
>> hard evidence shows that SHA-1 was equal to or faster than CRC32, which
>> is unequivocally simpler and faster than CityHash (though CityHash comes
>> close).
>>
>> And the CPUs in question are *not* particularly rare - Intel since Sandy
>> Bridge or so, the majority of SPARC systems, a goodly number of ARM
>> systems via coprocessors...
> 
> You can make convoluted, incorrect claims all you like, but the fact
> is that SHA-1 is not as fast as Spooky2 or CityHash128 on x64 Intel
> CPUs, and Murmur3 is faster on ARM systems. And it is not even close.
> Your claims are absurd.

And that _is_ the case; they are faster... *when both are software 
implementations*

And I'm not sure what is "convoluted" or "incorrect" about saying "Look, 
empirical evidence!"

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 3:46 PM, Alex Elsayed  wrote:
> And I'm not sure what is "convoluted" or "incorrect" about saying "Look,
> empirical evidence!"

No empirical evidence of the speed of SpookyHash or CityHash versus
SHA-1 was cited. The only empirical data mentioned was on an
UltraSPARC CPU, and did not include any SpookyHash or CityHash
measurements, and yet you made a claim about the speeds on Intel and
ARM CPUs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 3:05 PM, Alex Elsayed  wrote:
>> hard evidence shows that SHA-1 was equal to or faster than CRC32, which
>> is unequivocally simpler and faster than CityHash (though CityHash comes
>> close).
>>
>> And the CPUs in question are *not* particularly rare - Intel since Sandy
>> Bridge or so, the majority of SPARC systems, a goodly number of ARM
>> systems via coprocessors...
> 
> By the way, your "hard evidence" is imaginary.
> 
> Here you can see that SHA-1 is about 5 cycles per byte on Sandybridge:
> 
> https://blake2.net/
> 
> While SpookyHash (and CityHash) are about 3 bytes per cycle (on long
> keys) which is about 0.33 cycles per byte. More than 10 times faster
> than SHA-1.
> 
> http://burtleburtle.net/bob/hash/spooky.html

On further examination, I did indeed make a mistake - the hardware 
acceleration for SHA on Intel will be in Skylake; only the AES acceleration 
was added in Sandy Bridge. So you are correct to some degree with the rarity 
argument.

However, performance-wise, that means SHA-1 on Intel is still a software 
implementation. Let's look at ARMv8.

The ARM v8 architecture added a few cryptographic instructions, including 
for SHA-1. The results:

https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-armv8.pl

# hardware-assisted software(*)
# Apple A72.31  4.13 (+14%)
# Cortex-A53  2.19  8.73 (+108%)
# Cortex-A57  2.35  7.88 (+74%)

>From the CityHash readme, on a Xeon X5550 (which is _considerably_ more 
powerful than any of the above):

On a single core of a 2.67GHz Intel Xeon X5550, CityHashCrc256 peaks at 
about 5 to 5.5 bytes/cycle. The other CityHashCrc functions are wrappers 
around CityHashCrc256 and should have similar performance on long strings.
(CityHashCrc256 in v1.0.3 was even faster, but we decided it wasn't as 
thorough as it should be.) CityHash128 peaks at about 4.3 bytes/cycle. The 
fastest Murmur variant on that hardware, Murmur3F, peaks at about 2.4 
bytes/cycle. We expect the peak speed of CityHash128 to dominate CityHash64, 
which is aimed more toward short strings or use in hash tables.

So CityHash is - at best - half as fast as SHA1 with acceleration.

In fact, on the Apple A7, it would likely be slower than _software_ SHA-1.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 3:46 PM, Alex Elsayed  wrote:

> And that _is_ the case; they are faster... *when both are software
> implementations*

They are also faster when both are optimized to use special
instructions of the CPU.

According to this Intel whitepaper, SHA-1 does not achieve less than 1
cycle/byte in any of the situations they tested:

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/haswell-cryptographic-performance-paper.pdf

SpookyHash and CityHash obtain better than 0.5 cycle/byte, and in the
case of CityHash256, better than 0.2 cycle/byte

https://code.google.com/p/cityhash/source/browse/trunk/README
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Alex Elsayed wrote:

> So CityHash is - at best - half as fast as SHA1 with acceleration.
> 
> In fact, on the Apple A7, it would likely be slower than _software_ SHA-1.

Argh, ignore this. The CityHash readme is in bytes/cycle, which I missed on 
first readthrough (why on earth they are  not using either MB/s for rate, or 
cycles/byte, eludes  me completely.)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread MegaBrutal

2014-12-02 0:24 GMT+01:00 Robert White :
> On 12/01/2014 02:10 PM, MegaBrutal wrote:
>>
>> Since having duplicate UUIDs on devices is not a problem for me since
>> I can tell them apart by LVM names, the discussion is of little
>> relevance to my use case. Of course it's interesting and I like to
>> read it along, it is not about the actual problem at hand.
>>
>
> Which is why you use the device= mount option, which would take LVM names
> and which was repeatedly discussed as solving this very problem.
>
> Once you decide to duplicate the UUIDs with LVM snapshots you take up the
> burden of disambiguating your storage.
>
> Which is part of why re-reading was suggested as this was covered in some
> depth and _is_ _exactly_ about the problem at hand.

Nope.

root@reproduce-1391429:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
root=/dev/mapper/vg-rootlv ro
rootflags=device=/dev/mapper/vg-rootlv,subvol=@

Observe, device= mount option is added.


root@reproduce-1391429:~# ./reproduce-1391429.sh
#!/bin/sh -v
lvs
  LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
  rootlv vg   -wi-ao---   1.00g
  swap0  vg   -wi-ao--- 256.00m

grub-probe --target=device /
/dev/mapper/vg-rootlv

grep " / " /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-1 / btrfs rw,relatime,space_cache 0 0

lvcreate --snapshot --size=128M --name z vg/rootlv
  Logical volume "z" created

lvs
  LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
  rootlv vg   owi-aos--   1.00g
  swap0  vg   -wi-ao--- 256.00m
  z  vg   swi-a-s-- 128.00m  rootlv   0.11

ls -l /dev/vg/
total 0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv -> ../dm-1
lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 -> ../dm-0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 z -> ../dm-2

grub-probe --target=device /
/dev/mapper/vg-z

grep " / " /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-2 / btrfs rw,relatime,space_cache 0 0

lvremove --force vg/z
  Logical volume "z" successfully removed

grub-probe --target=device /
/dev/mapper/vg-rootlv

grep " / " /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-1 / btrfs rw,relatime,space_cache 0 0


Problem still reproduces.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 4:06 PM, Alex Elsayed  wrote:
> https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-armv8.pl
>
> # hardware-assisted software(*)
> # Apple A72.31  4.13 (+14%)
> # Cortex-A53  2.19  8.73 (+108%)
> # Cortex-A57  2.35  7.88 (+74%)


Note that those are showing 2 cycles per byte.

> From the CityHash readme, on a Xeon X5550 (which is _considerably_ more
> powerful than any of the above):
>
> On a single core of a 2.67GHz Intel Xeon X5550, CityHashCrc256 peaks at
> about 5 to 5.5 bytes/cycle.

5 bytes per cycle is 0.2 cycles per byte. So your own citation shows
that CityHash is 10 times faster.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 3:46 PM, Alex Elsayed  wrote:
>> And I'm not sure what is "convoluted" or "incorrect" about saying "Look,
>> empirical evidence!"
> 
> No empirical evidence of the speed of SpookyHash or CityHash versus
> SHA-1 was cited. The only empirical data mentioned was on an
> UltraSPARC CPU, and did not include any SpookyHash or CityHash
> measurements, and yet you made a claim about the speeds on Intel and
> ARM CPUs.

There's a thing called the transitive property. When CRC32 is faster than 
SpookyHash and CityHash (while admittedly weaker), and SHA-1 on SPARC is 
faster than CRC32, there are comparisons that can be made.

And what I've been trying to say this whole time is not some point about an 
individual architecture.

It's that the flat assertion that "CityHash/SpookyHash/etc is always faster" 
is _unwarranted_, as hardware acceleration _has a huge effect_.

On SPARC, it's empirically enough for SHA-1 to match CRC32.
On ARMv8, it brings SHA-1 from 4-8 cycles per byte down to _2_.
On Intel, when the Skylake SHA extensions land, it will likely have an 
enormous impact as well.

Broad, sweeping generalizations are great - so long as they are _properly 
qualified_.

For instance, I would agree *wholeheartedly* that a good software 
implementation of CityHash/SpookyHash/etc would beat the *pants* off a good 
software implementation of SHA-1. No question.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Christoph Anton Mitterer

On Sat, 2014-11-29 at 13:00 -0800, John Williams wrote: 
> On Sat, Nov 29, 2014 at 12:38 PM, Alex Elsayed  wrote:
> > Why not just use the kernel crypto API? Then the user can just specify any
> > hash the kernel supports.
> 
> One reason is that crytographic hashes are an order of magnitude
> slower than the fastest non-cryptographic hashes. And for filesystem
> checksums, I do not see a need for crypotgraphic hashes.

I'm not that crypto expert, but wouldn't the combination of a
cryptographic hash, in combination with e.g. dm-crypt below the
filesystem give us what dm-crypt alone cannot really give us
(authenticated integrity)?

Would that combination of hash+encrypt basically work like a MAC?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread John Williams

On Mon, Dec 1, 2014 at 4:15 PM, Alex Elsayed  wrote:
> There's a thing called the transitive property. When CRC32 is faster than
> SpookyHash and CityHash (while admittedly weaker), and SHA-1 on SPARC is
> faster than CRC32, there are comparisons that can be made.

And yet you applied the transitive property with poor assumptions and
in a convoluted way to come up with an incorrect conclusion.


> It's that the flat assertion that "CityHash/SpookyHash/etc is always faster"
> is _unwarranted_, as hardware acceleration _has a huge effect_.

Actually, the assertion is true and backed up by evidence that I
cited. I'm not sure why you think hardware acceleration only helps
SHA-1 and does not help CityHash or SpookyHash.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

John Williams wrote:

> On Mon, Dec 1, 2014 at 4:15 PM, Alex Elsayed  wrote:
>> There's a thing called the transitive property. When CRC32 is faster than
>> SpookyHash and CityHash (while admittedly weaker), and SHA-1 on SPARC is
>> faster than CRC32, there are comparisons that can be made.
> 
> And yet you applied the transitive property with poor assumptions and
> in a convoluted way to come up with an incorrect conclusion.
> 
> 
>> It's that the flat assertion that "CityHash/SpookyHash/etc is always
>> faster" is _unwarranted_, as hardware acceleration _has a huge effect_.
> 
> Actually, the assertion is true and backed up by evidence that I
> cited. I'm not sure why you think hardware acceleration only helps
> SHA-1 and does not help CityHash or SpookyHash.

...because the hardware acceleration is in the form of instructions like 
"Update SHA1 state" ?

https://software.intel.com/en-us/articles/intel-sha-extensions

https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf
(page 99, the SHA1{C,P,M,H,SU0,SU1} instructions)

On SPARC it's a full-on crypto coprocessor.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-01 Thread Qu Wenruo

 Original Message 
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Austin S Hemmelgarn 
To: Qu Wenruo , linux-btrfs 

Date: 2014年12月01日 20:53

On 2014-11-30 20:58, Qu Wenruo wrote:

[snipped]

So, I think this does a good job of highlighting one of the bigger 
issues with btrfsck when it is compared to ext* and/or xfs. Despite 
this being a problem, I really don't think using a rdbms is the way to 
fix it, both for reasons outlined in other responses, and because fsck 
should be as fast as possible when nothing is wrong with the fs.

Although I am not stick to the crazy idea, I think it is still needed to 
point out that,
even btrfsck is ran on a clean btrfs, it still needs to iterate all the 
extents, metadata.

So it may not be as fast as you thought even with the current implement.

Anyway thanks for the feedback.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Christoph Anton Mitterer wrote:

> On Sat, 2014-11-29 at 13:00 -0800, John Williams wrote:
>> On Sat, Nov 29, 2014 at 12:38 PM, Alex Elsayed 
>> wrote:
>> > Why not just use the kernel crypto API? Then the user can just specify
>> > any hash the kernel supports.
>> 
>> One reason is that crytographic hashes are an order of magnitude
>> slower than the fastest non-cryptographic hashes. And for filesystem
>> checksums, I do not see a need for crypotgraphic hashes.
> 
> I'm not that crypto expert, but wouldn't the combination of a
> cryptographic hash, in combination with e.g. dm-crypt below the
> filesystem give us what dm-crypt alone cannot really give us
> (authenticated integrity)?
> 
> Would that combination of hash+encrypt basically work like a MAC?

Sadly, no. Partially because in order for an encrypted hash to be a secure 
MAC, the encryption must be nonmalleable, which would require CMC or EME - 
encryption modes which Linux does not presently support as I understand it. 
There are other issues as well, including that MAC-then-encrypt is fragile 
against a number of attacks, mainly in the padding-oracle category (See: TLS 
BEAST attack).

AEAD modes are also nonmalleable, but as they are length-expanding they 
cannot be used for LUKS. However, as eCryptFS and possibly the recent ext4 
encryption work shows, using them at a higher-level (encrypting extents or 
files) does work. Of course, if you're using an AEAD mode in the filesystem 
anyway, just use it directly and have done with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Christoph Anton Mitterer

On Mon, 2014-12-01 at 16:43 -0800, Alex Elsayed wrote: 
> including that MAC-then-encrypt is fragile 
> against a number of attacks, mainly in the padding-oracle category (See: TLS 
> BEAST attack).
Well but here we talk about disk encryption... how would the MtE oracle
problems apply to that? Either you're already in the system, i.e. beyond
disk encryption (and can measure any timing difference)... or you're
not, but then you cannot measure anything.


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-01 Thread Qu Wenruo

 Original Message 
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White 
To: Qu Wenruo , linux-btrfs 

Date: 2014年12月02日 02:10

On 11/30/2014 10:18 PM, Qu Wenruo wrote:

(advocacy for using SQL internally for btrfsck)

All of these ideas you want to toss a entire SQL front end on are more 
simply handled with simple data structures.

In C++ terms "map" and/or "map>" 
beats the heck out of including all of SQL and its related indexes and 
type conversions (sqlite, for example, stores integers as doubles, or 
decimal numbers depending on version).

RDBMS _are_ good at representing things, so noticing that a thing 
_can_ be represented with an RDBMS is very common.

But by the time you put two or three indexes on relation->(parent, 
child, name) you've given yourself three or four copies of the core 
data in three or four different places. And those copies are largely 
immutable and randomly distributed and will include the overhead in 
memory for fairly sparse trees.

It's not that it's an unworkable idea.

But it is unnecessarily generic and adds an order of magnitude of 
complexity to your problems.

For instance, if I boot from a CD to run a btrfsck where will the 
database files be written to?

This is easy, memory.
Since only when we judge the fs' metadata is too huge then we will use file.

One of the problem in current inode_record is, btrfsck can only record 
them all in memory,
when metadata of the file system is too big, sysadmin can only add swap 
space or memory

to handle it.

Although it is not a urgent problem, since 1T btrfs fs with about 5G 
metadata will only takes about 500M

checking chunk and extent and even less for checking fs roots.

If it is an in-memory table why do I want the overhead of SQL to look 
up something indexed by integer?

If the sparse vectors of integers don't fit in memory why would the 
SQL tables of integers fit "better"?

SQL would be the second slowest possible for representing this data -- 
The slowest would be an XML schema stored as flat text.

So your crazy ides is also a pretty bad one compared to most if not 
all sparse data representations and techniques that come to bear on 
this problem set. All you are really doing is pushing the same work 
(walking a tree to find an integer) into a difficult "spell it out in 
SQL" space.

Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child = 
%d"); execute_sql(cursor,child); and its possible error returns 
actually clearer or better than "parent=inheretance.find(child); if 
(parent!=inheretance.end()) {...}" (as it might be written in C++)?

Do you want to know if (keep track of whether) an inode is allocated 
and referenced? There's a sparse bit-vector for that...

Want to be able to get back to an inode's location on disk, a sparse 
array of disk offsets exists (among other options).

Before you can even access the RDBMS you'd have to fill it completely; 
otherwise you wouldn't know if a select returning zero rows was an 
authoritative indication that the datum didn't exist or if it was 
instead an indication that the datum hadn't been populated yet.

THIS IS NOT SARCASM: If you strongly disagree, I suggest you start 
coding. Seriously, don't ask, do... And in a month really check to see 
if your solution is any smaller, faster, easier, or in _any_ _way_ 
more optimal than using native data structures. The attempt will 
answer the question definitively and then we'll all know...

I know this is a crazy idea and not disagree with your opinion.
But I am also somewhat tired of bringing new structure new searching 
functions or even bring larger change on
the btrfsck record infrastructure when I found that can't provide the 
function when new recovery function is going

to be implemented.

In fact, after I implement the whole corrupted-leaf recovery patchset, I 
may try to implement it as an experimental
try-and-error for cleanup/enhance for the inode_record infrastructure 
and see if there is the huge performance drop
or the lines of code reduced(anyway, just a personal try-and-error, will 
not send them if there is no such interesting

result, and it may be highly possible a disaster as you mentioned)

Thanks,
Qu

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Christoph Anton Mitterer wrote:

> On Mon, 2014-12-01 at 16:43 -0800, Alex Elsayed wrote:
>> including that MAC-then-encrypt is fragile
>> against a number of attacks, mainly in the padding-oracle category (See:
>> TLS BEAST attack).
> Well but here we talk about disk encryption... how would the MtE oracle
> problems apply to that? Either you're already in the system, i.e. beyond
> disk encryption (and can measure any timing difference)... or you're
> not, but then you cannot measure anything.

Arguable. On a system with sufficiently little noise in the signal (say... 
systemd, on SSD, etc) you could possibly get some real information from 
corrupting padding on a relatively long extent used early in the boot 
process, by measuring how it affects time-to-boot.

And padding oracles are just one issue. Overall, the problem is that MtE 
isn't generically secure. EtM or pure AEAD modes are, which means you can 
simply mark any attack that doesn't rely on one of the underlying primitives 
being weak as "Not applicable." It also means you can compose it out of 
arbitrary secure primitives, rather than needing to do your proof of 
security over again for every combination.

That's an _enormous_ win in terms of how easy it is to be sure a system is 
secure. Without it, you can't really be sure there isn't Yet Another Vector 
You Missed.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-12-01 Thread Alex Elsayed

Alex Elsayed wrote:

> Christoph Anton Mitterer wrote:
> 
>> On Mon, 2014-12-01 at 16:43 -0800, Alex Elsayed wrote:
>>> including that MAC-then-encrypt is fragile
>>> against a number of attacks, mainly in the padding-oracle category (See:
>>> TLS BEAST attack).
>> Well but here we talk about disk encryption... how would the MtE oracle
>> problems apply to that? Either you're already in the system, i.e. beyond
>> disk encryption (and can measure any timing difference)... or you're
>> not, but then you cannot measure anything.
> 
> Arguable. On a system with sufficiently little noise in the signal (say...
> systemd, on SSD, etc) you could possibly get some real information from
> corrupting padding on a relatively long extent used early in the boot
> process, by measuring how it affects time-to-boot.

To make this more concrete:

Alice owns the computer, and has root. /etc/shadow has the correct 
permissions.

Eve has _an_ account, but does not have root - and she wants it.

For simplicity, let's presume this is a laptop, Alice and Eve are sisters, 
and Eve wants to peek at Alice's diary.

Eve can boot into a livecd, selectively corrupt blocks, and get Alice to 
unlock the drive for a normal boot.

With this, she can execute the padding oracle attack against /etc/shadow, 
and deduce its contents.

The first rule of crypto is "Don't roll your own" largely because it is 
_brutally_ unforgiving of minor mistakes.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs stuck with lot's of files

2014-12-01 Thread Qu Wenruo

 Original Message 
Subject: btrfs stuck with lot's of files
From: Peter Volkov 
To: linux-btrfs@vger.kernel.org 
Date: 2014年12月01日 19:46

Hi, guys.

We have a problem with btrfs file system: sometimes it became stuck
without leaving me any way to interrupt it (shutdown -r now is unable to
restart server). By stuck I mean some processes that previously were
able to write on disk are unable to cope with load and load average goes
up:

top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
149.29
Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
0.0 st
KiB Mem:  65922104 total, 65414856 used,   507248 free, 1844 buffers
KiB Swap:0 total,0 used,0 free. 62570804 cached
Mem

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
  8644 root  20   0   0  0  0 R  96.5  0.0 127:21.95
kworker/u16:16
  5047 dvr   20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
dvrserver
30223 root  20   0   20140   2600   2132 R   6.4  0.0   0:00.01
top
 1 root  20   04276   1628   1524 S   0.0  0.0   0:40.19
init

There are about 300 treads on server, some of which are writing on disk.
A bit information about this btrfs filesystem: this is 22 disk file
system with raid1 for metadata and raid0 for data:

  # btrfs filesystem df /store/
Data, single: total=11.92TiB, used=10.86TiB
System, RAID1: total=8.00MiB, used=1.27MiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=46.00GiB, used=33.49GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=128.00KiB
  # btrfs property get /store/
ro=false
label=store
  # btrfs device stats /store/
(shows all zeros)
  # btrfs balance status /store/
No balance found on '/store/'
  # btrfs filesystem show /store/
Btrfs v3.17.1
(btw, is it supposed to have only version here?)
This is a small bug that if there is appending '/' in the path for 
'btrfs fi show', it can't recognize it

Patch is already sent and maybe included next version.

As for load we write quite small files of size (some of 313K, some of
800K), that's why metadata takes that much. So back to the problem.
iostat 1 exposes following problem:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   16.960.00   17.09   65.950.000.00

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda   0.00 0.00 0.00  0  0
sdc   0.00 0.00 0.00  0  0
sdb   0.00 0.00 0.00  0  0
sde   0.00 0.00 0.00  0  0
sdd   0.00 0.00 0.00  0  0
sdf   0.00 0.00 0.00  0  0
sdg   0.00 0.00 0.00  0  0
sdj   0.00 0.00 0.00  0  0
sdh   0.00 0.00 0.00  0  0
sdk   0.00 0.00 0.00  0  0
sdi   1.00 0.00   200.00  0200
sdl   0.00 0.00 0.00  0  0
sdn  48.00 0.00 17260.00  0  17260
sdm   0.00 0.00 0.00  0  0
sdp   0.00 0.00 0.00  0  0
sdo   0.00 0.00 0.00  0  0
sdq   0.00 0.00 0.00  0  0
sdr   0.00 0.00 0.00  0  0
sds   0.00 0.00 0.00  0  0
sdt   0.00 0.00 0.00  0  0
sdv   0.00 0.00 0.00  0  0
sdw   0.00 0.00 0.00  0  0
sdu   0.00 0.00 0.00  0  0

write goes to one disk. I've tried to debug what's going in kworker and
did

$ echo workqueue:workqueue_queue_work

/sys/kernel/debug/tracing/set_event

$ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2

trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
here?
It seems that attachment is blocked by mail-list so I didn't see the 
attachment.

Server has 64Gb of RAM. Is it possible that it is unable to keep all
metadata in memory, can we encrease this memory limit, if exists?

Not possible, it will never happen (if nothing goes wrong).
Kernel has the outstanding page cache mechanism, when memory comes short,
some cached metadata/data can be flushed back(if dirty) to disk to free 
space.

And re-read from disk if needed later.

So kernel don't need to load all the metadata/data into memory, and 
that's mostly

Re: btrfs stuck with lot's of files

2014-12-01 Thread Peter Volkov

В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
> On 12/01/2014 03:46 AM, Peter Volkov wrote:
>  > (stuff about getting hung up trying to write to one drive)
> 
> That drive (/dev/sdn) is probably starting to fail.
> (about failed drive)

Thank you Robert for the answer. It is not likely that drive fails here.
Similar condition (write to a single drive) happens with other drives
i.e. such write pattern may happen with any drive.

After looking at what happens longer I see the following. During stuck
single processor core is busy 100% of CPU in kernel space (some kworker
is taking 100% CPU). Ftrace reveals that
btrfs_async_reclaim_metadata_space is most frequently called function.
So it looks like btrfs is doing some operation with metadata and until
it finishes that everything is stuck (practically no writes happens on
disk). So I'm looking for suggestion on how to cope with this process.

> >   # btrfs filesystem df /store/
> > Data, single: total=11.92TiB, used=10.86TiB
> 
> Reguardless of the above...
> 
> You have a terabyte of unused but allocated data storage. You probably 
> need to balance your system to un-jamb that. That's a lot of space that 
> is unavailable to the metadata (etc).

Well, I'm afraid that balance will put fs into even longer "stuck".

> ASIDE: Having your metadata set to RAID1 (as opposed to the default of 
> DUP) seems a little iffy since your data is still set to DUP.

That's true. But why data is duplicated? During btrfs volume creation
I've set explicitly -d data single.

> FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given 
> 22 volumes and 10% empty empty space it would only cost you half of your 
> existing empty space. If you don't RAID your data, there is no real 
> point to putting your metadata in RAID.

Is raid5 ready for use? As I read post[1] mentioned on[2] it is still
some way to make it stable.

[1]
http://marc.merlins.org/perso/btrfs/post_2014-03-23_Btrfs-Raid5-Status.html
[2] https://btrfs.wiki.kernel.org/index.php/RAID56

--
Peter.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs stuck with lot's of files

2014-12-01 Thread Peter Volkov

В Вт, 02/12/2014 в 09:33 +0800, Qu Wenruo пишет:
>  Original Message 
> Subject: btrfs stuck with lot's of files
> From: Peter Volkov 
> To: linux-btrfs@vger.kernel.org 
> Date: 2014年12月01日 19:46
> > Hi, guys.
> >
> > We have a problem with btrfs file system: sometimes it became stuck
> > without leaving me any way to interrupt it (shutdown -r now is unable to
> > restart server). By stuck I mean some processes that previously were
> > able to write on disk are unable to cope with load and load average goes
> > up:
> >
> > top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
> > 149.29
> > Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
> > %Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
> > 0.0 st
> > KiB Mem:  65922104 total, 65414856 used,   507248 free, 1844 buffers
> > KiB Swap:0 total,0 used,0 free. 62570804 cached
> > Mem
> >
> >PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> > COMMAND
> >   8644 root  20   0   0  0  0 R  96.5  0.0 127:21.95
> > kworker/u16:16
> >   5047 dvr   20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
> > dvrserver
> > 30223 root  20   0   20140   2600   2132 R   6.4  0.0   0:00.01
> > top
> >  1 root  20   04276   1628   1524 S   0.0  0.0   0:40.19
> > init
> >
> >
> >
> > There are about 300 treads on server, some of which are writing on disk.
> > A bit information about this btrfs filesystem: this is 22 disk file
> > system with raid1 for metadata and raid0 for data:
> >
> >   # btrfs filesystem df /store/
> > Data, single: total=11.92TiB, used=10.86TiB
> > System, RAID1: total=8.00MiB, used=1.27MiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, RAID1: total=46.00GiB, used=33.49GiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=128.00KiB
> >   # btrfs property get /store/
> > ro=false
> > label=store
> >   # btrfs device stats /store/
> > (shows all zeros)
> >   # btrfs balance status /store/
> > No balance found on '/store/'
> >   # btrfs filesystem show /store/
> > Btrfs v3.17.1
> > (btw, is it supposed to have only version here?)
> This is a small bug that if there is appending '/' in the path for 
> 'btrfs fi show', it can't recognize it
> Patch is already sent and maybe included next version.
> >
> > As for load we write quite small files of size (some of 313K, some of
> > 800K), that's why metadata takes that much. So back to the problem.
> > iostat 1 exposes following problem:
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >16.960.00   17.09   65.950.000.00
> >
> > Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> > sda   0.00 0.00 0.00  0  0
> > sdc   0.00 0.00 0.00  0  0
> > sdb   0.00 0.00 0.00  0  0
> > sde   0.00 0.00 0.00  0  0
> > sdd   0.00 0.00 0.00  0  0
> > sdf   0.00 0.00 0.00  0  0
> > sdg   0.00 0.00 0.00  0  0
> > sdj   0.00 0.00 0.00  0  0
> > sdh   0.00 0.00 0.00  0  0
> > sdk   0.00 0.00 0.00  0  0
> > sdi   1.00 0.00   200.00  0200
> > sdl   0.00 0.00 0.00  0  0
> > sdn  48.00 0.00 17260.00  0  17260
> > sdm   0.00 0.00 0.00  0  0
> > sdp   0.00 0.00 0.00  0  0
> > sdo   0.00 0.00 0.00  0  0
> > sdq   0.00 0.00 0.00  0  0
> > sdr   0.00 0.00 0.00  0  0
> > sds   0.00 0.00 0.00  0  0
> > sdt   0.00 0.00 0.00  0  0
> > sdv   0.00 0.00 0.00  0  0
> > sdw   0.00 0.00 0.00  0  0
> > sdu   0.00 0.00 0.00  0  0
> >
> >
> > write goes to one disk. I've tried to debug what's going in kworker and
> > did
> >
> > $ echo workqueue:workqueue_queue_work
> >> /sys/kernel/debug/tracing/set_event
> > $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2
> >
> > trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
> > here?
> It seems that attachment is blocked by mail-list so I didn't see the 
> attachment.

I've put it here:
https://drive.google.com/file/d/0Byg

Re: [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

2014-12-01 Thread Eryu Guan

On Mon, Dec 01, 2014 at 05:11:29PM +, Filipe Manana wrote:
> Stress btrfs' block group allocation and deallocation while running
> fstrim in parallel. Part of the goal is also to get data block groups
> deallocated so that new metadata block groups, using the same physical
> device space ranges, get allocated while fstrim is running. This caused
> several issues ranging from invalid memory accesses, kernel crashes,
> metadata or data corruption, free space cache inconsistencies, free
> space leaks and memory leaks.
> 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Addressed Dave's comments.
> 
>  tests/generic/038 | 152 
> ++
>  tests/generic/038.out |   2 +
>  tests/generic/group   |   1 +
>  3 files changed, 155 insertions(+)
>  create mode 100755 tests/generic/038
>  create mode 100644 tests/generic/038.out
> 
> diff --git a/tests/generic/038 b/tests/generic/038
> new file mode 100755
> index 000..217aa7a
> --- /dev/null
> +++ b/tests/generic/038
> @@ -0,0 +1,152 @@
> +#! /bin/bash
> +# FSQA Test No. 038
> +#
> +# This test was motivated by btrfs issues, but it's generic enough as it
> +# doesn't use any btrfs specific features.
> +#
> +# Stress btrfs' block group allocation and deallocation while running fstrim 
> in
> +# parallel. Part of the goal is also to get data block groups deallocated so
> +# that new metadata block groups, using the same physical device space 
> ranges,
> +# get allocated while fstrim is running. This caused several issues ranging
> +# from invalid memory accesses, kernel crashes, metadata or data corruption,
> +# free space cache inconsistencies, free space leaks and memory leaks.
> +#
> +# These issues were fixed by the following btrfs linux kernel patches:
> +#
> +#   Btrfs: fix invalid block group rbtree access after bg is removed
> +#   Btrfs: fix crash caused by block group removal
> +#   Btrfs: fix freeing used extents after removing empty block group
> +#   Btrfs: fix race between fs trimming and block group remove/allocation
> +#   Btrfs: fix race between writing free space cache and trimming
> +#   Btrfs: make btrfs_abort_transaction consider existence of new block 
> groups
> +#   Btrfs: fix memory leak after block remove + trimming
> +#   Btrfs: fix extent map leak on chunk allocation failure
> +#
> +# The issues were found on a qemu/kvm guest with 4 virtual CPUs, 4Gb of ram 
> and
> +# scsi-hd devices with discard support enabled (that means hole punching in 
> the
> +# disk's image file is performed by the host).
> +#
> +#---
> +#
> +# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
> +# Author: Filipe Manana 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + rm -fr $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_need_to_be_root
> +_supported_fs btrfs

This should be "_supported_fs generic"

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Zygo Blaxell

On Mon, Dec 01, 2014 at 10:09:44PM +0530, Shriramana Sharma wrote:
> On Mon, Dec 1, 2014 at 7:16 PM, Roman Mamedov  wrote:
> >
> > A more sensible idea could be adding a global-level '-i' switch, same as in
> > 'rm', so that you or distros could then alias 'btrfs' to 'btrfs -i' (ask
> > confirmation on any irreversible action).
> 
> Well the difference being that there doesn't seem to be any other
> irreversible action from my scan of man btrfs -- am I missing
> anything? This is the only thing that actually leads to loss of data.
> 
> When btrfs has so many features (esp snapshots) to prevent user
> accidentally deleting data (I liked especially
> http://www.youtube.com/v/9H7e6BcI5Fo?start=209) I think there has to
> be *some* modicum of support for warning against deleting a subvolume
> (and it seems others agree too).
> 
> But I see what you mean in the bugzilla comment about not wanting your
> existing backup snapshot scripts to fail because they don't have a -f.
> At the same time, aliasing via -i on top level btrfs binary may not be
> so practical here because this is the only command which will actually
> use it (again, correct if wrong).
> 
> Perhaps exporting some envvar in the default shell's rc file (or
> whichever file will be read only if the shell is interactive) would
> work? Like in ~/.bashrc:
> 
> export BTRFS_SUBVOLUME_DELETE_CONFIRM=1
> 
> Ideas?

Never rely on aliasing or environment variables for defaults, and never
change default behavior if your releases are old enough that someone
has built scripts on top of them.  ;)

fprintf(stderr, "Deleting subvolume '%s' in 5 seconds.\n", subvol_path);
if (!f_flag_on_cmd_line) {
fprintf(stderr, "If this is not what you want,\n");
fprintf(stderr, "*** PRESS Ctrl-C TO ABORT NOW!!! ***\a\n");
sleep(5);
}

Of course, in an init-shell-type environment, Ctrl-C doesn't work
either...

If I had to pick the least evil, I'd go for interactive prompting by
default (do nothing if the interaction fails, e.g. no TTY) and add a
'-f'/'--force' flag to bypass the prompt.  This is consistent with the
way lvm2 and mdadm work when presented with data-losing or otherwise
questionable commands and parameters.  It will break scripts, but btrfs
users should still be expecting that for a while as undesirable default
behaviors are identified.

OTOH maybe there is no issue with the current behavior.  Only root can
delete subvolumes, and maybe we assume root knows what they're doing?

On a side note...only root can delete subvolumes, but non-root users
can create them, which results in...this:

$ /sbin/btrfs sub create foo
Create subvolume './foo'
$ date > foo/bar
$ /sbin/btrfs sub delete foo
Transaction commit: none (default)
Delete subvolume '/home/testuser/foo'
ERROR: cannot delete '/home/testuser/foo' - Operation not permitted
$ rm -rf foo
rm: cannot remove `foo': Operation not permitted
$ cat /proc/version
Linux version 3.17.1-zb64+ (root@buildbot) (gcc version 4.7.2 (Debian 
4.7.2-5) ) #1 SMP PREEMPT Tue Oct 21 00:17:49 EDT 2014

...uh oh?

> -- 
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: Digital signature

Re: Moving an entire subvol?

2014-12-01 Thread Shriramana Sharma

On Mon, Dec 1, 2014 at 6:24 AM, Chris Murphy  wrote:
>> But isn't it just possible to move i.e. reparent a
>> subvol so I can move these two under another subvol and have that as
>> default?
>
> You can move subvolumes.

OK so I just found out that just mv test1/foo test2/ where test1,
test2 and foo are all subvolumes is sufficient to reparent foo to
test2, if what btr sub list shows as "top level" is indeed the parent
subvolume.

Is that correct: what btr sub list shows as "top level" is indeed the
parent subvolume?

> My suggestion is subvolumes containing
> binaries shouldn't be located within another subvolume that ends up
> being mounted, that way old binaries with possible vulnerabilities
> aren't exposed in the normal search path.

I'm not sure what you mean. Are you saying that for example /usr/bin should be:

1) a separate subvolume than / or /usr,
2) not a child subvolume of / or /usr?

> openSUSE uses subvol id 5 for installing the OS to, and some
> directories are made subvolumes such as home var and maybe usr.
> Therefore when subvolid 5 is snapshot, those are exempt, and have to
> be individually snapshot.

Yes I also noticed that openSUSE creates such separate subvols, but is
there any particular benefit to making it so?

> Fedora uses subvolumes root and home by default, and fstab uses
> subvol=root and subvol=home to mount them at / and /home respectively.

This seems similar to Ubuntu's @ and @home setup.

Is there any advantage to either? That is, one model installs root to
the topmost subvol and makes usr, home etc nested subvols, whereas
another makes root a nested subvol under the topmost just like usr
home etc, and then mounts it to /...

In general it seems people (or at least distros) prefer avoiding
nesting subvolumes. Is there any particular reason for this? Esp in
regard to /usr etc it would seem that if they are nested within the
subvol for /, then just mounting that subvol would automatically mount
all nested subvolumes, right? So the extra effort needed to mount the
nested subvols would not be necessary, no?

Shriramana.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Shriramana Sharma

On Tue, Dec 2, 2014 at 8:44 AM, Zygo Blaxell
 wrote:
> This is consistent with the
> way lvm2 and mdadm work when presented with data-losing or otherwise
> questionable commands and parameters.  It will break scripts, but btrfs
> users should still be expecting that for a while as undesirable default
> behaviors are identified.

Ah so there *is* precedent for my hunch that deleting subvols should
be different than deleting ordinary files and folders... :-)

> OTOH maybe there is no issue with the current behavior.  Only root can
> delete subvolumes, and maybe we assume root knows what they're doing?

Well in office environs, where the root password is with a certain
person only, then that's fine because that person is going to be wary
of doing anything that's make others angry at them, but on single-user
systems, one's regular password *is* the root password and the
situation is such that because ordinary (and mostly non-destructive)
things like installing requires entering it, so one gets accustomed to
entering it without too much thought, leading to the requirement for
such safety nets.

(Perhaps like in banks, we should have a two-password system, one for
destructive actions, so the user is forced to apply thought to what
they are approving!)

> On a side note...only root can delete subvolumes, but non-root users
> can create them, which results in...this:

Not sure about your Debian system, but my openSUSE Tumbleweed (with
kernel 3.17.2 and btrfsprogs 3.17) requires me to enter the root
password before creating a subvol (or in fact running anything under
/sbin or /usr/sbin).

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread MegaBrutal

2014-12-01 17:39 GMT+01:00 Shriramana Sharma :
>
> When btrfs has so many features (esp snapshots) to prevent user
> accidentally deleting data (I liked especially
> http://www.youtube.com/v/9H7e6BcI5Fo?start=209) I think there has to
> be *some* modicum of support for warning against deleting a subvolume
> (and it seems others agree too).
>

WOW, this is pretty neat? How can I do the same actions from the
command-line? E.g. I would be curious whether a file changed since the
last snapshot. Today I have to use traditional methods like plain "ls
-l" (in case I trust the time & file size), or "diff". But in the
video we could see a directory in a view when we only seen the changed
files. How does that YaST application do so? And is there a more
enegant way to restore files to their originals than a plain "mv"?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread MegaBrutal

2014-12-02 4:40 GMT+01:00 Shriramana Sharma :
>
> Well in office environs, where the root password is with a certain
> person only, then that's fine because that person is going to be wary
> of doing anything that's make others angry at them, but on single-user
> systems, one's regular password *is* the root password and the
> situation is such that because ordinary (and mostly non-destructive)
> things like installing requires entering it, so one gets accustomed to
> entering it without too much thought, leading to the requirement for
> such safety nets.
>

It reminds me of this accidental deletion:
http://serverfault.com/questions/587102/monday-morning-mistake-sudo-rm-rf-no-preserve-root

LOL at "How do you even type --no-preserve-root accidentally?! :-o".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread MegaBrutal

2014-12-01 22:45 GMT+01:00 Konstantin :
>
> MegaBrutal schrieb am 01.12.2014 um 13:56:
>> Hi all,
>>
>> I've reported the bug I've previously posted about in "BTRFS messes up
>> snapshot LV with origin" in the Kernel Bug Tracker.
>> https://bugzilla.kernel.org/show_bug.cgi?id=89121
> Hi MegaBrutal. If I understand your report correctly, I can give you
> another example where this bug is appearing. It is so bad that it leads
> to freezing the system and I'm quite sure it's the same thing. I was
> thinking about filing a bug but didn't have the time for that yet. Maybe
> you could add this case to your bug report as well.
>
> The bug appears also when using mdadm RAID1 - when one of the drives is
> detached from the array then the OS discovers it and after a while (not
> directly, it takes several minutes) it appears under /proc/mounts:
> instead of /dev/md0p1 I see there /dev/sdb1. And usually after some hour
> or so (depending on system workload) the PC completely freezes. So
> discussion about the uniqueness of UUIDs or not, a crashing kernel is
> telling me that there is a serious bug.
>

Hmm, I also suspect our symptoms have the same root cause. It seems
the same thing happens: the BTRFS module notices another device with
the same file system and starts to report it as the root device. It
seems like it has no idea that it's part of a RAID configuration or
anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-01 Thread Marc MERLIN

On Tue, Dec 02, 2014 at 06:33:38AM +0100, MegaBrutal wrote:
> 2014-12-01 17:39 GMT+01:00 Shriramana Sharma :
> >
> > When btrfs has so many features (esp snapshots) to prevent user
> > accidentally deleting data (I liked especially
> > http://www.youtube.com/v/9H7e6BcI5Fo?start=209) I think there has to
> > be *some* modicum of support for warning against deleting a subvolume
> > (and it seems others agree too).
> >
> 
> WOW, this is pretty neat? How can I do the same actions from the
> command-line? E.g. I would be curious whether a file changed since the
> last snapshot. Today I have to use traditional methods like plain "ls

It's not trivial. See
http://marc.merlins.org/perso/btrfs/2014-05.html#Btrfs-diff-Between-Snapshots

>From what I understand, the only way to do this better is to use
btrfs-send and parse the output, but that's not trivial either since
btrfs-send has lots of intermediate renames.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 106 matches

Mail list logo