date:20141130

Hi everyone,

These patches clean up the big stack of sparse RCU errors I introduced into the
integration tree as reported by the kbuild test robot:

On Thu, Nov 27, 2014 at 06:45:20AM +0800, kbuild test robot wrote:
 tree:   git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
 integration
 head:   c7a37618b60026121255c69e042d74ae5631470c
 commit: 37aad79d90a0cbf82a5eda62dfe3af4241f5aca3 [38/39] Move BTRFS RCU 
 string to common library
 reproduce:
   # apt-get install sparse
   git checkout 37aad79d90a0cbf82a5eda62dfe3af4241f5aca3
   make ARCH=x86_64 allmodconfig
   make C=1 CF=-D__CHECK_ENDIAN__


 sparse warnings: (new ones prefixed by )

  fs/btrfs/check-integrity.c:848:25: sparse: incorrect type in argument 1 
  (different address spaces)
fs/btrfs/check-integrity.c:848:25:expected struct rcu_string [noderef] 
 asn:4*rcu_str
fs/btrfs/check-integrity.c:848:25:got struct rcu_string *name
[snip, there's a lot of these]

As payment for my transgressions, this also clean ups the existing rcu_string
usage to get rid of the preexisting noise.

The first patch fixes the __rcu annotations which I got wrong on the first go.
The second fixes an incorrect use of RCU in the BTRFS_IOC_DEV_INFO ioctl. The
third refactors the volume code's usage of rcu_string, fixing a questionable
RCU or two in the process.

This patch series applies to Chris' integration branch.

Thanks!

Omar Sandoval (3):
  rcustring: clean up botched __rcu annotations
  btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO
  btrfs: refactor btrfs_device-name updates

 fs/btrfs/ioctl.c  | 10 ++---
 fs/btrfs/volumes.c| 93 ---
 fs/btrfs/volumes.h|  2 +-
 include/linux/rcustring.h |  5 +--
 4 files changed, 72 insertions(+), 38 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] btrfs: refactor btrfs_device-name updates

The rcu_string API introduced some new sparse errors but also revealed existing
ones. First of all, the name in struct btrfs_device should be annotated as
__rcu to prevent unsafe reads. Additionally, updates should go through
rcu_dereference_protected to make it clear what's going on. This introduces
some helper functions that factor out this functionality.

Signed-off-by: Omar Sandoval osan...@osandov.com
---
 fs/btrfs/volumes.c | 93 +-
 fs/btrfs/volumes.h |  2 +-
 2 files changed, 65 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d13b253..6913bed 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -53,6 +53,45 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device 
*device);
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
 
+/*
+ * Dereference the device name under the uuid_mutex.
+ */
+static inline struct rcu_string *
+btrfs_dev_rcu_protected_name(struct btrfs_device *dev)
+__must_hold(uuid_mutex)
+{
+   return rcu_dereference_protected(dev-name,
+lockdep_is_held(uuid_mutex));
+}
+
+/*
+ * Use when the caller is the only possible updater.
+ */
+static inline struct rcu_string *
+btrfs_dev_rcu_only_name(struct btrfs_device *dev)
+{
+   return rcu_dereference_protected(dev-name, 1);
+}
+
+/*
+ * Rename a device under the uuid_mutex.
+ */
+static inline int btrfs_dev_rename(struct btrfs_device *dev, const char *name)
+__must_hold(uuid_mutex)
+{
+   struct rcu_string *old_name, *new_name;
+
+   new_name = rcu_string_strdup(name, GFP_NOFS);
+   if (!new_name)
+   return -ENOMEM;
+
+   old_name = btrfs_dev_rcu_protected_name(dev);
+   rcu_assign_pointer(dev-name, new_name);
+   rcu_string_free(old_name);
+
+   return 0;
+}
+
 static void lock_chunks(struct btrfs_root *root)
 {
mutex_lock(root-fs_info-chunk_mutex);
@@ -114,7 +153,7 @@ static void free_fs_devices(struct btrfs_fs_devices 
*fs_devices)
device = list_entry(fs_devices-devices.next,
struct btrfs_device, dev_list);
list_del(device-dev_list);
-   rcu_string_free(device-name);
+   rcu_string_free(btrfs_dev_rcu_only_name(device));
kfree(device);
}
kfree(fs_devices);
@@ -495,12 +534,10 @@ static noinline int device_list_add(const char *path,
return PTR_ERR(device);
}
 
-   name = rcu_string_strdup(path, GFP_NOFS);
-   if (!name) {
+   if (btrfs_dev_rename(device, path)) {
kfree(device);
return -ENOMEM;
}
-   rcu_assign_pointer(device-name, name);
 
mutex_lock(fs_devices-device_list_mutex);
list_add_rcu(device-dev_list, fs_devices-devices);
@@ -509,7 +546,11 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device-fs_devices = fs_devices;
-   } else if (!device-name || strcmp(device-name-str, path)) {
+   } else {
+   name = btrfs_dev_rcu_protected_name(device);
+   if (name  strcmp(name-str, path) == 0)
+   goto out;
+
/*
 * When FS is already mounted.
 * 1. If you are here and if the device-name is NULL that
@@ -547,17 +588,15 @@ static noinline int device_list_add(const char *path,
return -EEXIST;
}
 
-   name = rcu_string_strdup(path, GFP_NOFS);
-   if (!name)
+   if (btrfs_dev_rename(device, path))
return -ENOMEM;
-   rcu_string_free(device-name);
-   rcu_assign_pointer(device-name, name);
if (device-missing) {
fs_devices-missing_devices--;
device-missing = 0;
}
}
 
+out:
/*
 * Unmount does not free the btrfs_device struct but would zero
 * generation along with most of the other members. So just update
@@ -594,17 +633,12 @@ static struct btrfs_fs_devices *clone_fs_devices(struct 
btrfs_fs_devices *orig)
if (IS_ERR(device))
goto error;
 
-   /*
-* This is ok to do without rcu read locked because we hold the
-* uuid mutex so nothing we touch in here is going to disappear.
-*/
-   if (orig_dev-name) {
-   name = rcu_string_strdup(orig_dev-name-str, GFP_NOFS);
-   if (!name) {
+   name = btrfs_dev_rcu_protected_name(orig_dev);
+   if (name) {
+   if (btrfs_dev_rename(device, name-str)) {
kfree(device);

[PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO

A naked read of the value of an RCU pointer isn't safe. Put the whole access in
an RCU critical section, not just the pointer dereference.

Signed-off-by: Omar Sandoval osan...@osandov.com
---
 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ecdf68f..dd55844 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2706,6 +2706,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, 
void __user *arg)
struct btrfs_fs_devices *fs_devices = root-fs_info-fs_devices;
int ret = 0;
char *s_uuid = NULL;
+   struct rcu_string *name;
 
di_args = memdup_user(arg, sizeof(*di_args));
if (IS_ERR(di_args))
@@ -2726,17 +2727,16 @@ static long btrfs_ioctl_dev_info(struct btrfs_root 
*root, void __user *arg)
di_args-bytes_used = btrfs_device_get_bytes_used(dev);
di_args-total_bytes = btrfs_device_get_total_bytes(dev);
memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid));
-   if (dev-name) {
-   struct rcu_string *name;
 
-   rcu_read_lock();
-   name = rcu_dereference(dev-name);
+   rcu_read_lock();
+   name = rcu_dereference(dev-name);
+   if (name) {
strncpy(di_args-path, name-str, sizeof(di_args-path));
-   rcu_read_unlock();
di_args-path[sizeof(di_args-path) - 1] = 0;
} else {
di_args-path[0] = '\0';
}
+   rcu_read_unlock();
 
 out:
mutex_unlock(fs_devices-device_list_mutex);
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] rcustring: clean up botched __rcu annotations

The rcu_string returned by rcu_string_strdup isn't technically under RCU yet,
and it makes more sense not to treat it as such. Additionally, an rcu_string
passed to rcu_string_free should already be rcu_dereferenced and therefore not
in the __rcu address space.

Signed-off-by: Omar Sandoval osan...@osandov.com
---
 include/linux/rcustring.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcustring.h b/include/linux/rcustring.h
index 67277ab..28bd9bc 100644
--- a/include/linux/rcustring.h
+++ b/include/linux/rcustring.h
@@ -37,8 +37,7 @@ struct rcu_string {
  * @src: The string to copy
  * @flags: Flags for kmalloc
  */
-static inline struct rcu_string __rcu *rcu_string_strdup(const char *src,
-gfp_t flags)
+static inline struct rcu_string *rcu_string_strdup(const char *src, gfp_t 
flags)
 {
struct rcu_string *ret;
size_t len = strlen(src) + 1;
@@ -54,7 +53,7 @@ static inline struct rcu_string __rcu 
*rcu_string_strdup(const char *src,
  * rcu_string_free() - free an RCU string
  * @str: The string
  */
-static inline void rcu_string_free(struct rcu_string __rcu *str)
+static inline void rcu_string_free(struct rcu_string *str)
 {
if (str)
kfree_rcu(str, rcu);
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

On Sun, Nov 30, 2014 at 9:51 AM, Marc MERLIN m...@merlins.org wrote:

 So the Ubuntu Wiki BtrFS entry advises against using subvol
 set-default because it boots its kernel using root=subvol=@ and home
 as subvol=@home, and these two subvols are only present under the
 subvol with ID 5. But isn't it just possible to move i.e. reparent a
 subvol so I can move these two under another subvol and have that as
 default?

 Make a new subvolume called /root and just mount subvol=root

Sorry if my question wasn't clear: I wanted to know how to move a
subvol to appear under another subvol other than its original parent.
Turns out that sudo mv @ @home target/ is quite sufficient. If so why
would the Ubuntu wiki require that set-default not be used? Just @
@home need to be moved to the new place, no?

 Note that you can't mount subvols recursively in one mount AFAIK.

I'm not sure what you mean. I have a few subvols in my external HDD
which is entirely formatted as BtrFS and if I just mount the external
HDD /dev/sdc1 I am able to access all the subvols' contents as well.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Skript for backup btrfs on external HD

2014-11-30 Thread Jakob Schürz


Am 2014-11-29 um 23:18 schrieb Marc MERLIN:

On Sat, Nov 29, 2014 at 10:51:08PM +0100, Jakob Schürz wrote:

Am 2014-11-29 um 22:11 schrieb Marc MERLIN:

On Sat, Nov 29, 2014 at 09:34:01PM +0100, Jakob Schürz wrote:

Hi there!

I made a script to do backup with btrfs on a external HD.
You can see the function, how it works, and how it's to be used on
my site http://linux.xundeenergie.at/doku.php?id=mkbtrbackup
The site is in german. An english one will follow later.

Do you want some explanations?


Sure, how is it different from those 3?
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Available_Backup_Tools


Wheter i haven't seen it, or this scripts can't do recursive backup...


That's probably right, at least not automatically.


And that's why i made the script. :)




If you have subvolumes in subvolumes (for example: /home,
/home/user1, /home/user2 /var, /var/spool, /var/lib are extra
subvolumes IN the normal filetree from linux), my script takes them
all.


For me, they are all subvolumes also mounted on /mnt/btrfs_poolx
so I backup from there.


That's also possible with my skript, because you can control it with an 
config-file.

For example you have
/
|-@
|-@home
`-@var

And you want all your snapshots of this 3 subvolumes in separate 
directories with timestamp (and maybe .hourly_X-Tag)

put in the config:

SNPMNT=/path/to/btrfs-poolmount
BKPMNT=/path/to/external/HD/mountpoint

backup  @   roots   backup/roots
backup  @home   homes   backup/homes
backup  @varvarsbackup/vars

start the skript with
mkbtrbackup create --interval hourly -c /path/to/backupconfig

you get in /path/to/btrfs-poolmount 3 directories (roots, homes and 
vars), and on /path/to/external/HD/mountpoint one directoriy backup, 
including also the three given subdirectories from the 4th coloumn 
(leave this coloumn blank, no auto-transfer to the external HD!!!)


in this subdirectories you get subvolumes like
@.20141130-115001.hourly_0
@home.20141130-115001.hourly_0
@var.20141130-115001.hourly_0

AND they are rotated automatically.




And my script changes the fstab-entry in the new snapshot.
The original has the option subvol=@SUBVOL, where @SUBVOL is the
name of the original system.


I don't need to do that, my script updates a symlink pointing to the
last snapshot, and you can use subvol=symlink-name


I'm trying on this, it's not finished. There are many discussions about. 
What is better... modify grub.cfg on each snapshot, work with symlinks...


I create one symlink @*.CURRENT. I will rename it to .LAST... so i can 
do the same with a static grub-entry





You get a systemd-unit, in the tarball, which makes a snapshot from
your system, on successful boot, so you can switch back fast, if an
update destroyed your system.

And it is for minimal-systems... no python, no perl, no java... only
shell(bash) :-)


That makes sense, thanks for explaining.


For example... on an raspberry Pi it would be a good thing. :)

Hope, you try it, and give me some feedback. ;-)

Jakob
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pro/cons of raid1 with mdadm/lvm2

2014-11-30 Thread Gour

Hello,

recently I was migrating from Debian to openSUSE and wanted to make it
smooth by dismantling my 2x1TB raid-1 array, install SUSE on the 1st
disk, cp /home data, check everything is OK and then add 2nd disk
containing Debian into raid-1 array.

However, in orde to accomplish it I learnt that one cannot simply
degrade mount and use such array like with mdadm, but I had to convert
system with -dconvert=single -mconvert=single which takes some time.

That's why I'm considering to put all my partitions (swap, root with
several subvolumes, home) under LVM2 volumes and then create raid-1
array with mdadm since that would enable to me more easy and quickly
temporarily dismantle raid-1 array, do some data manipulation from one
disk to another (sometimes I use 2nd disk as temporatily storage when
restoring some archived data from tapes etc.) and then resilver raid-1
array.

However, I wonder if there are some 'cons' in having raid-1 partition
under mdadm and not using native mirroring capabilities of btrfs fs?

Let me add that I also want to take advantage of using SUSE's snapshots
features, but I hope that's not the obstacle for the above-mentioned
layout - I'd still use btrfs' snapshot facility.


Sincerely,
Gour

-- 
Whenever and wherever there is a decline in religious practice, 
O descendant of Bharata, and a predominant rise of irreligion — 
at that time I descend Myself.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

On Sun, Nov 30, 2014 at 09:01:37AM +0530, Shriramana Sharma wrote:
I am confused with this: should I call it the root subvol or
top-level subvol or default subvol or doesn't it matter? Are all
subvols equal, or some are more equal than others [hark to Orwell's
Animal Farm ;-)]?

I try to use top level for subvolid=5.

root subvol is hugely confusing, as it could be one of several
things. If you mean the subvol mounted at /, then I call that / or
the / subvol.

default subvol is the one marked as default. This starts out as
subvolid=5, but can be set to any other subvol.

And more importantly, is the ID of the root subvol 0 or 5?

In the data structures on disk, it's 5. The kernel aliases 0 to
mean subvolid 5.

The Oracle guide
(https://docs.oracle.com/cd/E37670_01/E37355/html/ol_use_case3_btrfs.html)
seems to say it's 0 :

By default, the operating system mounts the parent btrfs volume,
which has an ID of 0

but the BtrFS wiki (and btrfs subvol manpage) reads 5:

every btrfs filesystem has a default subvolume as its initially
top-level subvolume, whose subvolume id is 5(FS_TREE).

as also the Ubuntu Wiki:

The default subvolume to mount is always the top of the btrfs tree
(subvolid=5).

As above, both are correct here.

Now this Oracle page
http://www.oracle.com/technetwork/articles/servers-storage-admin/advanced-btrfs-1734952.html
says:

The only clean way to destroy the default subvolume is to rerun the
mkfs.btrfs command, which would destroy existing data.

OK, this is actually wrong. It's not the default subvolume if
someone's run set-default on the FS. They're correct that you can't
delete the top-level subvol. You can't delete the subvol marked as
default, either. Assuming (or implying) that the two are the same is
just plain wrong.

So from what I've (confusedly) understood so far, 0 refers to the
superstructure (or whatchamacallit) of the entire BtrFS-based contents
of the device(s) and hence cannot be deleted but only reset by a
mkfs.btrfs, but 5 is only the default subvol (mounted when the FS as a
whole is mounted without subvol spec) provided by mkfs.btrfs, and
subvol set-default can have another subvol mounted as default instead,
after which 5 can actually be deleted?

You can't delete subvolid=5. It's part of the fundamental
whatchamacallit of the FS (a good name). Even if you change the
default subvol, you still can't delete it.

Hugo.

--
Hugo Mills | People are too unreliable to be replaced by
hugo@... carfax.org.uk | machines.
http://carfax.org.uk/ |
PGP: 65E74AC0 | Nathan Spring, Star Cops

signature.asc
Description: Digital signature

Re: Skript for backup btrfs on external HD

2014-11-30 Thread Jakob Schürz


Am 2014-11-29 um 22:11 schrieb Marc MERLIN:

On Sat, Nov 29, 2014 at 09:34:01PM +0100, Jakob Schürz wrote:

Hi there!

I made a script to do backup with btrfs on a external HD.
You can see the function, how it works, and how it's to be used on
my site http://linux.xundeenergie.at/doku.php?id=mkbtrbackup
The site is in german. An english one will follow later.

Do you want some explanations?


Sure, how is it different from those 3?
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Available_Backup_Tools


Wheter i haven't seen it, or this scripts can't do recursive backup...

If you have subvolumes in subvolumes (for example: /home, /home/user1, 
/home/user2 /var, /var/spool, /var/lib are extra subvolumes IN the 
normal filetree from linux), my script takes them all.
It looks on the external storage, if there's an older snapshot (i call 
all subvolumes together in this case a snapshot!!) which is also on the 
local machine. If so, is makes a incremental backup. If not, a initial 
transfer is started. For each subvolume in the snapshot!


And my script changes the fstab-entry in the new snapshot.
The original has the option subvol=@SUBVOL, where @SUBVOL is the name 
of the original system.
It changes the @SUBVOLUME to the subvolume-id, so you can mount your 
snapshot easy.


One Point is missing... Modifying of grub to serve boot-menu-entries for 
older snapshots.


You get a systemd-unit, in the tarball, which makes a snapshot from your 
system, on successful boot, so you can switch back fast, if an update 
destroyed your system.


And it is for minimal-systems... no python, no perl, no java... only 
shell(bash) :-)


regards
jakob


--
http://xundeenergie.at
http://verkehrsloesungen.wordpress.com/
http://cogitationum.wordpress.com/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving contents from one subvol to another

On Sun, Nov 30, 2014 at 9:23 AM, Shriramana Sharma samj...@gmail.com wrote:


 Why should noCoW affect cp --reflink anyhow? I just created a 500 MiB
 file from /dev/urandom under a chattr +C-ed dir, and copied to another
 subvol using cp --reflink, and fi df still shows 500 MiB, not 1 GiB.

Looks like I might have spoken too soon (because I've read that some
changes aren't visible until the next FS commit) so right now it
actually says 1 GiB used, which I can't grok because why should a
nocow file be physically copied (to new blocks) just because it's
nocow? Is it because it is possible that the two copies are
overwritten separately at the same time?

But still, it seems to me that mv should make it so that the nocow
attr is temporarily (atomically?) suspended/ignored just for the
duration of the relocation, since there aren't going to be any two
copies to be overwritten at the same time.

Comments?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Change total in btrfs filesystem df output to alloc

Attached patch.

On Sun, Nov 30, 2014 at 9:30 AM, Shriramana Sharma samj...@gmail.com wrote:
 On Sun, Aug 31, 2014 at 7:25 AM, Shriramana Sharma samj...@gmail.com wrote:
 Hello. There seem to be lots of questions in various forums re the
 output of btrfs fi df -- especially w.r.t. the usage of the word
 total. For example see https://community.oracle.com/thread/2459838

 I feel it would make the intent clearer if total were changed to
 alloc or allocated (if the short form is felt unclear). It would
 also help people understand the output of regular df on a btrfs system
 since one can understand easier that pre-allocated space would count
 as used space as it is not free!

 Where should I report a bug to get this fixed? Thanks.

 --
 Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा



-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
From 3d386053105ef7c2dba3643530dffe3ecd4dcf49 Mon Sep 17 00:00:00 2001
From: Shriramana Sharma samj...@gmail.com
Date: Sun, 30 Nov 2014 19:00:38 +0530
Subject: [PATCH] df: change total to alloc

---
 cmds-filesystem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index cd6b3c6..05f6235 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -233,7 +233,7 @@ static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode)
 	struct btrfs_ioctl_space_info *sp = sargs-spaces;
 
 	for (i = 0; i  sargs-total_spaces; i++, sp++) {
-		printf(%s, %s: total=%s, used=%s\n,
+		printf(%s, %s: alloc=%s, used=%s\n,
 			group_type_str(sp-flags),
 			group_profile_str(sp-flags),
 			pretty_size_mode(sp-total_bytes, unit_mode),
-- 
2.1.3

Re: root subvol id is 0 or 5?

On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote:

In the data structures on disk, it's 5. The kernel aliases 0 to
 mean subvolid 5.

So why 5 and not just 0 which seems a logical choice? On top of this,
one needs to alias 0 to 5!

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा

Re: root subvol id is 0 or 5?

On Sun, Nov 30, 2014 at 7:08 PM, Shriramana Sharma samj...@gmail.com wrote:

 So why 5 and not just 0 which seems a logical choice? On top of this,
 one needs to alias 0 to 5!

Attached patch clarifying this in the documentation. (Should have done
this with the previous mail. Sorry for multiple mails.)

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
From 54387ff2155423d990b5a9aca95315fe6e649303 Mon Sep 17 00:00:00 2001
From: Shriramana Sharma samj...@gmail.com
Date: Sun, 30 Nov 2014 19:11:39 +0530
Subject: [PATCH 2/2] btrfs subvolume doc clarifications

---
 Documentation/btrfs-subvolume.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-subvolume.txt b/Documentation/btrfs-subvolume.txt
index 1360aba..34abdef 100644
--- a/Documentation/btrfs-subvolume.txt
+++ b/Documentation/btrfs-subvolume.txt
@@ -31,7 +31,7 @@ When `mount`(8) using 'subvol' or 'subvolid' mount option, one can access
 files/directories/subvolumes inside it, but nothing in parent subvolumes.
 
 Also every btrfs filesystem has a default subvolume as its initially top-level
-subvolume, whose subvolume id is 5(FS_TREE).
+subvolume, whose subvolume id is 5. (0 is also acceptable as an alias.)
 
 A btrfs snapshot is much like a subvolume, but shares its data(and metadata)
 with other subvolume/snapshot. Due to the capabilities of COW, modifications
@@ -166,7 +166,7 @@ sleep N seconds between checks (default: 1)
 
 EXIT STATUS
 ---
-*btrfs subvolume* returns a zero exit status if it succeeds. Non zero is
+*btrfs subvolume* returns a zero exit status if it succeeds. A non-zero value is
 returned in case of failure.
 
 AVAILABILITY
-- 
2.1.3

Considerations in snapshotting and send/receive of nocow files?

Given that snapshotting effectively reduces the usefulness of nocow, I
suppose the preferable model to snapshotting and send/receiving such
files would be different than other files.

Should nocow files (for me only VBox images) preferably be:

1) under a separate subvolume?

2) said subvol snapshotted less often?

3) sent/received any differently?

Thanks.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: root subvol id is 0 or 5?

2014-11-30 Thread Hugo Mills

On Sun, Nov 30, 2014 at 07:08:51PM +0530, Shriramana Sharma wrote:
 On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote:
 
 In the data structures on disk, it's 5. The kernel aliases 0 to
  mean subvolid 5.
 
 So why 5 and not just 0 which seems a logical choice? On top of this,
 one needs to alias 0 to 5!

   All of the trees used in the FS metadata have an ID number. The
well-known trees have small, fixed IDs:

#define BTRFS_ROOT_TREE_OBJECTID 1ULL
#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
#define BTRFS_CHUNK_TREE_OBJECTID 3ULL
#define BTRFS_DEV_TREE_OBJECTID 4ULL
#define BTRFS_FS_TREE_OBJECTID 5ULL
#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL
#define BTRFS_CSUM_TREE_OBJECTID 7ULL
#define BTRFS_QUOTA_TREE_OBJECTID 8ULL
#define BTRFS_UUID_TREE_OBJECTID 9ULL

   Note that the FS tree has ID 5. A subvolume is basically another
FS tree. Subvolumes other than the top level are given dynamically-
allocated IDs starting from 256. Note also that the root, chunk,
device and extent trees are all more important, lower level
information than any FS tree, so they logically have lower numbers
(and were probably implemented earlier).

   There's no particular reason that it couldn't have been designed
with the initial FS tree as ID 0, but it simply wasn't. However,
changing this value now would result in two incompatible versions of
btrfs -- neither one would be able to deal with the other's
filesystems, because the FS tree has a different ID. (And writing code
to cope with both would be painful, disruptive and error-prone.)

   The cost of fixing this minor nit would, I think, far outweigh any
benefits you'd get from it. Hence the alias for 0, which is (IIRC)
done up-front in the ioctl interface, and therefore has few places
that it could go wrong or affect the main code of the FS.

   Hugo.

-- 
Hugo Mills | How deep will this sub go?
hugo@... carfax.org.uk | Oh, she'll go all the way to the bottom if we don't
http://carfax.org.uk/  | stop her.
PGP: 65E74AC0  |  U571


signature.asc
Description: Digital signature

Re: Running out of disk space during BTRFS_IOC_CLONE - rebalance doesn't help

2014-11-30 Thread Liu Bo

On Sun, Nov 30, 2014 at 08:29:42AM +0100, Guenther Starnberger wrote:
 I'm having an issue with a filesystem where I'm regularly running out of disk
 space during deduplication with bedup. Rebalancing does not help and the same
 issue occurs even after a full rebalance.
 
 Main use-case for this filesystem is a 3 TB backup disk where I'm creating
 backups by copying a newer version of the data into a new directory and then
 afterwards running bedup to deduplicate the data (using the older already
 existing data).
 
 What happens is that bedup will deduplicate some files successfully, but at
 some point fails with an errno 28 (no space left on device) during
 deduplication. I had some very limited success with running a balance, but
 afterwards the same issue happens again after a few more files are
 deduplicated (applies to balances with and without filters). According to fsck
 the filesystem appears to be OK.
 
 Is there anything else that I can try out in order to fix this issue? Or 
 should
 I try to create a new filesystem and copy the existing data?
 
 Here's the log output:
 
 dmesg:
 
 [235491.227888] [ cut here ]
 [235491.227912] WARNING: CPU: 0 PID: 14837 at fs/btrfs/super.c:259 
 __btrfs_abort_transaction+0x50/0x110 [btrfs]()
 [235491.227914] BTRFS: Transaction aborted (error -28)

There is something wrong in these codes, clone_finish_inode_update() is 
supposed to
be successful since we've reserved some space in btrfs_start_transaction() for 
it.

Thanks,

-liubo

 [235491.227916] Modules linked in: fuse btrfs xor raid6_pq uas usb_storage 
 ctr ccm toshiba_acpi sparse_keymap toshiba_haps joydev hp_accel lis3lv02d 
 input_polldev hdaps(O) btusb bluetooth uvcvideo videobuf2_vmalloc 
 videobuf2_memops videobuf2_core v4l2_common videodev qcserial media usb_wwan 
 usbserial arc4 iwldvm snd_hda_codec_hdmi mousedev snd_hda_codec_conexant 
 snd_hda_codec_generic mac80211 iTCO_wdt iTCO_vendor_support coretemp 
 intel_powerclamp snd_hda_intel snd_hda_controller snd_hda_codec kvm_intel 
 snd_hwdep iwlwifi thinkpad_acpi mei_me mei cfg80211 snd_pcm nvram lpc_ich kvm 
 evdev snd_timer i915 snd mac_hid ac serio_raw e1000e psmouse led_class wmi 
 rfkill shpchp drm_kms_helper intel_ips i2c_i801 soundcore drm battery hwmon 
 ptp thermal pps_core i2c_algo_bit i2c_core video intel_agp intel_gtt button
 [235491.227968]  acpi_cpufreq processor sch_fq_codel tp_smapi(O) 
 thinkpad_ec(O) nfs lockd sunrpc fscache ext4 crc16 mbcache jbd2 
 algif_skcipher af_alg dm_crypt dm_mod atkbd libps2 crc32_pclmul crc32c_intel 
 ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper 
 ablk_helper cryptd ehci_pci ehci_hcd usbcore usb_common i8042 serio ata_piix 
 sd_mod crct10dif_generic crct10dif_pclmul crc_t10dif crct10dif_common ahci 
 libahci ata_generic libata scsi_mod
 [235491.228001] CPU: 0 PID: 14837 Comm: bedup Tainted: GW  O   
 3.17.4-1-ARCH #1
 [235491.228003] Hardware name: LENOVO 3680U4M/3680U4M, BIOS 6QET68WW (1.38 ) 
 12/01/2011
 [235491.228004]   5deed0d1 880144a57a90 
 81537b0e
 [235491.228006]  880144a57ad8 880144a57ac8 8107078d 
 ffe4
 [235491.228008]  8801719dcaa0 88009e273800 a09f7630 
 0c46
 [235491.228010] Call Trace:
 [235491.228017]  [81537b0e] dump_stack+0x4d/0x6f
 [235491.228021]  [8107078d] warn_slowpath_common+0x7d/0xa0
 [235491.228024]  [8107080c] warn_slowpath_fmt+0x5c/0x80
 [235491.228029]  [a0949d10] __btrfs_abort_transaction+0x50/0x110 
 [btrfs]
 [235491.228040]  [a09aa9ba] clone_finish_inode_update+0xda/0xf0 
 [btrfs]
 [235491.228046]  [a09ad0de] btrfs_clone+0x6ae/0xcc0 [btrfs]
 [235491.228053]  [a09ade69] btrfs_ioctl_clone+0x779/0x7b0 [btrfs]
 [235491.228059]  [a09b18b7] btrfs_ioctl+0x10d7/0x2810 [btrfs]
 [235491.228063]  [81193b19] ? free_pages_and_swap_cache+0xb9/0xe0
 [235491.228066]  [8117d14c] ? tlb_flush_mmu_free+0x2c/0x50
 [235491.228068]  [8117dd2d] ? tlb_finish_mmu+0x4d/0x50
 [235491.228070]  [81185cd2] ? unmap_region+0xe2/0x130
 [235491.228073]  [811ac539] ? kmem_cache_free+0x199/0x1d0
 [235491.228075]  [811da5f0] do_vfs_ioctl+0x2d0/0x4b0
 [235491.228076]  [81187fd0] ? do_munmap+0x260/0x400
 [235491.228078]  [811da851] SyS_ioctl+0x81/0xa0
 [235491.228081]  [8153db29] system_call_fastpath+0x16/0x1b
 [235491.228082] ---[ end trace 636d52c4c1dff6bc ]---
 
 btrfs fi show:
 
 Label: none  uuid: 36c795fe-acb8-458e-87f4-721fedd81b8e
 Total devices 1 FS bytes used 2.14TiB
 devid1 size 2.73TiB used 2.17TiB path /dev/mapper/crypt
 
 btrfs fi df:
 
 Data, single: total=2.12TiB, used=2.12TiB
 System, DUP: total=32.00MiB, used=248.00KiB
 Metadata, DUP: total=25.00GiB, used=23.64GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B
 
 I reported the same issue a year ago in 20131202081543.ga1...@gst.name and

Re: [PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO

2014-11-30 Thread Pranith Kumar

On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote:
 A naked read of the value of an RCU pointer isn't safe. Put the whole access 
 in
 an RCU critical section, not just the pointer dereference.

 Signed-off-by: Omar Sandoval osan...@osandov.com

You can use rcu_access_pointer() in the if() condition check rather
than increasing the read critical section. We should try to keep the
critical section as small as possible.

Also, since we have rcu_str_deref() we can use that instead of
rcu_dereference() on device-name. Thoughts?

 ---
  fs/btrfs/ioctl.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index ecdf68f..dd55844 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -2706,6 +2706,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root 
 *root, void __user *arg)
 struct btrfs_fs_devices *fs_devices = root-fs_info-fs_devices;
 int ret = 0;
 char *s_uuid = NULL;
 +   struct rcu_string *name;

 di_args = memdup_user(arg, sizeof(*di_args));
 if (IS_ERR(di_args))
 @@ -2726,17 +2727,16 @@ static long btrfs_ioctl_dev_info(struct btrfs_root 
 *root, void __user *arg)
 di_args-bytes_used = btrfs_device_get_bytes_used(dev);
 di_args-total_bytes = btrfs_device_get_total_bytes(dev);
 memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid));
 -   if (dev-name) {
 -   struct rcu_string *name;

 -   rcu_read_lock();
 -   name = rcu_dereference(dev-name);
 +   rcu_read_lock();
 +   name = rcu_dereference(dev-name);
 +   if (name) {
 strncpy(di_args-path, name-str, sizeof(di_args-path));
 -   rcu_read_unlock();
 di_args-path[sizeof(di_args-path) - 1] = 0;
 } else {
 di_args-path[0] = '\0';
 }
 +   rcu_read_unlock();

  out:
 mutex_unlock(fs_devices-device_list_mutex);
 --
 2.1.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs: refactor btrfs_device-name updates

2014-11-30 Thread Pranith Kumar

On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote:
 The rcu_string API introduced some new sparse errors but also revealed 
 existing
 ones. First of all, the name in struct btrfs_device should be annotated as
 __rcu to prevent unsafe reads. Additionally, updates should go through
 rcu_dereference_protected to make it clear what's going on. This introduces
 some helper functions that factor out this functionality.

 Signed-off-by: Omar Sandoval osan...@osandov.com
 diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
 index 6e04f27..2298a70 100644
 --- a/fs/btrfs/volumes.h
 +++ b/fs/btrfs/volumes.h
 @@ -54,7 +54,7 @@ struct btrfs_device {

 struct btrfs_root *dev_root;

 -   struct rcu_string *name;
 +   struct rcu_string __rcu *name;

 u64 generation;


Since rcu_strings are rcu specific, why not annotate the char pointer
in 'struct rcu_string' with __rcu annotation? That should catch all
error-prone users of rcu_string.

-- 
Pranith
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH v2] mount.btrfs helper

2014-11-30 Thread Goffredo Baroncelli

Hi all,

this patch provides a mount.btrfs helper for the mount command.
A btrfs filesystem could span several disks. This helper scans all the
partitions to discover all the disks required to mount a filesystem.
So it would not necessary any-more to scan the partitions to mount a 
filesystem.

mount.btrfs passes in the option parameters the devices required to mount a 
filesystem. 
Supposing that a filesystem is composed by several disks (/dev/sd[cdef]), when
the user runs mount /dev/sdd /mnt, mount.btrfs is called and it executes the
the mount(2) syscall as below:

mount(/dev/sdd, /mnt, btrfs, 0, 
device=/dev/sdc,device=/dev/sde,device=/de/vsdf).

This helper uses both the libblkid and libmount to discover the devices, to
manipulate the parameters and to update the mtab file.

I got the idea from the btrfs.wiki; the initial idea was to avoid the
separation of scanning phases (at boot time or during the block device
discovery) from the mounting. 
But now I think that its biggest advantage is that now it is possible to
perform some actions that before would not be possible, like:
- check that all the disks have different disk_uuid
Before mounting the filesystem, it is checked that all the disks have 
different uuid, otherwise it stops the process because it is impossible
to guarantee that the right disks are used (i.e. some disks may be 
snapshotted by lvm...)
- wait the availability of all disks
May be that when mount is called not all the disks are available. This helper
waits few second (now 10, tunable via the device_timeout option) that the
disks appear.
If the timeout expires, there are two possibilities:
1) if the option degraded is passed, the filesystem is mounted in 
   degraded mode 
2) otherwise the filesystem is NOT mounted with an error message

All the controls above may be avoided passing the disks explicitly:

mount /dev/sdb -o device=/dev/sdc,device=/dev/sdd /mnt

Of course all the previous kernels checks are still present.

Below an example of use:

ghigo@emulato:~$ sudo mkfs.btrfs /dev/vdb /dev/vdc /dev/vdd /dev/vde
ghigo@emulato:~$ sudo mount -v /dev/vdb /mnt/btrfs1/
mount: you didn't specify a filesystem type for /dev/vdb
   I will try type btrfs
INFO: scan the first device
INFO: find filesystem 'test1' [d43585b9-233e-4ce3-9201-81d68ec8e538]
INFO: source: /dev/vdb
INFO: target: /mnt/btrfs1/
INFO: vfs_opts: 0x - rw
INFO: fs_opts: (null)
INFO:dev='/dev/vdb' UUID='9e83d673-a76c-4b56-8daa-0a0659897d8c' gen=6
INFO:dev='/dev/vde' UUID='53647bb0-9c39-445a-ba3f-ce31e35026a7' gen=6
INFO:dev='/dev/vdd' UUID='8396ee54-fba1-46b3-801c-1918a9812603' gen=6
INFO:dev='/dev/vdc' UUID='577b77df-2c95-4087-90d7-2331ee10a59d' gen=6
INFO: mtab updated
INFO: mount succeded
ghigo@emulato:~$ 

you can pull the source from:

https://github.com/kreijack/btrfs-progs.git

branch
mount.btrfs

as bonus you will get also the test suite (under test/mount.btrfs-tests)

Comments are welcome

BR
G.Baroncelli

--

diff --git a/Makefile b/Makefile
index 4cae30c..8d38138 100644
--- a/Makefile
+++ b/Makefile
@@ -48,7 +48,7 @@ MAKEOPTS = --no-print-directory Q=$(Q)
 
 progs = mkfs.btrfs btrfs-debug-tree btrfsck \
btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \
-   btrfs-find-root btrfstune btrfs-show-super
+   btrfs-find-root btrfstune btrfs-show-super mount.btrfs
 
 progs_extra = btrfs-corrupt-block btrfs-fragments btrfs-calc-size \
  btrfs-select-super
@@ -239,6 +239,12 @@ ioctl-test: $(objects) $(libs) ioctl-test.o
@echo [LD] $@
$(Q)$(CC) $(CFLAGS) -o ioctl-test $(objects) ioctl-test.o $(LDFLAGS) 
$(LIBS)
 
+mount.btrfs: btrfs-mount.o btrfs-mount-find-disks.o crc32c.o utils.o
+   @echo [LD] $@
+   $(Q)$(CC) $(CFLAGS) -o mount.btrfs -lmount -lblkid -luuid \
+   crc32c.o \
+   btrfs-mount.o btrfs-mount-find-disks.o $(LDFLAGS) 
+
 send-test: $(objects) $(libs) send-test.o
@echo [LD] $@
$(Q)$(CC) $(CFLAGS) -o send-test $(objects) send-test.o $(LDFLAGS) 
$(LIBS) -lpthread
diff --git a/btrfs-mount-find-disks.c b/btrfs-mount-find-disks.c
new file mode 100644
index 000..89aac8b
--- /dev/null
+++ b/btrfs-mount-find-disks.c
@@ -0,0 +1,446 @@
+#define _XOPEN_SOURCE 500
+#define _GNU_SOURCE 1
+
+#include stdio.h
+#include unistd.h
+#include string.h
+#include stdlib.h
+#include assert.h
+#include sys/mount.h
+#include errno.h
+#include sys/types.h
+#include sys/stat.h
+#include fcntl.h
+#include unistd.h
+
+#include blkid/blkid.h
+#include uuid/uuid.h
+#include libmount/libmount.h
+
+#include crc32c.h
+
+#include kerncompat.h
+#include extent_io.h
+#include ctree.h
+#include disk-io.h
+#include btrfs-mount.h
+
+#define BTRFS_UUID_UNPARSED_SIZE 37
+
+/*
+ * checks if a path is a block device node
+ * Returns negative errno on failure, otherwise
+ * returns 1 for blockdev, 0 for not-blockdev
+ */
+static int

Re: pro/cons of raid1 with mdadm/lvm2

2014-11-30 Thread Russell Coker

When the 2 disks have different data mdadm has no way of knowing which one is 
correct and has a 50% chance of overwriting good data. But BTRFS does checksums 
on all reads and solves the problem of corrupt data - as long as you don't have 
2 corrupt sectors in matching blocks.
-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v2] mount.btrfs helper

2014-11-30 Thread Dimitri John Ledkov

Hello,

On 30 November 2014 at 17:43, Goffredo Baroncelli kreij...@libero.it wrote:
 Hi all,

 this patch provides a mount.btrfs helper for the mount command.
 A btrfs filesystem could span several disks. This helper scans all the
 partitions to discover all the disks required to mount a filesystem.
 So it would not necessary any-more to scan the partitions to mount a 
 filesystem.


I would welcome this, as a general idea. At the moment in debian 
ubuntu, btrfs tools package ships udev rules to call btrfs scan
whenever device nodes appear.

If scan is built into mount, I would be able to drop that udev rule.
There are also some reports (not yet re-verified) that such udev rule
is not effective, that is btrfs mount fails when attempted before udev
has attempted to be run - e.g. from initrdless boot trying to mount
btrfs systems before udev-trigger has been run (to process cold-plug
events).

-- 
Regards,

Dimitri.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v2] mount.btrfs helper

2014-11-30 Thread cwillu

In ubuntu, the initfs runs a btrfs dev scan, which should catch
anything that would be missed there.

On Sun, Nov 30, 2014 at 4:11 PM, Dimitri John Ledkov x...@debian.org wrote:
 Hello,

 On 30 November 2014 at 17:43, Goffredo Baroncelli kreij...@libero.it wrote:
 Hi all,

 this patch provides a mount.btrfs helper for the mount command.
 A btrfs filesystem could span several disks. This helper scans all the
 partitions to discover all the disks required to mount a filesystem.
 So it would not necessary any-more to scan the partitions to mount a 
 filesystem.


 I would welcome this, as a general idea. At the moment in debian 
 ubuntu, btrfs tools package ships udev rules to call btrfs scan
 whenever device nodes appear.

 If scan is built into mount, I would be able to drop that udev rule.
 There are also some reports (not yet re-verified) that such udev rule
 is not effective, that is btrfs mount fails when attempted before udev
 has attempted to be run - e.g. from initrdless boot trying to mount
 btrfs systems before udev-trigger has been run (to process cold-plug
 events).

 --
 Regards,

 Dimitri.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-30 Thread Christoph Anton Mitterer

Agree with others about -C 256...-C sha256 is only three
letters more ;)

Ideally, sha2-256 would be used, since there will be (are) other
versions of sha which have 256 bits size.


Cheers,
Chris.




smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH v2] mount.btrfs helper

2014-11-30 Thread Dimitri John Ledkov

On 30 November 2014 at 22:31, cwillu cwi...@cwillu.com wrote:

 In ubuntu, the initfs runs a btrfs dev scan, which should catch
 anything that would be missed there.


I'm sorry, udev rule(s) is not sufficient in the initramfs-less case,
as outlined.

In case of booting with initramfs, indeed, both Debian  Ubuntu
include snippets there to run btrfs scan.

-- 
Regards,

Dimitri.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-30 Thread Christoph Anton Mitterer

Agree with others about -C 256...-C sha256 is only three
letters more ;)

Ideally, sha2-256 would be used, since there will be (are) other
versions of sha which have 256 bits size.


Cheers,
Chris.




smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-30 Thread Dimitri John Ledkov

On 30 November 2014 at 22:59, Christoph Anton Mitterer
cales...@scientia.net wrote:
Agree with others about -C 256...-C sha256 is only three
letters more ;)

 Ideally, sha2-256 would be used, since there will be (are) other
 versions of sha which have 256 bits size.


Nope, we should use standard names. SHA-2 256 was the first SHA algo
to use 256 bits, thus it's commonly referred to as sha256 across the
board in multiple pieces of software.
SHA-3 family of hashes started to have the same length and thus will
be known as sha3-256 etc.

Shorthand variant names in this table here
http://en.wikipedia.org/wiki/SHA-1#Comparison_of_SHA_functions appear
to me how SHA hashes are currently referred as.

-- 
Regards,

Dimitri.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v2] mount.btrfs helper

2014-11-30 Thread cwillu

Sorry, misread initrdless as initramfs.

In #btrfs, I usually say something like do you gain enough by not
using an initfs for this to be worth the hassle?, but of course,
that's not an argument against making mount smarter.

On Sun, Nov 30, 2014 at 4:57 PM, Dimitri John Ledkov x...@debian.org wrote:
 On 30 November 2014 at 22:31, cwillu cwi...@cwillu.com wrote:

 In ubuntu, the initfs runs a btrfs dev scan, which should catch
 anything that would be missed there.


 I'm sorry, udev rule(s) is not sufficient in the initramfs-less case,
 as outlined.

 In case of booting with initramfs, indeed, both Debian  Ubuntu
 include snippets there to run btrfs scan.

 --
 Regards,

 Dimitri.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

2014-11-30 Thread Marc MERLIN

On Sun, Nov 30, 2014 at 03:57:06PM +0530, Shriramana Sharma wrote:
 On Sun, Nov 30, 2014 at 9:51 AM, Marc MERLIN m...@merlins.org wrote:
 
  So the Ubuntu Wiki BtrFS entry advises against using subvol
  set-default because it boots its kernel using root=subvol=@ and home
  as subvol=@home, and these two subvols are only present under the
  subvol with ID 5. But isn't it just possible to move i.e. reparent a
  subvol so I can move these two under another subvol and have that as
  default?
 
  Make a new subvolume called /root and just mount subvol=root
 
 Sorry if my question wasn't clear: I wanted to know how to move a
 subvol to appear under another subvol other than its original parent.
 Turns out that sudo mv @ @home target/ is quite sufficient. If so why
 would the Ubuntu wiki require that set-default not be used? Just @
 @home need to be moved to the new place, no?

I've never done that. If I had to move them, I'd just change the
mountpoint.
 
  Note that you can't mount subvols recursively in one mount AFAIK.
 
 I'm not sure what you mean. I have a few subvols in my external HDD
 which is entirely formatted as BtrFS and if I just mount the external
 HDD /dev/sdc1 I am able to access all the subvols' contents as well.

Yes, if you mount the root, it works of course.
If you mount a subvol, you cannot have it automatically have it mount
other subvols.
Subvols don't really know or care where they are mounted compared to one
another, and who is under whom. It's just mount setup.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

2014-11-30 Thread Chris Murphy

On Sat, Nov 29, 2014 at 8:31 PM, Shriramana Sharma samj...@gmail.com wrote:
 So the Ubuntu Wiki BtrFS entry advises against using subvol
 set-default because it boots its kernel using root=subvol=@ and home
 as subvol=@home, and these two subvols are only present under the
 subvol with ID 5.

The advice may have had to do with GRUB behavior prior to 2.02.
Previously GRUB attempted to honor the btrfs default subvolume, and
therefore treated any path in grub.cfg relative to the default
subvolume. Now, GRUB behaves the same as the subvol= mount option, it
is always treated as an absolute path from subvol id 5, hence the
default subvolume is ignored.

Since the default subvolume is set by a user space program I think
it's a domain violation for anything to subvert this; it really should
remain a shortcut for the user's benefit only, so they can use mount
without -o subvol=. Everything else should explicitly pass subvol=



 But isn't it just possible to move i.e. reparent a
 subvol so I can move these two under another subvol and have that as
 default?

You can move subvolumes. My suggestion is subvolumes containing
binaries shouldn't be located within another subvolume that ends up
being mounted, that way old binaries with possible vulnerabilities
aren't exposed in the normal search path.


 Possibly this is a hypothetical question as I'm not sure whether it
 would be actually practically required but looking at the specific
 Ubuntu advice on this I thought I should ask.

 I'm also not sure what openSUSE (or other distros) do about this... Do
 they mount root using subvolid, or subvol name or such?

openSUSE uses subvol id 5 for installing the OS to, and some
directories are made subvolumes such as home var and maybe usr.
Therefore when subvolid 5 is snapshot, those are exempt, and have to
be individually snapshot. The snapshots are found in the same root
directory everything else is, in a . directory (I think .snapshots ?)

Fedora uses subvolumes root and home by default, and fstab uses
subvol=root and subvol=home to mount them at / and /home respectively.

I don't know any distro using subvolid right now but that might be
prudent as it's far less user domain than subvolume names.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pro/cons of raid1 with mdadm/lvm2

2014-11-30 Thread Chris Murphy

On Sun, Nov 30, 2014 at 3:06 PM, Russell Coker russ...@coker.com.au wrote:
 When the 2 disks have different data mdadm has no way of knowing which one is 
 correct and has a 50% chance of overwriting good data. But BTRFS does 
 checksums on all reads and solves the problem of corrupt data - as long as 
 you don't have 2 corrupt sectors in matching blocks.

Yeah. I'm not sure though if openSUSE 13.2 prevents users from
creating btrfs raid1 volumes entirely, or if it's just an install time
limitation.

I know that Fedora's installer won't allow the user to create Btrfs on
LVM, and it probably doesn't allow it on md raid either.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Crazy idea of cleanup the inode_record btrfsck things with SQL?


[BACKGROUND]
I'm trying to implement the function to repair missing inode item.
Under that case, inode type must be salvaged(although it can be fallback to
FILE).

One case should be, if there is any dir_item/index or inode_ref refers the
inode as parent, the type of that inode must be DIR.

However, currently btrfsck implement (inode_record only records backref), we
are unable to search the inode_backref whose parent is given inode number.

[FIRST IMPLEMENT DESIGN]
My first thought is to implement an generic inode-relation structure,
recording parent ino, child ino, name and namelen, and restore the structure
in a rbtree, not in the child/parent's list.

But I soon recognize that this is a perfect use case for relational 
database,

as 'ino' as the primary key for INODE table,
('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.

[CRAZY IDEA]
So why not using SQL to implement the btrfsck inode-record things?

With such crazy idea, it will be much much easier to do any iteration from a
given ino, and with the already mature RDB implement, like sqlite3, we can
save hundreds of lines of codes implementing the rb-tree or list.

[PROS]
1. Easy to maintain
   Now we don't need to maintain the rbtree searching or list 
iteration, but

   easy SQL lines and its wrapper.

2. Easy to extend
   If we need to record something more, like extents and its relation to
   inode, we only need to create 2 tables and several SQL and wrappers.

3. Reduced memory usage for HUGE fs.
   When metadata grows to several TB or even more, current rb-tree based
   implement may run short of memory since they are all stored in memory.
   But if use SQL, RDBMS like sqlite3 can restore things in either 
memory or

   disk, which may hugely reduce the memory usage for huge btrfs.

   If not use existing RDBMS, we need to implement complicated memory 
control

   system to manage memory in userland.

[CONS]
1. Heavy implement
   SQL hide the rb-tree or B+ tree implement but costs more memory(if not
   compressed) and CPU cycles, which will be slower than the simple rb-tree
   implement even using lightweight RDBMS like sqlite3.

2. Heavy dependency
   If use it, btrfs-progs will include RDBMS as the make and runtime
   dependency.
   Such low level progs depend on high level programs like sqlite3 may 
be very

   strange.

3. A lot of rework on existing codes.
   Even SQL is easier to maintain and extend, if we use it, we still 
need to
   reimplement several hundreds or even thousands lines of code to 
implement

   it, not to mention the regression tests.

4. Copyright
   Will it cause any copyright problem if using non-GPL RDBMS like 
sqlite3 in

   GPLv2 btrfs-progs?

[NEED FEEDBACK]
Any feedback or discussion on the crazy idea is welcomed, since this may 
needs
a lot of work, it definitely needs a lot review on the idea before it 
comes to

codes.

Thanks,
Qu

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs: refactor btrfs_device-name updates

On Sun, Nov 30, 2014 at 10:26:43AM -0500, Pranith Kumar wrote:
 On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote:
  The rcu_string API introduced some new sparse errors but also revealed 
  existing
  ones. First of all, the name in struct btrfs_device should be annotated as
  __rcu to prevent unsafe reads. Additionally, updates should go through
  rcu_dereference_protected to make it clear what's going on. This introduces
  some helper functions that factor out this functionality.
 
  Signed-off-by: Omar Sandoval osan...@osandov.com
  diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
  index 6e04f27..2298a70 100644
  --- a/fs/btrfs/volumes.h
  +++ b/fs/btrfs/volumes.h
  @@ -54,7 +54,7 @@ struct btrfs_device {
 
  struct btrfs_root *dev_root;
 
  -   struct rcu_string *name;
  +   struct rcu_string __rcu *name;
 
  u64 generation;
 
 
 Since rcu_strings are rcu specific, why not annotate the char pointer
 in 'struct rcu_string' with __rcu annotation? That should catch all
 error-prone users of rcu_string.
 
Because the whole structure is RCU'd, not just the str part of it. If str is
annotated as __rcu, when we (correctly) rcu_dereference an rcu_string and then
access the str member, we'll still get sparse warnings.

In any case, the above code does what I want it to do. See the following
(non-sense but illustrative) example:

#include linux/rcustring.h

static void example_func(void)
{
struct rcu_string __rcu *example;
char *str;
str = example-str;
}

  CHECK   /home/osandov/linux/example/example.c
/home/osandov/linux/example/example.c:7:13: warning: incorrect type in 
assignment (different address spaces)
/home/osandov/linux/example/example.c:7:13:expected char *str
/home/osandov/linux/example/example.c:7:13:got char [noderef] 
asn:4*noident

-- 
Omar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-30 Thread Christoph Anton Mitterer

On Sun, 2014-11-30 at 23:05 +, Dimitri John Ledkov wrote: 
 Nope, we should use standard names.
Well I wouldn't know that there is really a standardised name in the
sense that it tells it's mandatory.
People use SHA2-xxx, SHA-xxx, SHAxxx and probably even more
combinations.

And just because something was started short-sighted and in a wrong way
it doesn't mean one cannot correct it, which is why we try to no longer
use e.g. KB but kB or KiB.

Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:

 [CRAZY IDEA]
 So why not using SQL to implement the btrfsck inode-record things?

 2. Heavy dependency
 If use it, btrfs-progs will include RDBMS as the make and runtime
 dependency.  Such low level progs depend on high level programs
 like sqlite3 may be very strange.

I expect this will turn many of the traditionalists off, at least.  I 
could see a lot of traditional sysadmins lumping btrfs in with systemd if 
it started requiring a db, much as one of the big objections to systemd 
is the dbus requirement... even for headless servers that have never 
required it before.  Of course they could be ignored, but do we really 
want to go there?

(Personally, my gut reaction is eew, and of course getting database 
file handling correct after an ungraceful shutdown/reboot is one of the 
big challenges for a filesystem as it is, so I'm not entirely sure 
storing information in a database file in ordered to use it to help fix 
the filesystem is a good idea since it could well be that you end up 
needing an fsck to restore the file... to do the fsck, but I could be 
convinced.  I'm worried about the ones that can't be.)

 4. Copyright
 Will it cause any copyright problem if using non-GPL RDBMS like
 sqlite3 in GPLv2 btrfs-progs?

I just checked and at least on gentoo, sqlite's license is registered as 
public domain, which is legally mergeable with code under any other 
license free or proprietary, so there should be no problem with it.  If 
something else is used of course it would depend on its license.

I believe the general kernel-rules practice for such compatible license 
merging is to keep code under compatible licenses in separate files and 
keep the individual files under their individual licenses.  While if it's 
compatible I don't believe that's generally an actual legal requirement, 
the BSD folks in particular tend to be /very/ sensitive about code 
formerly under the BSD license, for instance, merged directly into GPL 
headlined files, because in that case they can't reverse the process.  
Personally I don't see the big deal, since they seem to have /no/ problem 
with their code being taken proprietary where they can't even /look/ at 
it, but a /huge/ problem with it being taken GPL, where they may not be 
able to directly copy back to BSD, but they obviously have the code 
available to look at anyway and still mergeable provided they dual 
license, and that makes absolutely no sense to me. shrug


So while I don't think it'll go over well, there should be no license 
issues at least with sqlite.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO

On Sun, Nov 30, 2014 at 10:11:41AM -0500, Pranith Kumar wrote:
 On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote:
  A naked read of the value of an RCU pointer isn't safe. Put the whole 
  access in
  an RCU critical section, not just the pointer dereference.
 
  Signed-off-by: Omar Sandoval osan...@osandov.com
 
 You can use rcu_access_pointer() in the if() condition check rather
 than increasing the read critical section. We should try to keep the
 critical section as small as possible.
 
 Also, since we have rcu_str_deref() we can use that instead of
 rcu_dereference() on device-name. Thoughts?
 
That's right, I forgot about rcu_access_pointer. The difference is probably
negligible, and I doubt the performance of this ioctl is very important. Since
we're going to be dereferencing the pointer anyways in some (most?) cases, I
think this is a bit more readable.

-- 
Omar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

 Original Message 
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Duncan 1i5t5.dun...@cox.net
To: linux-btrfs@vger.kernel.org
Date: 2014年12月01日 11:08

Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:

[CRAZY IDEA]
So why not using SQL to implement the btrfsck inode-record things?
2. Heavy dependency
 If use it, btrfs-progs will include RDBMS as the make and runtime
 dependency.  Such low level progs depend on high level programs
 like sqlite3 may be very strange.

I expect this will turn many of the traditionalists off, at least.  I
could see a lot of traditional sysadmins lumping btrfs in with systemd if
it started requiring a db, much as one of the big objections to systemd
is the dbus requirement... even for headless servers that have never
required it before.  Of course they could be ignored, but do we really
want to go there?

Oh,so terrible the systemd warfare. :(
This objection sounds very solid now.

Anyway, this is a crazy idea...
(Maybe it is only me so lazy to implement the rb-tree based things)

(Personally, my gut reaction is eew, and of course getting database
file handling correct after an ungraceful shutdown/reboot is one of the
big challenges for a filesystem as it is, so I'm not entirely sure
storing information in a database file in ordered to use it to help fix
the filesystem is a good idea since it could well be that you end up
needing an fsck to restore the file... to do the fsck, but I could be
convinced.  I'm worried about the ones that can't be.)
The db file is mostly used in memory, only when the metadata is really 
really big, maybe when the fs tree's level

is 7 or 8 we may need to use db file.
And the db file should be anonymous(unlinked but open) since it is only 
used in one btrfsck session,

not reused or really needed to be stored.
So I will not be a problem of restore or something like that.

4. Copyright
 Will it cause any copyright problem if using non-GPL RDBMS like
 sqlite3 in GPLv2 btrfs-progs?

I just checked and at least on gentoo, sqlite's license is registered as
public domain, which is legally mergeable with code under any other
license free or proprietary, so there should be no problem with it.  If
something else is used of course it would depend on its license.

I believe the general kernel-rules practice for such compatible license
merging is to keep code under compatible licenses in separate files and
keep the individual files under their individual licenses.  While if it's
compatible I don't believe that's generally an actual legal requirement,
the BSD folks in particular tend to be /very/ sensitive about code
formerly under the BSD license, for instance, merged directly into GPL
headlined files, because in that case they can't reverse the process.
Personally I don't see the big deal, since they seem to have /no/ problem
with their code being taken proprietary where they can't even /look/ at
it, but a /huge/ problem with it being taken GPL, where they may not be
able to directly copy back to BSD, but they obviously have the code
available to look at anyway and still mergeable provided they dual
license, and that makes absolutely no sense to me. shrug

So while I don't think it'll go over well, there should be no license
issues at least with sqlite.

Thanks for the license check anyway.

Thanks,
Qu.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: remove empty fs_devices to prevent memory runout

2014-11-30 Thread Gui Hecheng

There is a global list @fs_uuids to keep @fs_devices object
for each created btrfs. But when a btrfs becomes empty
(all devices belong to it are gone), its @fs_devices remains
in @fs_uuids list until module exit.
If we keeps mkfs.btrfs on the same device again and again,
all empty @fs_devices produced are sure to eat up our memory.
So this case has better to be prevented.

I think that each time we setup btrfs on that device, we should
check whether we are stealing some device from another btrfs
seen before. To faciliate the search procedure, we could insert
all @btrfs_device in a rb_root, one @btrfs_device per each physical
device, with @bdev-bd_dev as key. Each time device stealing happens,
we should replace the corresponding @btrfs_device in the rb_root with
an up-to-date version.
If the stolen device is the last device in its @fs_devices,
then we have an empty btrfs to be deleted.

Actually there are 3 ways to steal devices and lead to empty btrfs
1. mkfs, with -f option
2. device add, with -f option
3. device replace, with -f option
We should act under these cases.

Moreover, there are special cases to consider:
o If there are seed devices, then it is asured that
  the devices in cloned @fs_devices are not treated as valid devices.
o If a device disappears and reappears without any touch, its
  @bdev-bd_dev may change, so we have to re-insert it into the rb_root.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
changelog
v1-v2: add handle for device disappears and reappears event

*Note*
Actually this handles the case when a device disappears and
reappears without any touch.
We are going to recycle all dead btrfs_device in another patch.
Two events leads to the deads:
1) device disappears and never returns again
2) device disappears and returns with a new fs on it
A shrinker shall kill the deads.
---
 fs/btrfs/super.c   |   1 +
 fs/btrfs/volumes.c | 281 ++---
 fs/btrfs/volumes.h |   6 ++
 3 files changed, 230 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 54bd91e..ee09a56 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2154,6 +2154,7 @@ static void __exit exit_btrfs_fs(void)
btrfs_end_io_wq_exit();
unregister_filesystem(btrfs_fs_type);
btrfs_exit_sysfs();
+   btrfs_cleanup_valid_dev_root();
btrfs_cleanup_fs_uuids();
btrfs_exit_compress();
btrfs_hash_exit();
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0192051..7093cce 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include linux/kthread.h
 #include linux/raid/pq.h
 #include linux/semaphore.h
+#include linux/rbtree.h
 #include asm/div64.h
 #include ctree.h
 #include extent_map.h
@@ -52,6 +53,126 @@ static void btrfs_dev_stat_print_on_load(struct 
btrfs_device *device);
 
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
+static struct rb_root valid_dev_root = RB_ROOT;
+
+static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev)
+{
+   struct rb_node **p;
+   struct rb_node *parent;
+   struct rb_node *new;
+   struct btrfs_device *old_dev;
+
+   WARN_ON(!mutex_is_locked(uuid_mutex));
+
+   parent = NULL;
+   new = new_dev-rb_node;
+
+   p = valid_dev_root.rb_node;
+   while (*p) {
+   parent = *p;
+   old_dev = rb_entry(parent, struct btrfs_device, rb_node);
+
+   if (new_dev-devnum  old_dev-devnum)
+   p = parent-rb_left;
+   else if (new_dev-devnum  old_dev-devnum)
+   p = parent-rb_right;
+   else {
+   rb_replace_node(parent, new, valid_dev_root);
+   RB_CLEAR_NODE(parent);
+
+   goto out;
+   }
+   }
+
+   old_dev = NULL;
+   rb_link_node(new, parent, p);
+   rb_insert_color(new, valid_dev_root);
+
+out:
+   return old_dev;
+}
+
+static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
+{
+   struct btrfs_device *device;
+   WARN_ON(fs_devices-opened);
+   while (!list_empty(fs_devices-devices)) {
+   device = list_entry(fs_devices-devices.next,
+   struct btrfs_device, dev_list);
+   list_del(device-dev_list);
+   rcu_string_free(device-name);
+   kfree(device);
+   }
+   kfree(fs_devices);
+}
+
+static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs)
+{
+   struct btrfs_fs_devices *seed_fs;
+
+   if (!list_empty(old_fs-devices))
+   return;
+
+   list_del(old_fs-list);
+
+   /* free the seed clones */
+   seed_fs = old_fs-seed;
+   free_fs_devices(old_fs);
+   while (seed_fs) {
+   old_fs = seed_fs;
+

Re: root subvol id is 0 or 5?

Hugo Mills posted on Sun, 30 Nov 2014 13:53:28 + as excerpted:

 On Sun, Nov 30, 2014 at 07:08:51PM +0530, Shriramana Sharma wrote:
 On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote:
 
 In the data structures on disk, it's 5. The kernel aliases 0 to
  mean subvolid 5.
 
 So why 5 and not just 0 which seems a logical choice? On top of this,
 one needs to alias 0 to 5!
 
All of the trees used in the FS metadata have an ID number. The
 well-known trees have small, fixed IDs:

Thanks, Hugo.

You might wish to find a place in the wiki (probably in the FAQ) for 
that, since your explanation was both the clearest I can imagine and 
cleared up some lingering but why? questions along that line for me, as 
well.

And if an answer to that basic a btrfs question is still clearing stuff 
up for me, I expect it could be useful to well over 90% of potential 
btrfs wiki FAQ readers...

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-11-30 Thread Robert White


On 11/30/2014 05:58 PM, Qu Wenruo wrote:

(why not use SQL to... suggestion)


SQL, as in Structured Query Language, is _terrible_ for recursion. It 
expresses all of its elements in terms of set theory and really can only 
implement union and intersection of flat sets.


Several companies offer extensions to SQL in their implementations to 
help with this lack of recursion such as prior in Oracle's PSQL, but 
they are all stateful beyond reason.


Several companies, including microsoft, have proposed and partially 
implemented a relational database as a file system paradigm and then 
crashed into the fact that dealing with the parent of the parent of 
something is different than dealing with the parent of the parent of the 
parent of something.


There is a humours-but-true saying: If you have a problme, and you 
decide to solve it with (regex or xml or uml or sql etc) you now have 
two problems.


Writing the SQL to walk the tree is harder than allocating the memory as 
a vector, filling it with the data, and then walking the pointers.


Your suggestion is the first step on the road to The Inner Platform 
Effect™. You have a specialized database (parent, inode, name) and now 
you want to put a generic database engine over the specialized database 
so that you an re-implement the specialized database with generic 
primitives.


http://en.wikipedia.org/wiki/Inner-platform_effect

Things need to be only as generic as they need to be, and no more 
generic than that.


Replacing a pointer to a record with a pointer to a cursor's result 
table that will give you the name of the next result to query is not a 
win. Even as you spell it out you can see that it is _not_ a reduction 
in memory or processing.


And the easy SQL lines stop being that easy when name stops being 
unique.


(I've been down this road before. Not with file systems but with 
managed objects in a network management system. Nodes, Parent nodes, 
etc. Just referring to distributed things like networks switches instead 
of file system inodes. ... It doesn't end well. 8-) )


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

2014-11-30 Thread Dave Chinner

On Wed, Nov 26, 2014 at 03:30:39PM +, Filipe Manana wrote:
 Stress btrfs' block group allocation and deallocation while running
 fstrim in parallel. Part of the goal is also to get data block groups
 deallocated so that new metadata block groups, using the same physical
 device space ranges, get allocated while fstrim is running. This caused
 several issues ranging from invalid memory accesses, kernel crashes,
 metadata or data corruption, free space cache inconsistencies and free
 space leaks.
 
 Signed-off-by: Filipe Manana fdman...@suse.com

There's nothing btrfs specific about this test. Pleas emake it
generic.



 +
 +# real QA test starts here
 +_need_to_be_root
 +_supported_fs btrfs
 +_supported_os Linux
 +_require_scratch_nocheck
 +_require_fstrim
 +
 +rm -f $seqres.full

# needs 40GB of space in the filesystem
_scratch_mkfs
_require_fs_space $SCRATCH_MNT $((40 * 1024 * 1024))

However, does it really need 40GB? It needs 2GB for the large alloc,
and then 400,000 * 4k is only 1.6GB. So This would fit in a 10GB
filesystem without a problem, right? And if it's a generic test,
keeping it under 10GB would mean it runs on the majority of
filesystem developers test VMs, small or large


 +# Create a bunch of small files that get their single extent inlined in the
 +# btree, so that we consume a lot of metadata space and get a chance of a
 +# data block group getting deleted and reused for metadata later. Sometimes
 +# the creation of all these files succeeds other times we get ENOSPC failures
 +# at some point - this depends on how fast the btrfs' cleaner kthread is
 +# notified about empty block groups, how fast it deletes them and how fast
 +# the fallocate calls happen. So we don't really care if they all succeed or
 +# not, the goal is just to keep metadata space usage growing while data block
 +# groups are deleted.
 +create_files()
 +{
 + local prefix=$1
 +
 + for ((i = 1; i = 40; i++)); do
 + echo Creating file ${prefix}_$i $seqres.full 21
 + $XFS_IO_PROG -f -c pwrite -S 0xaa 0 3900 \
 + $SCRATCH_MNT/${prefix}_$i $seqres.full 21

You don't need to echo 400,000 file creates to $seqres.full.

This is one of those times that directing output to /dev/null makes
sense, especially as:

 + ret=$?
 + if [ $ret -ne 0 ]; then
 + break
 + fi

you can do this:

if [ $? -ne 0 ]; then
echo failed creating file $prefix.$i  $seqres.full
break
fi

 + done
 +
 +}
 +
 +fsz=`expr 40 \* 1024 \* 1024 \* 1024`
 +_scratch_mkfs_sized $fsz $seqres.full 21 || \
 + _fail size=$fsz mkfs failed
 +_scratch_mount
 +
 +for ((i = 0; i  4; i++)); do
 + trim_loop 
 + trim_pids[$i]=$!
 +done
 +
 +fallocate_loop falloc_file 
 +fallocate_pid=$!
 +
 +create_files foobar
 +
 +kill $fallocate_pid
 +kill ${trim_pids[@]}
 +wait
 +
 +# Sleep a bit, otherwise umount fails often with EBUSY (TODO: investigate 
 why).
 +sleep 3
 +
 +# Check for fs consistency. The trimming was racy and caused some btree nodes
 +# to get full of zeroes on disk, which obviously caused fs metadata 
 corruption.
 +# The race often lead to missing free space entries in a block group's free
 +# space cache too.
 +_check_scratch_fs

Ummm, if you just use _require_scratch, you don't need to do this.
The test harness will check it for you.

 index e79b848..6608005 100644
 --- a/tests/btrfs/group
 +++ b/tests/btrfs/group
 @@ -84,3 +84,4 @@
  079 auto
  080 auto
  081 auto quick
 +082 auto

I'd suggest that for a generic test we'd want to add the stress
group to this, and allow the test to be scaled in terms of
filesystem size and the number of concurrent trim and fallocate
loops by $LOAD_FACTOR

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Considerations in snapshotting and send/receive of nocow files?

Shriramana Sharma posted on Sun, 30 Nov 2014 19:17:42 +0530 as excerpted:

 Given that snapshotting effectively reduces the usefulness of nocow, I
 suppose the preferable model to snapshotting and send/receiving such
 files would be different than other files.
 
 Should nocow files (for me only VBox images) preferably be:
 
 1) under a separate subvolume?
 
 2) said subvol snapshotted less often?
 
 3) sent/received any differently?

If you look back in the list history at the nocow threads, you'll see a 
lot of my answers to exactly this sort of question.

In general I'd say yes to 1 and 2, separate subvolume, in part to allow 
snapshotting it less often.  For 3, I don't deal directly with send/
receive for my own use case and it's complex enough that I've not become 
as familiar with it as I have the general fragmentation issue, but 
because send does require creating a read-only snapshot, I'd characterize 
#3 as depending on #2, and would thus suggest treating it differently to 
the extent that you keep send and therefore snapshotting to the low side 
of your reasonable range.

Here's the reasoning in a more detailed step-by-step fashion.  (I'll use 
lettered points here to avoid confusing them with your numbered points 
above, which I may wish to reference below as well.)

A) The basic issue in principle: As you've apparently found from your 
research, snapshotting and nocow can be used together but disrupt 
absolute nocow, because a snapshot locks in place the existing version of 
the file, forcing a COW on the initial change written to a (4 KiB) file 
block after a snapshot covering the same file.  The file does remain 
nocow, however, and further changes written to the same file block will 
be nocow -- until the next snapshot forces another lock-in-place, of 
course.

B) The biggest immediate practical problem leading from A is that of high-
frequency automated snapshotting -- some people are going wild and 
snapshotting as often as once a minute... at least until they see some of 
the issues that can cause (like snapshots happening nearly instantly but 
snapshot deletion often taking longer than a minute, and the current 
scaling issues involved once there's several hundreds or thousands of 
snapshots to deal with).  On a busy VM triggering change-writes with a 
similar 1 minute or lower frequency, the snapshotting very quickly 
eliminates much of the anti-fragmentation benefit of nocow in the first 
place.

C) On a more general level once again, it should be easily apparent that 
the more change-writes you can squeeze between snapshots, the more 
effective the nocow is going to be, because a higher percentage of them 
will still be nocow.

D) That leads pretty directly to your points 1 and 2, put the nocow files 
on their own subvolume so snapshotting the parent doesn't affect them, 
and then snapshot the nocow subvolume at a lower frequency, as low a 
frequency as can reasonably fit within your use-case target range.  For 
example, for a normally daily snapshot scenario you might snapshot the 
parent daily and the nocow subvolume every other day or twice a week.  
For a normal 4X-daily snapshot scenario (every six hours on a 24-hour 
schedule or every two hours on an 8-hour-shift schedule), you might 
snapshot the nocow subvolume only once or twice a day.  Tho of course if 
the primary goal is the snapshotting of the nocow files (the VMs in your 
case), then you may still be snapshotting it at a higher frequency than 
the parent, which you may not in fact be snapshotting at all.  The point 
remains, snapshot the nocow subvolume at as low a frequency as can 
reasonably fit your use-case/goals.

E) Regarding your point #3, since send must be done from a read-only 
snapshot, obviously you'll need to snapshot at a frequency that at 
minimum equals that of your sends.  However, if your VMs are low activity 
enough that there's a reasonable chance they won't have written any 
changes during the send, and the send is the primary reason for the 
snapshot in the first place, you may avoid /some/ of the issue by 
deleting most snapshots as soon after the send as possible.

It would work like this.  You'd do your initial full send, creating an 
initial reference on both sides, with that snapshot retained on both 
sides /as/ that initial reference.  At your primary sending frequency, 
say once a day, you'd do the send against the original parent and delete 
the sending snapshot as soon as the send completed, thus making each 
daily incremental against the original.  At a lower frequency, perhaps 
once a week or once a month, you'd retain the sending snapshot but use 
the mitigation measures discussed in F below, and could then delete older 
initially-retained-weeklies and the original full-reference, perhaps 
keeping say two quarterly snapshots on the send side.

Then if you needed to reverse the send/receive, you'd still have the last 
weekly as a reference on both sides and could replay the last daily

Re: pro/cons of raid1 with mdadm/lvm2

2014-11-30 Thread Roman Mamedov

On Sun, 30 Nov 2014 12:11:47 +0100
Gour g...@atmarama.net wrote:

 However, I wonder if there are some 'cons' in having raid-1 partition
 under mdadm and not using native mirroring capabilities of btrfs fs?

Pros:

  * mdadm RAID has much better read balancing;
Btrfs reads are satisfied from what's in effect a random drive (PID-based
balancing of threads to drives), mdadm reads from the less-loaded drive.
Also mdadm has a way to specify some RAID1 array members as to be never
used for reads if at all possible (write-mostly), which helps in RAID1 of
HDD and SSD.

  * mdadm RAID has much better write submission;
In my experience [1] Btrfs RAID1 on heavy write operations first writes to
one drive, then to another. The whole process takes up to 2x longer than
with a single drive. On the other hand mdadm writes to both drives
simultaneously.

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34103.html

Con:

  * You only get the ability to recover from a checksum failure with Btrfs
RAID1, not with mdadm RAID1 (see Russell's reply).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted:

 The db file is mostly used in memory, only when the metadata is really
 really big, maybe when the fs tree's level is 7 or 8 we may need to use
 db file.

So fscking the database in ordered to fsck the database isn't an issue.  
One objection down! =:^)

But seriously, the politics of the idea remains its biggest nemesis in my 
opinion.  And in systemd we've unfortunately a live demonstration of just 
how big a nemesis that can be.  =:^(  If the technical reasoning for it 
is sound and the benefit high enough, great, but IMO the benefit will 
need to be pretty high to justify the risk of political fallout, and I 
doubt it's anything close to that high.  But it's not my call, so we'll 
see.  Thinks could certainly get interesting if it's judged to be worth 
it.   Checking popcorn stash 

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

 Original Message 
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White rwh...@pobox.com
To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs 
linux-btrfs@vger.kernel.org

Date: 2014年12月01日 12:03

On 11/30/2014 05:58 PM, Qu Wenruo wrote:

(why not use SQL to... suggestion)

SQL, as in Structured Query Language, is _terrible_ for recursion. It 
expresses all of its elements in terms of set theory and really can 
only implement union and intersection of flat sets.

Several companies offer extensions to SQL in their implementations to 
help with this lack of recursion such as prior in Oracle's PSQL, but 
they are all stateful beyond reason.

Several companies, including microsoft, have proposed and partially 
implemented a relational database as a file system paradigm and then 
crashed into the fact that dealing with the parent of the parent of 
something is different than dealing with the parent of the parent of 
the parent of something.

There is a humours-but-true saying: If you have a problme, and you 
decide to solve it with (regex or xml or uml or sql etc) you now have 
two problems.

Wait, regex and uml and xml is OK, but never heard sql is one of them...

Writing the SQL to walk the tree is harder than allocating the memory 
as a vector, filling it with the data, and then walking the pointers.
In fact, such INODE and INODE_REF table is not (completely nor mainly) 
used to walk the tree,

it is mainly used to search for:
1. is there any inode_ref refers to a given ino as parent.

This will not even be a problem when the fs is *OK*, since do a simple 
btrfs_search_slot()
with key( objectied = ino, type = BTRFS_DIR_INDEX/ITEM_KEY, offset = 0) 
will do it.

However when it comes to corrupted leaf, the whole INODE_ITEM with its 
DIR_INDEX/ITEM are gone
with the leaf, so the old search way is not usable and btrfs-progs will 
relay on other mechanism

to determine that.
And unfortunately, there is no such mechanism.

2. is there any dir_index/dir_item refers to a given ino as child.
Current inode_record works fine for this object.

So when the crazy idea disappear and sane ideas come back, it will 
probably be rb-tree based

(parent, ino, name, namelen) entries to record parent-child relation
(currently it is a list_head only records backref inside the inode_record).

And another rb-tree based (ino) entries (same as current inode_record 
structure).

Your suggestion is the first step on the road to The Inner Platform 
Effect™. You have a specialized database (parent, inode, name) and now 
you want to put a generic database engine over the specialized 
database so that you an re-implement the specialized database with 
generic primitives.

http://en.wikipedia.org/wiki/Inner-platform_effect

Things need to be only as generic as they need to be, and no more 
generic than that.

Replacing a pointer to a record with a pointer to a cursor's result 
table that will give you the name of the next result to query is not a 
win. Even as you spell it out you can see that it is _not_ a reduction 
in memory or processing.

And the easy SQL lines stop being that easy when name stops being 
unique.
Name is still unique when parent ino is given, so the INODE_REF tables' 
primary key is not

name but the (parent, ino, name) combine.

But the inner platform effect still seems valid for my crazy idea.
Anyway, the crazy idea comes to me when I see the RDB like feature in 
the inode_record structure,
-and I just want to save sometime coding the new (parent, ino, name, 
namelen) rb-tree-.

(I've been down this road before. Not with file systems but with 
managed objects in a network management system. Nodes, Parent nodes, 
etc. Just referring to distributed things like networks switches 
instead of file system inodes. ... It doesn't end well. 8-) )

The RDB idea must come to you just like me, wanting to write less codes, 
right?

So it seems the end may be the same. :-(

Thanks,
Qu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?