[PATCH 0/3] btrfs: fix RCU string sparse noise
Hi everyone, These patches clean up the big stack of sparse RCU errors I introduced into the integration tree as reported by the kbuild test robot: On Thu, Nov 27, 2014 at 06:45:20AM +0800, kbuild test robot wrote: tree: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git integration head: c7a37618b60026121255c69e042d74ae5631470c commit: 37aad79d90a0cbf82a5eda62dfe3af4241f5aca3 [38/39] Move BTRFS RCU string to common library reproduce: # apt-get install sparse git checkout 37aad79d90a0cbf82a5eda62dfe3af4241f5aca3 make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by ) fs/btrfs/check-integrity.c:848:25: sparse: incorrect type in argument 1 (different address spaces) fs/btrfs/check-integrity.c:848:25:expected struct rcu_string [noderef] asn:4*rcu_str fs/btrfs/check-integrity.c:848:25:got struct rcu_string *name [snip, there's a lot of these] As payment for my transgressions, this also clean ups the existing rcu_string usage to get rid of the preexisting noise. The first patch fixes the __rcu annotations which I got wrong on the first go. The second fixes an incorrect use of RCU in the BTRFS_IOC_DEV_INFO ioctl. The third refactors the volume code's usage of rcu_string, fixing a questionable RCU or two in the process. This patch series applies to Chris' integration branch. Thanks! Omar Sandoval (3): rcustring: clean up botched __rcu annotations btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO btrfs: refactor btrfs_device-name updates fs/btrfs/ioctl.c | 10 ++--- fs/btrfs/volumes.c| 93 --- fs/btrfs/volumes.h| 2 +- include/linux/rcustring.h | 5 +-- 4 files changed, 72 insertions(+), 38 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: refactor btrfs_device-name updates
The rcu_string API introduced some new sparse errors but also revealed existing ones. First of all, the name in struct btrfs_device should be annotated as __rcu to prevent unsafe reads. Additionally, updates should go through rcu_dereference_protected to make it clear what's going on. This introduces some helper functions that factor out this functionality. Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/volumes.c | 93 +- fs/btrfs/volumes.h | 2 +- 2 files changed, 65 insertions(+), 30 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d13b253..6913bed 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -53,6 +53,45 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); +/* + * Dereference the device name under the uuid_mutex. + */ +static inline struct rcu_string * +btrfs_dev_rcu_protected_name(struct btrfs_device *dev) +__must_hold(uuid_mutex) +{ + return rcu_dereference_protected(dev-name, +lockdep_is_held(uuid_mutex)); +} + +/* + * Use when the caller is the only possible updater. + */ +static inline struct rcu_string * +btrfs_dev_rcu_only_name(struct btrfs_device *dev) +{ + return rcu_dereference_protected(dev-name, 1); +} + +/* + * Rename a device under the uuid_mutex. + */ +static inline int btrfs_dev_rename(struct btrfs_device *dev, const char *name) +__must_hold(uuid_mutex) +{ + struct rcu_string *old_name, *new_name; + + new_name = rcu_string_strdup(name, GFP_NOFS); + if (!new_name) + return -ENOMEM; + + old_name = btrfs_dev_rcu_protected_name(dev); + rcu_assign_pointer(dev-name, new_name); + rcu_string_free(old_name); + + return 0; +} + static void lock_chunks(struct btrfs_root *root) { mutex_lock(root-fs_info-chunk_mutex); @@ -114,7 +153,7 @@ static void free_fs_devices(struct btrfs_fs_devices *fs_devices) device = list_entry(fs_devices-devices.next, struct btrfs_device, dev_list); list_del(device-dev_list); - rcu_string_free(device-name); + rcu_string_free(btrfs_dev_rcu_only_name(device)); kfree(device); } kfree(fs_devices); @@ -495,12 +534,10 @@ static noinline int device_list_add(const char *path, return PTR_ERR(device); } - name = rcu_string_strdup(path, GFP_NOFS); - if (!name) { + if (btrfs_dev_rename(device, path)) { kfree(device); return -ENOMEM; } - rcu_assign_pointer(device-name, name); mutex_lock(fs_devices-device_list_mutex); list_add_rcu(device-dev_list, fs_devices-devices); @@ -509,7 +546,11 @@ static noinline int device_list_add(const char *path, ret = 1; device-fs_devices = fs_devices; - } else if (!device-name || strcmp(device-name-str, path)) { + } else { + name = btrfs_dev_rcu_protected_name(device); + if (name strcmp(name-str, path) == 0) + goto out; + /* * When FS is already mounted. * 1. If you are here and if the device-name is NULL that @@ -547,17 +588,15 @@ static noinline int device_list_add(const char *path, return -EEXIST; } - name = rcu_string_strdup(path, GFP_NOFS); - if (!name) + if (btrfs_dev_rename(device, path)) return -ENOMEM; - rcu_string_free(device-name); - rcu_assign_pointer(device-name, name); if (device-missing) { fs_devices-missing_devices--; device-missing = 0; } } +out: /* * Unmount does not free the btrfs_device struct but would zero * generation along with most of the other members. So just update @@ -594,17 +633,12 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) if (IS_ERR(device)) goto error; - /* -* This is ok to do without rcu read locked because we hold the -* uuid mutex so nothing we touch in here is going to disappear. -*/ - if (orig_dev-name) { - name = rcu_string_strdup(orig_dev-name-str, GFP_NOFS); - if (!name) { + name = btrfs_dev_rcu_protected_name(orig_dev); + if (name) { + if (btrfs_dev_rename(device, name-str)) { kfree(device);
[PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO
A naked read of the value of an RCU pointer isn't safe. Put the whole access in an RCU critical section, not just the pointer dereference. Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/ioctl.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ecdf68f..dd55844 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2706,6 +2706,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) struct btrfs_fs_devices *fs_devices = root-fs_info-fs_devices; int ret = 0; char *s_uuid = NULL; + struct rcu_string *name; di_args = memdup_user(arg, sizeof(*di_args)); if (IS_ERR(di_args)) @@ -2726,17 +2727,16 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) di_args-bytes_used = btrfs_device_get_bytes_used(dev); di_args-total_bytes = btrfs_device_get_total_bytes(dev); memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid)); - if (dev-name) { - struct rcu_string *name; - rcu_read_lock(); - name = rcu_dereference(dev-name); + rcu_read_lock(); + name = rcu_dereference(dev-name); + if (name) { strncpy(di_args-path, name-str, sizeof(di_args-path)); - rcu_read_unlock(); di_args-path[sizeof(di_args-path) - 1] = 0; } else { di_args-path[0] = '\0'; } + rcu_read_unlock(); out: mutex_unlock(fs_devices-device_list_mutex); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] rcustring: clean up botched __rcu annotations
The rcu_string returned by rcu_string_strdup isn't technically under RCU yet, and it makes more sense not to treat it as such. Additionally, an rcu_string passed to rcu_string_free should already be rcu_dereferenced and therefore not in the __rcu address space. Signed-off-by: Omar Sandoval osan...@osandov.com --- include/linux/rcustring.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/rcustring.h b/include/linux/rcustring.h index 67277ab..28bd9bc 100644 --- a/include/linux/rcustring.h +++ b/include/linux/rcustring.h @@ -37,8 +37,7 @@ struct rcu_string { * @src: The string to copy * @flags: Flags for kmalloc */ -static inline struct rcu_string __rcu *rcu_string_strdup(const char *src, -gfp_t flags) +static inline struct rcu_string *rcu_string_strdup(const char *src, gfp_t flags) { struct rcu_string *ret; size_t len = strlen(src) + 1; @@ -54,7 +53,7 @@ static inline struct rcu_string __rcu *rcu_string_strdup(const char *src, * rcu_string_free() - free an RCU string * @str: The string */ -static inline void rcu_string_free(struct rcu_string __rcu *str) +static inline void rcu_string_free(struct rcu_string *str) { if (str) kfree_rcu(str, rcu); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving an entire subvol?
On Sun, Nov 30, 2014 at 9:51 AM, Marc MERLIN m...@merlins.org wrote: So the Ubuntu Wiki BtrFS entry advises against using subvol set-default because it boots its kernel using root=subvol=@ and home as subvol=@home, and these two subvols are only present under the subvol with ID 5. But isn't it just possible to move i.e. reparent a subvol so I can move these two under another subvol and have that as default? Make a new subvolume called /root and just mount subvol=root Sorry if my question wasn't clear: I wanted to know how to move a subvol to appear under another subvol other than its original parent. Turns out that sudo mv @ @home target/ is quite sufficient. If so why would the Ubuntu wiki require that set-default not be used? Just @ @home need to be moved to the new place, no? Note that you can't mount subvols recursively in one mount AFAIK. I'm not sure what you mean. I have a few subvols in my external HDD which is entirely formatted as BtrFS and if I just mount the external HDD /dev/sdc1 I am able to access all the subvols' contents as well. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Skript for backup btrfs on external HD
Am 2014-11-29 um 23:18 schrieb Marc MERLIN: On Sat, Nov 29, 2014 at 10:51:08PM +0100, Jakob Schürz wrote: Am 2014-11-29 um 22:11 schrieb Marc MERLIN: On Sat, Nov 29, 2014 at 09:34:01PM +0100, Jakob Schürz wrote: Hi there! I made a script to do backup with btrfs on a external HD. You can see the function, how it works, and how it's to be used on my site http://linux.xundeenergie.at/doku.php?id=mkbtrbackup The site is in german. An english one will follow later. Do you want some explanations? Sure, how is it different from those 3? https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Available_Backup_Tools Wheter i haven't seen it, or this scripts can't do recursive backup... That's probably right, at least not automatically. And that's why i made the script. :) If you have subvolumes in subvolumes (for example: /home, /home/user1, /home/user2 /var, /var/spool, /var/lib are extra subvolumes IN the normal filetree from linux), my script takes them all. For me, they are all subvolumes also mounted on /mnt/btrfs_poolx so I backup from there. That's also possible with my skript, because you can control it with an config-file. For example you have / |-@ |-@home `-@var And you want all your snapshots of this 3 subvolumes in separate directories with timestamp (and maybe .hourly_X-Tag) put in the config: SNPMNT=/path/to/btrfs-poolmount BKPMNT=/path/to/external/HD/mountpoint backup @ roots backup/roots backup @home homes backup/homes backup @varvarsbackup/vars start the skript with mkbtrbackup create --interval hourly -c /path/to/backupconfig you get in /path/to/btrfs-poolmount 3 directories (roots, homes and vars), and on /path/to/external/HD/mountpoint one directoriy backup, including also the three given subdirectories from the 4th coloumn (leave this coloumn blank, no auto-transfer to the external HD!!!) in this subdirectories you get subvolumes like @.20141130-115001.hourly_0 @home.20141130-115001.hourly_0 @var.20141130-115001.hourly_0 AND they are rotated automatically. And my script changes the fstab-entry in the new snapshot. The original has the option subvol=@SUBVOL, where @SUBVOL is the name of the original system. I don't need to do that, my script updates a symlink pointing to the last snapshot, and you can use subvol=symlink-name I'm trying on this, it's not finished. There are many discussions about. What is better... modify grub.cfg on each snapshot, work with symlinks... I create one symlink @*.CURRENT. I will rename it to .LAST... so i can do the same with a static grub-entry You get a systemd-unit, in the tarball, which makes a snapshot from your system, on successful boot, so you can switch back fast, if an update destroyed your system. And it is for minimal-systems... no python, no perl, no java... only shell(bash) :-) That makes sense, thanks for explaining. For example... on an raspberry Pi it would be a good thing. :) Hope, you try it, and give me some feedback. ;-) Jakob -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pro/cons of raid1 with mdadm/lvm2
Hello, recently I was migrating from Debian to openSUSE and wanted to make it smooth by dismantling my 2x1TB raid-1 array, install SUSE on the 1st disk, cp /home data, check everything is OK and then add 2nd disk containing Debian into raid-1 array. However, in orde to accomplish it I learnt that one cannot simply degrade mount and use such array like with mdadm, but I had to convert system with -dconvert=single -mconvert=single which takes some time. That's why I'm considering to put all my partitions (swap, root with several subvolumes, home) under LVM2 volumes and then create raid-1 array with mdadm since that would enable to me more easy and quickly temporarily dismantle raid-1 array, do some data manipulation from one disk to another (sometimes I use 2nd disk as temporatily storage when restoring some archived data from tapes etc.) and then resilver raid-1 array. However, I wonder if there are some 'cons' in having raid-1 partition under mdadm and not using native mirroring capabilities of btrfs fs? Let me add that I also want to take advantage of using SUSE's snapshots features, but I hope that's not the obstacle for the above-mentioned layout - I'd still use btrfs' snapshot facility. Sincerely, Gour -- Whenever and wherever there is a decline in religious practice, O descendant of Bharata, and a predominant rise of irreligion — at that time I descend Myself. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root subvol id is 0 or 5?
On Sun, Nov 30, 2014 at 09:01:37AM +0530, Shriramana Sharma wrote: I am confused with this: should I call it the root subvol or top-level subvol or default subvol or doesn't it matter? Are all subvols equal, or some are more equal than others [hark to Orwell's Animal Farm ;-)]? I try to use top level for subvolid=5. root subvol is hugely confusing, as it could be one of several things. If you mean the subvol mounted at /, then I call that / or the / subvol. default subvol is the one marked as default. This starts out as subvolid=5, but can be set to any other subvol. And more importantly, is the ID of the root subvol 0 or 5? In the data structures on disk, it's 5. The kernel aliases 0 to mean subvolid 5. The Oracle guide (https://docs.oracle.com/cd/E37670_01/E37355/html/ol_use_case3_btrfs.html) seems to say it's 0 : By default, the operating system mounts the parent btrfs volume, which has an ID of 0 but the BtrFS wiki (and btrfs subvol manpage) reads 5: every btrfs filesystem has a default subvolume as its initially top-level subvolume, whose subvolume id is 5(FS_TREE). as also the Ubuntu Wiki: The default subvolume to mount is always the top of the btrfs tree (subvolid=5). As above, both are correct here. Now this Oracle page http://www.oracle.com/technetwork/articles/servers-storage-admin/advanced-btrfs-1734952.html says: The only clean way to destroy the default subvolume is to rerun the mkfs.btrfs command, which would destroy existing data. OK, this is actually wrong. It's not the default subvolume if someone's run set-default on the FS. They're correct that you can't delete the top-level subvol. You can't delete the subvol marked as default, either. Assuming (or implying) that the two are the same is just plain wrong. So from what I've (confusedly) understood so far, 0 refers to the superstructure (or whatchamacallit) of the entire BtrFS-based contents of the device(s) and hence cannot be deleted but only reset by a mkfs.btrfs, but 5 is only the default subvol (mounted when the FS as a whole is mounted without subvol spec) provided by mkfs.btrfs, and subvol set-default can have another subvol mounted as default instead, after which 5 can actually be deleted? You can't delete subvolid=5. It's part of the fundamental whatchamacallit of the FS (a good name). Even if you change the default subvol, you still can't delete it. Hugo. -- Hugo Mills | People are too unreliable to be replaced by hugo@... carfax.org.uk | machines. http://carfax.org.uk/ | PGP: 65E74AC0 | Nathan Spring, Star Cops signature.asc Description: Digital signature
Re: Skript for backup btrfs on external HD
Am 2014-11-29 um 22:11 schrieb Marc MERLIN: On Sat, Nov 29, 2014 at 09:34:01PM +0100, Jakob Schürz wrote: Hi there! I made a script to do backup with btrfs on a external HD. You can see the function, how it works, and how it's to be used on my site http://linux.xundeenergie.at/doku.php?id=mkbtrbackup The site is in german. An english one will follow later. Do you want some explanations? Sure, how is it different from those 3? https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Available_Backup_Tools Wheter i haven't seen it, or this scripts can't do recursive backup... If you have subvolumes in subvolumes (for example: /home, /home/user1, /home/user2 /var, /var/spool, /var/lib are extra subvolumes IN the normal filetree from linux), my script takes them all. It looks on the external storage, if there's an older snapshot (i call all subvolumes together in this case a snapshot!!) which is also on the local machine. If so, is makes a incremental backup. If not, a initial transfer is started. For each subvolume in the snapshot! And my script changes the fstab-entry in the new snapshot. The original has the option subvol=@SUBVOL, where @SUBVOL is the name of the original system. It changes the @SUBVOLUME to the subvolume-id, so you can mount your snapshot easy. One Point is missing... Modifying of grub to serve boot-menu-entries for older snapshots. You get a systemd-unit, in the tarball, which makes a snapshot from your system, on successful boot, so you can switch back fast, if an update destroyed your system. And it is for minimal-systems... no python, no perl, no java... only shell(bash) :-) regards jakob -- http://xundeenergie.at http://verkehrsloesungen.wordpress.com/ http://cogitationum.wordpress.com/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving contents from one subvol to another
On Sun, Nov 30, 2014 at 9:23 AM, Shriramana Sharma samj...@gmail.com wrote: Why should noCoW affect cp --reflink anyhow? I just created a 500 MiB file from /dev/urandom under a chattr +C-ed dir, and copied to another subvol using cp --reflink, and fi df still shows 500 MiB, not 1 GiB. Looks like I might have spoken too soon (because I've read that some changes aren't visible until the next FS commit) so right now it actually says 1 GiB used, which I can't grok because why should a nocow file be physically copied (to new blocks) just because it's nocow? Is it because it is possible that the two copies are overwritten separately at the same time? But still, it seems to me that mv should make it so that the nocow attr is temporarily (atomically?) suspended/ignored just for the duration of the relocation, since there aren't going to be any two copies to be overwritten at the same time. Comments? -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Change total in btrfs filesystem df output to alloc
Attached patch. On Sun, Nov 30, 2014 at 9:30 AM, Shriramana Sharma samj...@gmail.com wrote: On Sun, Aug 31, 2014 at 7:25 AM, Shriramana Sharma samj...@gmail.com wrote: Hello. There seem to be lots of questions in various forums re the output of btrfs fi df -- especially w.r.t. the usage of the word total. For example see https://community.oracle.com/thread/2459838 I feel it would make the intent clearer if total were changed to alloc or allocated (if the short form is felt unclear). It would also help people understand the output of regular df on a btrfs system since one can understand easier that pre-allocated space would count as used space as it is not free! Where should I report a bug to get this fixed? Thanks. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा From 3d386053105ef7c2dba3643530dffe3ecd4dcf49 Mon Sep 17 00:00:00 2001 From: Shriramana Sharma samj...@gmail.com Date: Sun, 30 Nov 2014 19:00:38 +0530 Subject: [PATCH] df: change total to alloc --- cmds-filesystem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index cd6b3c6..05f6235 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -233,7 +233,7 @@ static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode) struct btrfs_ioctl_space_info *sp = sargs-spaces; for (i = 0; i sargs-total_spaces; i++, sp++) { - printf(%s, %s: total=%s, used=%s\n, + printf(%s, %s: alloc=%s, used=%s\n, group_type_str(sp-flags), group_profile_str(sp-flags), pretty_size_mode(sp-total_bytes, unit_mode), -- 2.1.3
Re: root subvol id is 0 or 5?
On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote: In the data structures on disk, it's 5. The kernel aliases 0 to mean subvolid 5. So why 5 and not just 0 which seems a logical choice? On top of this, one needs to alias 0 to 5! -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
Re: root subvol id is 0 or 5?
On Sun, Nov 30, 2014 at 7:08 PM, Shriramana Sharma samj...@gmail.com wrote: So why 5 and not just 0 which seems a logical choice? On top of this, one needs to alias 0 to 5! Attached patch clarifying this in the documentation. (Should have done this with the previous mail. Sorry for multiple mails.) -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा From 54387ff2155423d990b5a9aca95315fe6e649303 Mon Sep 17 00:00:00 2001 From: Shriramana Sharma samj...@gmail.com Date: Sun, 30 Nov 2014 19:11:39 +0530 Subject: [PATCH 2/2] btrfs subvolume doc clarifications --- Documentation/btrfs-subvolume.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-subvolume.txt b/Documentation/btrfs-subvolume.txt index 1360aba..34abdef 100644 --- a/Documentation/btrfs-subvolume.txt +++ b/Documentation/btrfs-subvolume.txt @@ -31,7 +31,7 @@ When `mount`(8) using 'subvol' or 'subvolid' mount option, one can access files/directories/subvolumes inside it, but nothing in parent subvolumes. Also every btrfs filesystem has a default subvolume as its initially top-level -subvolume, whose subvolume id is 5(FS_TREE). +subvolume, whose subvolume id is 5. (0 is also acceptable as an alias.) A btrfs snapshot is much like a subvolume, but shares its data(and metadata) with other subvolume/snapshot. Due to the capabilities of COW, modifications @@ -166,7 +166,7 @@ sleep N seconds between checks (default: 1) EXIT STATUS --- -*btrfs subvolume* returns a zero exit status if it succeeds. Non zero is +*btrfs subvolume* returns a zero exit status if it succeeds. A non-zero value is returned in case of failure. AVAILABILITY -- 2.1.3
Considerations in snapshotting and send/receive of nocow files?
Given that snapshotting effectively reduces the usefulness of nocow, I suppose the preferable model to snapshotting and send/receiving such files would be different than other files. Should nocow files (for me only VBox images) preferably be: 1) under a separate subvolume? 2) said subvol snapshotted less often? 3) sent/received any differently? Thanks. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root subvol id is 0 or 5?
On Sun, Nov 30, 2014 at 07:08:51PM +0530, Shriramana Sharma wrote: On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote: In the data structures on disk, it's 5. The kernel aliases 0 to mean subvolid 5. So why 5 and not just 0 which seems a logical choice? On top of this, one needs to alias 0 to 5! All of the trees used in the FS metadata have an ID number. The well-known trees have small, fixed IDs: #define BTRFS_ROOT_TREE_OBJECTID 1ULL #define BTRFS_EXTENT_TREE_OBJECTID 2ULL #define BTRFS_CHUNK_TREE_OBJECTID 3ULL #define BTRFS_DEV_TREE_OBJECTID 4ULL #define BTRFS_FS_TREE_OBJECTID 5ULL #define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL #define BTRFS_CSUM_TREE_OBJECTID 7ULL #define BTRFS_QUOTA_TREE_OBJECTID 8ULL #define BTRFS_UUID_TREE_OBJECTID 9ULL Note that the FS tree has ID 5. A subvolume is basically another FS tree. Subvolumes other than the top level are given dynamically- allocated IDs starting from 256. Note also that the root, chunk, device and extent trees are all more important, lower level information than any FS tree, so they logically have lower numbers (and were probably implemented earlier). There's no particular reason that it couldn't have been designed with the initial FS tree as ID 0, but it simply wasn't. However, changing this value now would result in two incompatible versions of btrfs -- neither one would be able to deal with the other's filesystems, because the FS tree has a different ID. (And writing code to cope with both would be painful, disruptive and error-prone.) The cost of fixing this minor nit would, I think, far outweigh any benefits you'd get from it. Hence the alias for 0, which is (IIRC) done up-front in the ioctl interface, and therefore has few places that it could go wrong or affect the main code of the FS. Hugo. -- Hugo Mills | How deep will this sub go? hugo@... carfax.org.uk | Oh, she'll go all the way to the bottom if we don't http://carfax.org.uk/ | stop her. PGP: 65E74AC0 | U571 signature.asc Description: Digital signature
Re: Running out of disk space during BTRFS_IOC_CLONE - rebalance doesn't help
On Sun, Nov 30, 2014 at 08:29:42AM +0100, Guenther Starnberger wrote: I'm having an issue with a filesystem where I'm regularly running out of disk space during deduplication with bedup. Rebalancing does not help and the same issue occurs even after a full rebalance. Main use-case for this filesystem is a 3 TB backup disk where I'm creating backups by copying a newer version of the data into a new directory and then afterwards running bedup to deduplicate the data (using the older already existing data). What happens is that bedup will deduplicate some files successfully, but at some point fails with an errno 28 (no space left on device) during deduplication. I had some very limited success with running a balance, but afterwards the same issue happens again after a few more files are deduplicated (applies to balances with and without filters). According to fsck the filesystem appears to be OK. Is there anything else that I can try out in order to fix this issue? Or should I try to create a new filesystem and copy the existing data? Here's the log output: dmesg: [235491.227888] [ cut here ] [235491.227912] WARNING: CPU: 0 PID: 14837 at fs/btrfs/super.c:259 __btrfs_abort_transaction+0x50/0x110 [btrfs]() [235491.227914] BTRFS: Transaction aborted (error -28) There is something wrong in these codes, clone_finish_inode_update() is supposed to be successful since we've reserved some space in btrfs_start_transaction() for it. Thanks, -liubo [235491.227916] Modules linked in: fuse btrfs xor raid6_pq uas usb_storage ctr ccm toshiba_acpi sparse_keymap toshiba_haps joydev hp_accel lis3lv02d input_polldev hdaps(O) btusb bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev qcserial media usb_wwan usbserial arc4 iwldvm snd_hda_codec_hdmi mousedev snd_hda_codec_conexant snd_hda_codec_generic mac80211 iTCO_wdt iTCO_vendor_support coretemp intel_powerclamp snd_hda_intel snd_hda_controller snd_hda_codec kvm_intel snd_hwdep iwlwifi thinkpad_acpi mei_me mei cfg80211 snd_pcm nvram lpc_ich kvm evdev snd_timer i915 snd mac_hid ac serio_raw e1000e psmouse led_class wmi rfkill shpchp drm_kms_helper intel_ips i2c_i801 soundcore drm battery hwmon ptp thermal pps_core i2c_algo_bit i2c_core video intel_agp intel_gtt button [235491.227968] acpi_cpufreq processor sch_fq_codel tp_smapi(O) thinkpad_ec(O) nfs lockd sunrpc fscache ext4 crc16 mbcache jbd2 algif_skcipher af_alg dm_crypt dm_mod atkbd libps2 crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ehci_pci ehci_hcd usbcore usb_common i8042 serio ata_piix sd_mod crct10dif_generic crct10dif_pclmul crc_t10dif crct10dif_common ahci libahci ata_generic libata scsi_mod [235491.228001] CPU: 0 PID: 14837 Comm: bedup Tainted: GW O 3.17.4-1-ARCH #1 [235491.228003] Hardware name: LENOVO 3680U4M/3680U4M, BIOS 6QET68WW (1.38 ) 12/01/2011 [235491.228004] 5deed0d1 880144a57a90 81537b0e [235491.228006] 880144a57ad8 880144a57ac8 8107078d ffe4 [235491.228008] 8801719dcaa0 88009e273800 a09f7630 0c46 [235491.228010] Call Trace: [235491.228017] [81537b0e] dump_stack+0x4d/0x6f [235491.228021] [8107078d] warn_slowpath_common+0x7d/0xa0 [235491.228024] [8107080c] warn_slowpath_fmt+0x5c/0x80 [235491.228029] [a0949d10] __btrfs_abort_transaction+0x50/0x110 [btrfs] [235491.228040] [a09aa9ba] clone_finish_inode_update+0xda/0xf0 [btrfs] [235491.228046] [a09ad0de] btrfs_clone+0x6ae/0xcc0 [btrfs] [235491.228053] [a09ade69] btrfs_ioctl_clone+0x779/0x7b0 [btrfs] [235491.228059] [a09b18b7] btrfs_ioctl+0x10d7/0x2810 [btrfs] [235491.228063] [81193b19] ? free_pages_and_swap_cache+0xb9/0xe0 [235491.228066] [8117d14c] ? tlb_flush_mmu_free+0x2c/0x50 [235491.228068] [8117dd2d] ? tlb_finish_mmu+0x4d/0x50 [235491.228070] [81185cd2] ? unmap_region+0xe2/0x130 [235491.228073] [811ac539] ? kmem_cache_free+0x199/0x1d0 [235491.228075] [811da5f0] do_vfs_ioctl+0x2d0/0x4b0 [235491.228076] [81187fd0] ? do_munmap+0x260/0x400 [235491.228078] [811da851] SyS_ioctl+0x81/0xa0 [235491.228081] [8153db29] system_call_fastpath+0x16/0x1b [235491.228082] ---[ end trace 636d52c4c1dff6bc ]--- btrfs fi show: Label: none uuid: 36c795fe-acb8-458e-87f4-721fedd81b8e Total devices 1 FS bytes used 2.14TiB devid1 size 2.73TiB used 2.17TiB path /dev/mapper/crypt btrfs fi df: Data, single: total=2.12TiB, used=2.12TiB System, DUP: total=32.00MiB, used=248.00KiB Metadata, DUP: total=25.00GiB, used=23.64GiB GlobalReserve, single: total=512.00MiB, used=0.00B I reported the same issue a year ago in 20131202081543.ga1...@gst.name and
Re: [PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO
On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote: A naked read of the value of an RCU pointer isn't safe. Put the whole access in an RCU critical section, not just the pointer dereference. Signed-off-by: Omar Sandoval osan...@osandov.com You can use rcu_access_pointer() in the if() condition check rather than increasing the read critical section. We should try to keep the critical section as small as possible. Also, since we have rcu_str_deref() we can use that instead of rcu_dereference() on device-name. Thoughts? --- fs/btrfs/ioctl.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ecdf68f..dd55844 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2706,6 +2706,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) struct btrfs_fs_devices *fs_devices = root-fs_info-fs_devices; int ret = 0; char *s_uuid = NULL; + struct rcu_string *name; di_args = memdup_user(arg, sizeof(*di_args)); if (IS_ERR(di_args)) @@ -2726,17 +2727,16 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) di_args-bytes_used = btrfs_device_get_bytes_used(dev); di_args-total_bytes = btrfs_device_get_total_bytes(dev); memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid)); - if (dev-name) { - struct rcu_string *name; - rcu_read_lock(); - name = rcu_dereference(dev-name); + rcu_read_lock(); + name = rcu_dereference(dev-name); + if (name) { strncpy(di_args-path, name-str, sizeof(di_args-path)); - rcu_read_unlock(); di_args-path[sizeof(di_args-path) - 1] = 0; } else { di_args-path[0] = '\0'; } + rcu_read_unlock(); out: mutex_unlock(fs_devices-device_list_mutex); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: refactor btrfs_device-name updates
On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote: The rcu_string API introduced some new sparse errors but also revealed existing ones. First of all, the name in struct btrfs_device should be annotated as __rcu to prevent unsafe reads. Additionally, updates should go through rcu_dereference_protected to make it clear what's going on. This introduces some helper functions that factor out this functionality. Signed-off-by: Omar Sandoval osan...@osandov.com diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6e04f27..2298a70 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -54,7 +54,7 @@ struct btrfs_device { struct btrfs_root *dev_root; - struct rcu_string *name; + struct rcu_string __rcu *name; u64 generation; Since rcu_strings are rcu specific, why not annotate the char pointer in 'struct rcu_string' with __rcu annotation? That should catch all error-prone users of rcu_string. -- Pranith -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v2] mount.btrfs helper
Hi all, this patch provides a mount.btrfs helper for the mount command. A btrfs filesystem could span several disks. This helper scans all the partitions to discover all the disks required to mount a filesystem. So it would not necessary any-more to scan the partitions to mount a filesystem. mount.btrfs passes in the option parameters the devices required to mount a filesystem. Supposing that a filesystem is composed by several disks (/dev/sd[cdef]), when the user runs mount /dev/sdd /mnt, mount.btrfs is called and it executes the the mount(2) syscall as below: mount(/dev/sdd, /mnt, btrfs, 0, device=/dev/sdc,device=/dev/sde,device=/de/vsdf). This helper uses both the libblkid and libmount to discover the devices, to manipulate the parameters and to update the mtab file. I got the idea from the btrfs.wiki; the initial idea was to avoid the separation of scanning phases (at boot time or during the block device discovery) from the mounting. But now I think that its biggest advantage is that now it is possible to perform some actions that before would not be possible, like: - check that all the disks have different disk_uuid Before mounting the filesystem, it is checked that all the disks have different uuid, otherwise it stops the process because it is impossible to guarantee that the right disks are used (i.e. some disks may be snapshotted by lvm...) - wait the availability of all disks May be that when mount is called not all the disks are available. This helper waits few second (now 10, tunable via the device_timeout option) that the disks appear. If the timeout expires, there are two possibilities: 1) if the option degraded is passed, the filesystem is mounted in degraded mode 2) otherwise the filesystem is NOT mounted with an error message All the controls above may be avoided passing the disks explicitly: mount /dev/sdb -o device=/dev/sdc,device=/dev/sdd /mnt Of course all the previous kernels checks are still present. Below an example of use: ghigo@emulato:~$ sudo mkfs.btrfs /dev/vdb /dev/vdc /dev/vdd /dev/vde ghigo@emulato:~$ sudo mount -v /dev/vdb /mnt/btrfs1/ mount: you didn't specify a filesystem type for /dev/vdb I will try type btrfs INFO: scan the first device INFO: find filesystem 'test1' [d43585b9-233e-4ce3-9201-81d68ec8e538] INFO: source: /dev/vdb INFO: target: /mnt/btrfs1/ INFO: vfs_opts: 0x - rw INFO: fs_opts: (null) INFO:dev='/dev/vdb' UUID='9e83d673-a76c-4b56-8daa-0a0659897d8c' gen=6 INFO:dev='/dev/vde' UUID='53647bb0-9c39-445a-ba3f-ce31e35026a7' gen=6 INFO:dev='/dev/vdd' UUID='8396ee54-fba1-46b3-801c-1918a9812603' gen=6 INFO:dev='/dev/vdc' UUID='577b77df-2c95-4087-90d7-2331ee10a59d' gen=6 INFO: mtab updated INFO: mount succeded ghigo@emulato:~$ you can pull the source from: https://github.com/kreijack/btrfs-progs.git branch mount.btrfs as bonus you will get also the test suite (under test/mount.btrfs-tests) Comments are welcome BR G.Baroncelli -- diff --git a/Makefile b/Makefile index 4cae30c..8d38138 100644 --- a/Makefile +++ b/Makefile @@ -48,7 +48,7 @@ MAKEOPTS = --no-print-directory Q=$(Q) progs = mkfs.btrfs btrfs-debug-tree btrfsck \ btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \ - btrfs-find-root btrfstune btrfs-show-super + btrfs-find-root btrfstune btrfs-show-super mount.btrfs progs_extra = btrfs-corrupt-block btrfs-fragments btrfs-calc-size \ btrfs-select-super @@ -239,6 +239,12 @@ ioctl-test: $(objects) $(libs) ioctl-test.o @echo [LD] $@ $(Q)$(CC) $(CFLAGS) -o ioctl-test $(objects) ioctl-test.o $(LDFLAGS) $(LIBS) +mount.btrfs: btrfs-mount.o btrfs-mount-find-disks.o crc32c.o utils.o + @echo [LD] $@ + $(Q)$(CC) $(CFLAGS) -o mount.btrfs -lmount -lblkid -luuid \ + crc32c.o \ + btrfs-mount.o btrfs-mount-find-disks.o $(LDFLAGS) + send-test: $(objects) $(libs) send-test.o @echo [LD] $@ $(Q)$(CC) $(CFLAGS) -o send-test $(objects) send-test.o $(LDFLAGS) $(LIBS) -lpthread diff --git a/btrfs-mount-find-disks.c b/btrfs-mount-find-disks.c new file mode 100644 index 000..89aac8b --- /dev/null +++ b/btrfs-mount-find-disks.c @@ -0,0 +1,446 @@ +#define _XOPEN_SOURCE 500 +#define _GNU_SOURCE 1 + +#include stdio.h +#include unistd.h +#include string.h +#include stdlib.h +#include assert.h +#include sys/mount.h +#include errno.h +#include sys/types.h +#include sys/stat.h +#include fcntl.h +#include unistd.h + +#include blkid/blkid.h +#include uuid/uuid.h +#include libmount/libmount.h + +#include crc32c.h + +#include kerncompat.h +#include extent_io.h +#include ctree.h +#include disk-io.h +#include btrfs-mount.h + +#define BTRFS_UUID_UNPARSED_SIZE 37 + +/* + * checks if a path is a block device node + * Returns negative errno on failure, otherwise + * returns 1 for blockdev, 0 for not-blockdev + */ +static int
Re: pro/cons of raid1 with mdadm/lvm2
When the 2 disks have different data mdadm has no way of knowing which one is correct and has a 50% chance of overwriting good data. But BTRFS does checksums on all reads and solves the problem of corrupt data - as long as you don't have 2 corrupt sectors in matching blocks. -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v2] mount.btrfs helper
Hello, On 30 November 2014 at 17:43, Goffredo Baroncelli kreij...@libero.it wrote: Hi all, this patch provides a mount.btrfs helper for the mount command. A btrfs filesystem could span several disks. This helper scans all the partitions to discover all the disks required to mount a filesystem. So it would not necessary any-more to scan the partitions to mount a filesystem. I would welcome this, as a general idea. At the moment in debian ubuntu, btrfs tools package ships udev rules to call btrfs scan whenever device nodes appear. If scan is built into mount, I would be able to drop that udev rule. There are also some reports (not yet re-verified) that such udev rule is not effective, that is btrfs mount fails when attempted before udev has attempted to be run - e.g. from initrdless boot trying to mount btrfs systems before udev-trigger has been run (to process cold-plug events). -- Regards, Dimitri. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v2] mount.btrfs helper
In ubuntu, the initfs runs a btrfs dev scan, which should catch anything that would be missed there. On Sun, Nov 30, 2014 at 4:11 PM, Dimitri John Ledkov x...@debian.org wrote: Hello, On 30 November 2014 at 17:43, Goffredo Baroncelli kreij...@libero.it wrote: Hi all, this patch provides a mount.btrfs helper for the mount command. A btrfs filesystem could span several disks. This helper scans all the partitions to discover all the disks required to mount a filesystem. So it would not necessary any-more to scan the partitions to mount a filesystem. I would welcome this, as a general idea. At the moment in debian ubuntu, btrfs tools package ships udev rules to call btrfs scan whenever device nodes appear. If scan is built into mount, I would be able to drop that udev rule. There are also some reports (not yet re-verified) that such udev rule is not effective, that is btrfs mount fails when attempted before udev has attempted to be run - e.g. from initrdless boot trying to mount btrfs systems before udev-trigger has been run (to process cold-plug events). -- Regards, Dimitri. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
Agree with others about -C 256...-C sha256 is only three letters more ;) Ideally, sha2-256 would be used, since there will be (are) other versions of sha which have 256 bits size. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [RFC][PATCH v2] mount.btrfs helper
On 30 November 2014 at 22:31, cwillu cwi...@cwillu.com wrote: In ubuntu, the initfs runs a btrfs dev scan, which should catch anything that would be missed there. I'm sorry, udev rule(s) is not sufficient in the initramfs-less case, as outlined. In case of booting with initramfs, indeed, both Debian Ubuntu include snippets there to run btrfs scan. -- Regards, Dimitri. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
Agree with others about -C 256...-C sha256 is only three letters more ;) Ideally, sha2-256 would be used, since there will be (are) other versions of sha which have 256 bits size. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 30 November 2014 at 22:59, Christoph Anton Mitterer cales...@scientia.net wrote: Agree with others about -C 256...-C sha256 is only three letters more ;) Ideally, sha2-256 would be used, since there will be (are) other versions of sha which have 256 bits size. Nope, we should use standard names. SHA-2 256 was the first SHA algo to use 256 bits, thus it's commonly referred to as sha256 across the board in multiple pieces of software. SHA-3 family of hashes started to have the same length and thus will be known as sha3-256 etc. Shorthand variant names in this table here http://en.wikipedia.org/wiki/SHA-1#Comparison_of_SHA_functions appear to me how SHA hashes are currently referred as. -- Regards, Dimitri. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v2] mount.btrfs helper
Sorry, misread initrdless as initramfs. In #btrfs, I usually say something like do you gain enough by not using an initfs for this to be worth the hassle?, but of course, that's not an argument against making mount smarter. On Sun, Nov 30, 2014 at 4:57 PM, Dimitri John Ledkov x...@debian.org wrote: On 30 November 2014 at 22:31, cwillu cwi...@cwillu.com wrote: In ubuntu, the initfs runs a btrfs dev scan, which should catch anything that would be missed there. I'm sorry, udev rule(s) is not sufficient in the initramfs-less case, as outlined. In case of booting with initramfs, indeed, both Debian Ubuntu include snippets there to run btrfs scan. -- Regards, Dimitri. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving an entire subvol?
On Sun, Nov 30, 2014 at 03:57:06PM +0530, Shriramana Sharma wrote: On Sun, Nov 30, 2014 at 9:51 AM, Marc MERLIN m...@merlins.org wrote: So the Ubuntu Wiki BtrFS entry advises against using subvol set-default because it boots its kernel using root=subvol=@ and home as subvol=@home, and these two subvols are only present under the subvol with ID 5. But isn't it just possible to move i.e. reparent a subvol so I can move these two under another subvol and have that as default? Make a new subvolume called /root and just mount subvol=root Sorry if my question wasn't clear: I wanted to know how to move a subvol to appear under another subvol other than its original parent. Turns out that sudo mv @ @home target/ is quite sufficient. If so why would the Ubuntu wiki require that set-default not be used? Just @ @home need to be moved to the new place, no? I've never done that. If I had to move them, I'd just change the mountpoint. Note that you can't mount subvols recursively in one mount AFAIK. I'm not sure what you mean. I have a few subvols in my external HDD which is entirely formatted as BtrFS and if I just mount the external HDD /dev/sdc1 I am able to access all the subvols' contents as well. Yes, if you mount the root, it works of course. If you mount a subvol, you cannot have it automatically have it mount other subvols. Subvols don't really know or care where they are mounted compared to one another, and who is under whom. It's just mount setup. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving an entire subvol?
On Sat, Nov 29, 2014 at 8:31 PM, Shriramana Sharma samj...@gmail.com wrote: So the Ubuntu Wiki BtrFS entry advises against using subvol set-default because it boots its kernel using root=subvol=@ and home as subvol=@home, and these two subvols are only present under the subvol with ID 5. The advice may have had to do with GRUB behavior prior to 2.02. Previously GRUB attempted to honor the btrfs default subvolume, and therefore treated any path in grub.cfg relative to the default subvolume. Now, GRUB behaves the same as the subvol= mount option, it is always treated as an absolute path from subvol id 5, hence the default subvolume is ignored. Since the default subvolume is set by a user space program I think it's a domain violation for anything to subvert this; it really should remain a shortcut for the user's benefit only, so they can use mount without -o subvol=. Everything else should explicitly pass subvol= But isn't it just possible to move i.e. reparent a subvol so I can move these two under another subvol and have that as default? You can move subvolumes. My suggestion is subvolumes containing binaries shouldn't be located within another subvolume that ends up being mounted, that way old binaries with possible vulnerabilities aren't exposed in the normal search path. Possibly this is a hypothetical question as I'm not sure whether it would be actually practically required but looking at the specific Ubuntu advice on this I thought I should ask. I'm also not sure what openSUSE (or other distros) do about this... Do they mount root using subvolid, or subvol name or such? openSUSE uses subvol id 5 for installing the OS to, and some directories are made subvolumes such as home var and maybe usr. Therefore when subvolid 5 is snapshot, those are exempt, and have to be individually snapshot. The snapshots are found in the same root directory everything else is, in a . directory (I think .snapshots ?) Fedora uses subvolumes root and home by default, and fstab uses subvol=root and subvol=home to mount them at / and /home respectively. I don't know any distro using subvolid right now but that might be prudent as it's far less user domain than subvolume names. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pro/cons of raid1 with mdadm/lvm2
On Sun, Nov 30, 2014 at 3:06 PM, Russell Coker russ...@coker.com.au wrote: When the 2 disks have different data mdadm has no way of knowing which one is correct and has a 50% chance of overwriting good data. But BTRFS does checksums on all reads and solves the problem of corrupt data - as long as you don't have 2 corrupt sectors in matching blocks. Yeah. I'm not sure though if openSUSE 13.2 prevents users from creating btrfs raid1 volumes entirely, or if it's just an install time limitation. I know that Fedora's installer won't allow the user to create Btrfs on LVM, and it probably doesn't allow it on md raid either. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Crazy idea of cleanup the inode_record btrfsck things with SQL?
[BACKGROUND] I'm trying to implement the function to repair missing inode item. Under that case, inode type must be salvaged(although it can be fallback to FILE). One case should be, if there is any dir_item/index or inode_ref refers the inode as parent, the type of that inode must be DIR. However, currently btrfsck implement (inode_record only records backref), we are unable to search the inode_backref whose parent is given inode number. [FIRST IMPLEMENT DESIGN] My first thought is to implement an generic inode-relation structure, recording parent ino, child ino, name and namelen, and restore the structure in a rbtree, not in the child/parent's list. But I soon recognize that this is a perfect use case for relational database, as 'ino' as the primary key for INODE table, ('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table. [CRAZY IDEA] So why not using SQL to implement the btrfsck inode-record things? With such crazy idea, it will be much much easier to do any iteration from a given ino, and with the already mature RDB implement, like sqlite3, we can save hundreds of lines of codes implementing the rb-tree or list. [PROS] 1. Easy to maintain Now we don't need to maintain the rbtree searching or list iteration, but easy SQL lines and its wrapper. 2. Easy to extend If we need to record something more, like extents and its relation to inode, we only need to create 2 tables and several SQL and wrappers. 3. Reduced memory usage for HUGE fs. When metadata grows to several TB or even more, current rb-tree based implement may run short of memory since they are all stored in memory. But if use SQL, RDBMS like sqlite3 can restore things in either memory or disk, which may hugely reduce the memory usage for huge btrfs. If not use existing RDBMS, we need to implement complicated memory control system to manage memory in userland. [CONS] 1. Heavy implement SQL hide the rb-tree or B+ tree implement but costs more memory(if not compressed) and CPU cycles, which will be slower than the simple rb-tree implement even using lightweight RDBMS like sqlite3. 2. Heavy dependency If use it, btrfs-progs will include RDBMS as the make and runtime dependency. Such low level progs depend on high level programs like sqlite3 may be very strange. 3. A lot of rework on existing codes. Even SQL is easier to maintain and extend, if we use it, we still need to reimplement several hundreds or even thousands lines of code to implement it, not to mention the regression tests. 4. Copyright Will it cause any copyright problem if using non-GPL RDBMS like sqlite3 in GPLv2 btrfs-progs? [NEED FEEDBACK] Any feedback or discussion on the crazy idea is welcomed, since this may needs a lot of work, it definitely needs a lot review on the idea before it comes to codes. Thanks, Qu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: refactor btrfs_device-name updates
On Sun, Nov 30, 2014 at 10:26:43AM -0500, Pranith Kumar wrote: On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote: The rcu_string API introduced some new sparse errors but also revealed existing ones. First of all, the name in struct btrfs_device should be annotated as __rcu to prevent unsafe reads. Additionally, updates should go through rcu_dereference_protected to make it clear what's going on. This introduces some helper functions that factor out this functionality. Signed-off-by: Omar Sandoval osan...@osandov.com diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6e04f27..2298a70 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -54,7 +54,7 @@ struct btrfs_device { struct btrfs_root *dev_root; - struct rcu_string *name; + struct rcu_string __rcu *name; u64 generation; Since rcu_strings are rcu specific, why not annotate the char pointer in 'struct rcu_string' with __rcu annotation? That should catch all error-prone users of rcu_string. Because the whole structure is RCU'd, not just the str part of it. If str is annotated as __rcu, when we (correctly) rcu_dereference an rcu_string and then access the str member, we'll still get sparse warnings. In any case, the above code does what I want it to do. See the following (non-sense but illustrative) example: #include linux/rcustring.h static void example_func(void) { struct rcu_string __rcu *example; char *str; str = example-str; } CHECK /home/osandov/linux/example/example.c /home/osandov/linux/example/example.c:7:13: warning: incorrect type in assignment (different address spaces) /home/osandov/linux/example/example.c:7:13:expected char *str /home/osandov/linux/example/example.c:7:13:got char [noderef] asn:4*noident -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On Sun, 2014-11-30 at 23:05 +, Dimitri John Ledkov wrote: Nope, we should use standard names. Well I wouldn't know that there is really a standardised name in the sense that it tells it's mandatory. People use SHA2-xxx, SHA-xxx, SHAxxx and probably even more combinations. And just because something was started short-sighted and in a wrong way it doesn't mean one cannot correct it, which is why we try to no longer use e.g. KB but kB or KiB. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted: [CRAZY IDEA] So why not using SQL to implement the btrfsck inode-record things? 2. Heavy dependency If use it, btrfs-progs will include RDBMS as the make and runtime dependency. Such low level progs depend on high level programs like sqlite3 may be very strange. I expect this will turn many of the traditionalists off, at least. I could see a lot of traditional sysadmins lumping btrfs in with systemd if it started requiring a db, much as one of the big objections to systemd is the dbus requirement... even for headless servers that have never required it before. Of course they could be ignored, but do we really want to go there? (Personally, my gut reaction is eew, and of course getting database file handling correct after an ungraceful shutdown/reboot is one of the big challenges for a filesystem as it is, so I'm not entirely sure storing information in a database file in ordered to use it to help fix the filesystem is a good idea since it could well be that you end up needing an fsck to restore the file... to do the fsck, but I could be convinced. I'm worried about the ones that can't be.) 4. Copyright Will it cause any copyright problem if using non-GPL RDBMS like sqlite3 in GPLv2 btrfs-progs? I just checked and at least on gentoo, sqlite's license is registered as public domain, which is legally mergeable with code under any other license free or proprietary, so there should be no problem with it. If something else is used of course it would depend on its license. I believe the general kernel-rules practice for such compatible license merging is to keep code under compatible licenses in separate files and keep the individual files under their individual licenses. While if it's compatible I don't believe that's generally an actual legal requirement, the BSD folks in particular tend to be /very/ sensitive about code formerly under the BSD license, for instance, merged directly into GPL headlined files, because in that case they can't reverse the process. Personally I don't see the big deal, since they seem to have /no/ problem with their code being taken proprietary where they can't even /look/ at it, but a /huge/ problem with it being taken GPL, where they may not be able to directly copy back to BSD, but they obviously have the code available to look at anyway and still mergeable provided they dual license, and that makes absolutely no sense to me. shrug So while I don't think it'll go over well, there should be no license issues at least with sqlite. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: fix suspicious RCU in BTRFS_IOC_DEV_INFO
On Sun, Nov 30, 2014 at 10:11:41AM -0500, Pranith Kumar wrote: On Sun, Nov 30, 2014 at 3:26 AM, Omar Sandoval osan...@osandov.com wrote: A naked read of the value of an RCU pointer isn't safe. Put the whole access in an RCU critical section, not just the pointer dereference. Signed-off-by: Omar Sandoval osan...@osandov.com You can use rcu_access_pointer() in the if() condition check rather than increasing the read critical section. We should try to keep the critical section as small as possible. Also, since we have rcu_str_deref() we can use that instead of rcu_dereference() on device-name. Thoughts? That's right, I forgot about rcu_access_pointer. The difference is probably negligible, and I doubt the performance of this ioctl is very important. Since we're going to be dereferencing the pointer anyways in some (most?) cases, I think this is a bit more readable. -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Original Message Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL? From: Duncan 1i5t5.dun...@cox.net To: linux-btrfs@vger.kernel.org Date: 2014年12月01日 11:08 Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted: [CRAZY IDEA] So why not using SQL to implement the btrfsck inode-record things? 2. Heavy dependency If use it, btrfs-progs will include RDBMS as the make and runtime dependency. Such low level progs depend on high level programs like sqlite3 may be very strange. I expect this will turn many of the traditionalists off, at least. I could see a lot of traditional sysadmins lumping btrfs in with systemd if it started requiring a db, much as one of the big objections to systemd is the dbus requirement... even for headless servers that have never required it before. Of course they could be ignored, but do we really want to go there? Oh,so terrible the systemd warfare. :( This objection sounds very solid now. Anyway, this is a crazy idea... (Maybe it is only me so lazy to implement the rb-tree based things) (Personally, my gut reaction is eew, and of course getting database file handling correct after an ungraceful shutdown/reboot is one of the big challenges for a filesystem as it is, so I'm not entirely sure storing information in a database file in ordered to use it to help fix the filesystem is a good idea since it could well be that you end up needing an fsck to restore the file... to do the fsck, but I could be convinced. I'm worried about the ones that can't be.) The db file is mostly used in memory, only when the metadata is really really big, maybe when the fs tree's level is 7 or 8 we may need to use db file. And the db file should be anonymous(unlinked but open) since it is only used in one btrfsck session, not reused or really needed to be stored. So I will not be a problem of restore or something like that. 4. Copyright Will it cause any copyright problem if using non-GPL RDBMS like sqlite3 in GPLv2 btrfs-progs? I just checked and at least on gentoo, sqlite's license is registered as public domain, which is legally mergeable with code under any other license free or proprietary, so there should be no problem with it. If something else is used of course it would depend on its license. I believe the general kernel-rules practice for such compatible license merging is to keep code under compatible licenses in separate files and keep the individual files under their individual licenses. While if it's compatible I don't believe that's generally an actual legal requirement, the BSD folks in particular tend to be /very/ sensitive about code formerly under the BSD license, for instance, merged directly into GPL headlined files, because in that case they can't reverse the process. Personally I don't see the big deal, since they seem to have /no/ problem with their code being taken proprietary where they can't even /look/ at it, but a /huge/ problem with it being taken GPL, where they may not be able to directly copy back to BSD, but they obviously have the code available to look at anyway and still mergeable provided they dual license, and that makes absolutely no sense to me. shrug So while I don't think it'll go over well, there should be no license issues at least with sqlite. Thanks for the license check anyway. Thanks, Qu. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: remove empty fs_devices to prevent memory runout
There is a global list @fs_uuids to keep @fs_devices object for each created btrfs. But when a btrfs becomes empty (all devices belong to it are gone), its @fs_devices remains in @fs_uuids list until module exit. If we keeps mkfs.btrfs on the same device again and again, all empty @fs_devices produced are sure to eat up our memory. So this case has better to be prevented. I think that each time we setup btrfs on that device, we should check whether we are stealing some device from another btrfs seen before. To faciliate the search procedure, we could insert all @btrfs_device in a rb_root, one @btrfs_device per each physical device, with @bdev-bd_dev as key. Each time device stealing happens, we should replace the corresponding @btrfs_device in the rb_root with an up-to-date version. If the stolen device is the last device in its @fs_devices, then we have an empty btrfs to be deleted. Actually there are 3 ways to steal devices and lead to empty btrfs 1. mkfs, with -f option 2. device add, with -f option 3. device replace, with -f option We should act under these cases. Moreover, there are special cases to consider: o If there are seed devices, then it is asured that the devices in cloned @fs_devices are not treated as valid devices. o If a device disappears and reappears without any touch, its @bdev-bd_dev may change, so we have to re-insert it into the rb_root. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog v1-v2: add handle for device disappears and reappears event *Note* Actually this handles the case when a device disappears and reappears without any touch. We are going to recycle all dead btrfs_device in another patch. Two events leads to the deads: 1) device disappears and never returns again 2) device disappears and returns with a new fs on it A shrinker shall kill the deads. --- fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 281 ++--- fs/btrfs/volumes.h | 6 ++ 3 files changed, 230 insertions(+), 58 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 54bd91e..ee09a56 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2154,6 +2154,7 @@ static void __exit exit_btrfs_fs(void) btrfs_end_io_wq_exit(); unregister_filesystem(btrfs_fs_type); btrfs_exit_sysfs(); + btrfs_cleanup_valid_dev_root(); btrfs_cleanup_fs_uuids(); btrfs_exit_compress(); btrfs_hash_exit(); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0192051..7093cce 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include linux/kthread.h #include linux/raid/pq.h #include linux/semaphore.h +#include linux/rbtree.h #include asm/div64.h #include ctree.h #include extent_map.h @@ -52,6 +53,126 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); +static struct rb_root valid_dev_root = RB_ROOT; + +static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev) +{ + struct rb_node **p; + struct rb_node *parent; + struct rb_node *new; + struct btrfs_device *old_dev; + + WARN_ON(!mutex_is_locked(uuid_mutex)); + + parent = NULL; + new = new_dev-rb_node; + + p = valid_dev_root.rb_node; + while (*p) { + parent = *p; + old_dev = rb_entry(parent, struct btrfs_device, rb_node); + + if (new_dev-devnum old_dev-devnum) + p = parent-rb_left; + else if (new_dev-devnum old_dev-devnum) + p = parent-rb_right; + else { + rb_replace_node(parent, new, valid_dev_root); + RB_CLEAR_NODE(parent); + + goto out; + } + } + + old_dev = NULL; + rb_link_node(new, parent, p); + rb_insert_color(new, valid_dev_root); + +out: + return old_dev; +} + +static void free_fs_devices(struct btrfs_fs_devices *fs_devices) +{ + struct btrfs_device *device; + WARN_ON(fs_devices-opened); + while (!list_empty(fs_devices-devices)) { + device = list_entry(fs_devices-devices.next, + struct btrfs_device, dev_list); + list_del(device-dev_list); + rcu_string_free(device-name); + kfree(device); + } + kfree(fs_devices); +} + +static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs) +{ + struct btrfs_fs_devices *seed_fs; + + if (!list_empty(old_fs-devices)) + return; + + list_del(old_fs-list); + + /* free the seed clones */ + seed_fs = old_fs-seed; + free_fs_devices(old_fs); + while (seed_fs) { + old_fs = seed_fs; +
Re: root subvol id is 0 or 5?
Hugo Mills posted on Sun, 30 Nov 2014 13:53:28 + as excerpted: On Sun, Nov 30, 2014 at 07:08:51PM +0530, Shriramana Sharma wrote: On Sun, Nov 30, 2014 at 5:29 PM, Hugo Mills h...@carfax.org.uk wrote: In the data structures on disk, it's 5. The kernel aliases 0 to mean subvolid 5. So why 5 and not just 0 which seems a logical choice? On top of this, one needs to alias 0 to 5! All of the trees used in the FS metadata have an ID number. The well-known trees have small, fixed IDs: Thanks, Hugo. You might wish to find a place in the wiki (probably in the FAQ) for that, since your explanation was both the clearest I can imagine and cleared up some lingering but why? questions along that line for me, as well. And if an answer to that basic a btrfs question is still clearing stuff up for me, I expect it could be useful to well over 90% of potential btrfs wiki FAQ readers... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
On 11/30/2014 05:58 PM, Qu Wenruo wrote: (why not use SQL to... suggestion) SQL, as in Structured Query Language, is _terrible_ for recursion. It expresses all of its elements in terms of set theory and really can only implement union and intersection of flat sets. Several companies offer extensions to SQL in their implementations to help with this lack of recursion such as prior in Oracle's PSQL, but they are all stateful beyond reason. Several companies, including microsoft, have proposed and partially implemented a relational database as a file system paradigm and then crashed into the fact that dealing with the parent of the parent of something is different than dealing with the parent of the parent of the parent of something. There is a humours-but-true saying: If you have a problme, and you decide to solve it with (regex or xml or uml or sql etc) you now have two problems. Writing the SQL to walk the tree is harder than allocating the memory as a vector, filling it with the data, and then walking the pointers. Your suggestion is the first step on the road to The Inner Platform Effect™. You have a specialized database (parent, inode, name) and now you want to put a generic database engine over the specialized database so that you an re-implement the specialized database with generic primitives. http://en.wikipedia.org/wiki/Inner-platform_effect Things need to be only as generic as they need to be, and no more generic than that. Replacing a pointer to a record with a pointer to a cursor's result table that will give you the name of the next result to query is not a win. Even as you spell it out you can see that it is _not_ a reduction in memory or processing. And the easy SQL lines stop being that easy when name stops being unique. (I've been down this road before. Not with file systems but with managed objects in a network management system. Nodes, Parent nodes, etc. Just referring to distributed things like networks switches instead of file system inodes. ... It doesn't end well. 8-) ) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
On Wed, Nov 26, 2014 at 03:30:39PM +, Filipe Manana wrote: Stress btrfs' block group allocation and deallocation while running fstrim in parallel. Part of the goal is also to get data block groups deallocated so that new metadata block groups, using the same physical device space ranges, get allocated while fstrim is running. This caused several issues ranging from invalid memory accesses, kernel crashes, metadata or data corruption, free space cache inconsistencies and free space leaks. Signed-off-by: Filipe Manana fdman...@suse.com There's nothing btrfs specific about this test. Pleas emake it generic. + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch_nocheck +_require_fstrim + +rm -f $seqres.full # needs 40GB of space in the filesystem _scratch_mkfs _require_fs_space $SCRATCH_MNT $((40 * 1024 * 1024)) However, does it really need 40GB? It needs 2GB for the large alloc, and then 400,000 * 4k is only 1.6GB. So This would fit in a 10GB filesystem without a problem, right? And if it's a generic test, keeping it under 10GB would mean it runs on the majority of filesystem developers test VMs, small or large +# Create a bunch of small files that get their single extent inlined in the +# btree, so that we consume a lot of metadata space and get a chance of a +# data block group getting deleted and reused for metadata later. Sometimes +# the creation of all these files succeeds other times we get ENOSPC failures +# at some point - this depends on how fast the btrfs' cleaner kthread is +# notified about empty block groups, how fast it deletes them and how fast +# the fallocate calls happen. So we don't really care if they all succeed or +# not, the goal is just to keep metadata space usage growing while data block +# groups are deleted. +create_files() +{ + local prefix=$1 + + for ((i = 1; i = 40; i++)); do + echo Creating file ${prefix}_$i $seqres.full 21 + $XFS_IO_PROG -f -c pwrite -S 0xaa 0 3900 \ + $SCRATCH_MNT/${prefix}_$i $seqres.full 21 You don't need to echo 400,000 file creates to $seqres.full. This is one of those times that directing output to /dev/null makes sense, especially as: + ret=$? + if [ $ret -ne 0 ]; then + break + fi you can do this: if [ $? -ne 0 ]; then echo failed creating file $prefix.$i $seqres.full break fi + done + +} + +fsz=`expr 40 \* 1024 \* 1024 \* 1024` +_scratch_mkfs_sized $fsz $seqres.full 21 || \ + _fail size=$fsz mkfs failed +_scratch_mount + +for ((i = 0; i 4; i++)); do + trim_loop + trim_pids[$i]=$! +done + +fallocate_loop falloc_file +fallocate_pid=$! + +create_files foobar + +kill $fallocate_pid +kill ${trim_pids[@]} +wait + +# Sleep a bit, otherwise umount fails often with EBUSY (TODO: investigate why). +sleep 3 + +# Check for fs consistency. The trimming was racy and caused some btree nodes +# to get full of zeroes on disk, which obviously caused fs metadata corruption. +# The race often lead to missing free space entries in a block group's free +# space cache too. +_check_scratch_fs Ummm, if you just use _require_scratch, you don't need to do this. The test harness will check it for you. index e79b848..6608005 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -84,3 +84,4 @@ 079 auto 080 auto 081 auto quick +082 auto I'd suggest that for a generic test we'd want to add the stress group to this, and allow the test to be scaled in terms of filesystem size and the number of concurrent trim and fallocate loops by $LOAD_FACTOR Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Considerations in snapshotting and send/receive of nocow files?
Shriramana Sharma posted on Sun, 30 Nov 2014 19:17:42 +0530 as excerpted: Given that snapshotting effectively reduces the usefulness of nocow, I suppose the preferable model to snapshotting and send/receiving such files would be different than other files. Should nocow files (for me only VBox images) preferably be: 1) under a separate subvolume? 2) said subvol snapshotted less often? 3) sent/received any differently? If you look back in the list history at the nocow threads, you'll see a lot of my answers to exactly this sort of question. In general I'd say yes to 1 and 2, separate subvolume, in part to allow snapshotting it less often. For 3, I don't deal directly with send/ receive for my own use case and it's complex enough that I've not become as familiar with it as I have the general fragmentation issue, but because send does require creating a read-only snapshot, I'd characterize #3 as depending on #2, and would thus suggest treating it differently to the extent that you keep send and therefore snapshotting to the low side of your reasonable range. Here's the reasoning in a more detailed step-by-step fashion. (I'll use lettered points here to avoid confusing them with your numbered points above, which I may wish to reference below as well.) A) The basic issue in principle: As you've apparently found from your research, snapshotting and nocow can be used together but disrupt absolute nocow, because a snapshot locks in place the existing version of the file, forcing a COW on the initial change written to a (4 KiB) file block after a snapshot covering the same file. The file does remain nocow, however, and further changes written to the same file block will be nocow -- until the next snapshot forces another lock-in-place, of course. B) The biggest immediate practical problem leading from A is that of high- frequency automated snapshotting -- some people are going wild and snapshotting as often as once a minute... at least until they see some of the issues that can cause (like snapshots happening nearly instantly but snapshot deletion often taking longer than a minute, and the current scaling issues involved once there's several hundreds or thousands of snapshots to deal with). On a busy VM triggering change-writes with a similar 1 minute or lower frequency, the snapshotting very quickly eliminates much of the anti-fragmentation benefit of nocow in the first place. C) On a more general level once again, it should be easily apparent that the more change-writes you can squeeze between snapshots, the more effective the nocow is going to be, because a higher percentage of them will still be nocow. D) That leads pretty directly to your points 1 and 2, put the nocow files on their own subvolume so snapshotting the parent doesn't affect them, and then snapshot the nocow subvolume at a lower frequency, as low a frequency as can reasonably fit within your use-case target range. For example, for a normally daily snapshot scenario you might snapshot the parent daily and the nocow subvolume every other day or twice a week. For a normal 4X-daily snapshot scenario (every six hours on a 24-hour schedule or every two hours on an 8-hour-shift schedule), you might snapshot the nocow subvolume only once or twice a day. Tho of course if the primary goal is the snapshotting of the nocow files (the VMs in your case), then you may still be snapshotting it at a higher frequency than the parent, which you may not in fact be snapshotting at all. The point remains, snapshot the nocow subvolume at as low a frequency as can reasonably fit your use-case/goals. E) Regarding your point #3, since send must be done from a read-only snapshot, obviously you'll need to snapshot at a frequency that at minimum equals that of your sends. However, if your VMs are low activity enough that there's a reasonable chance they won't have written any changes during the send, and the send is the primary reason for the snapshot in the first place, you may avoid /some/ of the issue by deleting most snapshots as soon after the send as possible. It would work like this. You'd do your initial full send, creating an initial reference on both sides, with that snapshot retained on both sides /as/ that initial reference. At your primary sending frequency, say once a day, you'd do the send against the original parent and delete the sending snapshot as soon as the send completed, thus making each daily incremental against the original. At a lower frequency, perhaps once a week or once a month, you'd retain the sending snapshot but use the mitigation measures discussed in F below, and could then delete older initially-retained-weeklies and the original full-reference, perhaps keeping say two quarterly snapshots on the send side. Then if you needed to reverse the send/receive, you'd still have the last weekly as a reference on both sides and could replay the last daily
Re: pro/cons of raid1 with mdadm/lvm2
On Sun, 30 Nov 2014 12:11:47 +0100 Gour g...@atmarama.net wrote: However, I wonder if there are some 'cons' in having raid-1 partition under mdadm and not using native mirroring capabilities of btrfs fs? Pros: * mdadm RAID has much better read balancing; Btrfs reads are satisfied from what's in effect a random drive (PID-based balancing of threads to drives), mdadm reads from the less-loaded drive. Also mdadm has a way to specify some RAID1 array members as to be never used for reads if at all possible (write-mostly), which helps in RAID1 of HDD and SSD. * mdadm RAID has much better write submission; In my experience [1] Btrfs RAID1 on heavy write operations first writes to one drive, then to another. The whole process takes up to 2x longer than with a single drive. On the other hand mdadm writes to both drives simultaneously. [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34103.html Con: * You only get the ability to recover from a checksum failure with Btrfs RAID1, not with mdadm RAID1 (see Russell's reply). -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted: The db file is mostly used in memory, only when the metadata is really really big, maybe when the fs tree's level is 7 or 8 we may need to use db file. So fscking the database in ordered to fsck the database isn't an issue. One objection down! =:^) But seriously, the politics of the idea remains its biggest nemesis in my opinion. And in systemd we've unfortunately a live demonstration of just how big a nemesis that can be. =:^( If the technical reasoning for it is sound and the benefit high enough, great, but IMO the benefit will need to be pretty high to justify the risk of political fallout, and I doubt it's anything close to that high. But it's not my call, so we'll see. Thinks could certainly get interesting if it's judged to be worth it. Checking popcorn stash -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Original Message Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL? From: Robert White rwh...@pobox.com To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs linux-btrfs@vger.kernel.org Date: 2014年12月01日 12:03 On 11/30/2014 05:58 PM, Qu Wenruo wrote: (why not use SQL to... suggestion) SQL, as in Structured Query Language, is _terrible_ for recursion. It expresses all of its elements in terms of set theory and really can only implement union and intersection of flat sets. Several companies offer extensions to SQL in their implementations to help with this lack of recursion such as prior in Oracle's PSQL, but they are all stateful beyond reason. Several companies, including microsoft, have proposed and partially implemented a relational database as a file system paradigm and then crashed into the fact that dealing with the parent of the parent of something is different than dealing with the parent of the parent of the parent of something. There is a humours-but-true saying: If you have a problme, and you decide to solve it with (regex or xml or uml or sql etc) you now have two problems. Wait, regex and uml and xml is OK, but never heard sql is one of them... Writing the SQL to walk the tree is harder than allocating the memory as a vector, filling it with the data, and then walking the pointers. In fact, such INODE and INODE_REF table is not (completely nor mainly) used to walk the tree, it is mainly used to search for: 1. is there any inode_ref refers to a given ino as parent. This will not even be a problem when the fs is *OK*, since do a simple btrfs_search_slot() with key( objectied = ino, type = BTRFS_DIR_INDEX/ITEM_KEY, offset = 0) will do it. However when it comes to corrupted leaf, the whole INODE_ITEM with its DIR_INDEX/ITEM are gone with the leaf, so the old search way is not usable and btrfs-progs will relay on other mechanism to determine that. And unfortunately, there is no such mechanism. 2. is there any dir_index/dir_item refers to a given ino as child. Current inode_record works fine for this object. So when the crazy idea disappear and sane ideas come back, it will probably be rb-tree based (parent, ino, name, namelen) entries to record parent-child relation (currently it is a list_head only records backref inside the inode_record). And another rb-tree based (ino) entries (same as current inode_record structure). Your suggestion is the first step on the road to The Inner Platform Effect™. You have a specialized database (parent, inode, name) and now you want to put a generic database engine over the specialized database so that you an re-implement the specialized database with generic primitives. http://en.wikipedia.org/wiki/Inner-platform_effect Things need to be only as generic as they need to be, and no more generic than that. Replacing a pointer to a record with a pointer to a cursor's result table that will give you the name of the next result to query is not a win. Even as you spell it out you can see that it is _not_ a reduction in memory or processing. And the easy SQL lines stop being that easy when name stops being unique. Name is still unique when parent ino is given, so the INODE_REF tables' primary key is not name but the (parent, ino, name) combine. But the inner platform effect still seems valid for my crazy idea. Anyway, the crazy idea comes to me when I see the RDB like feature in the inode_record structure, -and I just want to save sometime coding the new (parent, ino, name, namelen) rb-tree-. (I've been down this road before. Not with file systems but with managed objects in a network management system. Nodes, Parent nodes, etc. Just referring to distributed things like networks switches instead of file system inodes. ... It doesn't end well. 8-) ) The RDB idea must come to you just like me, wanting to write less codes, right? So it seems the end may be the same. :-( Thanks, Qu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Original Message Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL? From: Duncan 1i5t5.dun...@cox.net To: linux-btrfs@vger.kernel.org Date: 2014年12月01日 13:47 Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted: The db file is mostly used in memory, only when the metadata is really really big, maybe when the fs tree's level is 7 or 8 we may need to use db file. So fscking the database in ordered to fsck the database isn't an issue. One objection down! =:^) But seriously, the politics of the idea remains its biggest nemesis in my opinion. And in systemd we've unfortunately a live demonstration of just how big a nemesis that can be. =:^( If the technical reasoning for it is sound and the benefit high enough, great, but IMO the benefit will need to be pretty high to justify the risk of political fallout, and I doubt it's anything close to that high. But it's not my call, so we'll see. Thinks could certainly get interesting if it's judged to be worth it. Checking popcorn stash No (systemd civilian) war, make love! I seldom consider the politics problem of the insane idea even the systemd warfare is still here. (Emmm, maybe it is because Arch accept systemd too long time ago without too much problem?) Anyway, it is just an insane idea and any feedback killing the seed of insanity as soon as possible is welcomed. Thanks, Qu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html