Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag
David, any progress on this patch series? I never saw an updated version of this patch series after the last round of reviews, but it would be great to move it forward. I have filefrag patches in my e2fsprogs tree waiting for an updated version of your patch. I recall the main changes were: - add FIEMAP_EXTENT_PHYS_LENGTH flag to indicate if fe_phys_length was valid - rename fe_length to fe_logi_length and #define fe_length fe_logi_length - always fill in fe_phys_length (= fe_logi_length for uncompressed files) and set FIEMAP_EXTENT_PHYS_LENGTH whether the extent is compressed or not - add WARN_ONCE() in fiemap_fill_next_extent() as described below I don't know if there was any clear statement about whether there should be separate FIEMAP_EXTENT_PHYS_LENGTH and FIEMAP_EXTENT_DATA_COMPRESSED flags, or if the latter should be implicit? Probably makes sense to have separate flags. It should be fine to use: #define FIEMAP_EXTENT_PHYS_LENGTH 0x0010 since this flag was never used. Cheers, Andreas On Dec 12, 2013, at 5:02 PM, Andreas Dilger adil...@dilger.ca wrote: On Dec 12, 2013, at 4:24 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Dec 12, 2013 at 04:25:59PM +0100, David Sterba wrote: This flag was not accepted when fiemap was proposed [2] due to lack of in-kernel users. Btrfs has compression for a long time and we'd like to see that an extent is compressed in the output of 'filefrag' utility once it's taught about it. For that purpose, a reserved field from fiemap_extent is used to let the filesystem store along the physcial extent length when the flag is set. This keeps compatibility with applications that use FIEMAP. I'd prefer to just see the new physical length field always filled out, regardless of whether it is a compressed extent or not. In terms of backwards compatibility to userspace, it makes no difference because the value of reserved/unused fields is undefined by the API. Yes, the implementation zeros them, but there's nothing in the documentation that says reserved fields must be zero. Hence I think we should just set it for every extent. I'd actually thought the same thing while reading the patch, but I figured people would object because it implies that old kernels will return a physical length of 0 bytes (which might be valid) and badly-written tools will not work correctly on older kernels. That said, applications _should_ be checking the FIEMAP_EXTENT_DATA_COMPRESSED flag, and I suspect in the future fewer developers will be confused if fe_phys_length == fe_length going forward. If the initial tools get it right (in particular filefrag), then hopefully others will get it correct also. From the point of view of the kernel API (fiemap_fill_next_extent), passing the physical extent size in the len parameter for normal extents, then passing 0 for the physical length makes absolutely no sense. IOWs, what you have created is a distinction between the extent's logical length and it's physical length. For uncompressed extents, they are both equal and they should both be passed to fiemap_fill_next_extent as the same value. Extents where they are different (i.e. encoded extents) is when they can be different. Perhaps fiemap_fill_next_extent() should check and warn about mismatches when they differ and the relevant flags are not set... Seems reasonable to have a WARN_ONCE() in that case. That would catch bugs in the filesystem, code as well: WARN_ONCE(phys_len != lgcl_len !(flags FIEMAP_EXTENT_DATA_COMPRESSED), physical len %llu != logical length %llu without DATA_COMPRESSED\n, phys_len, logical_len, phys_len, logical_len); diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h index 93abfcd..0e32cae 100644 --- a/include/uapi/linux/fiemap.h +++ b/include/uapi/linux/fiemap.h @@ -19,7 +19,9 @@ struct fiemap_extent { __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ - __u64 fe_reserved64[2]; + __u64 fe_phys_length; /* physical length in bytes, undefined if + * DATA_COMPRESSED not set */ + __u64 fe_reserved64; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_reserved[3]; }; The comment for fe_length needs to change, too, because it needs to indicate that it is the logical extent length and that it may be different to the fe_phys_length depending on the flags that are set on the extent. Would it make sense to rename fe_length to fe_logi_length (or something, I'm open to suggestions), and have a compat macro: #define fe_length fe_logi_length around for older applications? That way, new developers would start to use the new name, old applications would still compile for both newer and older interfaces,
Kindly consider my proposal
Original Message Good day My name is Alfred Robert and I work with the finance house here in the Netherlands. I found your address through my countries international Web directory. During our last meeting and examination of the bank accounts here in the Netherlands, my department found a dormant account with an enormous sum of US$ 6,500,000.00 (Six million five hundred thousand US dollar) which was deposited by late Mr. Williams from England. Before his death he transferred the sum of US$ 6,500,000.00 (Six million five hundred thousand US dollar) to a bank here in Netherlands. From our investigation he had no beneficiary or next of kin to claim these funds. The financial regulations of our bank allow only a foreigner to stand as next relatives or next of kin. The request of a foreigner as a next of kin is base on the fact that the depositor was a foreigner and somebody in the Netherlands cannot stand as the next of kin. I need your permission as the next relative or next of kin of our deceased customer, so that the funds can be released and transferred to your account, at the end of the transaction 40% will be for you and 60% will be for me and my colleague. We need a foreign account. I still work at the financial house and that's the actual reason that I need a second party or person to stand and work with me and apply to the bank here in the Netherlands as the next of kin. I have in my possession all the necessary documents to have this transaction carried out successfully. Further information will be provided upon the receipt of your prompt response and I want you to know that there is no risk involved. I will need us to work together if you are interested and I assure you that I will provide all useful information and documentation as this business needs urgent attention as there is no much time to waste. Kindly Write me directly with your Name, Address, telephone and fax number on this so I can explain the procedures. Regards ALFRED ROBERT. ar0654...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests/btrfs: _devmgt_add() to check if the device is back online
btrfs/003 uses a method to remove the device as part of the test case, and after the test completes the removed device is added back to the system. However on certain system, albeit the slow running system the device comes back a bit later, and so the latter occurring sub-test with in the btrfs/003 fails. This patch adds script to wait and test if the device is back online, and thus report the same to to the full log. Signed-off-by: Anand Jain anand.j...@oracle.com --- common/rc | 25 + 1 file changed, 25 insertions(+) diff --git a/common/rc b/common/rc index 2c83340..4a6511f 100644 --- a/common/rc +++ b/common/rc @@ -2054,6 +2054,31 @@ _devmgt_add() tdl=`echo ${1} | cut -d: -f 2-|sed 's/:/ /g'` echo ${tdl} /sys/class/scsi_host/host${h}/scan || _fail Add disk failed + + # ensure the device comes online + dev_back_oneline=0 + for i in `seq 1 10`; do + if [ -d /sys/class/scsi_device/${1}/device/block ]; then + dev=`ls /sys/class/scsi_device/${1}/device/block` + for j in `seq 1 10`; + do + stat /dev/$dev /dev/null 21 + if [ $? -eq 0 ]; then + dev_back_oneline=1 + break + fi + sleep 1 + done + break + else + sleep 1 + fi + done + if [ $dev_back_oneline -eq 0 ]; then + echo /dev/$dev online failed $seqres.full + else + echo /dev/$dev is back online $seqres.full + fi } _require_fstrim() -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)?
Hello, I need to clarify, I'm _not_ sharing a drive between multiple computers at the _same_ time. It's a portable device which I use at different locations with different computers. I just wanted to give a rationale for mounting the whole drive to some mountpoint and then also part of that drive (a subvolume) to the respective computer's /home mountpoint. So it's controlled by the same kernel in the same computer, it's just that part of the filesystem is mounted at multiple mountpoints, much like a bind-mount, but I'm interested in mounting a subvolume of the already-mounted volume to some other mountpoint. Sorry for the confusion. Best regards Sebastian On 17.07.2014 01:18, Chris Murphy wrote: On Jul 16, 2014, at 4:18 PM, Sebastian Ochmann ochm...@informatik.uni-bonn.de wrote: Hello, I'm sharing a btrfs-formatted drive between multiple computers and each of the machines has a separate home directory on that drive. 2+ computers writing to the same block device? I don't see how this is safe. Seems possibly a bug that the 1st mount event isn't setting some metadata so that another kernel instance knows not to allow another mount. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs fi df shows unknown ?
Hi there, Since a few days, I have noticed that btrfs fi df / displays an entry about unknown used space, and I can see this on several Fedora machines, so it is not an issue related to a given system... Does anybody know what these unknown data are ? i.e: # btrfs fi df / Data, single: total=106.00GiB, used=88.28GiB System, DUP: total=32.00MiB, used=24.00KiB Metadata, DUP: total=1.00GiB, used=520.36MiB unknown, single: total=176.00MiB, used=0.00 # btrfs --version Btrfs v3.14.2 # uname -r 3.15.5-200.fc20.x86_64 TIA, kind regards. -- Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 0/2] vfs / btrfs: add support for ustat()
On Wed, Jul 16, 2014 at 02:37:56PM -0700, Luis R. Rodriguez wrote: From: Luis R. Rodriguez mcg...@suse.com This makes the implementation simpler by stuffing the struct on the driver and just letting the driver iinsert it and remove it onto the sb list. This avoids the kzalloc() completely. Again, NAK. Make btrfs report the proper anon dev_t in stat and everything will just work. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem. With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range. This is the cause of the above problem, -- btrfs's fsync has a stage called 'sync log' which will wait for all the ordered extents it've recorded to finish. In xfstests/generic/127, with mixed operations such as truncate, fallocate, punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will mmap, and then msync. And I find that msync will wait for quite a long time (about 20s in my case), thanks to ftrace, it turns out that the previous fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants, there can be some ordered extents created but not getting corresponding pages flushed, then they're left in memory until we fsync which runs into the stage 'sync log', and fsync will just wait for the system writeback thread to flush those pages and get ordered extents finished, so the latency is inevitable. This adds a flush similar to btrfs_start_ordered_extent() in btrfs_wait_logged_extents() to fix that. Reviewed-by: Miao Xie mi...@cn.fujitsu.com Signed-off-by: Liu Bo bo.li@oracle.com --- v3: Add a check for IO_DONE flag to avoid unnecessary flush. v2: Move flush part into btrfs_wait_logged_extents() to get the flush range more precise. fs/btrfs/ordered-data.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index e12441c..7187b14 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -484,8 +484,19 @@ void btrfs_wait_logged_extents(struct btrfs_root *log, u64 transid) log_list); list_del_init(ordered-log_list); spin_unlock_irq(log-log_extents_lock[index]); + + if (!test_bit(BTRFS_ORDERED_IO_DONE, ordered-flags) + !test_bit(BTRFS_ORDERED_DIRECT, ordered-flags)) { + struct inode *inode = ordered-inode; + u64 start = ordered-file_offset; + u64 end = ordered-file_offset + ordered-len - 1; + + WARN_ON(!inode); + filemap_fdatawrite_range(inode-i_mapping, start, end); + } wait_event(ordered-wait, test_bit(BTRFS_ORDERED_IO_DONE, ordered-flags)); + btrfs_put_ordered_extent(ordered); spin_lock_irq(log-log_extents_lock[index]); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)?
On Thu, Jul 17, 2014 at 12:18:37AM +0200, Sebastian Ochmann wrote: I'm sharing a btrfs-formatted drive between multiple computers and each of the machines has a separate home directory on that drive. The root of the drive is mounted at /mnt/tray and the home directory for machine {hostname} is under /mnt/tray/Homes/{hostname}. Up until now, I have mounted /mnt/tray like a normal volume and then did an additional bind-mount of /mnt/tray/Homes/{hostname} to /home. You've said you're not sharing it concurrently, which is good -- as long as you've only got one machine accessing it at the same time, you're fine there. Now I have a new drive and wanted to do things a bit more advanced by creating subvolumes for each of the machines' home directories so that I can also do independent snapshotting. I guess I could use the bind-mount method like before but my question is if it is considered safe to do an additional, regular mount of one of the subvolumes to /home instead, like mount /dev/sdxN /mnt/tray mount -o subvol=/Homes/{hostname} /dev/sdxN /home When I experimented with such additional mounts of subvolumes of already-mounted volumes, I noticed that the mount options of the additional subvolume mount might differ from the original mount. For instance, the root volume might be mounted with noatime while the subvolume mount may have relatime. So my questions are: Is mounting a subvolume of an already mounted volume considered safe Yes, absolutely: hrm@amelia:~$ mount | grep btrfs /dev/sda2 on /boot type btrfs (rw,noatime,space_cache) /dev/sda2 on /home type btrfs (rw,noatime,space_cache) /dev/sda2 on /media/video type btrfs (rw,noatime,space_cache) /dev/sda2 on /media/pipeline type btrfs (rw,noatime,space_cache) /dev/sda2 on /media/snarf type btrfs (rw,noatime,space_cache) /dev/sda2 on /media/audio type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/home type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/video type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/testing type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/pipeline type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/audio type btrfs (rw,noatime,space_cache) /dev/sda2 on /srv/nfs/nadja type btrfs (rw,noatime,space_cache) and are there any combinations of possibly conflicting mount options one should be aware of (compression, autodefrag, cache clearing)? Is it advisable to use the same mount options for all mounts pointing to the same physical device? If you assume that the first mount options are the ones used for everything, regardless of any different options provided in subsequent mounts, then you probably won't go far wrong. It's not quite true: some options do work on a per-mount basis, but most are per-filesystem. I'm sure there was a list of them on the wiki at some point, but I can't seem to track it down right now. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Try everything once, except incest and folk-dancing. --- signature.asc Description: Digital signature
Re: btrfs fi df shows unknown ?
On Thu, Jul 17, 2014 at 10:02:01AM +0200, Swâmi Petaramesh wrote: Hi there, Since a few days, I have noticed that btrfs fi df / displays an entry about unknown used space, and I can see this on several Fedora machines, so it is not an issue related to a given system... Does anybody know what these unknown data are ? It's the block reserve, which used to be part of metadata, but is now split out to its own type. An updated userspace should be able to show it properly. Hugo. i.e: # btrfs fi df / Data, single: total=106.00GiB, used=88.28GiB System, DUP: total=32.00MiB, used=24.00KiB Metadata, DUP: total=1.00GiB, used=520.36MiB unknown, single: total=176.00MiB, used=0.00 # btrfs --version Btrfs v3.14.2 # uname -r 3.15.5-200.fc20.x86_64 TIA, kind regards. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Try everything once, except incest and folk-dancing. --- signature.asc Description: Digital signature
Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)?
Original Message Subject: Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)? From: Sebastian Ochmann ochm...@informatik.uni-bonn.de To: Chris Murphy li...@colorremedies.com, zhe.zhang.resea...@gmail.com Date: 2014年07月17日 15:58 Hello, I need to clarify, I'm _not_ sharing a drive between multiple computers at the _same_ time. It's a portable device which I use at different locations with different computers. I just wanted to give a rationale for mounting the whole drive to some mountpoint and then also part of that drive (a subvolume) to the respective computer's /home mountpoint. So it's controlled by the same kernel in the same computer, it's just that part of the filesystem is mounted at multiple mountpoints, much like a bind-mount, but I'm interested in mounting a subvolume of the already-mounted volume to some other mountpoint. Sorry for the confusion. Best regards Sebastian If you mean something like the following use case: # mount /dev/sdb1 -o subvolid=257 /home # mount /dev/sdb1 -o subvolid=5 /some/other/place That is completly OK. But when it comes to different mount option, especially different ro/rw mount option, although it is working for 3.16-rc*, the ro/rw mount option is still under disscussion and the current rc implement will cause a kernel warning mounting a subvolume rw when it's first mounted as ro. So in short: 1) mount subvolumes when the btrfs fs is already mounted. Completly OK. 2) different mount option for different subvolume in one btrfs fs. For most mount option including ro/rw, No. Thanks, Qu On 17.07.2014 01:18, Chris Murphy wrote: On Jul 16, 2014, at 4:18 PM, Sebastian Ochmann ochm...@informatik.uni-bonn.de wrote: Hello, I'm sharing a btrfs-formatted drive between multiple computers and each of the machines has a separate home directory on that drive. 2+ computers writing to the same block device? I don't see how this is safe. Seems possibly a bug that the 1st mount event isn't setting some metadata so that another kernel instance knows not to allow another mount. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-porgs: fix xfstest btrfs/023 random failure
xfstest btrfs/023 which does the following tests create_group_profile raid0 check_group_profile RAID0 create_group_profile raid1 check_group_profile RAID1 create_group_profile raid10 check_group_profile RAID10 create_group_profile raid5 check_group_profile RAID5 create_group_profile raid6 check_group_profile RAID6 fails randomly with the error as below ERROR: device scan failed '/dev/sde' - Invalid argument since failure is at random group profile it indicates to me that btrfs kernel did not see the newly created btrfs on the device To note: I have the following patch on the kernel which is not yet integrated, but its not related to this bug. btrfs: RFC: code optimize use btrfs_get_bdev_and_sb() at btrfs_scan_one_device btrfs: looping 'mkfs.btrfs -f dev' may fail with EBUSY btrfs: check generation as replace duplicates devid+uuid This patch calls fsync() at btrfs_prepare_device(). With this btrfs/023 has NOT failed consistently for several long iterations. Signed-off-by: Anand Jain anand.j...@oracle.com --- utils.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/utils.c b/utils.c index fbc5bde..e144dfd 100644 --- a/utils.c +++ b/utils.c @@ -741,6 +741,8 @@ int btrfs_prepare_device(int fd, char *file, int zero_end, u64 *block_count_ret, } *block_count_ret = block_count; + fsync(fd); + zero_dev_error: if (ret 0) { fprintf(stderr, ERROR: failed to zero device '%s' - %s\n, -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix wrong manpage of defrag command
'btrfs filesystem defrag' has an option '-t', whose manpage says Any extent bigger than threshold given by -t option, will be considered already defragged. Use 0 to take the kernel default, and use 1 to say every single extent must be rewritten. Here 'use 0' still works, it refers to the default value(256K), however, 'use 1' is an obvious typo, it should be -1, which means the largest value it can be. Right now, we use parse_size() which no more allow value '-1', so in order to keep the manpage correct, this updates it to only keep value '0'. If you want to make sure every single extent is rewritten, please use a fairly large size, say 1G. Reported-by: Sebastian Ochmann ochm...@informatik.uni-bonn.de Signed-off-by: Liu Bo bo.li@oracle.com --- Documentation/btrfs-filesystem.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index 0ee79cb..c9c0b00 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -41,8 +41,7 @@ The start position and the number of bytes to defragment can be specified by start and len using '-s' and '-l' options below. Any extent bigger than threshold given by '-t' option, will be considered already defragged. -Use 0 to take the kernel default, and use 1 to -say every single extent must be rewritten. +Use 0 to take the kernel default. You can also turn on compression in defragment operations. + `Options` -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS FILE ID not unique when exporting many brtfs subvolumes
On Thu, Jul 17, 2014 at 10:40:14AM +, philippe.simo...@swisscom.com wrote: I have a problem using btrfs/nfs to store my vmware images. [snip] - vmware is basing its NFS files locks on the nfs fileid field returned from a NFS GETATTR request for the file being locked http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1007909 vmware assumes that these nfs fileid are unique per storage. - it seemed that these nfs fileid are only unique 'per-subvolume', but because my nfs export contains many subvolumes, the nfs export has then my files (in different subvolume) with the same nfs fileid. - no problem when I start all machine alone, but when 2 machines are running at the same time, vmware seems to mix its reference to lock file and sometimes kills one vm. in esx server, following messages : /var/log/vmkwarning.log : 2014-07-17T06:31:46.854Z cpu2:268913)WARNING: NFSLock: 1315: Inode (Dup: 260 Orig: 260) has been recycled by server, freeing lock info for .lck-0401 2014-07-17T06:34:47.925Z cpu2:114740)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found 2014-07-17T10:18:50.320Z cpu0:32824)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found and in machine log : Message from sncubeesx02: The lock protecting vm-w7-sysp.vmdk has been lost, possibly due to underlying storage issues. If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK. - vmware try to make its own file locking for flowing file type : http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=10051 VMNAME.vswp DISKNAME-flat.vmdk DISKNAME-ITERATION-delta.vmdk VMNAME.vmx VMNAME.vmxf vmware.log Is there a way to deal with this problem ? is that a bug ? Add an arbitrary and unique fsid=0x12345 value to the exports declaration. For example, my server exports a number of subvolumes from the same FS with: /srv/nfs/nadja-rw,async,fsid=0x1729,no_subtree_check,no_root_squash \ 10.0.0.20 fe80::20 /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash \ fe80::/64 /srv/nfs/video-ro,async,fsid=0x1731,no_subtree_check \ 10.0.0.0/24 fe80::/64 Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- You can get more with a kind word and a two-by-four than you --- can with just a kind word. signature.asc Description: Digital signature
Re: [PATCH v3] Btrfs: fix abnormal long waiting in fsync
On 07/17/2014 04:08 AM, Liu Bo wrote: xfstests generic/127 detected this problem. With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range. This is the cause of the above problem, -- btrfs's fsync has a stage called 'sync log' which will wait for all the ordered extents it've recorded to finish. In xfstests/generic/127, with mixed operations such as truncate, fallocate, punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will mmap, and then msync. And I find that msync will wait for quite a long time (about 20s in my case), thanks to ftrace, it turns out that the previous fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants, there can be some ordered extents created but not getting corresponding pages flushed, then they're left in memory until we fsync which runs into the stage 'sync log', and fsync will just wait for the system writeback thread to flush those pages and get ordered extents finished, so the latency is inevitable. This adds a flush similar to btrfs_start_ordered_extent() in btrfs_wait_logged_extents() to fix that. I was able to trigger the stalls with plain fsx as well. Thanks! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: NFS FILE ID not unique when exporting many brtfs subvolumes
Hi Hugo -Original Message- From: Hugo Mills [mailto:h...@carfax.org.uk] Sent: Thursday, July 17, 2014 1:13 PM To: Simonet Philippe, INI-ON-FIT-NW-IPE Cc: linux-btrfs@vger.kernel.org Subject: Re: NFS FILE ID not unique when exporting many brtfs subvolumes On Thu, Jul 17, 2014 at 10:40:14AM +, philippe.simo...@swisscom.com wrote: I have a problem using btrfs/nfs to store my vmware images. [snip] - vmware is basing its NFS files locks on the nfs fileid field returned from a NFS GETATTR request for the file being locked http://kb.vmware.com/selfservice/microsites/search.do?language=en_ UScmd=displayKCexternalId=1007909 vmware assumes that these nfs fileid are unique per storage. - it seemed that these nfs fileid are only unique 'per-subvolume', but because my nfs export contains many subvolumes, the nfs export has then my files (in different subvolume) with the same nfs fileid. - no problem when I start all machine alone, but when 2 machines are running at the same time, vmware seems to mix its reference to lock file and sometimes kills one vm. in esx server, following messages : /var/log/vmkwarning.log : 2014-07-17T06:31:46.854Z cpu2:268913)WARNING: NFSLock: 1315: Inode (Dup: 260 Orig: 260) has been recycled by server, freeing lock info for .lck-0401 2014-07-17T06:34:47.925Z cpu2:114740)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found 2014-07-17T10:18:50.320Z cpu0:32824)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found and in machine log : Message from sncubeesx02: The lock protecting vm-w7- sysp.vmdk has been lost, possibly due to underlying storage issues. If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK. - vmware try to make its own file locking for flowing file type : http://kb.vmware.com/selfservice/microsites/search.do?language=en_ UScmd=displayKCexternalId=10051 VMNAME.vswp DISKNAME-flat.vmdk DISKNAME-ITERATION-delta.vmdk VMNAME.vmx VMNAME.vmxf vmware.log Is there a way to deal with this problem ? is that a bug ? Add an arbitrary and unique fsid=0x12345 value to the exports declaration. For example, my server exports a number of subvolumes from the same FS with: /srv/nfs/nadja-rw,async,fsid=0x1729,no_subtree_check,no_root_squash \ 10.0.0.20 fe80::20 /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash \ fe80::/64 /srv/nfs/video-ro,async,fsid=0x1731,no_subtree_check \ 10.0.0.0/24 fe80::/64 Hugo. first of all, thank for your answer ! on my system, I have one export, that is the root btrfs subvolume and contains itself one subvolume per vm. if I change the NFS export fsid, it does not change anything in each the file IDs of the whole NFS export. (I crossed checked it just to be sure, tshark -V -nlp -t a port 2049 | egrep Entry: name|File ID, and effectively, fsid has no impact on file id) -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- You can get more with a kind word and a two-by-four than you --- can with just a kind word. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
[ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS FILE ID not unique when exporting many brtfs subvolumes
On Thu, Jul 17, 2014 at 01:02:06PM +, philippe.simo...@swisscom.com wrote: Hi Hugo -Original Message- From: Hugo Mills [mailto:h...@carfax.org.uk] Sent: Thursday, July 17, 2014 1:13 PM To: Simonet Philippe, INI-ON-FIT-NW-IPE Cc: linux-btrfs@vger.kernel.org Subject: Re: NFS FILE ID not unique when exporting many brtfs subvolumes On Thu, Jul 17, 2014 at 10:40:14AM +, philippe.simo...@swisscom.com wrote: I have a problem using btrfs/nfs to store my vmware images. [snip] - vmware is basing its NFS files locks on the nfs fileid field returned from a NFS GETATTR request for the file being locked http://kb.vmware.com/selfservice/microsites/search.do?language=en_ UScmd=displayKCexternalId=1007909 vmware assumes that these nfs fileid are unique per storage. - it seemed that these nfs fileid are only unique 'per-subvolume', but because my nfs export contains many subvolumes, the nfs export has then my files (in different subvolume) with the same nfs fileid. - no problem when I start all machine alone, but when 2 machines are running at the same time, vmware seems to mix its reference to lock file and sometimes kills one vm. in esx server, following messages : /var/log/vmkwarning.log : 2014-07-17T06:31:46.854Z cpu2:268913)WARNING: NFSLock: 1315: Inode (Dup: 260 Orig: 260) has been recycled by server, freeing lock info for .lck-0401 2014-07-17T06:34:47.925Z cpu2:114740)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found 2014-07-17T10:18:50.320Z cpu0:32824)WARNING: NFSLock: 2348: Unable to remove lockfile .invalid, not found and in machine log : Message from sncubeesx02: The lock protecting vm-w7- sysp.vmdk has been lost, possibly due to underlying storage issues. If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK. - vmware try to make its own file locking for flowing file type : http://kb.vmware.com/selfservice/microsites/search.do?language=en_ UScmd=displayKCexternalId=10051 VMNAME.vswp DISKNAME-flat.vmdk DISKNAME-ITERATION-delta.vmdk VMNAME.vmx VMNAME.vmxf vmware.log Is there a way to deal with this problem ? is that a bug ? Add an arbitrary and unique fsid=0x12345 value to the exports declaration. For example, my server exports a number of subvolumes from the same FS with: /srv/nfs/nadja-rw,async,fsid=0x1729,no_subtree_check,no_root_squash \ 10.0.0.20 fe80::20 /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash \ fe80::/64 /srv/nfs/video-ro,async,fsid=0x1731,no_subtree_check \ 10.0.0.0/24 fe80::/64 Hugo. first of all, thank for your answer ! on my system, I have one export, that is the root btrfs subvolume and contains itself one subvolume per vm. if I change the NFS export fsid, it does not change anything in each the file IDs of the whole NFS export. (I crossed checked it just to be sure, tshark -V -nlp -t a port 2049 | egrep Entry: name|File ID, and effectively, fsid has no impact on file id) Aaah, that's interesting. I suspect that you'll have to make the mounts explicit, so for every subvolume exported from the server, there's a line in fstab to mount it to the place it's exported from. This happens as a side-effect of the recommended filesystem/subvol layout[1] anyway, since it doesn't use nested subvolumes at all, so I've never actually noticed the situation you mention. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- There's a Martian war machine outside -- they want to talk --- to you about a cure for the common cold. signature.asc Description: Digital signature
Re: Is it safe to mount subvolumes of already-mounted volumes (even with different options)?
Hugo Mills posted on Thu, 17 Jul 2014 09:41:53 +0100 as excerpted: and are there any combinations of possibly conflicting mount options one should be aware of (compression, autodefrag, cache clearing)? Is it advisable to use the same mount options for all mounts pointing to the same physical device? If you assume that the first mount options are the ones used for everything, regardless of any different options provided in subsequent mounts, then you probably won't go far wrong. It's not quite true: some options do work on a per-mount basis, but most are per-filesystem. I'm sure there was a list of them on the wiki at some point, but I can't seem to track it down right now. IIRC/AFAIK, the btrfs-specific mount options should be per filesystem, while stuff like relatime vs noatime is VFS level and should work per subvolume. There's actually a current discussion about ro vs rw. Consider the case of a parent subvolume (perhaps but not necessarily the root subvolume, id=5), being mounted writable in one location, with a child mounted elsewhere read-only. Because it's possible to browse in the parent's subvolume down into the child subvolume as well, and someone could write a file there, that write would then show up in the elsewhere mounted read- only child subvolume as well. That's unexpected behavior to say the least! Normally, read-only means it cannot and will not change, but in this case it wouldn't mean that at all! My idea is that the same rules should apply to ro/rw as apply to btrfs snapshots -- they stop at subvolume borders. Any write into a child subvolume would thus throw an error, regardless of how the parent subvolume was mounted. The only way to write into a subvolume would be to mount it read-write on its own. That would solve the ambiguity, but it would also be quite a change from existing behavior, where a read- write mount of the root subvolume can write into any subvolume. Someone else suggested that we separate filesystem read-write from subvolume read-write. There's already the concept of read-only snapshots, used in btrfs-send, for one thing. The idea here would be that a read-only filesystem/root mount means the entire filesystem is read-only, but provided the filesystem/root was mounted read-write, individual subvolumes could be mounted read-only using a different option, subv=ro, or similar, which would be hooked into the existing read- only subvolume mechanism. In that case, if the filesystem/root was read- write, then the subvolume specific rw/ro mount option would take precedence and would trigger an error on write to that subvolume even if written from the read-write parent mount. But while btrfs is the first filesystem to do this sort of thing and thus to deal with the problem, it might not be the last, so policy coordination with the VFS layer should be considered and a generic kernel policy for any filesystem dealing with subvolumes should be established. IOW, it's bigger than simply btrfs. So anyway, while there was a patch applied earlier that did allow different read-only/read-write subvolume mounts, I believe that's reverted for 3.16, while this discussion continues and until it gets resolved one way or another, possibly at a kernel conference or the like. But I believe generic VFS stuff like noatime/relatime/atime and dev/nodev/ suid/nosuid/exec/noexec is fine per-subvolume, because that's enforced at the VFS layer and there's no internal or expectation inconsistency to worry about if you can access for example the same device-file as a device via one mountpoint and not by another. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 0/2] vfs / btrfs: add support for ustat()
On Thu, Jul 17, 2014 at 01:03:01AM -0700, Christoph Hellwig wrote: On Wed, Jul 16, 2014 at 02:37:56PM -0700, Luis R. Rodriguez wrote: From: Luis R. Rodriguez mcg...@suse.com This makes the implementation simpler by stuffing the struct on the driver and just letting the driver iinsert it and remove it onto the sb list. This avoids the kzalloc() completely. Again, NAK. Make btrfs report the proper anon dev_t in stat and everything will just work. Let's consider this userspace case: struct stat buf; struct ustat ubuf; /* Find a valid device number */ if (stat(/, buf)) { fprintf(stderr, Stat failed: %s\n, strerror(errno)); return 1; } /* Call ustat on it */ if (ustat(buf.st_dev, ubuf)) { fprintf(stderr, Ustat failed: %s\n, strerror(errno)); return 1; } In the btrfs case it has an inode op for getattr, that is used and we set the dev to anonymous dev_t. Later ustat will use user_get_super() which will only be able to work with a userblock if the super block's only dev_t is assigned to it. Since we have many anonymous to dev_t mapping to super block though we can't complete the search for btfs and ustat() fails with -EINVAL. The series expands the number of dev_t's that a super block can have and allows this search to complete. Luis -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Btrfs: fix sparse warning
@@ -515,7 +515,8 @@ static int write_buf(struct file *filp, const void *buf, u32 len, loff_t *off) Though this probably wants to be rewritten in terms of kernel_write(). That'd give an opportunity to get rid of the sctx-send_off and have it use f_pos in the filp. Do you mean directly call kernel_write from send_cmd/send_header ? I guess that loop around vfs_write in write_buf is there for something ... write_buf() could still exist to iterate over the buffer in the case of partial writes but it doesn't need to muck around with set_fs() and forcing casts. - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs: correctly handle return from ulist_add
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting() doesn't take into account. As a result, that value can be bubbled up to callers, causing an error to be printed. Fix this by only returning the value of ulist_add() when it indicates an error. Signed-off-by: Mark Fasheh mfas...@suse.de --- fs/btrfs/qgroup.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 2ec2432..b55870c 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1976,6 +1976,7 @@ static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans, struct btrfs_qgroup_list *glist; struct ulist *parents; int ret = 0; + int err; struct btrfs_qgroup *qg; u64 root_obj = 0; struct seq_list elem = {}; @@ -2030,10 +2031,12 @@ static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans, * while adding parents of the parents to our ulist. */ list_for_each_entry(glist, qg-groups, next_group) { - ret = ulist_add(parents, glist-group-qgroupid, + err = ulist_add(parents, glist-group-qgroupid, ptr_to_u64(glist-group), GFP_ATOMIC); - if (ret 0) + if (err 0) { + ret = err; goto out_unlock; + } } ULIST_ITER_INIT(uiter); @@ -2045,10 +2048,12 @@ static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans, /* Add any parents of the parents */ list_for_each_entry(glist, qg-groups, next_group) { - ret = ulist_add(parents, glist-group-qgroupid, + err = ulist_add(parents, glist-group-qgroupid, ptr_to_u64(glist-group), GFP_ATOMIC); - if (ret 0) + if (err 0) { + ret = err; goto out_unlock; + } } } -- 1.8.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] btrfs: delete qgroup items in drop_snapshot
btrfs_drop_snapshot() leaves subvolume qgroup items on disk after completion. This wastes space and also can cause problems with snapshot creation. If a new snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST from add_qgroup_item() and go read-only. We can partially fix this by catching -EEXIST in add_qgroup_item() and initializing the existing items. This will leave orphaned relation items (BTRFS_QGROUP_RELATION_KEY) around however would be confusing to the end user. Also this does nothing to fix the wasted space taken up by orphaned qgroup items. So the full fix is to delete all qgroup items related to the deleted snapshot in btrfs_drop_snapshot. If an item persists (either due to a previous drop_snapshot without the fix, or some error) we can still continue with snapshot create instead of throwing the whole filesystem readonly. In the very small chance that some relation items persist, they will not affect functioning of our level 0 subvolume qgroup. Signed-off-by: Mark Fasheh mfas...@suse.de --- fs/btrfs/extent-tree.c | 6 +++ fs/btrfs/qgroup.c | 114 +++-- fs/btrfs/qgroup.h | 3 ++ 3 files changed, 120 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ed9e13c..2dad701 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8296,6 +8296,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, if (err) goto out_end_trans; + ret = btrfs_del_qgroup_items(trans, root); + if (ret) { + btrfs_abort_transaction(trans, root, ret); + goto out_end_trans; + } + ret = btrfs_del_root(trans, tree_root, root-root_key); if (ret) { btrfs_abort_transaction(trans, tree_root, ret); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1569338..2ec2432 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -35,7 +35,6 @@ #include qgroup.h /* TODO XXX FIXME - * - subvol delete - delete when ref goes to 0? delete limits also? * - reorganize keys * - compressed * - sync @@ -99,6 +98,16 @@ struct btrfs_qgroup_list { struct btrfs_qgroup *member; }; +/* + * used in remove_qgroup_relations() to track qgroup relations that + * need deleting + */ +struct relation_rec { + struct list_head list; + u64 src; + u64 dst; +}; + #define ptr_to_u64(x) ((u64)(uintptr_t)x) #define u64_to_ptr(x) ((struct btrfs_qgroup *)(uintptr_t)x) @@ -551,9 +560,15 @@ static int add_qgroup_item(struct btrfs_trans_handle *trans, key.type = BTRFS_QGROUP_INFO_KEY; key.offset = qgroupid; + /* +* Avoid a transaction abort by catching -EEXIST here. In that +* case, we proceed by re-initializing the existing structure +* on disk. +*/ + ret = btrfs_insert_empty_item(trans, quota_root, path, key, sizeof(*qgroup_info)); - if (ret) + if (ret ret != -EEXIST) goto out; leaf = path-nodes[0]; @@ -572,7 +587,7 @@ static int add_qgroup_item(struct btrfs_trans_handle *trans, key.type = BTRFS_QGROUP_LIMIT_KEY; ret = btrfs_insert_empty_item(trans, quota_root, path, key, sizeof(*qgroup_limit)); - if (ret) + if (ret ret != -EEXIST) goto out; leaf = path-nodes[0]; @@ -2817,3 +2832,96 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info) btrfs_queue_work(fs_info-qgroup_rescan_workers, fs_info-qgroup_rescan_work); } + +static struct relation_rec * +qlist_to_relation_rec(struct btrfs_qgroup_list *qlist, struct list_head *all) +{ + u64 group, member; + struct relation_rec *rec; + + BUILD_BUG_ON(sizeof(struct btrfs_qgroup_list) sizeof(struct relation_rec)); + + list_del(qlist-next_group); + list_del(qlist-next_member); + group = qlist-group-qgroupid; + member = qlist-member-qgroupid; + rec = (struct relation_rec *)qlist; + rec-src = group; + rec-dst = member; + + list_add(rec-list, all); + return rec; +} + +static int remove_qgroup_relations(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, u64 qgroupid) +{ + int ret, err; + struct btrfs_root *quota_root = fs_info-quota_root; + struct relation_rec *rec; + struct btrfs_qgroup_list *qlist; + struct btrfs_qgroup *qgroup; + LIST_HEAD(relations); + + spin_lock(fs_info-qgroup_lock); + qgroup = find_qgroup_rb(fs_info, qgroupid); + + while (!list_empty(qgroup-groups)) { + qlist = list_first_entry(qgroup-groups, +struct btrfs_qgroup_list, next_group); + rec = qlist_to_relation_rec(qlist, relations); + } + +
[PATCH 3/5] Btrfs: __btrfs_mod_ref should always use no_quota
From: Josef Bacik jba...@fb.com Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't understand how snapshot delete was using it and assumed that we needed the quota operations there. With Mark's work this has turned out to be not the case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the argument and make __btrfs_mod_ref call it's process function with no_quota set always. Thanks, Signed-off-by: Josef Bacik jba...@fb.com Signed-off-by: Mark Fasheh mfas...@suse.de --- fs/btrfs/ctree.c | 20 ++-- fs/btrfs/ctree.h | 4 ++-- fs/btrfs/extent-tree.c | 24 +++- 3 files changed, 23 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index aeab453..44ee5d2 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -280,9 +280,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans, WARN_ON(btrfs_header_generation(buf) trans-transid); if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1, 1); + ret = btrfs_inc_ref(trans, root, cow, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0, 1); + ret = btrfs_inc_ref(trans, root, cow, 0); if (ret) return ret; @@ -1035,14 +1035,14 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if ((owner == root-root_key.objectid || root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) !(flags BTRFS_BLOCK_FLAG_FULL_BACKREF)) { - ret = btrfs_inc_ref(trans, root, buf, 1, 1); + ret = btrfs_inc_ref(trans, root, buf, 1); BUG_ON(ret); /* -ENOMEM */ if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) { - ret = btrfs_dec_ref(trans, root, buf, 0, 1); + ret = btrfs_dec_ref(trans, root, buf, 0); BUG_ON(ret); /* -ENOMEM */ - ret = btrfs_inc_ref(trans, root, cow, 1, 1); + ret = btrfs_inc_ref(trans, root, cow, 1); BUG_ON(ret); /* -ENOMEM */ } new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; @@ -1050,9 +1050,9 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1, 1); + ret = btrfs_inc_ref(trans, root, cow, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0, 1); + ret = btrfs_inc_ref(trans, root, cow, 0); BUG_ON(ret); /* -ENOMEM */ } if (new_flags != 0) { @@ -1069,11 +1069,11 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, if (flags BTRFS_BLOCK_FLAG_FULL_BACKREF) { if (root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) - ret = btrfs_inc_ref(trans, root, cow, 1, 1); + ret = btrfs_inc_ref(trans, root, cow, 1); else - ret = btrfs_inc_ref(trans, root, cow, 0, 1); + ret = btrfs_inc_ref(trans, root, cow, 0); BUG_ON(ret); /* -ENOMEM */ - ret = btrfs_dec_ref(trans, root, buf, 1, 1); + ret = btrfs_dec_ref(trans, root, buf, 1); BUG_ON(ret); /* -ENOMEM */ } clean_tree_block(trans, root, buf); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index be91397..8e29b61 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3326,9 +3326,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, int is_data, int delalloc); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, - struct extent_buffer *buf, int full_backref, int no_quota); + struct extent_buffer *buf, int full_backref); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, - struct extent_buffer *buf, int full_backref, int no_quota); + struct extent_buffer *buf, int full_backref); int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 flags, diff --git
[PATCH 0/5] btrfs: qgroup fixes for btrfs_drop_snapshot V5
Hi, the following patches try to fix a long outstanding issue with qgroups and snapshot deletion. The core problem is that btrfs_drop_snapshot will skip shared extents during it's tree walk. This results in an inconsistent qgroup state once the drop is processed. We also have a bug where qgroup items are not deleted after drop_snapshot. The orphaned items will cause btrfs to go readonly when a snapshot is created with the same id as the deleted one. The first patch adds some tracing which I found very useful in debugging qgroup operations. The second patch is an actual fix to the problem. A third patch, from Josef is also added. We need this because it fixes at least one set of inconsistencies qgroups can get to via drop_snapshot. The fourth patch adds code to delete qgroup items from disk once drop_snapshot has completed. With this version of the patch series, I can no longer reproduce qgroup inconsistencies via drop_snapshot on my test disks. Change from last patch set: - Added a small fix (patch #5). I can fold this back into the main patch if requested. Changes from V3-V4: - Added patch 'btrfs: delete qgroup items in drop_snapshot' Changes from V2-V3: - search on bytenr and root, but not seq in btrfs_record_ref when we're looking for existing qgroup operations. Changes before that (V1-V2): - remove extra extent_buffer_uptodate call from account_shared_subtree() - catch return values for the accounting calls now and do the right thing (log an error and tell the user to rescan) - remove the loop on roots in qgroup_subtree_accounting and just use the nnodes member to make our first decision. - Don't queue up the subtree root for a change (the code in drop_snapshot handkles qgroup updates for this block). - only walk subtrees if we're actually in DROP_REFERENCE stage and we're going to call free_extent - account leaf items for level zero blocks that we are dropping in walk_up_proc Please review, thanks. Diffstat follows, --Mark fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h |4 fs/btrfs/extent-tree.c | 291 -- fs/btrfs/qgroup.c| 295 +-- fs/btrfs/qgroup.h|4 fs/btrfs/super.c |1 include/trace/events/btrfs.h | 59 7 files changed, 641 insertions(+), 33 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs: add trace for qgroup accounting
We want this to debug qgroup changes on live systems. Signed-off-by: Mark Fasheh mfas...@suse.de Reviewed-by: Josef Bacik jba...@fb.com --- fs/btrfs/qgroup.c| 3 +++ fs/btrfs/super.c | 1 + include/trace/events/btrfs.h | 56 3 files changed, 60 insertions(+) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 98cb6b2..6a6dc62 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1290,6 +1290,7 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, oper-seq = atomic_inc_return(fs_info-qgroup_op_seq); INIT_LIST_HEAD(oper-elem.list); oper-elem.seq = 0; + trace_btrfs_qgroup_record_ref(oper); ret = insert_qgroup_oper(fs_info, oper); if (ret) { /* Shouldn't happen so have an assert for developers */ @@ -1911,6 +1912,8 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle *trans, ASSERT(is_fstree(oper-ref_root)); + trace_btrfs_qgroup_account(oper); + switch (oper-type) { case BTRFS_QGROUP_OPER_ADD_EXCL: case BTRFS_QGROUP_OPER_SUB_EXCL: diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8e16bca..38b8bd8 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -60,6 +60,7 @@ #include backref.h #include tests/btrfs-tests.h +#include qgroup.h #define CREATE_TRACE_POINTS #include trace/events/btrfs.h diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 4ee4e30..b8774b3 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -23,6 +23,7 @@ struct map_lookup; struct extent_buffer; struct btrfs_work; struct __btrfs_workqueue; +struct btrfs_qgroup_operation; #define show_ref_type(type)\ __print_symbolic(type, \ @@ -1119,6 +1120,61 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy, TP_ARGS(wq) ); +#define show_oper_type(type) \ + __print_symbolic(type, \ + { BTRFS_QGROUP_OPER_ADD_EXCL, OPER_ADD_EXCL }, \ + { BTRFS_QGROUP_OPER_ADD_SHARED, OPER_ADD_SHARED },\ + { BTRFS_QGROUP_OPER_SUB_EXCL, OPER_SUB_EXCL }, \ + { BTRFS_QGROUP_OPER_SUB_SHARED, OPER_SUB_SHARED }) + +DECLARE_EVENT_CLASS(btrfs_qgroup_oper, + + TP_PROTO(struct btrfs_qgroup_operation *oper), + + TP_ARGS(oper), + + TP_STRUCT__entry( + __field(u64, ref_root ) + __field(u64, bytenr) + __field(u64, num_bytes ) + __field(u64, seq ) + __field(int, type ) + __field(u64, elem_seq ) + ), + + TP_fast_assign( + __entry-ref_root = oper-ref_root; + __entry-bytenr = oper-bytenr, + __entry-num_bytes = oper-num_bytes; + __entry-seq= oper-seq; + __entry-type = oper-type; + __entry-elem_seq = oper-elem.seq; + ), + + TP_printk(ref_root = %llu, bytenr = %llu, num_bytes = %llu, + seq = %llu, elem.seq = %llu, type = %s, + (unsigned long long)__entry-ref_root, + (unsigned long long)__entry-bytenr, + (unsigned long long)__entry-num_bytes, + (unsigned long long)__entry-seq, + (unsigned long long)__entry-elem_seq, + show_oper_type(__entry-type)) +); + +DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_account, + + TP_PROTO(struct btrfs_qgroup_operation *oper), + + TP_ARGS(oper) +); + +DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_record_ref, + + TP_PROTO(struct btrfs_qgroup_operation *oper), + + TP_ARGS(oper) +); + #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 1.8.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] btrfs: qgroup: account shared subtrees during snapshot delete
During its tree walk, btrfs_drop_snapshot() will skip any shared subtrees it encounters. This is incorrect when we have qgroups turned on as those subtrees need to have their contents accounted. In particular, the case we're concerned with is when removing our snapshot root leaves the subtree with only one root reference. In those cases we need to find the last remaining root and add each extent in the subtree to the corresponding qgroup exclusive counts. This patch implements the shared subtree walk and a new qgroup operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of this type is encountered during qgroup accounting, we search for any root references to that extent and in the case that we find only one reference left, we go ahead and do the math on it's exclusive counts. Signed-off-by: Mark Fasheh mfas...@suse.de Reviewed-by: Josef Bacik jba...@fb.com --- fs/btrfs/extent-tree.c | 261 +++ fs/btrfs/qgroup.c| 165 +++ fs/btrfs/qgroup.h| 1 + include/trace/events/btrfs.h | 3 +- 4 files changed, 429 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 813537f..1aa4325 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7478,6 +7478,220 @@ reada: wc-reada_slot = slot; } +static int account_leaf_items(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct extent_buffer *eb) +{ + int nr = btrfs_header_nritems(eb); + int i, extent_type, ret; + struct btrfs_key key; + struct btrfs_file_extent_item *fi; + u64 bytenr, num_bytes; + + for (i = 0; i nr; i++) { + btrfs_item_key_to_cpu(eb, key, i); + + if (key.type != BTRFS_EXTENT_DATA_KEY) + continue; + + fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item); + /* filter out non qgroup-accountable extents */ + extent_type = btrfs_file_extent_type(eb, fi); + + if (extent_type == BTRFS_FILE_EXTENT_INLINE) + continue; + + bytenr = btrfs_file_extent_disk_bytenr(eb, fi); + if (!bytenr) + continue; + + num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi); + + ret = btrfs_qgroup_record_ref(trans, root-fs_info, + root-objectid, + bytenr, num_bytes, + BTRFS_QGROUP_OPER_SUB_SUBTREE, 0); + if (ret) + return ret; + } + return 0; +} + +/* + * Walk up the tree from the bottom, freeing leaves and any interior + * nodes which have had all slots visited. If a node (leaf or + * interior) is freed, the node above it will have it's slot + * incremented. The root node will never be freed. + * + * At the end of this function, we should have a path which has all + * slots incremented to the next position for a search. If we need to + * read a new node it will be NULL and the node above it will have the + * correct slot selected for a later read. + * + * If we increment the root nodes slot counter past the number of + * elements, 1 is returned to signal completion of the search. + */ +static int adjust_slots_upwards(struct btrfs_root *root, + struct btrfs_path *path, int root_level) +{ + int level = 0; + int nr, slot; + struct extent_buffer *eb; + + if (root_level == 0) + return 1; + + while (level = root_level) { + eb = path-nodes[level]; + nr = btrfs_header_nritems(eb); + path-slots[level]++; + slot = path-slots[level]; + if (slot = nr || level == 0) { + /* +* Don't free the root - we will detect this +* condition after our loop and return a +* positive value for caller to stop walking the tree. +*/ + if (level != root_level) { + btrfs_tree_unlock_rw(eb, path-locks[level]); + path-locks[level] = 0; + + free_extent_buffer(eb); + path-nodes[level] = NULL; + path-slots[level] = 0; + } + } else { + /* +* We have a valid slot to walk back down +* from. Stop here so caller can process these +* new nodes. +*/ + break; + } + + level++; + } + + eb = path-nodes[root_level]; + if
Questions on incremental backups
I've a couple of questions on incremental backups. I've read the wiki page, and would like to confirm my understanding of some features, and also see if other features are possible that are not mentioned. I'm looking to replace my existing backup solution, and hoping to match the features I currently use, and go a little beyond. === Daily snapshot === So, if I understand correctly, I can make a daily snapshot of my filesystem with very little overhead. Then these can later be synced efficiently to another system (only syncing the differences), so I can backup regularly over the internet to my server, and also to an external HDD. After syncing, I can delete the snapshots (other than the trailing one needed for the next backup). In this way I can keep a constant stream of daily backups even when offline, and simply sync them next time I am online before deleting them locally. === Ignore directories === Due to storage limitations on my server, is it possible to ignore certain directories? For example, ignoring the folder that stores all my games, as this could be rather large, and the contents can easily be re-downloaded. The instructions involve subvolumes, so maybe it's possible to ignore a subvolume when syncing? If that is possible, then is it also possible to have a separate backup that does include the ignored directory? For example, having the smaller sync to the storage-limited server, but having a full sync to an external HDD. === Display backups === Is it possible to view the contents of all backups? So, the expected interface would be something like a tree of all files from across all snapshots. Any files that are not present in the latest snapshot would be greyed out to show they have been deleted. Selecting a file would show a list of versions of the file, with one version for each snapshot the file has been modified in. As long as I can get access to this information, maybe some kind of diff between snapshots, I'm willing to write the actual software to display this interface. (I suppose even if it's not supported, I could crawl through the filesystems and generate some kind of database, but that sounds like a painful process.) === Merge snapshots down === Is there some way to merge snapshots down? So, I could merge the last week of daily snapshots into a single weekly snapshot. The new snapshot should include all files across all the snapshots (even if deleted in some of the snapshots), and include just the latest version of each file. This way, I'd like to maintain daily snapshots, which can be regularly merged down into weekly snapshots, and then into monthly snapshots, and then finally into yearly snapshots. And, finally, there's no problem in deleting old snapshots? I'm assuming any data from these snapshots used by other snapshots will still be referenced by the other snapshots, and thus be retained, so nothing will break? signature.asc Description: This is a digitally signed message part
Re: Unmountable btrfs filesystem - 'unable to find logical' / 'no mapping'
Duncan 1i5t5.duncan at cox.net writes: Gareth Clay posted on Tue, 15 Jul 2014 14:35:22 +0100 as excerpted: I noticed yesterday that the mount points on my btrfs RAID1 filesystem had become read-only. On a reboot, the filesystem fails to mount. I wondered if someone here might be able offer any advice on how to recover (if possible) from this position? I had a similar (but I think different) issue some weeks ago. It was my first real experience with btrfs troubleshooting and recovery. First, the recommendation is do NOT do btrfs check --repair except either at the recommendation of a dev after they've seen the details and determined it can fix them, or if your next step would be a new mkfs of the filesystem, thus blowing away what's there anyway, so you've nothing to lose. You can try btrfs check (aka btrfsck) without --repair to see what it reports as that's read-only and thus won't break anything further, but similarly, won't repair anything either. Also, as a general recommendation, try a current kernel as btrfs is still developing fast enough that if you're a kernel series behind, there's fixes in the new version that you won't have in older kernels. I see you're on an ubuntu 3.13 series kernel, and the recommendation would be the latest 3.15 series stable kernel, if not the 3.16-rc series development kernel, since that's past rc5 now and thus getting close to release. The userspace, btrfs-progs, isn't quite as critical, but running at least v3.12 (which you are), is recommended. FWIW, v3.14.2 is current (as of when I last checked a couple days ago anyway) and is what I am running here. In general, you can try mounting with recovery and then with recovery,ro options, but that didn't work here. You can also try with the degraded option (tho I didn't), to see if it'll mount with just one of the pair. Of course, btrfs is still not fully stable and keeping current backups is recommended. I did have backups, but they weren't as current as I wanted. Beyond that, there's btrfs restore (a separate btrfs-restore executable in older btrfs-progs, part of the main btrfs executable in newer versions), which is what I ended up using and is what the rest of this reply is about. That does NOT mount or write to the filesystem, but DOES let you pull files off the unmounted filesystem and write them to a working filesystem (btrfs or other, it was reiserfs here) in ordered to recover what you can. You can use --dry-run to list files that would be recovered in ordered to get an idea of how much it can recover. There's a page on the wiki about using btrfs recover in combination with btrfs-find-root, if the current root is damaged and won't let you recover much. Note that generation and transid refer to the same thing, and you want to specify the root (using the -t location option, with the location found using find-root) that lets you recover the most. The -l (list tree roots) option is also useful in this context. https://btrfs.wiki.kernel.org/index.php/Restore Of course restoring in this manner means you have to have somewhere else to put what you restore, which was fine for me as I'm using relatively small independent btrfs filesystems and could restore to a larger reiserfs on a different device, but could be rather tougher for large multi-terabyte filesystems, unless you have (or purchase) a spare disk to put it on. One thing I did NOT realize until later, however, is that btrfs restore loses the user and permissions information (at least without -x, which says it restores extended attributes, I didn't try it with that). I hacked up a find script to compare the restore to the backup and set ownership/permissions appropriately based on the files in the backup, but of course that didn't help for files that were new since the backup, and I had to set their ownership/permissions manually. Hi Duncan, Thanks for your thorough response and the tips - sorry to hear you've had issues too. Point taken on the kernel updates! I'm in a similar situation to you - this is my first btrfs recovery experience. I've been playing with the fs for some time and have had no apparent issues, but this has been a useful reality check. Read / write error counts were high so there's a suggestion that it might be down to drive failure. In the end I had a lot of help from xaba on the #btrfs IRC channel, whose suggestions got me to the point where, with a bang up to date version of the userspace utils, I could get a successful btrfsck run using the -b option (3.12 only got part way). At that point btrfs restore still couldn't be run, degraded mounting also wouldn't work, and I'd spent about as much time as I was prepared to spend on recovering this fs, so I took a deep breath and ran btrfsck --repair. That's got me to the point where btrfs restore can now be run, so I'm going to dump as much as I
Re: Questions on incremental backups
Daily snapshots work welk with kernel 3.14 and above (I had problems with 3.13 and previous). I have snapshots every 15 mins on some subvols. Very large numbers of snapshots can cause performance problems. I suggest keeping below 1000 snapshots at this time. You can use send/recv functionality for remote backups. So far I've used rsync, it works well and send/recv has some limitations about filesystem structure etc. Rsync can transfer to a ext4 or ZFS filesystem if you wish. Ignoring directories in send/recv is done by subvol. Even if you use rsync it's a good idea to have different subvols for directory trees with different backup requirements. Displaying backups is an issue of backup software. It is above the level that BTRFS development touches. While people here can probably offer generic advice on backup software it's not the topic of the list. I use date based snapshots on my backup BTRFS filesystems and I can easily delete snapshots in the middle of the list. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html