Re: cross-fs copy support
On Oct 1, 2018, at 9:49 AM, Eric Sandeen wrote: > > > On 10/1/18 9:48 AM, Qu Wenruo wrote: >> >> >> On 2018/10/1 下午10:32, Joshi wrote: >>> I was wondering about the cross-fs copy through copy_file_range. >> >> The term "cross-fs" looks pretty confusing. >> >> If you mean "cross-subvolume", then it should work without problem in btrfs. >> >> If you mean reflink across two different file systems (not matter the >> same fs type or not). >> Then it's impossible to work. > > I believe Joshi is talking about vfs_copy_file_range() not > vfs_clone_file range(), although _copy_ does call _clone_ if it can. > >> Reflink (clone_file_range) works by inserting data pointers into the >> filesystem other than really copying the data. >> Thus if the source is outside of the fs, it's really impossible to work, >> as the source pointer/data is completely out of control of the dest fs. > > Yes, I would expect there to be problems with his modified kernel > for a filesystem that supports clone_file_range, because > vfs_copy_file_range() will clone if possible, and this should fail across > filesystems. > > In general, though, I don't know for sure why we don't fall back to > do_splice_direct() across filesystems, although the filesystems that > implement their own ->copy_file_range ops may have their own, > further restrictions within their implementations. > > This call /is/ documented in the manpage as only being valid for > files on the same filesystem, though: > http://man7.org/linux/man-pages/man2/copy_file_range.2.html There was a patch to allow cross-mount copy for NFS, but it hasn't landed yet. Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [PATCH 10/10] Dynamic fault injection
On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstr...@gmail.com> wrote: > > On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote: >> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstr...@gmail.com> >> wrote: >>> >>> Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> >> >> I agree with Christoph that even if there was some explanation in the cover >> letter, there should be something at least as good in the patch itself. The >> cover letter is not saved, but the commit stays around forever, and should >> explain how this should be added to code, and how to use it from userspace. >> >> >> That said, I think this is a useful functionality. We have something similar >> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to >> test a distributed filesystem, which is just a CPP macro with an unlikely() >> branch, while this looks more sophisticated. This looks like it has some >> added functionality like having more than one fault enabled at a time. >> If this lands we could likely switch our code over to using this. > > This is pretty much what I was looking for, I just wanted to know if this > patch was interesting enough to anyone that I should spend more time on it > or just drop it :) Agreed on documentation. I think it's also worth > factoring out the functionality for the elf section trick that dynamic > debug uses too. > >> Some things that are missing from this patch that is in our code: >> >> - in addition to the basic "enabled" and "oneshot" mechanisms, we have: >> - timeout: sleep for N msec to simulate network/disk/locking delays >> - race: wait with one thread until a second thread hits matching check >> >> We also have a "fail_val" that allows making the check conditional (e.g. >> only operation on server "N" should fail, only RPC opcode "N", etc). > > Those all sound like good ideas... fail_val especially, I think with that > we'd have all the functionality the existing fault injection framework has > (which is way too heavyweight to actually get used, imo) The other thing that we have that is slightly orthogonal to your modes, which is possible because we just have a __u32 for the fault location, is that the "oneshot" mode is just a mask added to the fault location together with "fail_val" is that we can add other masks "fail N times", "fail randomly 1/N times", or "pass N times before failure". The other mask is set in the kernel when the fault was actually hit, so that test scripts can poll until that happens, and then continue running. The "fail randomly 1/N times" was useful for detecting memory allocation failure handling under load, but that has been superseded by the same functionality in kmalloc(), and it sounds like your fault injection can do this deterministically for every allocation? Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [PATCH 10/10] Dynamic fault injection
On May 18, 2018, at 1:49 AM, Kent Overstreetwrote: > > Signed-off-by: Kent Overstreet I agree with Christoph that even if there was some explanation in the cover letter, there should be something at least as good in the patch itself. The cover letter is not saved, but the commit stays around forever, and should explain how this should be added to code, and how to use it from userspace. That said, I think this is a useful functionality. We have something similar in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to test a distributed filesystem, which is just a CPP macro with an unlikely() branch, while this looks more sophisticated. This looks like it has some added functionality like having more than one fault enabled at a time. If this lands we could likely switch our code over to using this. Some things that are missing from this patch that is in our code: - in addition to the basic "enabled" and "oneshot" mechanisms, we have: - timeout: sleep for N msec to simulate network/disk/locking delays - race: wait with one thread until a second thread hits matching check We also have a "fail_val" that allows making the check conditional (e.g. only operation on server "N" should fail, only RPC opcode "N", etc). Cheers, Andreas > --- > include/asm-generic/vmlinux.lds.h | 4 + > include/linux/dynamic_fault.h | 117 + > lib/Kconfig.debug | 5 + > lib/Makefile | 2 + > lib/dynamic_fault.c | 760 ++ > 5 files changed, 888 insertions(+) > create mode 100644 include/linux/dynamic_fault.h > create mode 100644 lib/dynamic_fault.c > > diff --git a/include/asm-generic/vmlinux.lds.h > b/include/asm-generic/vmlinux.lds.h > index 1ab0e520d6..a4c9dfcbbd 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -246,6 +246,10 @@ > VMLINUX_SYMBOL(__start___verbose) = .; \ > KEEP(*(__verbose)) \ > VMLINUX_SYMBOL(__stop___verbose) = .; \ > + . = ALIGN(8); \ > + VMLINUX_SYMBOL(__start___faults) = .; \ > + *(__faults) \ > + VMLINUX_SYMBOL(__stop___faults) = .;\ > LIKELY_PROFILE()\ > BRANCH_PROFILE()\ > TRACE_PRINTKS() \ > diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h > new file mode 100644 > index 00..6e7bb56ae8 > --- /dev/null > +++ b/include/linux/dynamic_fault.h > @@ -0,0 +1,117 @@ > +#ifndef _DYNAMIC_FAULT_H > +#define _DYNAMIC_FAULT_H > + > +#include > +#include > +#include > + > +enum dfault_enabled { > + DFAULT_DISABLED, > + DFAULT_ENABLED, > + DFAULT_ONESHOT, > +}; > + > +union dfault_state { > + struct { > + unsignedenabled:2; > + unsignedcount:30; > + }; > + > + struct { > + unsignedv; > + }; > +}; > + > +/* > + * An instance of this structure is created in a special > + * ELF section at every dynamic fault callsite. At runtime, > + * the special section is treated as an array of these. > + */ > +struct _dfault { > + const char *modname; > + const char *function; > + const char *filename; > + const char *class; > + > + const u16 line; > + > + unsignedfrequency; > + union dfault_state state; > + > + struct static_key enabled; > +} __aligned(8); > + > + > +#ifdef CONFIG_DYNAMIC_FAULT > + > +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod); > +int dfault_remove_module(char *mod_name); > +bool __dynamic_fault_enabled(struct _dfault *); > + > +#define dynamic_fault(_class) > \ > +({ \ > + static struct _dfault descriptor\ > + __used __aligned(8) __attribute__((section("__faults"))) = {\ > + .modname= KBUILD_MODNAME, \ > + .function = __func__, \ > + .filename = __FILE__, \ > + .line = __LINE__, \ > + .class = _class, \ > + }; \ > + \ > +
Re: [PATCH 1/2] fs: hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs
On May 9, 2018, at 10:10 AM, Darrick J. Wong <darrick.w...@oracle.com> wrote: > > On Wed, May 09, 2018 at 11:01:21AM -0500, Eric Sandeen wrote: >> Move the btrfs label ioctls up to the vfs for general use. >> >> This retains 256 chars as the maximum size through the interface, which >> is the btrfs limit and AFAIK exceeds any other filesystem's maximum >> label size. >> >> Signed-off-by: Eric Sandeen <sand...@redhat.com> >> --- >> >> Let the bikeshedding on the exact ioctl name begin ;) >> >> fs/btrfs/ioctl.c | 8 >> include/uapi/linux/btrfs.h | 6 ++ >> include/uapi/linux/fs.h| 8 ++-- >> 3 files changed, 12 insertions(+), 10 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 632e26d..2dd4cdf 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -5444,6 +5444,10 @@ long btrfs_ioctl(struct file *file, unsigned int >> return btrfs_ioctl_setflags(file, argp); >> case FS_IOC_GETVERSION: >> return btrfs_ioctl_getversion(file, argp); >> +case FS_IOC_GETFSLABEL: >> +return btrfs_ioctl_get_fslabel(file, argp); >> +case FS_IOC_SETFSLABEL: >> +return btrfs_ioctl_set_fslabel(file, argp); >> case FITRIM: >> return btrfs_ioctl_fitrim(file, argp); >> case BTRFS_IOC_SNAP_CREATE: >> @@ -,10 +5559,6 @@ long btrfs_ioctl(struct file *file, unsigned int >> return btrfs_ioctl_quota_rescan_wait(file, argp); >> case BTRFS_IOC_DEV_REPLACE: >> return btrfs_ioctl_dev_replace(fs_info, argp); >> -case BTRFS_IOC_GET_FSLABEL: >> -return btrfs_ioctl_get_fslabel(file, argp); >> -case BTRFS_IOC_SET_FSLABEL: >> -return btrfs_ioctl_set_fslabel(file, argp); >> case BTRFS_IOC_GET_SUPPORTED_FEATURES: >> return btrfs_ioctl_get_supported_features(argp); >> case BTRFS_IOC_GET_FEATURES: >> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h >> index c8d99b9..ec611c8 100644 >> --- a/include/uapi/linux/btrfs.h >> +++ b/include/uapi/linux/btrfs.h >> @@ -823,10 +823,8 @@ enum btrfs_err_code { >> #define BTRFS_IOC_QUOTA_RESCAN_STATUS _IOR(BTRFS_IOCTL_MAGIC, 45, \ >> struct btrfs_ioctl_quota_rescan_args) >> #define BTRFS_IOC_QUOTA_RESCAN_WAIT _IO(BTRFS_IOCTL_MAGIC, 46) >> -#define BTRFS_IOC_GET_FSLABEL _IOR(BTRFS_IOCTL_MAGIC, 49, \ >> - char[BTRFS_LABEL_SIZE]) >> -#define BTRFS_IOC_SET_FSLABEL _IOW(BTRFS_IOCTL_MAGIC, 50, \ >> - char[BTRFS_LABEL_SIZE]) >> +#define BTRFS_IOC_GET_FSLABEL FS_IOC_GETFSLABEL >> +#define BTRFS_IOC_SET_FSLABEL FS_IOC_SETFSLABEL >> #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \ >>struct btrfs_ioctl_get_dev_stats) >> #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ >> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h >> index d2a8313..1df3707 100644 >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -242,6 +242,8 @@ struct fsxattr { >> #define FICLONERANGE _IOW(0x94, 13, struct file_clone_range) >> #define FIDEDUPERANGE_IOWR(0x94, 54, struct file_dedupe_range) >> >> +#define FSLABEL_MAX 256 /* Max chars for the interface; each fs may >> differ */ >> + >> #define FS_IOC_GETFLAGS _IOR('f', 1, long) >> #define FS_IOC_SETFLAGS _IOW('f', 2, long) >> #define FS_IOC_GETVERSION _IOR('v', 1, long) >> @@ -251,8 +253,10 @@ struct fsxattr { >> #define FS_IOC32_SETFLAGS_IOW('f', 2, int) >> #define FS_IOC32_GETVERSION _IOR('v', 1, int) >> #define FS_IOC32_SETVERSION _IOW('v', 2, int) >> -#define FS_IOC_FSGETXATTR _IOR ('X', 31, struct fsxattr) >> -#define FS_IOC_FSSETXATTR _IOW ('X', 32, struct fsxattr) >> +#define FS_IOC_FSGETXATTR _IOR('X', 31, struct fsxattr) >> +#define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr) > > Separate patch for whitespace cleanup. Really? I've heard Ted complain the other way, that whitespace cleanup by itself is useless and should only be done as part of other changes. As long as it is not overwhelming the rest of the patch I don't see an issue with a minor improvement being part of another patch. Otherwise, no bikeshedding from me. Looks very reasonable, and the 256-char limit is definitely large en
Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault
[remove stable@ as this is not really a stable patch] On Dec 14, 2017, at 7:30 AM, Yan, Zhengwrote: > > On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara wrote: >> On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >>> We recently got an Oops report: >>> >>> BUG: unable to handle kernel NULL pointer dereference at (null) >>> IP: jbd2__journal_start+0x38/0x1a2 >>> [...] >>> Call Trace: >>> ext4_page_mkwrite+0x307/0x52b >>> _ext4_get_block+0xd8/0xd8 >>> do_page_mkwrite+0x6e/0xd8 >>> handle_mm_fault+0x686/0xf9b >>> mntput_no_expire+0x1f/0x21e >>> __do_page_fault+0x21d/0x465 >>> dput+0x4a/0x2f7 >>> page_fault+0x22/0x30 >>> copy_user_generic_string+0x2c/0x40 >>> copy_page_to_iter+0x8c/0x2b8 >>> generic_file_read_iter+0x26e/0x845 >>> timerqueue_del+0x31/0x90 >>> ceph_read_iter+0x697/0xa33 [ceph] >>> hrtimer_cancel+0x23/0x41 >>> futex_wait+0x1c8/0x24d >>> get_futex_key+0x32c/0x39a >>> __vfs_read+0xe0/0x130 >>> vfs_read.part.1+0x6c/0x123 >>> handle_mm_fault+0x831/0xf9b >>> __fget+0x7e/0xbf >>> SyS_read+0x4d/0xb5 >>> >>> ceph_read_iter() uses current->journal_info to pass context info to >>> ceph_readpages(). Because ceph_readpages() needs to know if its caller >>> has already gotten capability of using page cache (distinguish read >>> from readahead/fadvise). ceph_read_iter() set current->journal_info, >>> then calls generic_file_read_iter(). >>> >>> In above Oops, page fault happened when copying data to userspace. >>> Page fault handler called ext4_page_mkwrite(). Ext4 code read >>> current->journal_info and assumed it is journal handle. >>> >>> I checked other filesystems, btrfs probably suffers similar problem >>> for its readpage. (page fault happens when write() copies data from >>> userspace memory and the memory is mapped to a file in btrfs. >>> verify_parent_transid() can be called during readpage) >>> >>> Cc: sta...@vger.kernel.org >>> Signed-off-by: "Yan, Zheng" >> >> I agree with the analysis but the patch is too ugly too live. Ceph just >> should not be abusing current->journal_info for passing information between >> two random functions or when it does a hackery like this, it should just >> make sure the pieces hold together. Poluting generic code to accommodate >> this hack in Ceph is not acceptable. Also bear in mind there are likely >> other code paths (e.g. memory reclaim) which could recurse into another >> filesystem confusing it with non-NULL current->journal_info in the same >> way. > > But ... > > some filesystem set journal_info in its write_begin(), then clear it > in write_end(). If buffer for write is mapped to another filesystem, > current->journal can leak to the later filesystem's page_readpage(). > The later filesystem may read current->journal and treat it as its own > journal handle. Besides, most filesystem's vm fault handle is > filemap_fault(), filemap also may tigger memory reclaim. Shouldn't the memory reclaim be prevented from recursing into the other filesystem by use of GFP_NOFS, or the new memalloc_nofs annotation? I don't think that ext4 is ever using current->journal on any read paths, only in case of writes. >> In this particular case I'm not sure why does ceph pass 'filp' into >> readpage() / readpages() handler when it already gets that pointer as part >> of arguments... > > It actually a flag which tells ceph_readpages() if its caller is > ceph_read_iter or readahead/fadvise/madvise. because when there are > multiple clients read/write a file a the same time, page cache should > be disabled. I've wanted something similar for other reasons. It would be better to have a separate fs-specific pointer in the task struct to handle this kind of information. This can be used by the filesystem "upper half" to communicate with the "lower half" (doing the writeout or other IO below the VFS), and the "lower half" can use ->journal for handling the writeout. However, some care would be needed to ensure that other processes accessing this pointer would only do so if it is their own. Something like ->fs_private_sb and ->fs_private_data would allow this sanely. If the ->fs_private_sb != sb in the filesystem then ->fs_private_data is not valid for this fs and cannot be used by the current filesystem code. Alternately, we could have a single ->fs_private pointer to reduce impact on task_struct so long as all filesystems used the first field of the structure to point to "sb", probably with a library helper to ensure this was done consistently: data = current_fs_private_get(sb); current_fs_private_set(sb, data); data = current_fs_private_alloc(sb, size, gfp); or whatever. > Regards > Yan, Zheng > >> >>Honza >> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index a728bed16c20..db2a50233c49 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma,
Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
On Nov 15, 2017, at 7:18 PM, Qu Wenruowrote: > > [Background] > Recently I'm considering the possibility to use checksum from filesystem > to enhance device-mapper raid. > > The idea behind it is quite simple, since most modern filesystems have > checksum for their metadata, and even some (btrfs) have checksum for data. > > And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time > it can use the checksum to determine which copy is correct so it can > return the correct data even one copy get corrupted. > > [Objective] > The final objective is to allow device mapper to do the checksum > verification (and repair if possible). > > If only for verification, it's not much different from current endio > hook method used by most of the fs. > However if we can move the repair part from filesystem (well, only btrfs > supports it yet), it would benefit all fs. I recall Darrick was looking into a mechanism to do this. Rather than changing the whole block layer to take a callback to do a checksum, what we looked at was to allow the upper-layer read to specify a "retry count" to the lower-layer block device. If the lower layer is able to retry the read then it will read a different device (or combination of devices for e.g. RAID-6) based on the retry count, until the upper layer gets a good read (based on checksum, or whatever). If there are no more devices (or combinations) to try then a final error is returned. Darrick can probably point at the original thread/patch. Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
On May 10, 2017, at 11:10 PM, Eric Biggerswrote: > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang >> out] >> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: >>> Theodore Ts'o writes: >>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > 1.) Privacy implications. Say the filesystem is being shared between > multiple >users, and one user unpacks foo.tar.gz into their home directory, which >they've set to mode 700 to hide from other users. Because of this new >ioctl, all users will be able to see every (inode number, size in > blocks) >pair that was added to the filesystem, as well as the exact layout of > the >physical block allocations which might hint at how the files were > created. >If there is a known "fingerprint" for the unpacked foo.tar.gz in this >regard, its presence on the filesystem will be revealed to all users. > And >if any filesystems happen to prefer allocating blocks near the > containing >directory, the directory the files are in would likely be revealed too. >> >> Frankly, why are container users even allowed to make unrestricted ioctl >> calls? I thought we had a bunch of security infrastructure to constrain >> what userspace can do to a system, so why don't ioctls fall under these >> same protections? If your containers are really that adversarial, you >> ought to be blacklisting as much as you can. >> > > Personally I don't find the presence of sandboxing features to be a very good > excuse for introducing random insecure ioctls. Not everyone has everything > perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget > about the filesystem ioctls, too, since they can be executed on any regular > file, without having to open some device node in /dev. > > (And this actually does happen; the SELinux policy in Android, for example, > still allows apps to call any ioctl on their data files, despite all the > effort > that has gone into whitelisting other types of ioctls. Which should be fixed, > of course, but it shows that this kind of mistake is very easy to make.) > Unix/Linux has historically not been terribly concerned about trying to protect this kind of privacy between users. So for example, in order to do this, you would have to call GETFSMAP continously to track this sort of thing. Someone who wanted to do this could probably get this information (and much, much more) by continuously running "ps" to see what processes are running. (I will note. wryly, that in the bad old days, when dozens of users were sharing a one MIPS Vax/780, it was considered a *good* thing that social pressure could be applied when it was found that someone was running a CPU or memory hogger on a time sharing system. The privacy right of someone running "xtrek" to be able to hide this from other users on the system was never considered important at all. :-) >> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious >> both from the high kernel cpu usage and the huge number of metadata >> operations. > > Well, only if that someone running GETFSMAP actually wants to watch things in > real-time (it's not necessary for all scenarios that have been mentioned), > *and* > there is monitoring in place which actually detects it and can do something > about it. > > Yes, PIDs have traditionally been global, but today we have PID namespaces, > and > many other isolation features such as mount namespaces. Nothing is perfect, > of > course, and containers are a lot worse than VMs, but it seems weird to use > that > as an excuse to knowingly make things worse... > >> Fortunately, the days of timesharing seem to well behind us. For those people who think that containers are as secure as VM's (hah, hah, hah), it might be that best way to handle this is to have a mount option that requires root access to this functionality. For those people who really care about this, they can disable access. >> >> Or use separate filesystems for each container so that exploitable bugs >> that shut down the filesystem can't be used to kill the other >> containers. You could use a torrent of metadata-heavy operations >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >> the other containers. >> >>> What would be the reason for not putting this behind >>> capable(CAP_SYS_ADMIN)? >>> >>> What possible legitimate function could this functionality serve to >>> users who don't own your filesystem? >> >> As I've said before, it's to enable dedupe tools to decide, given a set >> of files with shareable blocks, roughly how many other times each of >> those shareable blocks are shared
Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Jan 17, 2017, at 8:59 AM, Theodore Ts'owrote: > > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote: >> >> OK, so I've been staring into the code and AFAIU current->journal_info >> can contain my stored information. I could either hijack part of the >> word as the ref counting is only consuming low 12b. But that looks too >> ugly to live. Or I can allocate some placeholder. > > Yeah, I was looking at something similar. Can you guarantee that the > context will only take one or two bits? (Looks like it only needs one > bit ATM, even though at the moment you're storing the whole GFP mask, > correct?) > >> But before going to play with that I am really wondering whether we need >> all this with no journal at all. AFAIU what Jack told me it is the >> journal lock(s) which is the biggest problem from the reclaim recursion >> point of view. What would cause a deadlock in no journal mode? > > We still have the original problem for why we need GFP_NOFS even in > ext2. If we are in a writeback path, and we need to allocate memory, > we don't want to recurse back into the file system's writeback path. > Certainly not for the same inode, and while we could make it work if > the mm was writing back another inode, or another superblock, there > are also stack depth considerations that would make this be a bad > idea. So we do need to be able to assert GFP_NOFS even in no journal > mode, and for any file system including ext2, for that matter. > > Because of the fact that we're going to have to play games with > current->journal_info, maybe this is something that I should take > responsibility for, and to go through the the ext4 tree after the main > patch series go through? Maybe you could use xfs and ext2 as sample > (simple) implementations? > > My only ask is that the memalloc nofs context be a well defined N > bits, where N < 16, and I'll find some place to put them (probably > journal_info). I think Dave was suggesting that the NOFS context allow a pointer to an arbitrary struct, so that it is possible to dereference this in the filesystem itself to determine if the recursion is safe or not. That way, ext2 could store an inode pointer (if that is what it cares about) and verify that writeback is not recursing on the same inode, and XFS can store something different. It would also need to store some additional info (e.g. fstype or superblock pointer) so that it can determine how to interpret the NOFS context pointer. I think it makes sense to add a couple of void * pointers to the task struct along with journal_info and leave it up to the filesystem to determine how to use them. Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [LSF/MM TOPIC] sharing pages between mappings
On Jan 11, 2017, at 3:29 AM, Miklos Szerediwrote: > > I know there's work on this for xfs, but could this be done in generic mm > code? > > What are the obstacles? page->mapping and page->index are the obvious ones. > > If that's too difficult is it maybe enough to share mappings between > files while they are completely identical and clone the mapping when > necessary? > > All COW filesystems would benefit, as well as layered ones: lots of > fuse fs, and in some cases overlayfs too. For layered filesystems it would also be useful to have an API to move pages between mappings easily. > Related: what can DAX do in the presence of cloned block? > > Thanks, > Miklos > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH] f2fs: support multiple devices
On Nov 9, 2016, at 1:56 PM, Jaegeuk Kimwrote: > > This patch implements multiple devices support for f2fs. > Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big > volume under one f2fs instance. > > Internal block management is very simple, but we will modify block > allocation and background GC policy to boost IO speed by exploiting them > accoording to each device speed. How will you integrate this into FIEMAP, since it is now possible if a file is split across multiple devices then it will return ambiguous block numbers for a file. I've been meaning to merge the FIEMAP handling in Lustre to support multiple devices in a single filesystem, so that this can be detected in userspace. struct ll_fiemap_extent { __u64 fe_logical; /* logical offset in bytes for the start of * the extent from the beginning of the file */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ __u64 fe_reserved64[2]; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_device; /* device number for this extent */ __u32 fe_reserved[2]; }; This adds the 32-bit "fe_device" field, which would optionally be filled in by the filesystem (zero otherwise). It would return the kernel device number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET set) this could just return an integer device number since the device number is meaningless (and may conflict) on a remote system. Since AFAIK Btrfs also has multiple device support there are an increasing number of places where this would be useful. Cheers, Andreas > > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/data.c | 55 --- > fs/f2fs/f2fs.h | 29 -- > fs/f2fs/segment.c | 119 + > fs/f2fs/super.c | 138 ++-- > include/linux/f2fs_fs.h | 10 +++- > 5 files changed, 277 insertions(+), 74 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 47ded0c..e2be24e 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -88,6 +88,46 @@ static void f2fs_write_end_io(struct bio *bio) > } > > /* > + * Return true, if pre_bio's bdev is same as its target device. > + */ > +struct block_device *f2fs_target_device(struct f2fs_sb_info *sbi, > + block_t blk_addr, struct bio *bio) > +{ > + struct block_device *bdev = sbi->sb->s_bdev; > + int i; > + > + for (i = 0; i < sbi->s_ndevs; i++) { > + if (FDEV(i).start_blk <= blk_addr && > + FDEV(i).end_blk >= blk_addr) { > + blk_addr -= FDEV(i).start_blk; > + bdev = FDEV(i).bdev; > + break; > + } > + } > + if (bio) { > + bio->bi_bdev = bdev; > + bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(blk_addr); > + } > + return bdev; > +} > + > +int f2fs_target_device_index(struct f2fs_sb_info *sbi, block_t blkaddr) > +{ > + int i; > + > + for (i = 0; i < sbi->s_ndevs; i++) > + if (FDEV(i).start_blk <= blkaddr && FDEV(i).end_blk >= blkaddr) > + return i; > + return 0; > +} > + > +static bool __same_bdev(struct f2fs_sb_info *sbi, > + block_t blk_addr, struct bio *bio) > +{ > + return f2fs_target_device(sbi, blk_addr, NULL) == bio->bi_bdev; > +} > + > +/* > * Low-level block read/write IO operations. > */ > static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, > @@ -97,8 +137,7 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, > block_t blk_addr, > > bio = f2fs_bio_alloc(npages); > > - bio->bi_bdev = sbi->sb->s_bdev; > - bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(blk_addr); > + f2fs_target_device(sbi, blk_addr, bio); > bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io; > bio->bi_private = is_read ? NULL : sbi; > > @@ -273,7 +312,8 @@ void f2fs_submit_page_mbio(struct f2fs_io_info *fio) > down_write(>io_rwsem); > > if (io->bio && (io->last_block_in_bio != fio->new_blkaddr - 1 || > - (io->fio.op != fio->op || io->fio.op_flags != fio->op_flags))) > + (io->fio.op != fio->op || io->fio.op_flags != fio->op_flags) || > + !__same_bdev(sbi, fio->new_blkaddr, io->bio))) > __submit_merged_bio(io); > alloc_new: > if (io->bio == NULL) { > @@ -965,7 +1005,6 @@ static struct bio *f2fs_grab_bio(struct inode *inode, > block_t blkaddr, > { > struct f2fs_sb_info *sbi = F2FS_I_SB(inode); > struct fscrypt_ctx *ctx = NULL; > - struct
Re: [PATCH 5/5] don't set *REFERENCED unless we are on the lru list
On Oct 25, 2016, at 4:44 PM, Omar Sandovalwrote: > > On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: >> With anything that populates the inode/dentry cache with a lot of one time >> use >> inodes we can really put a lot of pressure on the system for things we don't >> need to keep in cache. It takes two runs through the LRU to evict these one >> use >> entries, and if you have a lot of memory you can end up with 10's of >> millions of >> entries in the dcache or icache that have never actually been touched since >> they >> were first instantiated, and it will take a lot of CPU and a lot of pressure >> to >> evict all of them. >> >> So instead do what we do with pagecache, only set the *REFERENCED flags if we >> are being used after we've been put onto the LRU. This makes a significant >> difference in the system's ability to evict these useless cache entries. >> With a >> fs_mark workload that creates 40 million files we get the following results >> (all >> in files/sec) >> >> BtrfsPatched Unpatched >> Average Files/sec: 72209.3 63254.2 >> p50 Files/sec: 70850 57560 >> p90 Files/sec: 68757 53085 >> p99 Files/sec: 68757 53085 >> >> XFS Patched Unpatched >> Average Files/sec: 61025.5 60719.5 >> p50 Files/sec: 60107 59465 >> p90 Files/sec: 59300 57966 >> p99 Files/sec: 59227 57528 >> >> Ext4 Patched Unpatched >> Average Files/sec: 121785.4119248.0 >> p50 Files/sec: 120852 119274 >> p90 Files/sec: 116703 112783 >> p99 Files/sec: 114393 104934 >> >> The reason Btrfs has a much larger improvement is because it holds a lot more >> things in memory so benefits more from faster slab reclaim, but across the >> board >> is an improvement for each of the file systems. >> >> Signed-off-by: Josef Bacik >> --- >> fs/dcache.c | 15 ++- >> fs/inode.c | 5 - >> 2 files changed, 14 insertions(+), 6 deletions(-) >> >> diff --git a/fs/dcache.c b/fs/dcache.c >> index 5c7cc95..a558075 100644 >> --- a/fs/dcache.c >> +++ b/fs/dcache.c >> @@ -779,8 +779,6 @@ repeat: >> goto kill_it; >> } >> >> -if (!(dentry->d_flags & DCACHE_REFERENCED)) >> -dentry->d_flags |= DCACHE_REFERENCED; >> dentry_lru_add(dentry); >> >> dentry->d_lockref.count--; >> @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry) >> dentry->d_lockref.count++; >> } >> >> +static inline void __dget_dlock_reference(struct dentry *dentry) >> +{ >> +if (dentry->d_flags & DCACHE_LRU_LIST) >> +dentry->d_flags |= DCACHE_REFERENCED; >> +dentry->d_lockref.count++; >> +} >> + >> static inline void __dget(struct dentry *dentry) >> { >> lockref_get(>d_lockref); >> @@ -875,7 +880,7 @@ again: >> (alias->d_flags & DCACHE_DISCONNECTED)) { >> discon_alias = alias; >> } else { >> -__dget_dlock(alias); >> +__dget_dlock_reference(alias); >> spin_unlock(>d_lock); >> return alias; >> } >> @@ -886,7 +891,7 @@ again: >> alias = discon_alias; >> spin_lock(>d_lock); >> if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) { >> -__dget_dlock(alias); >> +__dget_dlock_reference(alias); >> spin_unlock(>d_lock); >> return alias; >> } >> @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, >> const struct qstr *name) >> if (!d_same_name(dentry, parent, name)) >> goto next; >> >> -dentry->d_lockref.count++; >> +__dget_dlock_reference(dentry); > > This misses dentries that we get through __d_lookup_rcu(), so I think > your change made it so most things aren't getting DCACHE_REFERENCED set > at all. Maybe something like this instead? > > diff --git a/fs/dcache.c b/fs/dcache.c > index 5c7cc95..d7a56a8 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -412,15 +412,6 @@ static void d_lru_shrink_move(struct list_lru_one *lru, > struct dentry *dentry, > list_lru_isolate_move(lru, >d_lru, list); > } > > -/* > - * dentry_lru_(add|del)_list) must be called with d_lock held. > - */ > -static void dentry_lru_add(struct dentry *dentry) > -{ > - if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) > - d_lru_add(dentry); > -} > - > /** > * d_drop - drop a dentry > * @dentry: dentry to drop > @@ -779,9 +770,12 @@ void dput(struct dentry *dentry) >
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
> On Jul 6, 2016, at 10:33 AM, Joerg Schilling >wrote: > > "Austin S. Hemmelgarn" wrote: > >> On 2016-07-06 11:22, Joerg Schilling wrote: >>> >>> >>> You are mistaken. >>> >>> stat /proc/$$/as >>> File: `/proc/6518/as' >>> Size: 2793472 Blocks: 5456 IO Block: 512regular file >>> Device: 544h/88342528d Inode: 7557Links: 1 >>> Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> stat /proc/$$/auxv >>> File: `/proc/6518/auxv' >>> Size: 168 Blocks: 1 IO Block: 512regular file >>> Device: 544h/88342528d Inode: 7568Links: 1 >>> Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> Any correct implementation of /proc returns the expected numbers in >>> st_size as well as in st_blocks. >> >> Odd, because I get 0 for both values on all the files in /proc/self and >> all the top level files on all kernels I tested prior to sending that > > I tested this with an official PROCFS-2 implementation that was written by > the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased > away last weekend. > > You may have done your tests on an inofficial procfs implementation So, what you are saying is that you don't care about star working properly on Linux, because it has an "inofficial" procfs implementation, while Solaris has an "official" implementation? >>> Now you know why BTRFS is still an incomplete filesystem. In a few years >>> when it turns 10, this may change. People who implement filesystems of >>> course need to learn that they need to hide implementation details from >>> the official user space interfaces. >> >> So in other words you think we should be lying about how much is >> actually allocated on disk and thus violating the standard directly (and >> yes, ext4 and everyone else who does this with delayed allocation _is_ >> strictly speaking violating the standard, because _nothing_ is allocated >> yet)? > > If it returns 0, it would be lying or it would be wrong anyway as it did not > check fpe available space. > > Also note that I mentioned already that the priciple availability of SEEK_HOLE > does not help as there is e.g. NFS... So, it's OK that NFS is not POSIX compliant in various ways, and star will deal with it, but you aren't willing to fix a heuristic used by star for a behaviour that is unspecified by POSIX but has caused users to lose data when archiving from several modern filesystems? That's fine, so long as GNU tar is fixed to use the safe fallback in such cases (i.e. trying to archive data from files that are newly created, even if they report st_blocks == 0). Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On Jul 2, 2016, at 1:18 AM, Pavel Raiskupwrote: > > There are optimizations in archivers (tar, rsync, ...) that rely on up2date > st_blocks info. For example, in GNU tar there is optimization check [1] > whether the 'st_size' reports more data than the 'st_blocks' can hold --> then > tar considers that file is sparse (and does additional steps). > > It looks like btrfs doesn't show correct value in 'st_blocks' until the data > are synced. ATM, there happens that: > >a) some "tool" creates sparse file >b) that tool does not sync explicitly and exits .. >c) tar is called immediately after that to archive the sparse file >d) tar considers [2] the file is completely sparse (because st_blocks is > zero) and archives no data. Here comes data loss. > > Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is > small and is in-lined (no real data blocks) -- I consider this is too bug in > btrfs worth fixing. We had a similar problem with both ext4 and Lustre - the client was reporting zero blocks due to delayed allocation until data was written to disk. While those problems were fixed in the filesystem to report an estimate of the block count before any blocks were actually written to disk, it seems like this may be a problem that will come up again with other filesystems in the future. I think in addition to fixing btrfs (because it needs to work with existing tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to handle this situation more robustly. One option is if st_blocks == 0 then tar should also check if st_mtime is less than 60s in the past, and if yes then it should call fsync() on the file to flush any unwritten data to disk, or assume the file is not sparse and read the whole file, so that it doesn't incorrectly assume that the file is sparse and skip archiving the file data. Cheers, Andreas > > [1] > http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392 > [2] > http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273 > > Tested on kernel: > kernel-4.5.7-300.fc24.x86_64 > > Originally reported here, reproducer available there: > https://bugzilla.redhat.com/show_bug.cgi?id=1352061 > > Pavel > > Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
On Apr 27, 2016, at 5:54 AM, Michal Hockowrote: > > From: Michal Hocko > > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite > some time ago. We would like to make this concept more generic and use > it for other filesystems as well. Let's start by giving the flag a > more genric name PF_MEMALLOC_NOFS which is in line with an exiting > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO > contexts. Replace all PF_FSTRANS usage from the xfs code in the first > step before we introduce a full API for it as xfs uses the flag directly > anyway. > > This patch doesn't introduce any functional change. > > Signed-off-by: Michal Hocko > --- > Hi, > as suggested by Dave, I have split up [1] into two parts. The first one > addes a new PF flag which is just an alias to the existing PF_FSTRANS > and does all the renaming and the second one to introduce the generic > API which only changes the bare minimum in the xfs proper. > > fs/xfs/kmem.c | 4 ++-- > fs/xfs/kmem.h | 2 +- > fs/xfs/libxfs/xfs_btree.c | 2 +- > fs/xfs/xfs_aops.c | 6 +++--- > fs/xfs/xfs_trans.c| 12 ++-- > include/linux/sched.h | 2 ++ > 6 files changed, 15 insertions(+), 13 deletions(-) > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index 686ba6fb20dd..73f6ab59c664 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) >* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering >* the filesystem here and potentially deadlocking. >*/ > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > noio_flag = memalloc_noio_save(); > > lflags = kmem_flags_convert(flags); > ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); > > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > memalloc_noio_restore(noio_flag); Not really the fault of this patch, but it brings this nasty bit of code into the light. Is all of this machinery still needed given that __vmalloc() can accept GFP flags? If yes, wouldn't it be better to fix __vmalloc() to honor the GFP flags instead of working around it in the filesystem code? Cheers, Andreas > return ptr; > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > index d1c66e465ca5..0d83f332e5c2 100644 > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags) > lflags = GFP_ATOMIC | __GFP_NOWARN; > } else { > lflags = GFP_KERNEL | __GFP_NOWARN; > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > lflags &= ~__GFP_FS; > } > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > index a0eb18ce3ad3..326566f4a131 100644 > --- a/fs/xfs/libxfs/xfs_btree.c > +++ b/fs/xfs/libxfs/xfs_btree.c > @@ -2540,7 +2540,7 @@ xfs_btree_split_worker( > struct xfs_btree_split_args *args = container_of(work, > struct xfs_btree_split_args, > work); > unsigned long pflags; > - unsigned long new_pflags = PF_FSTRANS; > + unsigned long new_pflags = PF_MEMALLOC_NOFS; > > /* >* we are in a transaction context here, but may also be doing work > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index d12dfcfd0cc8..6d816ff0b763 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc( >* We hand off the transaction to the completion thread now, so >* clear the flag here. >*/ > - current_restore_flags_nested(>t_pflags, PF_FSTRANS); > + current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS); > return 0; > } > > @@ -169,7 +169,7 @@ xfs_setfilesize_ioend( >* thus we need to mark ourselves as being in a transaction manually. >* Similarly for freeze protection. >*/ > - current_set_flags_nested(>t_pflags, PF_FSTRANS); > + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS); > __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS); > > /* we abort the update if there was an IO error */ > @@ -979,7 +979,7 @@ xfs_vm_writepage( >* Given that we do not allow direct reclaim to call us, we should >* never be called while in a filesystem transaction. >*/ > - if (WARN_ON_ONCE(current->flags & PF_FSTRANS)) > + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) > goto redirty; > > /* Is this page beyond the end of the file? */ > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c >
Re: fallocate mode flag for "unshare blocks"?
On Mar 31, 2016, at 12:08 PM, J. Bruce Fieldswrote: > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote: >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote: >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote: On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote: > Or is it ok that fallocate could block, potentially for a long time as > we stream cows through the page cache (or however unshare works > internally)? Those same programs might not be expecting fallocate to > take a long time. Yes, it's perfectly fine for fallocate to block for long periods of time. See what gfs2 does during preallocation of blocks - it ends up calling sb_issue_zerout() because it doesn't have unwritten extents, and hence can block for long periods of time >>> >>> gfs2 fallocate is an implementation that will cause all but the most >>> trivial users real pain. Even the initial XFS implementation just >>> marking the transactions synchronous made it unusable for all kinds >>> of applications, and this is much worse. E.g. a NFS ALLOCATE operation >>> to gfs2 will probab;ly hand your connection for extended periods of >>> time. >>> >>> If we need to support something like what gfs2 does we should have a >>> separate flag for it. >> >> Using fallocate() for preallocation was always intended to >> be a faster, more efficient method allocating zeroed space >> than having userspace write blocks of data. Faster, more efficient >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means >> that if the hardware has zeroing offloads (deterministic trim, write >> same, etc) it will use them, and that will be much faster than >> writing zeros from userspace. >> >> IMO, what gfs2 is definitely within the intended usage of >> fallocate() for accelerating the preallocation of blocks. >> >> Yes, it may not be optimal for things like NFS servers which haven't >> considered that a fallocate based offload operation might take some >> time to execute, but that's not a problem with fallocate. i.e. >> that's a problem with the nfs server ALLOCATE implementation not >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs >> and timeouts while the operation is run > > That's an interesting idea, but I don't think it's really legal. I take > JUKEBOX to mean "sorry, I'm failing this operation for now, try again > later and it might succeed", not "OK, I'm working on it, try again and > you may find out I've done it". > > So if the client gets a JUKEBOX error but the server goes ahead and does > the operation anyway, that'd be unexpected. Well, the tape continued to be mounted in the background and/or the file restored from the tape into the filesystem... > I suppose it's comparable to the case where a slow fallocate is > interrupted--would it be legal to return EINTR in that case and leave > the application to sort out whether some part of the allocation had > already happened? If the later fallocate() was not re-doing the same work as the first one, it should be fine for the client to re-send the fallocate() request. The fallocate() to reserve blocks does not touch the blocks that are already allocated, so this is safe to do even if another process is writing to the file. If you have multiple processes writing and calling fallocate() with PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then the application is in for a world of hurt already. > Would it be legal to continue the fallocate under the covers even after > returning EINTR? That might produce unexpected results in some cases, but it depends on the options used. Probably the safest is to not continue, and depend on userspace to retry the operation on EINTR. For fallocate() doing prealloc or punch or zero this should eventually complete even if it is slow. Cheers, Andreas > But anyway my first inclination is to say that the NFS FALLOCATE > protocol just wasn't designed to handle long-running fallocates, and if > we really need that then we need to give it a way to either report > partial results or to report results asynchronously. > > --b. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: fallocate mode flag for "unshare blocks"?
On Mar 31, 2016, at 1:55 AM, Christoph Hellwigwrote: > > On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote: >> Well, btrfs fallocate doesn't allocate space if it's a shared one >> because it thinks the space is already allocated. So a later overwrite >> over this shared extent may hit enospc errors. > > And this makes it an incorrect implementation of posix_fallocate, > which glibcs implements using fallocate if available. It isn't really useful for a COW filesystem to implement fallocate() to reserve blocks. Even if it did allocate all of the blocks on the initial fallocate() call, when it comes time to overwrite these blocks new blocks need to be allocated as the old ones will not be overwritten. Because of snapshots that could hold references to the old blocks, there isn't even the guarantee that the previous fallocated blocks will be released in a reasonable time to free up an equal amount of space. Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH v7 0/4] VFS: In-kernel copy system call
> On Oct 24, 2015, at 10:52 AM, Eric Biggerswrote: > > A few comments: > >> if (!(file_in->f_mode & FMODE_READ) || >> !(file_out->f_mode & FMODE_WRITE) || >> (file_out->f_flags & O_APPEND) || >> !file_out->f_op) >> return -EBADF; > > Isn't 'f_op' always non-NULL? > > If the destination file cannot be append-only, shouldn't this be documented? Actually, wouldn't O_APPEND only be a problem if the target file wasn't being appended to? In other words, if the target i_size == start offset then it should be possible to use the copy syscall on an O_APPEND file. Cheers, Andreas >> if (inode_in->i_sb != inode_out->i_sb || >> file_in->f_path.mnt != file_out->f_path.mnt) >> return -EXDEV; > > Doesn't the same mount already imply the same superblock? > >> /* >> * copy_file_range() differs from regular file read and write in that it >> * specifically allows return partial success. When it does so is up to >> * the copy_file_range method. >> */ > > What does this mean? I thought that read() and write() can also return > partial > success. > >> f_out = fdget(fd_out); >> if (!f_in.file || !f_out.file) { >> ret = -EBADF; >> goto out; >> } > > This looked wrong at first because it may call fdput() on a 'struct fd' that > was > not successfully acquired, but it looks like it works currently because of how > the FDPUT_FPUT flag is used. It may be a good idea to write it the "obvious" > way, though (use separate labels depending on which fdput()s need to happen). > > > Other questions: > > Should FMODE_PREAD or FMODE_PWRITE access be checked if the user specifies > their > own 'off_in' or 'off_out', respectively? > > What is supposed to happen if the user passes provides a file descriptor to a > non-regular file, such as a block device or char device? > > If the 'in' file has fewer than 'len' bytes remaining until EOF, what is the > expected behavior? It looks like the btrfs implementation has different > behavior from the pagecache implementation. > > It appears the btrfs implementation has alignment restrictions --- where is > this > documented and how will users know what alignment to use? > > Are copies within the same file permitted and can the ranges overlap? The man > page doesn't say. > > It looks like the initial patch defines __NR_copy_file_range for the ARM > architecture but doesn't actually hook that system call up for ARM; why is > that? > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()
> On Oct 16, 2015, at 3:08 PM, Anna Schumakerwrote: > > copy_file_range() is a new system call for copying ranges of data > completely in the kernel. This gives filesystems an opportunity to > implement some kind of "copy acceleration", such as reflinks or > server-side-copy (in the case of NFS). > > Signed-off-by: Anna Schumaker > Reviewed-by: Darrick J. Wong > --- > v6: > - Updates for removing most flags > --- > man2/copy_file_range.2 | 204 + > man2/splice.2 | 1 + > 2 files changed, 205 insertions(+) > create mode 100644 man2/copy_file_range.2 > > diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 > new file mode 100644 > index 000..6c52c85 > --- /dev/null > +++ b/man2/copy_file_range.2 > @@ -0,0 +1,204 @@ > +.\"This manpage is Copyright (C) 2015 Anna Schumaker > > +.\" > +.\" %%%LICENSE_START(VERBATIM) > +.\" Permission is granted to make and distribute verbatim copies of this > +.\" manual provided the copyright notice and this permission notice are > +.\" preserved on all copies. > +.\" > +.\" Permission is granted to copy and distribute modified versions of > +.\" this manual under the conditions for verbatim copying, provided that > +.\" the entire resulting derived work is distributed under the terms of > +.\" a permission notice identical to this one. > +.\" > +.\" Since the Linux kernel and libraries are constantly changing, this > +.\" manual page may be incorrect or out-of-date. The author(s) assume. > +.\" no responsibility for errors or omissions, or for damages resulting. > +.\" from the use of the information contained herein. The author(s) may. > +.\" not have taken the same level of care in the production of this. > +.\" manual, which is licensed free of charge, as they might when working. Is there a reason why every. one. of. those. lines. ends. in. a. period? I don't think that is needed for nroff, and other paragraphs would support that conclusion. > +.\" professionally. > +.\" > +.\" Formatted or processed versions of this manual, if unaccompanied by > +.\" the source, must acknowledge the copyright and authors of this work. > +.\" %%%LICENSE_END > +.\" > +.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual" > +.SH NAME > +copy_file_range \- Copy a range of data from one file to another > +.SH SYNOPSIS > +.nf > +.B #include > +.B #include > +.B #include > + > +.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out > ", > +.BI "loff_t *" off_out ", size_t " len \ > +", unsigned int " flags ); > +.fi > +.SH DESCRIPTION > +The > +.BR copy_file_range () > +system call performs an in-kernel copy between two file descriptors > +without the additional cost of transferring data from the kernel to userspace > +and then back into the kernel. > +It copies up to > +.I len > +bytes of data from file descriptor > +.I fd_in > +to file descriptor > +.IR fd_out , > +overwriting any data that exists within the requested range of the target > file. > + > +The following semantics apply for > +.IR off_in , > +and similar statements apply to > +.IR off_out : > +.IP * 3 > +If > +.I off_in > +is NULL, then bytes are read from > +.I fd_in > +starting from the current file offset, and the offset is > +adjusted by the number of bytes copied. > +.IP * > +If > +.I off_in > +is not NULL, then > +.I off_in > +must point to a buffer that specifies the starting > +offset where bytes from > +.I fd_in > +will be read. The current file offset of > +.I fd_in > +is not changed, but > +.I off_in > +is adjusted appropriately. > +.PP > + > +The > +.I flags > +argument can have the following flag set: > +.TP 1.9i > +.B COPY_FR_REFLINK > +Create a lightweight "reflink", where data is not copied until > +one of the files is modified. This is a circular definition. Something like: Create a lightweight reference to the data blocks in the original file, where data is not copied until one of the files is modified. although I'm not sure if "lightweight" is really valuable there. > +.PP > +The default behavior > +.RI ( flags > +== 0) is to perform a full data copy of the requested range. > +.SH RETURN VALUE > +Upon successful completion, > +.BR copy_file_range () > +will return the number of bytes copied between files. > +This could be less than the length originally requested. This is a bit vague. When COPY_FR_REFLINK is used, no data is "copied", per se, but I doubt that "0" would be returned in that case either. It probably makes sense to write something like: ... return the number of bytes accessible in the target file. or maybe (s/accessible/transferred/) or ... return the number of bytes added to the target file. or similar. > + > +On error, > +.BR copy_file_range () > +returns \-1 and > +.I errno > +is set to indicate the error. > +.SH ERRORS > +.TP
Re: [PATCH v1 0/8] VFS: In-kernel copy system call
On Sep 4, 2015, at 2:16 PM, Anna Schumakerwrote: > > Copy system calls came up during Plumbers a couple of weeks ago, > because several filesystems (including NFS and XFS) are currently > working on copy acceleration implementations. We haven't heard from > Zach Brown in a while, so I volunteered to push his patches upstream > so individual filesystems don't need to keep writing their own ioctls. > > The first three patches are a simple reposting of Zach's patches > from several months ago, with one minor error code fix. The remaining > patches add in a fallback mechanism when filesystems don't provide a > copy function. This is especially useful when doing a server-side > copy on NFS (using the new COPY operation in NFS v4.2). This fallback > can be disabled by passing the flag COPY_REFLINK to the system call. > > The last patch is a man page patch documenting this new system call, > including an example program. > > I tested the fallback option by using /dev/urandom to generate files > of varying sizes and copying them. I compared the time to copy > against that of `cp` just to see if there is a noticable difference. > I found that runtimes are roughly the same, but in-kernel copy tends > to use less of the cpu. Values in the tables below are averages > across multiple trials. > > > /usr/bin/cp | 512 MB | 1024 MB | 1536 MB | 2048 MB > -|---|---|---|--- > user | 0.00s | 0.00s | 0.00s | 0.00s > system | 0.32s | 0.52s | 1.04s | 1.04s >cpu | 73% | 69% | 62% | 62% > total | 0.446 | 0.757 | 1.197 | 1.667 > > > VFS copy | 512 MB | 1024 MB | 1536 MB | 2048 MB > -|---|---|---|--- > user | 0.00s | 0.00s | 0.00s | 0.00s > system | 0.33s | 0.49s | 0.76s | 0.99s >cpu | 77% | 62% | 60% |59% > total | 0.422 | 0.777 | 1.267 | 1.655 > > > Questions? Comments? Thoughts? This is a bit of a surprising result, since in my testing in the past, copy_{to/from}_user() is a major consumer of CPU time (50% of a CPU core at 1GB/s). What backing filesystem did you test on? In theory, the VFS copy routines should save at least 50% of the CPU usage since it only needs to make one copy (src->dest) instead of two (kernel->user, user->kernel). Ideally it wouldn't make any data copies at all and just pass page references from the source to the target. Cheers, Andreas > > Anna > > > Anna Schumaker (5): > btrfs: Add mountpoint checking during btrfs_copy_file_range > vfs: Remove copy_file_range mountpoint checks > vfs: Copy should check len after file open mode > vfs: Copy should use file_out rather than file_in > vfs: Fall back on splice if no copy function defined > > Zach Brown (3): > vfs: add copy_file_range syscall and vfs helper > x86: add sys_copy_file_range to syscall tables > btrfs: add .copy_file_range file operation > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/btrfs/ctree.h | 3 + > fs/btrfs/file.c| 1 + > fs/btrfs/ioctl.c | 95 ++-- > fs/read_write.c| 132 + > include/linux/copy.h | 6 ++ > include/linux/fs.h | 3 + > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/copy.h | 6 ++ > kernel/sys_ni.c| 1 + > 12 files changed, 214 insertions(+), 40 deletions(-) > create mode 100644 include/linux/copy.h > create mode 100644 include/uapi/linux/copy.h > > -- > 2.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range()
On Sep 4, 2015, at 3:38 PM, Darrick J. Wongwrote: > > On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote: >> copy_file_range() is a new system call for copying ranges of data >> completely in the kernel. This gives filesystems an opportunity to >> implement some kind of "copy acceleration", such as reflinks or >> server-side-copy (in the case of NFS). >> >> Signed-off-by: Anna Schumaker >> --- >> man2/copy_file_range.2 | 168 >> + >> 1 file changed, 168 insertions(+) >> create mode 100644 man2/copy_file_range.2 >> >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 >> new file mode 100644 >> index 000..4a4cb73 >> --- /dev/null >> +++ b/man2/copy_file_range.2 >> @@ -0,0 +1,168 @@ >> +.\"This manpage is Copyright (C) 2015 Anna Schumaker >> >> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual" >> +.SH NAME >> +copy_file_range \- Copy a range of data from one file to another >> +.SH SYNOPSIS >> +.nf >> +.B #include >> +.B #include >> +.B #include >> + >> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in >> ", >> +.BI "int " fd_out ", loff_t * " off_out ", size_t " len ", >> +.BI "unsigned int " flags ); >> +.fi >> +.SH DESCRIPTION >> +The >> +.BR copy_file_range () >> +system call performs an in-kernel copy between two file descriptors >> +without all that tedious mucking about in userspace. > > ;) > >> +It copies up to >> +.I len >> +bytes of data from file descriptor >> +.I fd_in >> +to file descriptor >> +.I fd_out >> +at >> +.IR off_out . >> +The file descriptors must not refer to the same file. > > Why? btrfs (and XFS) reflink can handle the case of a file sharing blocks > with itself. > >> + >> +The following semantics apply for >> +.IR fd_in , >> +and similar statements apply to >> +.IR off_out : >> +.IP * 3 >> +If >> +.I off_in >> +is NULL, then bytes are read from >> +.I fd_in >> +starting from the current file offset and the current >> +file offset is adjusted appropriately. >> +.IP * >> +If >> +.I off_in >> +is not NULL, then >> +.I off_in >> +must point to a buffer that specifies the starting >> +offset where bytes from >> +.I fd_in >> +will be read. The current file offset of >> +.I fd_in >> +is not changed, but >> +.I off_in >> +is adjusted appropriately. >> +.PP >> +The default behavior of >> +.BR copy_file_range () >> +is filesystem specific, and might result in creating a >> +copy-on-write reflink. >> +In the event that a given filesystem does not implement >> +any form of copy acceleration, the kernel will perform >> +a deep copy of the requested range by reading bytes from > > I wonder if it's wise to allow deep copies -- what happens if > len == 1T? Will this syscall just block for a really long time? It should be interruptible, and return the length of the number of bytes copied so far, just like read() and write(). That allows the caller to continue where it left off, or abort and delete the target file, or whatever it wants to do. Cheers, Andreas >> +.I fd_in >> +and writing them to >> +.IR fd_out . > > "...if COPY_REFLINK is not set in flags." > >> + >> +Currently, Linux only supports the following flag: >> +.TP 1.9i >> +.B COPY_REFLINK >> +Only perform the copy if the filesystem can do it as a reflink. >> +Do not fall back on performing a deep copy. >> +.SH RETURN VALUE >> +Upon successful completion, >> +.BR copy_file_range () >> +will return the number of bytes copied between files. >> +This could be less than the length originally requested. >> + >> +On error, >> +.BR copy_file_range () >> +returns \-1 and >> +.I errno >> +is set to indicate the error. >> +.SH ERRORS >> +.TP >> +.B EBADF >> +One or more file descriptors are not valid, >> +or do not have proper read-write mode. > > "or fd_out is not opened for writing"? > >> +.TP >> +.B EINVAL >> +Requested range extends beyond the end of the file; >> +.I flags >> +argument is set to an invalid value. >> +.TP >> +.B EOPNOTSUPP >> +.B COPY_REFLINK >> +was specified in >> +.IR flags , >> +but the target filesystem does not support reflinks. >> +.TP >> +.B EXDEV >> +Target filesystem doesn't support cross-filesystem copies. >> +.SH VERSIONS > > Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...) > that can be returned? (I was looking at the fallocate manpage.) > > --D > >> +The >> +.BR copy_file_range () >> +system call first appeared in Linux 4.3. >> +.SH CONFORMING TO >> +The >> +.BR copy_file_range () >> +system call is a nonstandard Linux extension. >> +.SH EXAMPLE >> +.nf >> + >> +#define _GNU_SOURCE >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> + >> +int main(int argc, char **argv) >> +{ >> +int fd_in, fd_out; >> +struct stat stat; >> +loff_t len, ret; >> + >> +if (argc != 3) { >> +
Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
On Aug 5, 2015, at 3:51 AM, mho...@kernel.org wrote: Hi, small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing traditionally even though their reclaim capabilities are restricted because the VM code cannot recurse into filesystems to clean dirty pages. At the same time these allocation requests do not allow to trigger the OOM killer because that would lead to pre-mature OOM killing during heavy fs metadata workloads. This leaves the VM code in an unfortunate situation where GFP_NOFS requests is looping inside the allocator relying on somebody else to make a progress on its behalf. This is prone to deadlocks when the request is holding resources which are necessary for other task to make a progress and release memory (e.g. OOM victim is blocked on the lock held by the NONFS request). Another drawback is that the caller of the allocator cannot define any fallback strategy because the request doesn't fail. As the VM cannot do much about these requests we should face the reality and allow those allocations to fail. Johannes has already posted the patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2) but the discussion died pretty quickly. I was playing with this patch and xfs, ext[34] and btrfs for a while to see what is the effect under heavy memory pressure. As expected this led to some fallouts. My test consisted of a simple memory hog which allocates a lot of anonymous memory and writes to a fs mainly to trigger a fs activity on exit. In parallel there is a parallel fs metadata load (multiple tasks creating thousands of empty files and directories). All is running in a VM with small amount of memory to emulate an under provisioned system. The metadata load is triggering a sufficient load to invoke the direct reclaim even without the memory hog. The memory hog forks several tasks sharing the VM and OOM killer manages to kill it without locking up the system (this was based on the test case from Tetsuo Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't want to kill my machine ;)). With all the patches applied none of the 4 filesystems gets aborted transactions and RO remount (well xfs didn't need any special treatment). This is obviously not sufficient to claim that failing GFP_NOFS is OK now but I think it is a good start for the further discussion. I would be grateful if FS people could have a look at those patches. I have simply used __GFP_NOFAIL in the critical paths. This might be not the best strategy but it sounds like a good first step. The first patch in the series also allows __GFP_NOFAIL allocations to access memory reserves when the system is OOM which should help those requests to make a forward progress - especially in combination with GFP_NOFS. The second patch tries to address a potential pre-mature OOM killer from the page fault path. I have posted it separately but it didn't get much traction. The third patch allows GFP_NOFS to fail and I believe it should see much more testing coverage. It would be really great if it could sit in the mmotm tree for few release cycles so that we can catch more fallouts. The rest are the FS specific patches to fortify allocations requests which are really needed to finish transactions without RO remounts. There might be more needed but my test case survives with these in place. Wouldn't it make more sense to order the fs-specific patches _before_ the GFP_NOFS can fail patch (#3), so that once that patch is applied all known failures have already been fixed? Otherwise it could show test failures during bisection that would be confusing. Cheers, Andreas They would obviously need some rewording if they are going to be applied even without Patch3 and I will do that if respective maintainers will take them. Ext3 and JBD are going away soon so they might be dropped but they have been in the tree while I was testing so I've kept them. Thoughts? Opinions? -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper
On Apr 10, 2015, at 4:00 PM, Zach Brown z...@redhat.com wrote: Add a copy_file_range() system call for offloading copies between regular files. This gives an interface to underlying layers of the storage stack which can copy without reading and writing all the data. There are a few candidates that should support copy offloading in the nearer term: - btrfs shares extent references with its clone ioctl - NFS has patches to add a COPY command which copies on the server - SCSI has a family of XCOPY commands which copy in the device This system call avoids the complexity of also accelerating the creation of the destination file by operating on an existing destination file descriptor, not a path. Currently the high level vfs entry point limits copy offloading to files on the same mount and super (and not in the same file). This can be relaxed if we get implementations which can copy between file systems safely. Signed-off-by: Zach Brown z...@redhat.com --- fs/read_write.c | 129 ++ include/linux/fs.h| 3 + include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + 4 files changed, 136 insertions(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 8e1b687..c65ce1d 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -17,6 +17,7 @@ #include linux/pagemap.h #include linux/splice.h #include linux/compat.h +#include linux/mount.h #include internal.h #include asm/uaccess.h @@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, return do_sendfile(out_fd, in_fd, NULL, count, 0); } #endif + +/* + * copy_file_range() differs from regular file read and write in that it + * specifically allows return partial success. When it does so is up to + * the copy_file_range method. + */ +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len, int flags) Minor nit - flags should be unsigned int to match the syscall. +{ + struct inode *inode_in; + struct inode *inode_out; + ssize_t ret; + + if (flags) + return -EINVAL; + + if (len == 0) + return 0; + + /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */ This says ssize_t, but the len parameter is size_t... + ret = rw_verify_area(READ, file_in, pos_in, len); + if (ret = 0) + ret = rw_verify_area(WRITE, file_out, pos_out, len); + if (ret 0) + return ret; + + if (!(file_in-f_mode FMODE_READ) || + !(file_out-f_mode FMODE_WRITE) || + (file_out-f_flags O_APPEND) || + !file_in-f_op || !file_in-f_op-copy_file_range) + return -EINVAL; + + inode_in = file_inode(file_in); + inode_out = file_inode(file_out); + + /* make sure offsets don't wrap and the input is inside i_size */ + if (pos_in + len pos_in || pos_out + len pos_out || + pos_in + len i_size_read(inode_in)) + return -EINVAL; + + /* this could be relaxed once a method supports cross-fs copies */ + if (inode_in-i_sb != inode_out-i_sb || + file_in-f_path.mnt != file_out-f_path.mnt) + return -EXDEV; + + /* forbid ranges in the same file */ + if (inode_in == inode_out) + return -EINVAL; + + ret = mnt_want_write_file(file_out); + if (ret) + return ret; + + ret = file_in-f_op-copy_file_range(file_in, pos_in, file_out, pos_out, + len, flags); + if (ret 0) { + fsnotify_access(file_in); + add_rchar(current, ret); + fsnotify_modify(file_out); + add_wchar(current, ret); + } + inc_syscr(current); + inc_syscw(current); + + mnt_drop_write_file(file_out); + + return ret; +} +EXPORT_SYMBOL(vfs_copy_file_range); + +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in, + int, fd_out, loff_t __user *, off_out, + size_t, len, unsigned int, flags) +{ + loff_t pos_in; + loff_t pos_out; + struct fd f_in; + struct fd f_out; + ssize_t ret; + + f_in = fdget(fd_in); + f_out = fdget(fd_out); + if (!f_in.file || !f_out.file) { + ret = -EBADF; + goto out; + } + + ret = -EFAULT; + if (off_in) { + if (copy_from_user(pos_in, off_in, sizeof(loff_t))) + goto out; + } else { + pos_in = f_in.file-f_pos; + } + + if (off_out) { + if (copy_from_user(pos_out, off_out, sizeof(loff_t))) + goto out; + } else { + pos_out = f_out.file-f_pos; + } + + ret =
Re: Documenting MS_LAZYTIME
On Feb 20, 2015, at 1:50 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote: Hello Ted, Based on your commit message 0ae45f63d4e, I I wrote the documentation below for MS_LAZYTIME, to go into the mount(2) man page. Could you please check it over and let me know if it's accurate. In particular, I added pieces marked with * below that were not part of the commit message and I'd like confirmation that they're accurate. Thanks, Michael [[ MS_LAZYTIME (since Linux 3.20) Only update filetimes (atime, mtime, ctime) on the in- memory version of the file inode. The on-disk time‐ stamps are updated only when: (a) the inode needs to be updated for some change unre‐ lated to file timestamps; (b) the application employs fsync(2), syncfs(2), or sync(2); (c) an undeleted inode is evicted from memory; or * (d) more than 24 hours have passed since the i-node was * written to disk. This mount option significantly reduces writes to the inode table for workloads that perform frequent random writes to preallocated files. * As at Linux 3.20, this option is supported only on ext4. I _think_ that the lazytime mount option is generic for all filesystems. I believe ext4 has an extra optimization for it, but that's it. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
On Dec 2, 2014, at 12:23 PM, Theodore Ts'o ty...@mit.edu wrote: On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote: This I do not understand. I thought that I_DIRTY_TIME, and the all lazytime mount option, is only for atime. So if there are dirty pages then there are also m/ctime that changed and surly we want to write these times to disk ASAP. What are the situations where you are most concerned about mtime or ctime being accurate after a crash? I've been running with it on my laptop for a while now, and it's certainly not a problem for build trees; remember, whenever you need to update the inode to update i_blocks or i_size, the inode (with its updated timestamps) will be flushed to disk anyway. [snip] I'm not aware of an application which is doing a large number of non-allocating random writes (for example, such as a database), where said database actually cares about mtime being correct. [snip] Did you have such a use case or application in mind? One thing that comes to mind is touch/utimes()/utimensat(). Those should definitely not result in timestamps being kept only in memory for 24h, since the whole point of those calls is to update the times. It makes sense for these APIs to dirty the inode for proper writeout. Cheers, Andreas if we are lazytime also with m/ctime then I think I would like an option for only atime lazy. because m/ctime is cardinal to some operations even though I might want atime lazy. If there's a sufficiently compelling use case where we do actually care about mtime/ctime being accurate, and the current semantics don't provide enough of a guarantee, it's certainly something we could do. I'd rather keep things simple unless it's really there. (After all, we did create the strictatime mount option, but I'm not sure anyone every ends up using it. It woud be a shame if we created a strictcmtime, which had the same usage rate.) I'll also note that if it's only about atime updates, with the default relatime mount option, I'm not sure there's enough of a win to hae a mode to justify a lazyatime only option. If you really neeed strict c/mtime after a crash, maybe the best thing to do is to just simply not use the lazytime mount option and be done with it. Cheeres, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote: Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 49 ++--- fs/ext4/super.c | 9 + include/trace/events/ext4.h | 30 +++ 3 files changed, 85 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa4..8308c82 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, + unsigned long orig_ino, char *buf) +{ +struct ext4_inode_info *ei; +struct ext4_inode *raw_inode; +unsigned long ino; +struct inode*inode; +int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; +int inode_size = EXT4_INODE_SIZE(sb); + +ino = orig_ino ~(inodes_per_block - 1); +for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { +if (ino == orig_ino) +continue; +inode = find_active_inode_nowait(sb, ino); +if (!inode || +(inode-i_state I_DIRTY_TIME) == 0 || +!spin_trylock(inode-i_lock)) { +iput(inode); +continue; +} +inode-i_state = ~I_DIRTY_TIME; +inode-i_ts_dirty_day = 0; +spin_unlock(inode-i_lock); +inode_requeue_dirtytime(inode); + +ei = EXT4_I(inode); +raw_inode = (struct ext4_inode *) buf; + +spin_lock(ei-i_raw_lock); +EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); +EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); +EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); +ext4_inode_csum_set(inode, raw_inode, ei); +spin_unlock(ei-i_raw_lock); +trace_ext4_other_inode_update_time(inode, orig_ino); +iput(inode); +} +} Am I right in that this now does unlogged timestamp updates of inodes? What happens when that buffer gets overwritten by log recover after a crash? The timestamp updates get lost? FYI, XFS has had all sorts of nasty log recovery corner cases caused by log recovery overwriting non-logged inode updates like this. In the past few years we've removed every single non-logged inode update optimisation so that all changes (including timestamps) are transactional so inode state on disk not matching what log recovery wrote to disk for all the other inode metadata... Optimistic unlogged inode updates are a slippery slope, and history tells me that it doesn't lead to a nice place Since ext4/jbd2 is logging the whole block, unlike XFS which is doing logical journaling, this isn't an unlogged update. It is just taking advantage of the fact that the whole block is going to be logged and written to the disk anyway. If the only update needed for other inodes in the block is the timestamp then they may as well be flushed to disk at the same time and avoid the need for another update later on. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
On Nov 21, 2014, at 1:59 PM, Theodore Ts'o ty...@mit.edu wrote: Guarantee that the on-disk timestamps will be no more than 24 hours stale. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c | 7 ++- include/linux/fs.h | 1 + 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ce7de22..eb04277 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (flags (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { trace_writeback_dirty_inode_start(inode, flags); + inode-i_ts_dirty_day = 0; if (sb-s_op-dirty_inode) sb-s_op-dirty_inode(inode, flags); diff --git a/fs/inode.c b/fs/inode.c index 6e91aca..f0d6232 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { + int days_since_boot = jiffies / (HZ * 86400); int ret; if (inode-i_op-update_time) { @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } - if (inode-i_sb-s_flags MS_LAZYTIME) { + if ((inode-i_sb-s_flags MS_LAZYTIME) + (!inode-i_ts_dirty_day || + inode-i_ts_dirty_day == days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); + inode-i_ts_dirty_day = days_since_boot; It isn't clear if this is correct. It looks like the condition will only be entered if i_ts_dirty_day == days_since_boot, but that is only set inside the condition? Shouldn't this be: inode-i_ts_dirty_day = ~0U; if ((inode-i_sb-s_flags MS_LAZYTIME) inode-i_ts_dirty_day != days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); inode-i_ts_dirty_day = days_since_boot; } and days_since_boot should be declared unsigned short so it wraps in the same way as i_ts_dirty_day, maybe using typeof(i_ts_dirty_day) so that there isn't any need to update this in the future if the type changes. Conceivably this could be an unsigned char, if that packed into struct inode better, at the risk of losing a timestamp update on an inode in cache for 256+ days and only modifying it 256 days later. Cheers, Andreas return 0; } + inode-i_ts_dirty_day = 0; if (inode-i_op-write_time) return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 489b2f2..e3574cd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -575,6 +575,7 @@ struct inode { struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; + unsigned short i_ts_dirty_day; unsigned inti_blkbits; blkcnt_ti_blocks; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 2/6] fiemap: add fe_phys_length and EXTENT_PHYS_LENGTH flag
On Jul 30, 2014, at 11:18 AM, David Sterba dste...@suse.cz wrote: Add a new member to fiemap_extent that represents the physical extent length. This value is undefined if the flag EXTENT_PHYS_LENGTH is not set. The description here of PHYS_LENGTH makes sense... The patch description should also mention the name of the new member, namely fe_phys_length +#define FIEMAP_EXTENT_PHYS_LENGTH0x0010 /* Physical length of extent + * not the same as logical */ But the comment doesn't match. This implies that if PHYS_LENGTH is set, fe_phys_length != fe_logi_length, but I don't think that is necessarily correct. I think it makes more sense to just set PHYS_LENGTH when fe_phys_length is valid, and if PHYS_LENGTH is not set then fe_phys_length aware applications should just use fe_phys_length = fe_logi_length, and older applications would just use fe_length for both as they would already today. Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag
On Jul 24, 2014, at 1:22 PM, David Sterba dste...@suse.cz wrote: On Thu, Jul 17, 2014 at 12:07:57AM -0600, Andreas Dilger wrote: any progress on this patch series? I'm sorry I got distracted at the end of year and did not finish the series. I never saw an updated version of this patch series after the last round of reviews, but it would be great to move it forward. I have filefrag patches in my e2fsprogs tree waiting for an updated version of your patch. I recall the main changes were: - add FIEMAP_EXTENT_PHYS_LENGTH flag to indicate if fe_phys_length was valid fe_phys_length will be always valid, so other the flags are set only if it's not equal to the logical length. - rename fe_length to fe_logi_length and #define fe_length fe_logi_length - always fill in fe_phys_length (= fe_logi_length for uncompressed files) and set FIEMAP_EXTENT_PHYS_LENGTH whether the extent is compressed or not This is my understanding and contradicts the first point. I think Dave Chinner's former point was that having fe_phys_length validity depend on FIEMAP_EXTENT_DATA_COMPRESSED is a non-intuitive interface. It is not true that fe_phys_length would always be valid, since that is not the case for older kernels that currently always set this field to 0, so they need some flag to indicate if fe_phys_length is valid. Alternately, userspace could do: if (ext-fe_phys_length == 0) ext-fe_phys_length = ext-fe_logi_length; but that pre-supposes that fe_phys_length == 0 is never a valid value when fe_logi_length is non-zero, and this might introduce errors in some cases. I could imagine that some compression methods might not allocate any space at all if it was all zeroes, and just store a bit in the blockpointer or extent, so having a separate FIEMAP_EXTENT_PHYS_LENGTH is probably safer in the long run. That opens up the question of whether a written zero filled space that gets compressed away is different from a hole, but I'd prefer to just return whatever the file mapping is than interpret it. Cheers, Andreas - add WARN_ONCE() in fiemap_fill_next_extent() as described below I don't know if there was any clear statement about whether there should be separate FIEMAP_EXTENT_PHYS_LENGTH and FIEMAP_EXTENT_DATA_COMPRESSED flags, or if the latter should be implicit? Probably makes sense to have separate flags. It should be fine to use: #define FIEMAP_EXTENT_PHYS_LENGTH0x0010 since this flag was never used. I've kept only FIEMAP_EXTENT_DATA_COMPRESSED, I don't see a need for FIEMAP_EXTENT_PHYS_LENGTH and this would be yet another flag because the FIEMAP_EXTENT_DATA_ENCODED is also implied. I'll send V4, we can discuss the PHYS_LENGTH flag then. Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag
David, any progress on this patch series? I never saw an updated version of this patch series after the last round of reviews, but it would be great to move it forward. I have filefrag patches in my e2fsprogs tree waiting for an updated version of your patch. I recall the main changes were: - add FIEMAP_EXTENT_PHYS_LENGTH flag to indicate if fe_phys_length was valid - rename fe_length to fe_logi_length and #define fe_length fe_logi_length - always fill in fe_phys_length (= fe_logi_length for uncompressed files) and set FIEMAP_EXTENT_PHYS_LENGTH whether the extent is compressed or not - add WARN_ONCE() in fiemap_fill_next_extent() as described below I don't know if there was any clear statement about whether there should be separate FIEMAP_EXTENT_PHYS_LENGTH and FIEMAP_EXTENT_DATA_COMPRESSED flags, or if the latter should be implicit? Probably makes sense to have separate flags. It should be fine to use: #define FIEMAP_EXTENT_PHYS_LENGTH 0x0010 since this flag was never used. Cheers, Andreas On Dec 12, 2013, at 5:02 PM, Andreas Dilger adil...@dilger.ca wrote: On Dec 12, 2013, at 4:24 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Dec 12, 2013 at 04:25:59PM +0100, David Sterba wrote: This flag was not accepted when fiemap was proposed [2] due to lack of in-kernel users. Btrfs has compression for a long time and we'd like to see that an extent is compressed in the output of 'filefrag' utility once it's taught about it. For that purpose, a reserved field from fiemap_extent is used to let the filesystem store along the physcial extent length when the flag is set. This keeps compatibility with applications that use FIEMAP. I'd prefer to just see the new physical length field always filled out, regardless of whether it is a compressed extent or not. In terms of backwards compatibility to userspace, it makes no difference because the value of reserved/unused fields is undefined by the API. Yes, the implementation zeros them, but there's nothing in the documentation that says reserved fields must be zero. Hence I think we should just set it for every extent. I'd actually thought the same thing while reading the patch, but I figured people would object because it implies that old kernels will return a physical length of 0 bytes (which might be valid) and badly-written tools will not work correctly on older kernels. That said, applications _should_ be checking the FIEMAP_EXTENT_DATA_COMPRESSED flag, and I suspect in the future fewer developers will be confused if fe_phys_length == fe_length going forward. If the initial tools get it right (in particular filefrag), then hopefully others will get it correct also. From the point of view of the kernel API (fiemap_fill_next_extent), passing the physical extent size in the len parameter for normal extents, then passing 0 for the physical length makes absolutely no sense. IOWs, what you have created is a distinction between the extent's logical length and it's physical length. For uncompressed extents, they are both equal and they should both be passed to fiemap_fill_next_extent as the same value. Extents where they are different (i.e. encoded extents) is when they can be different. Perhaps fiemap_fill_next_extent() should check and warn about mismatches when they differ and the relevant flags are not set... Seems reasonable to have a WARN_ONCE() in that case. That would catch bugs in the filesystem, code as well: WARN_ONCE(phys_len != lgcl_len !(flags FIEMAP_EXTENT_DATA_COMPRESSED), physical len %llu != logical length %llu without DATA_COMPRESSED\n, phys_len, logical_len, phys_len, logical_len); diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h index 93abfcd..0e32cae 100644 --- a/include/uapi/linux/fiemap.h +++ b/include/uapi/linux/fiemap.h @@ -19,7 +19,9 @@ struct fiemap_extent { __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ - __u64 fe_reserved64[2]; + __u64 fe_phys_length; /* physical length in bytes, undefined if + * DATA_COMPRESSED not set */ + __u64 fe_reserved64; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_reserved[3]; }; The comment for fe_length needs to change, too, because it needs to indicate that it is the logical extent length and that it may be different to the fe_phys_length depending on the flags that are set on the extent. Would it make sense to rename fe_length to fe_logi_length (or something, I'm open to suggestions), and have a compat macro: #define fe_length fe_logi_length around for older applications? That way, new developers would start to use the new name, old applications would still compile for both newer and older interfaces
Re: [PATCH 3/4 v3] btrfs: set FIEMAP_EXTENT_DATA_COMPRESSED for compressed extents
On Dec 12, 2013, at 8:26 AM, David Sterba dste...@suse.cz wrote: Set the EXTENT_DATA_COMPRESSED flag together with EXTENT_ENCODED as defined by fiemap spec. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/extent_io.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 5ea0ef5..8a28f15 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4220,9 +4222,12 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, if (ref_cnt 1) flags |= FIEMAP_EXTENT_SHARED; + em_phys_len = em-block_len; } - if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { flags |= FIEMAP_EXTENT_ENCODED; + flags |= FIEMAP_EXTENT_DATA_COMPRESSED; (minor nit) Could combine these (not sure it makes a difference to the compiler): flags |= (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DATA_COMPRESSED); Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 0/4 v3] fiemap: introduce EXTENT_DATA_COMPRESSED flag
On Dec 12, 2013, at 8:25 AM, David Sterba dste...@suse.cz wrote: The original FIEMAP patch did not define this bit, btrfs will make use of it. The defined constant maintains the same value as originally proposed. Currently, the 'filefrag' utility has no way to recognize and denote a compressed extent. As implemented in btrfs right now, the compression step splits a big extent into smaller chunks and this is reported as a heavily fragmented file. Adding the flag to filefrag will at least give some explanation why, this has been confusing users for some time already. The whole series looks good to me (one minor nit if it needs to be resubmitted for some reason). You can add my: Reviewed-by: Andreas Dilger adil...@dilger.ca V3: Based on feedback from Andreas, implement #1 from V2, current users of fiemap_fill_next_extent (fs/, ext4, gfs2, ocfs2, nilfs2, xfs) updated accordingly, no functional change. V2: Based on feedback from Andreas, the fiemap_extent is now able to hold the physical extent length, to be filled by the filesystem callback. The filesystems do not have access to the structure that is passed back to userspace and are supposed to call fiemap_fill_next_extent, there's no direct way to fill fe_phys_length. There are two ways to pass it: 1) extend fiemap_fill_next_extent to take phys_length and update all users (ext4, gfs2, ocfs2, nilfs2, xfs) 2) add new function that takes arguments for all the fiemap_extent items, newly added phys_length compared to fiemap_fill_next_extent David Sterba (4): fiemap: fix comment at EXTENT_DATA_ENCRYPTED fiemap: add EXTENT_DATA_COMPRESSED flag btrfs: set FIEMAP_EXTENT_DATA_COMPRESSED for compressed extents Documentation/fiemap: Document the DATA_COMPRESSED flag Documentation/filesystems/fiemap.txt | 17 + fs/btrfs/extent_io.c |9 +++-- fs/ext4/extents.c|3 ++- fs/ext4/inline.c |2 +- fs/gfs2/inode.c |2 +- fs/ioctl.c | 18 -- fs/nilfs2/inode.c|8 +--- fs/ocfs2/extent_map.c|4 ++-- fs/xfs/xfs_iops.c|2 +- include/linux/fs.h |2 +- include/uapi/linux/fiemap.h |8 ++-- 11 files changed, 51 insertions(+), 24 deletions(-) -- 1.7.9 Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag
On Dec 12, 2013, at 4:24 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Dec 12, 2013 at 04:25:59PM +0100, David Sterba wrote: This flag was not accepted when fiemap was proposed [2] due to lack of in-kernel users. Btrfs has compression for a long time and we'd like to see that an extent is compressed in the output of 'filefrag' utility once it's taught about it. For that purpose, a reserved field from fiemap_extent is used to let the filesystem store along the physcial extent length when the flag is set. This keeps compatibility with applications that use FIEMAP. I'd prefer to just see the new physical length field always filled out, regardless of whether it is a compressed extent or not. In terms of backwards compatibility to userspace, it makes no difference because the value of reserved/unused fields is undefined by the API. Yes, the implementation zeros them, but there's nothing in the documentation that says reserved fields must be zero. Hence I think we should just set it for every extent. I'd actually thought the same thing while reading the patch, but I figured people would object because it implies that old kernels will return a physical length of 0 bytes (which might be valid) and badly-written tools will not work correctly on older kernels. That said, applications _should_ be checking the FIEMAP_EXTENT_DATA_COMPRESSED flag, and I suspect in the future fewer developers will be confused if fe_phys_length == fe_length going forward. If the initial tools get it right (in particular filefrag), then hopefully others will get it correct also. From the point of view of the kernel API (fiemap_fill_next_extent), passing the physical extent size in the len parameter for normal extents, then passing 0 for the physical length makes absolutely no sense. IOWs, what you have created is a distinction between the extent's logical length and it's physical length. For uncompressed extents, they are both equal and they should both be passed to fiemap_fill_next_extent as the same value. Extents where they are different (i.e. encoded extents) is when they can be different. Perhaps fiemap_fill_next_extent() should check and warn about mismatches when they differ and the relevant flags are not set... Seems reasonable to have a WARN_ONCE() in that case. That would catch bugs in the filesystem, code as well: WARN_ONCE(phys_len != lgcl_len !(flags FIEMAP_EXTENT_DATA_COMPRESSED), physical len %llu != logical length %llu without DATA_COMPRESSED\n, phys_len, logical_len, phys_len, logical_len); diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h index 93abfcd..0e32cae 100644 --- a/include/uapi/linux/fiemap.h +++ b/include/uapi/linux/fiemap.h @@ -19,7 +19,9 @@ struct fiemap_extent { __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ -__u64 fe_reserved64[2]; +__u64 fe_phys_length; /* physical length in bytes, undefined if + * DATA_COMPRESSED not set */ +__u64 fe_reserved64; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_reserved[3]; }; The comment for fe_length needs to change, too, because it needs to indicate that it is the logical extent length and that it may be different to the fe_phys_length depending on the flags that are set on the extent. Would it make sense to rename fe_length to fe_logi_length (or something, I'm open to suggestions), and have a compat macro: #define fe_length fe_logi_length around for older applications? That way, new developers would start to use the new name, old applications would still compile for both newer and older interfaces, and it doesn't affect the ABI at all. And, FWIW, I wouldn't mention specific flags in the comment here, but do it at the definition of the flags that indicate there is a difference between physical and logical extent lengths Actually, I was thinking just the opposite for this field. It seems useful that the requirement for DATA_COMPRESSED being set is beside fe_phys_length so that anyone using this field sees the correlation clearly. I don't expect everyone would read and understand the meaning of all the flags when looking at the data structure. Cheers, Andreas @@ -50,6 +52,8 @@ struct fiemap { * Sets EXTENT_UNKNOWN. */ #define FIEMAP_EXTENT_ENCODED0x0008 /* Data can not be read * while fs is unmounted */ +#define FIEMAP_EXTENT_DATA_COMPRESSED 0x0040 /* Data is compressed by fs. +* Sets EXTENT_ENCODED */ i.e. here. Cheers, Dave. -- Dave Chinner da...@fromorbit.com Cheers, Andreas
Re: atime and filesystems with snapshots (especially Btrfs)
On 2012-05-25, at 9:59, Alexander Block abloc...@googlemail.com wrote: On Fri, May 25, 2012 at 5:42 PM, Josef Bacik jo...@redhat.com wrote: On Fri, May 25, 2012 at 05:35:37PM +0200, Alexander Block wrote: Hello, (this is a resend with proper CC for linux-fsdevel and linux-kernel) I would like to start a discussion on atime in Btrfs (and other filesystems with snapshot support). As atime is updated on every access of a file or directory, we get many changes to the trees in btrfs that as always trigger cow operations. This is no problem as long as the changed tree blocks are not shared by other subvolumes. Performance is also not a problem, no matter if shared or not (thanks to relatime which is the default). The problems start when someone starts to use snapshots. If you for example snapshot your root and continue working on your root, after some time big parts of the tree will be cowed and unshared. In the worst case, the whole tree gets unshared and thus takes up the double space. Normally, a user would expect to only use extra space for a tree if he changes something. A worst case scenario would be if someone took regular snapshots for backup purposes and later greps the contents of all snapshots to find a specific file. This would touch all inodes in all trees and thus make big parts of the trees unshared. Are you talking about the atime for the primary copy, or the atime for the snapshots? IMHO, the atime should not be updated for a snapshot unless it is explicitly mounted r/w, or it isn't really a good snapshot. relatime (which is the default) reduces this problem a little bit, as it by default only updates atime once a day. This means, if anyone wants to test this problem, mount with relatime disabled or change the system date before you try to update atime (that's the way i tested it). As a solution, I would suggest to make noatime the default for btrfs. I'm however not sure if it is allowed in linux to have different default mount options for different filesystem types. I know this discussion pops up every few years (last time it resulted in making relatime the default). But this is a special case for btrfs. atime is already bad on other filesystems, but it's much much worse in btrfs. Just mount with -o noatime, there's no chance of turning something like that on by default since it will break some applications (notably mutt). Thanks, Josef I know about the discussions regarding compatibility with existing applications. The problem here is, that it is not only a compatibility problem. Having atime enabled by default, may give you ENOSPC for reasons that a normal user does not understand or expect. As a normal user, I would think: If I never change something, why does it then take up more space just by reading it? -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks
On 2012-04-16, at 9:13 AM, Jan Kara wrote: Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). Looking at a desktop system, I think it is very typical that there are open-unlinked files present, so I don't know if this is really an acceptable solution. It isn't clear from your comments whether this is a blanket refusal for all open-unlinked files, or only in some particular cases... lsof | grep deleted nautilus 25393 adilger 19r REG 253,0 340 253954 /home/adilger/.local/share/gvfs-metadata/home (deleted) nautilus 25393 adilger 20r REG 253,032768 253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted) gnome-ter 25623 adilger 22u REG0,18178412717846 /tmp/vtePIRJCW (deleted) gnome-ter 25623 adilger 23u REG0,18 55682717847 /tmp/vteDCSJCW (deleted) gnome-ter 25623 adilger 29u REG0,18 4802728484 /tmp/vte6C1TCW (deleted) Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks
On 2012-04-16, at 5:43 PM, Dave Chinner wrote: On Mon, Apr 16, 2012 at 03:02:50PM -0700, Andreas Dilger wrote: On 2012-04-16, at 9:13 AM, Jan Kara wrote: Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). Looking at a desktop system, I think it is very typical that there are open-unlinked files present, so I don't know if this is really an acceptable solution. It isn't clear from your comments whether this is a blanket refusal for all open-unlinked files, or only in some particular cases... Unlinked-but-open files are the reason that XFS dirties the log after the freeze process is complete. This ensures that if the system crashes while the filesystem is frozen then log recovery during the next mount will process the unlinked (orphaned) inodes and free the correctly. i.e. you can still freeze a filesystem with inodes in this state successfully and have everythign behave as you'd expect. I'm not sure how other filesystems handle this problem, but perhaps pushing this check down into filesystem specific code or adding a superblock feature flag might be a way to allow filesystems to handle this case in the way they think is best... The ext3/4 code has long been able to handle open-unlinked files properly after a crash (they are put into a singly-linked list from the superblock on disk that is processed after journal recovery). The issue here is that blocking freeze from succeeding with open- unlinked files is an unreasonable assumption of this patch, and I don't think it is acceptable to land this patchset (which IMHO would prevent nearly every Gnome system from freezing unless these apps have changed their behaviour in more recent releases). Like you suggest, filesystems that handle this correctly should be able to flag or otherwise indicate that this is OK, and allow the freeze to continue. For other filesystems that do not handle open-unlinked file consistency during a filesystem freeze/snapshot whether this should even be considered a new case, or is something that has existed for ages already. The other question is whether this is still a problem even for filesystems handling the consistency issue, but from Jan's comment above there is a new locking issue related to mmap_sem being added? Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineerhttp://www.whamcloud.com/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: getdents - ext4 vs btrfs performance
On 2012-03-09, at 9:48 PM, Ted Ts'o wrote: On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote: Just reading this on the plane, so I can't find the exact reference that I want, but a solution to this problem with htree was discussed a few years ago between myself and Coly Li. The basic idea is that for large directories the inode allocator starts by selecting a range of (relatively) free inodes based on the current directory size, and then piecewise maps the hash value for the filename into this inode range and uses that as the goal inode. I've heard you sketch out this idea before, but it's not been clean to me how well it would work in practice. The potential downside is that it fragments the inode table, such that a single inode table block might not have as many inodes stored in it as we might otherwise would. (Ideally, with 256 byte inodes, there would 16 of them in each 4k block.) This is something I'd definitely recommend modelling in simulation to see how well it works first. Very true, I was thinking this as I wrote the email. I'd observe that if we knew in advance how many files would be in a particular directory (i.e., via a hint from userspace at mkdir time), then it would be possible to optimize this case much more easily. In fact, if we had perfect knowledge --- if the userspace process could feed us the exact set of filenames that will be used in the directory, plus the exact file sizes for each of the file names --- then it would be possible to allocate inode numbers and starting block numbers such that when the files are read in readdir() order, the inode numbers and block numbers would line up perfectly into sequential writes. Except POSIX doesn't allow anything close to this at all. Something like fallocate() for directories would at least give the kernel a hint about how large the directory is, but fewer userspace tools have this information than the file size (e.g. tar doesn't really store a directory, just a bunch of pathnames that happen to have largely the same components). In the cases where size IS known in advance, or at least if the directory is going to be large, an allocation policy like mine would essentially predict where a uniformly distributed hash function will allocate the inodes for better reading order. Of course, the trade off is that by optimizing for read order, the write order will tend to get painful as well! So there's a tension here; making the read part of the benchmark faster will make the process of writing the directory hierarchy require more seeks --- and this assuming that you know file names and sizes ahead of time. (Unless, of course, you play the same game that Hans Reiser did when he cooked his tar benchmark by constructing a kernel source tarball where the file order in the tarball was perfectly ordered to guarantee the desired result given Reiser4's hash! :-) No, I definitely don't want to go down that path... Optimization for the sake of benchmarks is broken. I suspect the best optimization for now is probably something like this: 1) Since the vast majority of directories are less than (say) 256k (this would be a tunable value), Yes, this would be reasonable, and I suspect could be a lot larger than 256kB. Probably variable based on the amount of RAM is reasonable. Looking at http://www.pdsi-scidac.org/fsstats/ (which is listed as HPC, but is a bit old and the filesystems aren't much larger than large SATA drives today :-) having a 1MB cache would handle virtually every directory in that survey (99.999%). The unfortunate part is that this will help small directories, where the problem is already largely hidden by readahead and the page cache, but it doesn't help at all for huge directories that do not fit into cache. That is where my approach would do well, but the difficulty is in knowing at create time how large the directory will be. for directories which are less than this threshold size, the entire directory is sucked in after the first readdir() after an opendir() or rewinddir(). The directory contents are then sorted by inode number (or loaded into an rbtree ordered by inode number), and returned back to userspace in the inode order via readdir(). The directory contents will be released on a closedir() or rewinddir(). What would the telldir() cookie be in this case? The inode number, or the file descriptor? What happens if the directory size crosses the threshold while the cache is in use? I guess in that case the cache could still be used (according to POSIX) because readdir() doesn't need to report entries added after it started. 2) For files larger than this size, we don't want to hold that much kernel memory during an opendir(), so fall back to ext4's current scheme. 3) If we want do to better than that, we could create new flag O_NOTELLDIR, which can be set via fcntl. (This way programs can use do something like dir = opendir
Re: [PATCH 1/7] BTRFS: Fix lseek return value for error
On 2011-09-17, at 12:10 AM, Jeff Liu jeff@oracle.com wrote: I once posted a similar patch for this issue which can be found at: http://www.spinics.net/lists/linux-btrfs/msg12169.html with an additional improvement if the offset is larger or equal to the file size, return -ENXIO in directly: if (offset = inode-i_size) { mutex_unlock(inode-i_mutex); return -ENXIO; } Except that is wrong, because it would then be impossible to write sparse files. Thanks, -Jeff On 09/16/2011 11:48 PM, Christoph Hellwig wrote: On Thu, Sep 15, 2011 at 04:06:47PM -0700, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c I think this should go to Chris/Linus ASAP. But a slightly better patch description wouldn't hurt either. Also any reason you captialize BTRFS? Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/file.c | 13 +++-- 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3c3abff..7ec0a24 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1818,19 +1818,17 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) case SEEK_DATA: case SEEK_HOLE: ret = find_desired_extent(inode, offset, origin); -if (ret) { -mutex_unlock(inode-i_mutex); -return ret; -} +if (ret) +goto error; } if (offset 0 !(file-f_mode FMODE_UNSIGNED_OFFSET)) { ret = -EINVAL; -goto out; +goto error; } if (offset inode-i_sb-s_maxbytes) { ret = -EINVAL; -goto out; +goto error; } /* Special lock needed here? */ @@ -1841,6 +1839,9 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) out: mutex_unlock(inode-i_mutex); return offset; +error: +mutex_unlock(inode-i_mutex); +return ret; } const struct file_operations btrfs_file_operations = { -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On 2011-08-25, at 12:40 AM, Dave Chinner wrote: On Thu, Aug 25, 2011 at 02:06:32AM -0400, Christoph Hellwig wrote: On Tue, Jun 28, 2011 at 11:33:19AM -0400, Josef Bacik wrote: This is a test to make sure seek_data/seek_hole is acting like it does on Solaris. It will check to see if the fs supports finding a hole or not and will adjust as necessary. Can you resend this with any updates that happened in the meantime? Dave also still had some comments about semantics, so it might be worth to incorporate that as well. The main questions I had when looking at this was how we should handle unwritten extents - the only answer I got was along the lines of we'll deal with that once filesystems have implemented something. That's a bit of a chicken-and-egg situation, and doesn't help me decide what is the best thing to do. I don't want to have to re-implement this code when it's decided i did the wrong thing initially. Let's first clarify what you mean by an unwritten extent? Do you mean a preallocated extent that returns 0 when read, or do you mean a delayed allocation extent that was written by the application that is still in memory but not yet written to disk? Unfortunately, ZFS has no concept of preallocated extents, so we can't look to it for precedent, but it definitely has delayed allocation. Possibly if UFS on Solaris has SEEK_HOLE and also preallocated extents (I have no idea) it could be tested? The most basic question I really want answered is this: - is preallocated space a HOLE, or is it DATA? Whatever the answer, I think it should be consistently presented by all filesystems that support preallocation, and it should be encoded into the generic SEEK_HOLE/SEEK_DATA tests My thought would be that a preallocated extent is still a HOLE, because it doesn't contain data that an application actually cares about. On the other hand, a delalloc extent is DATA because it has something that an application cares about. The original reason SEEK_HOLE/SEEK_DATA were brought up over FIEMAP was cp being able to consistently access delalloc data that was only in the page cache, so if we don't get that right it will have been a pointless exercise. Answering that question then helps answer the more complex questions I had, like: - what does SEEK_DATA return when you have a file layout like hole-prealloc-data? I would think only the data part, since that is what the definition of SEEK_DATA is IMHO. Answers to that sort of question need to be known so we can write corner-case tests to correctly verify the filesystem implementation. I like to do better than present userspace with an interface that behaves vastly different depending on the underlying filesystem, but if the answer is definition and implementation is entirely filesystem specific then I'll just go make something up Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On 2011-06-27, at 12:02 PM, Josef Bacik wrote: This is a test to make sure seek_data/seek_hole is acting like it does on Solaris. It will check to see if the fs supports finding a hole or not and will adjust as necessary. diff --git a/src/seek-tester.c b/src/seek-tester.c new file mode 100644 index 000..2b8c957 --- /dev/null +++ b/src/seek-tester.c @@ -0,0 +1,473 @@ +/* + * Copyright (C) 2011 Oracle. All rights reserved. + * Copyright (C) 2011 Red Hat. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#define _XOPEN_SOURCE 500 +#include sys/types.h +#include sys/stat.h +#include errno.h +#include fcntl.h +#include stdio.h +#include string.h +#include unistd.h +#include stdlib.h + +#define SEEK_DATA3 +#define SEEK_HOLE4 These should probably be #ifndef SEEK_DATA so that gcc doesn't complain in the future when these are added to a standard header. +#define FS_NO_HOLES (1 0) +#define QUIET(1 1) + +static blksize_t alloc_size; +static unsigned flags = 0; + +static int get_io_sizes(int fd) +{ + struct stat buf; + int ret; + + ret = fstat(fd, buf); + if (ret) + fprintf(stderr, ERROR %d: Failed to find io blocksize\n, + errno); + + /* st_blksize is typically also the allocation size */ + alloc_size = buf.st_blksize; + + if (!(flags QUIET)) + printf(Allocation size: %ld\n, alloc_size); + + return ret; +} + +#define do_free(x) do { if(x) free(x); } while(0); + +static void *do_malloc(size_t size) +{ + void *buf; + + buf = malloc(size); + if (!buf) + fprintf(stderr, ERROR: Unable to allocate %ld bytes\n, + (long)size); + + return buf; +} + +static int do_truncate(int fd, off_t length) +{ + int ret; + + ret = ftruncate(fd, length); + if (ret) + fprintf(stderr, ERROR %d: Failed to extend file + to %ld bytes\n, errno, (long)length); + return ret; +} + +static ssize_t do_pwrite(int fd, const void *buf, size_t count, off_t offset) +{ + ssize_t ret, written = 0; + + while (count written) { + ret = pwrite(fd, buf + written, count - written, offset + written); + if (ret 0) { + fprintf(stderr, ERROR %d: Failed to write %ld + bytes\n, errno, (long)count); + return ret; + } + written += ret; + } + + return 0; +} + +static int do_lseek(int testnum, int subtest, int fd, int origin, off_t set, + off_t exp) +{ + off_t pos; + int ret = -1; + + pos = lseek(fd, set, origin); + + if (pos != exp) { + fprintf(stderr, ERROR in Test %d.%d: POS expected %ld, + got %ld\n, testnum, subtest, (long)exp, (long)pos); + goto out; + } + + if (pos == -1 errno != ENXIO) { + fprintf(stderr, ERROR in Test %d.%d: ERRNO expected %d, + got %d\n, testnum, subtest, ENXIO, errno); + goto out; + } + + ret = 0; + +out: + return ret; +} + +static int get_flags(int fd) +{ + const char *buf = ABCDEF; + ssize_t written; + off_t pos; + int ret; + + ret = do_truncate(fd, alloc_size * 2); + if (ret) + return ret; + + written = do_pwrite(fd, buf, strlen(buf), 0); + if (written) + return -1; + + pos = lseek(fd, 0, SEEK_HOLE); + if (pos == alloc_size * 2) { + if (!(flags QUIET)) + printf(File system does not recognize holes, the only +hole found will be at the end.\n); + flags |= FS_NO_HOLES; This is a question that I've also had about compatibility with older (well, every) Linux kernel that does not support SEEK_{HOLE,DATA} today. My reading of the existing generic_file_llseek() and default_llseek() code, along with most filesystem-specific llseek() implementations is that they will happily ignore the @whence parameter if it is not known, and pretend like it is 0 (SEEK_SET), so they will just set the position to the @offset parameter and return this
Re: [PATCH 2/2] Btrfs: load the key from the dir item in readdir into a fake dentry
On May 26, 2011, at 14:03, Andi Kleen wrote: On Thu, May 26, 2011 at 03:02:42PM -0400, Josef Bacik wrote: +/* + * If this dentry needs lookup, don't set the referenced flag so that it + * is more likely to be cleaned up by the dcache shrinker in case of + * memory pressure. + */ +if (!d_need_lookup(dentry)) +dentry-d_flags |= DCACHE_REFERENCED; No it doesn't at all. The allocation will just push everything else out. Really you cannot view this by only looking at the dcache. You have to look at the complete VM behaviour. All the caches and the other memory interact. Even without this patch, if you are doing find / there will be considerable memory pressure from all of the pages that are accessed by readdir(), and to a lesser extent the directory dentry/inode entries. I'm not sure whether the issue you raise is going to be significant in the end or not. Taking this development a bit further, I've long thought it would be quite interesting if these dcache entries could be linked into a readdir-ordered linked list, and then discard the actual directory pages from cache entirely. This would potentially allow later readdir operations to work just by walking the readdir-ordered linked list and not touching the page cache at all. Since normal operations like ls -l currently instantiate both a pagecache for readdir, and a dentry for each dirent, it could potentially even reduce memory if there was no a need to keep the directory pages in cache anymore. d_alloc uses a normal GFP_KERNEL, which is quite in appropiate for this. It should at least reclaim and probably more, but even then it's risky. Ah yeah I guess I should have probably used GFP_KERNEL. Sorry about that, GFP_KERNEL is already used, but it's wrong. I'm not sure any of the existing GFP_* flags will give you the semantics you need in fact. The new flag Minchan added for readahead may come near, but even that is probably not enough. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4
On May 23, 2011, at 15:43, Josef Bacik wrote: This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck. We need to match solaris here, and the definitions are diff --git a/fs/read_write.c b/fs/read_write.c index 5520f8a..9c3b453 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -64,6 +64,23 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin) return file-f_pos; offset += file-f_pos; break; + case SEEK_DATA: + /* + * In the generic case the entire file is data, so as long as + * offset isn't at the end of the file then the offset is data. + */ + if (offset = inode-i_size) + return -ENXIO; + break; + case SEEK_HOLE: + /* + * There is a virtual hole at the end of the file, so as long as + * offset isn't i_size or larger, return i_size. + */ + if (offset = inode-i_size) + return -ENXIO; + offset = inode-i_size; + break; } What about all of the existing filesystems that currently just ignore values of origin that they don't understand? Looking through those it appears that most of them will return offset for unknown values of origin, which I guess is OK for SEEK_DATA, but is confusing for SEEK_HOLE. Some filesystems will return -EINVAL for values of origin that are unknown. Most of the filesystem-specific -llseek() methods don't do any error checking on origin because this is handled at the sys_llseek() level, and hasn't changed in many years. I assume this patch is also dependent upon the remove default_llseek() patch, so that the implementation of SEEK_DATA and SEEK_HOLE can be done in only generic_file_llseek()? Finally, while looking through the various -llseek() methods I notice that many filesystems return i_size for SEEK_END, which clearly does not make sense for filesystems like ext3/ext4 htree, btrfs, etc that use hash keys instead of byte offsets for doing directory traversal. The comment at generic_file_llseek() is that it is intended for use by regular files. Should the ext4_llseek() code be changed to return 0x7 for the SEEK_END value? That makes more sense compared to values returned for SEEK_CUR so that an application can compare the current offset with the final value for a progress bar. Another interesting use is for N threads to process a large directory in parallel by using max_off = llseek(dirfd, 0, SEEK_END) and then each thread calls llseek(dirfd, thread_nr * max_off / N, SEEK_SET) to process 1/N of the directory. Cheers, Andreas Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags
On 2011-05-20, at 2:07 PM, Andi Kleen a...@firstfloor.org wrote: Josef Bacik jo...@redhat.com writes: Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go Funny I remember discussing this optimization a long time ago with Andrea. But the problem is still that it has the potential to pollute the dcache a lot when someone is reading a large directory. Consider the find / case. You don't want that to turn over all of your dcache. So if you do that you need some way to put the dummy dentries on a special LRU list that gets cleaned quickly and is kept small. Actually, I recall years ago looking into something similar. It should be possible to drop such dentries even under GFP_NOFS because they can't be dirty and don't need any inode operations to free. Putting them at the end of the cache LRU instead of the head would allow them to be dropped quickly under memory pressure. Cheers, Andreas-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags
On May 19, 2011, at 11:58, Josef Bacik wrote: Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op-lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, This comment should probably mention the number of files being tested, in order to make a 1300s savings meaningful. Similarly, it would be better to provide the absolute times of tests in case these URLs disappear in the future. That reduces the time to do ls -l on a 1M file directory from 2181s to 855s. Signed-off-by: Josef Bacik jo...@redhat.com --- fs/namei.c | 48 include/linux/dcache.h |1 + 2 files changed, 49 insertions(+), 0 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index e3c4f11..a1bff4f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1198,6 +1198,29 @@ static struct dentry *d_alloc_and_lookup(struct dentry *parent, } /* + * We already have a dentry, but require a lookup to be performed on the parent + * directory to fill in d_inode. Returns the new dentry, or ERR_PTR on error. + * parent-d_inode-i_mutex must be held. d_lookup must have verified that no + * child exists while under i_mutex. + */ +static struct dentry *d_inode_lookup(struct dentry *parent, struct dentry *dentry, + struct nameidata *nd) +{ + struct inode *inode = parent-d_inode; + struct dentry *old; + + /* Don't create child dentry for a dead directory. */ + if (unlikely(IS_DEADDIR(inode))) + return ERR_PTR(-ENOENT); + + old = inode-i_op-lookup(inode, dentry, nd); + if (unlikely(old)) { + dput(dentry); + dentry = old; + } + return dentry; +} +/* * It's more convoluted than I'd like it to be, but... it's still fairly * small and for now I'd prefer to have fast path as straight as possible. * It _is_ time-critical. @@ -1236,6 +1259,13 @@ static int do_lookup(struct nameidata *nd, struct qstr *name, goto unlazy; } } + if (unlikely(dentry-d_flags DCACHE_NEED_LOOKUP)) { + if (nameidata_dentry_drop_rcu(nd, dentry)) + return -ECHILD; + dput(dentry); + dentry = NULL; + goto retry; + } path-mnt = mnt; path-dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, false))) @@ -1250,6 +1280,12 @@ unlazy: } } else { dentry = __d_lookup(parent, name); + if (unlikely(!dentry)) + goto retry; + if (unlikely(dentry-d_flags DCACHE_NEED_LOOKUP)) { + dput(dentry); + dentry = NULL; + } } retry: @@ -1268,6 +1304,18 @@ retry: /* known good */ need_reval = 0; status = 1; + } else if (unlikely(dentry-d_flags DCACHE_NEED_LOOKUP)) { + struct dentry *old; + + dentry-d_flags = ~DCACHE_NEED_LOOKUP; + dentry = d_inode_lookup(parent, dentry, nd); Would it make sense to keep DCACHE_NEED_LOOKUP set in d_flags until _after_ the call to d_inode_lookup()? That way the filesystem can positively know it is doing the inode lookup from d_fsdata, instead of just inferring it from the presence of d_fsdata? It is already the filesystem that is setting DCACHE_NEED_LOOKUP, so it should really be the one clearing this flag also. I'm concerned that there may be filesystems that need d_fsdata for something already, so the presence/absence of d_fsdata is not a clear indication to the underlying filesystem of whether to do an inode lookup based on d_fsdata, which might mean that it needs to keep yet another flag for this purpose. + if
Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?
On May 5, 2011, at 12:10, Matthew Wilcox wrote: On Wed, May 04, 2011 at 08:51:39AM -0600, Andreas Dilger wrote: I was aware of REQ_META, but I didn't know there was any benefit to using it. I think it would be easy to set REQ_META on all ext4 metadata if there was a reason to do so. The CFQ ioscheduler pays attention to it (prioritising metadata accesses over data accesses), and blocktrace will print an 'M' for metadata requests if it's set, so I think that's two excellent reasons to set REQ_META today. However, ext3, ext4, and XFS already use it: fs/ext4/inode.c:1500: ll_rw_block(READ_META, 1, bh); - ext4_bread() fs/ext4/inode.c:4775: submit_bh(READ_META, bh); - __ext4_get_inode_loc() fs/ext4/namei.c:924:ll_rw_block(READ_META, 1, bh); - ext4_find_entry() Looking more closely at the code it seems that this is handling only a subset of the code. There are many places in ext4 that are using sb_bread() instead of ext4_bread(), in particular in the extents and migration code that was developed more recently, but also in the xattr code which has been around a long time. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?
On 2011-05-04, at 5:45 AM, Gao, Yunpeng yunpeng@intel.com wrote: Yes, I have been working on some changes that allow us to tag bios and pass the information out to storage. These patches have been on the back burner for a while due to other commitments. But I'll dig them out and post them later. We just discussed them a couple of weeks ago at the Linux Storage Workshop. That's great! Thanks for the update and look forward to your patches. In the meantime: Can you point me to the relevant eMMC stuff so I can see how many tiers or classes we have to work with there? I'm investigating on add some eMMC 4.5 features support in current Linux mmc driver (drivers/mmc). The Linux mmc driver register the eMMC device as a normal Linux block device. So, it can get all the Linux block layer bio flags. To below eMMC 4.5 new features: Data Tag: 'The mechanism permits the device to receive from the host information about specific data types (for instance file system metadata, time-stamps, configuration parameters, etc.). The information is conveyed before a write multiple blocks operation at well defined addresses. By receiving this information the device can improve the access rate during the following read and update operations and offer a more reliable and robust storage.' I guess the exist block layer flag REQ_META can be used to notify the low level block driver/device to execute the Data Tag feature. But don't know why currently most of Linux file systems don't use the REQ_META flag at all (seems only gfs2 uses it now). I was aware of REQ_META, but I didn't know there was any benefit to using it. I think it would be easy to set REQ_META on all ext4 metadata if there was a reason to do so. Context Management: 'To better differentiate between large sequential operations and small random operations, and to improve multitasking support, contexts can be associated with groups of read or write commands ... A context can be seen as an active session, configured for a specific read/write pattern (e.g. sequential in some granularity). Multiple read or write commands are associated with this context to create some logical association between them, to allow device to optimize performance.' To my understanding, to support this feature, it needs file system (or application?) to notify the low level driver that the following data access will be large sequential read/write operations. And I'm not sure is it possible for file system to pass this kind of information to low level driver? Any idea? A simple way to do this would be to use the inode number as the context, so writes to the same inode are grouped together. Another possibility is the PID, though that is less clearly correct (e.g. if tar is extracting a buch of files, are they all related or not). Cheers, Andreas-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags
On 2011-04-22, at 11:08 AM, Sunil Mushran sunil.mush...@oracle.com wrote: On 04/22/2011 10:03 AM, Eric Blake wrote: cp can read whatever blocksize it chooses. If that block contains zero, it would signal cp that maybe it should SEEK_DATA and skip reading all those blocks. That's all. We are not trying to achieve perfection. We are just trying to reduce cpu waste. If the fs supports SEEK_*, then great. If it does not, then it is no worse than before. But providing just SEEK_DATA _is_ worse than before if you don't provide the correct SEEK_HOLE everywhere. Because then your algorithm of trying lseek(SEEK_DATA) after every run of zeros in the hopes of an optimization is a wasted syscall, since it will just return your current offset every time, so you end up with more syscalls than if you had used the single lseek(SEEK_DATA) that returns the end of the file up front, and known that the remainder of the file has no holes to even try seeking past. You are over-optimizing. strace any process on your box and you will find numerous wasted syscalls. Sure, there are lots of wasted syscalls, but in this case the cost of doing extra SEEK_DATA and SEEK_HOLE operations may be fairly costly. This involves scanning the whole disk layout, possibly over a network, which might need tens or hundreds of disk seeks to read the metadata, unlike regular SEEK_SET. Even SEEK_END is somewhat costly on a network filesystem, since the syscall needs to lock the file size in order to determine the end of the file, which can block other threads from writing to the file. So while I agree that the Linux mantra that syscalls are cheap is true if those syscalls don't actually do any work, I think it also ignores the cost of what some syscalls need to do behind the scenes. Cheers, Andreas-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: add datacow flag in inode flag
On 2011-03-15, at 2:57 PM, Christoph Hellwig wrote: On Tue, Mar 15, 2011 at 04:26:50PM -0400, Chris Mason wrote: #define FS_EXTENT_FL 0x0008 /* Extents */ #define FS_DIRECTIO_FL 0x0010 /* Use direct i/o */ +#define FS_NOCOW_FL 0x0080 /* Do not cow file */ +#define FS_COW_FL0x0100 /* Cow file */ #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */ I'm fine with it. I'll defer the check for conflicts with extN-specific flags to Ted, though. Looking at the upstream e2fsprogs I see in that range: #define EXT4_EXTENTS_FL 0x0008 /* Inode uses extents */ #define EXT4_EA_INODE_FL 0x0020 /* Inode used for large EA */ #define EXT4_EOFBLOCKS_FL 0x0040 /* Blocks allocated beyond EOF */ #define EXT4_SNAPFILE_FL 0x0100 /* Inode is a snapshot */ #define EXT4_SNAPFILE_DELETED_FL 0x0400 /* Snapshot is being deleted */ #define EXT4_SNAPFILE_SHRUNK_FL 0x0800 /* Snapshot shrink has completed */ #define EXT2_RESERVED_FL 0x8000 /* reserved for ext2 lib */ #define EXT2_FL_USER_VISIBLE 0x004BDFFF /* User visible flags */ so there is a conflict with FS_COW_FL and EXT4_SNAPFILE_FL. I don't know the semantics of those two flags enough to say for sure whether it is reasonable that they alias to each other, but at first glance COW and SNAPSHOT don't seem completely unrelated. It also isn't clear to me whether the SNAPFILE_DELETED_FL and SNAPFILE_SHRUNK_FL really need to be persistent flags that are stored on disk, or if they are state flags that should only be stored in the in-memory inode? Not having them in the limited remaining inode flag space would be better... Amir? Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add ioctl to set compress or cow per file/dir
On 2011-02-24, at 2:40 AM, liubo wrote: #define FS_DIRECTIO_FL 0x0010 /* Use direct i/o */ +#define FS_NOCOW_FL 0x0020 /* Do not cow file */ +#define FS_COW_FL0x0010 /* Cow file */ #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */ I'm assuming that FS_COW_FL should not be the same as FS_DIRECTIO_FL? Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in ext4/btrfs fallocate?
On 2010-12-28, at 09:06, Marco Stornelli wrote: it seems that ext4/btrfs code for fallocate doesn't check for immutable/append inode flag. fallocate() probably shouldn't be allowed for immutable files, but it makes a lot of sense to call fallocate() on append-only files to avoid fragmentation, though it should only be called with the KEEP_SIZE flag. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-07, at 10:02, Trond Myklebust wrote: On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote: It's just as stable as a real dev_t in the times of hotplug and udev. As long as you don't touch anything including not upgrading the kernel it's remain stable, otherwise it will break. That's why modern nfs-utils default to using the uuid-based filehandle schemes instead of the dev_t based ones. At least that's what I told - I really hope it's using the real UUIDs from the filesystem and not the horrible fsid hack that was once added - for some filesystems like XFS that field does not actually have any relation to the UUID historically. And while we could have changed that it's too late now that nfs was hacked into abusing that field. IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but they won't fit into the NFSv2 32-byte filehandles, so there is an '8-byte fsid' and '4-byte fsid + inode number' workaround for that... See the mk_fsid() helper in fs/nfsd/nfsfh.h It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid=... option). There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-06, at 09:48, J. Bruce Fields wrote: On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: Any chance we can add a -get_fsid(sb, inode) method to export_operations (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? No objection from here. (Though I don't understand the inode argument--aren't subvolumes usually expected to have separate superblocks?) I thought that if two directories from the same filesystem are both being exported at the same time that they would need to have different FSID values, hence the inode parameter to allow generating an FSID that is a function of both the filesystem (sb) and the directory being exported (inode)? Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-08, at 16:07, Neil Brown wrote: On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com wrote: On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote: Any chance we can add a -get_fsid(sb, inode) method to export_operations (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? No objection from here. My standard objection here is that you cannot guarantee that the fsid is 100% guarantied to be unique across all filesystems in the system (including filesystems mounted from dm snapshots of filesystems that are currently mounted). NFSd needs this uniqueness. Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value... This is only really an objection if user-space cannot over-ride the fsid provided by the filesystem. Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding btrfs filesystems). I'd be very happy to see an interface to user-space whereby user-space can get a reasonably unique fsid for a given filesystem. Hmm, maybe I'm missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a uuid= option is present in the exports file that is preferentially used over the value from the filesystem. That said, I think Aneesh's open_by_handle patchset also made the UUID visible in /proc/pid/mountinfo, after the filesystems stored it in sb-s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid. Whether this is an export_operations method or some field in the 'struct super' which gets copied out doesn't matter to me. Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29): [PATCH -V22 12/14] vfs: Export file system uuid via /proc/pid/mountinfo [PATCH -V22 13/14] ext3: Copy fs UUID to superblock. [PATCH -V22 14/14] ext4: Copy fs UUID to superblock Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about subvolumes?
On 2010-12-03, at 15:45, J. Bruce Fields wrote: We're using statfs64.fs_fsid for this; I believe that's both stable across reboots and distinguishes between subvolumes, so that's OK. (That said, since fs_fsid doesn't work for other filesystems, we depend on an explicit check for a filesystem type of btrfs, which is awful--btrfs won't always be the only filesystem that wants to do this kind of thing, etc.) Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward. Any chance we can add a -get_fsid(sb, inode) method to export_operations (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported? Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On 2010-11-16, at 20:11, Dave Chinner wrote: On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote: IMHO, it makes more sense for consistency and get what users expect that these be treated as flags. Some users will want KEEP_SIZE, but in other cases it may make sense that a hole punch at the end of a file should shrink the file (i.e. the opposite of an append). What's wrong with ftruncate() for this? It makes the API usage from applications more consistent. It would be inconvenient, for example, if applications had to use a different system call if they were writing in the middle of the file vs. at the end, wouldn't it? Similarly, if multiple threads are appending vs. punching (let's assume non-overlapping regions, for sanity, like a producer/consumer model punching out completed records) then using ftruncate() to remove the last record and shrink the file would require locking the whole file from userspace (unlike the append, which does this in the kernel), or risk discarding unprocessed data beyond the record that was punched out. There's plenty of open questions about the interface if we allow hole punching to change the file size. e.g. where do we set the EOF (offset or offset+len)? I would think it natural that the new size is the start of the region, like an anti-write (where write sets the size at the end of the added bytes). What do we do with the rest of the blocks that are now beyond EOF? We weren't asked to punch them out, so do we leave them behind? I definitely think they should be left as is. If they were in the punched-out range, they would be deallocated, and if they are beyond EOF they will remain as they are - we didn't ask to remove them unless the punched-out range went to ~0ULL (which would make it equivalent to an ftruncate()). What if we are leaving written blocks beyond EOF - does any filesystem other than XFS support that (i.e. are we introducing different behaviour on different filesystems)? I'm not sure I understand what a written block beyond EOF means. How can there be data beyond EOF? I think the KEEP_SIZE flag is only relevant if the punch is spanning EOF, like the opposite of a write that is spanning EOF. If KEEP_SIZE is set, then it leaves the size unchanged, and if unset and punch spans EOF it reduces the file size. If the punch is not at EOF it doesn't change the file size, just like a write that is not at EOF. And what happens if the offset is beyond EOF? Do we extend the file, and if so why wouldn't you just use ftruncate() instead? Even if the effects were the same, it makes sense because applications may be using fallocate(PUNCH_HOLE) to punch out records, and having them special case the use of ftruncate() to get certain semantics at the end of the file adds needless complexity. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On 2010-11-16, at 20:34, Josef Bacik wrote: FWIW I agree with Dave, the only question at this point is do we force users to specify KEEP_SIZE with PUNCH_HOLE? On one hand it makes the interface a bit more consistent, on the other hand it makes the documentation a little weird We have mode here, but if you want to use PUNCH_HOLE you also have to specify KEEP_SIZE, so really it's like a flags field it's just named poorly Even if this is the case, and we decide today that PUNCH_HOLE without KEEP_SIZE is not desirable to implement, it would be better to just return -EOPNOTSUPP if both flags are not set than assume one or the other is what the user wanted. That allows the ability to implement this in the future without breaking every application, while if it is assumed that KEEP_SIZE is always implicit there will never be a way to add that functionality without something awful like a separate CHANGE_SIZE flag for PUNCH_HOLE. One option is to define FALLOC_FL_PUNCH_HOLE as 0x3 (so that KEEP_SIZE is always passed) and in the future we can define some new flag name like TRUNCATE_HOLE (or whatever) that is 0x2 only. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs: add hole punching to fallocate
On 2010-11-16, at 07:14, Jan Kara wrote: Yeah I went back and forth on this. KEEP_SIZE won't change the behavior of PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size. I figured since its mode and not flags it would be ok to make either way accepted, but if you prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool with that, just let me know one way or the other. So we call it mode but speak about flags? Seems a bit inconsistent. I'd maybe lean a bit at the flags side and just make sure that only one of FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE is set (interpreting FALLOC_FL_KEEP_SIZE as allocate blocks beyond i_size). But I'm not sure what others think. IMHO, it makes more sense for consistency and get what users expect that these be treated as flags. Some users will want KEEP_SIZE, but in other cases it may make sense that a hole punch at the end of a file should shrink the file (i.e. the opposite of an append). Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] lib: improve the performance of memcpy and memmove of the general version
On 2010-09-03, at 05:03, Florian Weimer wrote: * Miao Xie: EXPORT_SYMBOL(memcpy); I think you need to change that to EXPORT_SYMBOL_GPL, because the code is now licensed under the GPL, and not the GPL plus kernel exceptions (whatever they are, but they undoubtly exist), unlike the original implementation. Ouch. That would basically make it impossible to implement a non-GPL module. Also, this should only apply to the lib version of the memcpy() routine, not the arch-specific ones that are written in assembly (maybe that already is true, I'm not sure). Lustre is GPL, so it isn't fatal for us, but it definitely seems draconian, and almost a reason not to include this patch into the kernel. Also, given that the original code is LGPL (which allows linking to non-GPL code) this seems counter to the intent of the original authors. I suspect there isn't anything so brilliant in the glibc memcpy() that it couldn't be re-implemented without copying the code. implement memcpy with aligned machine-word-sized chunks would be enough for anyone to reimplement it without ever having come close to the glibc code. Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On 2010-08-03, at 11:35, Dan Magenheimer wrote: - The FS should be block-device-based (e.g. a ram-based FS such as tmpfs should not enable cleancache) When you say block device based, does this exclude network filesystems? It would seem cleancache, like fscache, is actually best suited to high-latency network filesystems. - To ensure coherency/correctness, inode numbers must be unique (e.g. no emulating 64-bit inode space on 32-bit inode numbers) Does it need to be restricted to inode numbers at all (i.e. can it use an opaque internal identifier like the NFS file handle)? Disallowing cleancache on a filesystem that uses 64-bit (or larger) inodes on a 32-bit system reduces its usefulness. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview
On 2010-06-02, at 20:46, Nitin Gupta wrote: On 06/03/2010 04:32 AM, Dan Magenheimer wrote: From: Minchan Kim [mailto:minchan@gmail.com] I am also eagerly awaiting Nitin Gupta's cleancache backend and implementation to do in-kernel page cache compression. Do Nitin say he will make backend of cleancache for page cache compression? It would be good feature. I have a interest, too. :) That was Nitin's plan for his GSOC project when we last discussed this. Nitin is on the cc list and can comment if this has changed. Yes, I have just started work on in-kernel page cache compression backend for cleancache :) Is there a design doc for this implementation? I was thinking it would be quite clever to do compression in, say, 64kB or 128kB chunks in a mapping (to get decent compression) and then write these compressed chunks directly from the page cache to disk in btrfs and/or a revived compressed ext4. That would mean that the on-disk compression algorithm needs to match the in-memory algorithm, which implies that the in-memory compression algorithm should be selectable on a per-mapping basis. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cleancache [PATCH 6/7] (was Transcendent Memory): ext4 hook
On 2010-04-22, at 07:29, Dan Magenheimer wrote: Cleancache [PATCH 6/7] (was Transcendent Memory): ext4 hook Filesystems must explicitly enable cleancache. For ext4, all other cleancache hooks are in the VFS layer. Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com Given the minimal changes being done to ext3/ext4, I don't have any objection to this. Being able to hook ext4 into SSDs for hierarchical caching is something that will become increasingly important for huge ext4 filesystems. Acked-by: Andreas Dilger adil...@sun.com Diffstat: super.c |2 ++ 1 file changed, 2 insertions(+) --- linux-2.6.34-rc5/fs/ext4/super.c 2010-04-19 17:29:56.0 -0600 +++ linux-2.6.34-rc5-cleancache/fs/ext4/super.c 2010-04-21 10:13:00.0 -0600 @@ -39,6 +39,7 @@ #include linux/ctype.h #include linux/log2.h #include linux/crc16.h +#include linux/cleancache.h #include asm/uaccess.h #include ext4.h @@ -1784,6 +1785,7 @@ static int ext4_setup_super(struct super EXT4_INODES_PER_GROUP(sb), sbi-s_mount_opt); + sb-cleancache_poolid = cleancache_init_fs(PAGE_SIZE); return res; } Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: source line numbers with x86_64 modules? [Was: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact]
On Jan 10, 2009 16:15 -0500, Theodore Ts'o wrote: In my experience, there are very few kernel versions and hardware for which kdump works. I've talked to the people who have to make kdump work, and every 12-18 months, with a new set of enterprise kernels comes out, they have to go and fix kdump so it works again for the set of hardware that they care about, and for the kernel version involved. I'm sad that netconsole/netdump never made it big. It was fairly useful, and extending the eth drivers to add the polling mode was trivial to do. We were using that for a few years, but it got replaced by kdump and it appears to be less usable IMHO. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
). When filesystem X is mounted (using any one device from the filesystem passed to mount) it can quickly scan this list to find all devices that for that filesystem's UUID to assemble all of the required devices. Since the filesystem would internally have to verify the list of devices that would (somehow) be passed to the kernel during mount this doesn't really gain much to keep the list in userspace. Advantage:Requires a mount.fstype helper or fs-specific knowledge in mount(8). Disadvantages: Required libvolume_id / libblkid to keep state. 2) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem they call into the kernel through and ioctl / sysfs / etc to add it to a list in kernel space. The kernel code then manages to select the right filesystem instances, and either adds it to an already running instance, or prepares a list for a future mount. While not performance critical, this seems that calling into userspace just to call back down into the kernel is more complex than it needs to be (i.e. prone to failure). I'm not stuck on doing this in the kernel, just thinking that the more components involved the more likely it is that something goes wrong. I have enough grief with just initrd not having the right kernel module, I'd hate to depend on tools to do the device assembly for the root fs. Advantages:keeps state in the kernel instead of in libraries Disadvantages: requires and additional user interface to mount c) will be deal with an option that allows adding named block devices on the mount command line, which is already required for design 1) but would have to be implemented additionally for design 2). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs trees for linux-next
On Dec 11, 2008 09:43 -0500, Chris Mason wrote: The multi-device code uses a very simple brute force scan from userland to populate the list of devices that belong to a given FS. Kay Sievers has some ideas on hotplug magic to make this less dumb. (The scan isn't required for single device filesystems). This should use libblkid to do the scanning of the devices, and it can cache the results for efficiency. Best would be to have the same LABEL+UUID for all devices in the same filesystem, and then once any of these devices are found the mount.btrfs code can query the rest of the devices to find the remaining parts of the filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html