Re: cross-fs copy support

2018-10-01 Thread Andreas Dilger
On Oct 1, 2018, at 9:49 AM, Eric Sandeen  wrote:
> 
> 
> On 10/1/18 9:48 AM, Qu Wenruo wrote:
>> 
>> 
>> On 2018/10/1 下午10:32, Joshi wrote:
>>> I was wondering about the cross-fs copy through copy_file_range.
>> 
>> The term "cross-fs" looks pretty confusing.
>> 
>> If you mean "cross-subvolume", then it should work without problem in btrfs.
>> 
>> If you mean reflink across two different file systems (not matter the
>> same fs type or not).
>> Then it's impossible to work.
> 
> I believe Joshi is talking about vfs_copy_file_range() not
> vfs_clone_file range(), although _copy_ does call _clone_ if it can.
> 
>> Reflink (clone_file_range) works by inserting data pointers into the
>> filesystem other than really copying the data.
>> Thus if the source is outside of the fs, it's really impossible to work,
>> as the source pointer/data is completely out of control of the dest fs.
> 
> Yes, I would expect there to be problems with his modified kernel
> for a filesystem that supports clone_file_range, because
> vfs_copy_file_range() will clone if possible, and this should fail across
> filesystems.
> 
> In general, though, I don't know for sure why we don't fall back to
> do_splice_direct() across filesystems, although the filesystems that
> implement their own ->copy_file_range ops may have their own,
> further restrictions within their implementations.
> 
> This call /is/ documented in the manpage as only being valid for
> files on the same filesystem, though:
> http://man7.org/linux/man-pages/man2/copy_file_range.2.html

There was a patch to allow cross-mount copy for NFS, but it hasn't landed
yet.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger
On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstr...@gmail.com> wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstr...@gmail.com> 
>> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com>
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger
On May 18, 2018, at 1:49 AM, Kent Overstreet  wrote:
> 
> Signed-off-by: Kent Overstreet 

I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself.  The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.


That said, I think this is a useful functionality.  We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated.  This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.


Some things that are missing from this patch that is in our code:

- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
  - timeout: sleep for N msec to simulate network/disk/locking delays
  - race: wait with one thread until a second thread hits matching check

We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).

Cheers, Andreas

> ---
> include/asm-generic/vmlinux.lds.h |   4 +
> include/linux/dynamic_fault.h | 117 +
> lib/Kconfig.debug |   5 +
> lib/Makefile  |   2 +
> lib/dynamic_fault.c   | 760 ++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
>   VMLINUX_SYMBOL(__start___verbose) = .;  \
>   KEEP(*(__verbose))  \
>   VMLINUX_SYMBOL(__stop___verbose) = .;   \
> + . = ALIGN(8);   \
> + VMLINUX_SYMBOL(__start___faults) = .;   \
> + *(__faults) \
> + VMLINUX_SYMBOL(__stop___faults) = .;\
>   LIKELY_PROFILE()\
>   BRANCH_PROFILE()\
>   TRACE_PRINTKS() \
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 00..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include 
> +#include 
> +#include 
> +
> +enum dfault_enabled {
> + DFAULT_DISABLED,
> + DFAULT_ENABLED,
> + DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> + struct {
> + unsignedenabled:2;
> + unsignedcount:30;
> + };
> +
> + struct {
> + unsignedv;
> + };
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite.  At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> + const char  *modname;
> + const char  *function;
> + const char  *filename;
> + const char  *class;
> +
> + const u16   line;
> +
> + unsignedfrequency;
> + union dfault_state  state;
> +
> + struct static_key   enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class)
> \
> +({   \
> + static struct _dfault descriptor\
> + __used __aligned(8) __attribute__((section("__faults"))) = {\
> + .modname= KBUILD_MODNAME,   \
> + .function   = __func__, \
> + .filename   = __FILE__, \
> + .line   = __LINE__, \
> + .class  = _class,   \
> + };  \
> + \
> + 

Re: [PATCH 1/2] fs: hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-09 Thread Andreas Dilger
On May 9, 2018, at 10:10 AM, Darrick J. Wong <darrick.w...@oracle.com> wrote:
> 
> On Wed, May 09, 2018 at 11:01:21AM -0500, Eric Sandeen wrote:
>> Move the btrfs label ioctls up to the vfs for general use.
>> 
>> This retains 256 chars as the maximum size through the interface, which
>> is the btrfs limit and AFAIK exceeds any other filesystem's maximum
>> label size.
>> 
>> Signed-off-by: Eric Sandeen <sand...@redhat.com>
>> ---
>> 
>> Let the bikeshedding on the exact ioctl name begin ;)
>> 
>> fs/btrfs/ioctl.c   | 8 
>> include/uapi/linux/btrfs.h | 6 ++
>> include/uapi/linux/fs.h| 8 ++--
>> 3 files changed, 12 insertions(+), 10 deletions(-)
>> 
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 632e26d..2dd4cdf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -5444,6 +5444,10 @@ long btrfs_ioctl(struct file *file, unsigned int
>>  return btrfs_ioctl_setflags(file, argp);
>>  case FS_IOC_GETVERSION:
>>  return btrfs_ioctl_getversion(file, argp);
>> +case FS_IOC_GETFSLABEL:
>> +return btrfs_ioctl_get_fslabel(file, argp);
>> +case FS_IOC_SETFSLABEL:
>> +return btrfs_ioctl_set_fslabel(file, argp);
>>  case FITRIM:
>>  return btrfs_ioctl_fitrim(file, argp);
>>  case BTRFS_IOC_SNAP_CREATE:
>> @@ -,10 +5559,6 @@ long btrfs_ioctl(struct file *file, unsigned int
>>  return btrfs_ioctl_quota_rescan_wait(file, argp);
>>  case BTRFS_IOC_DEV_REPLACE:
>>  return btrfs_ioctl_dev_replace(fs_info, argp);
>> -case BTRFS_IOC_GET_FSLABEL:
>> -return btrfs_ioctl_get_fslabel(file, argp);
>> -case BTRFS_IOC_SET_FSLABEL:
>> -return btrfs_ioctl_set_fslabel(file, argp);
>>  case BTRFS_IOC_GET_SUPPORTED_FEATURES:
>>  return btrfs_ioctl_get_supported_features(argp);
>>  case BTRFS_IOC_GET_FEATURES:
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index c8d99b9..ec611c8 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -823,10 +823,8 @@ enum btrfs_err_code {
>> #define BTRFS_IOC_QUOTA_RESCAN_STATUS _IOR(BTRFS_IOCTL_MAGIC, 45, \
>> struct btrfs_ioctl_quota_rescan_args)
>> #define BTRFS_IOC_QUOTA_RESCAN_WAIT _IO(BTRFS_IOCTL_MAGIC, 46)
>> -#define BTRFS_IOC_GET_FSLABEL _IOR(BTRFS_IOCTL_MAGIC, 49, \
>> -   char[BTRFS_LABEL_SIZE])
>> -#define BTRFS_IOC_SET_FSLABEL _IOW(BTRFS_IOCTL_MAGIC, 50, \
>> -   char[BTRFS_LABEL_SIZE])
>> +#define BTRFS_IOC_GET_FSLABEL   FS_IOC_GETFSLABEL
>> +#define BTRFS_IOC_SET_FSLABEL   FS_IOC_SETFSLABEL
>> #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>>struct btrfs_ioctl_get_dev_stats)
>> #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index d2a8313..1df3707 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -242,6 +242,8 @@ struct fsxattr {
>> #define FICLONERANGE _IOW(0x94, 13, struct file_clone_range)
>> #define FIDEDUPERANGE_IOWR(0x94, 54, struct file_dedupe_range)
>> 
>> +#define FSLABEL_MAX 256 /* Max chars for the interface; each fs may 
>> differ */
>> +
>> #define  FS_IOC_GETFLAGS _IOR('f', 1, long)
>> #define  FS_IOC_SETFLAGS _IOW('f', 2, long)
>> #define  FS_IOC_GETVERSION   _IOR('v', 1, long)
>> @@ -251,8 +253,10 @@ struct fsxattr {
>> #define FS_IOC32_SETFLAGS_IOW('f', 2, int)
>> #define FS_IOC32_GETVERSION  _IOR('v', 1, int)
>> #define FS_IOC32_SETVERSION  _IOW('v', 2, int)
>> -#define FS_IOC_FSGETXATTR   _IOR ('X', 31, struct fsxattr)
>> -#define FS_IOC_FSSETXATTR   _IOW ('X', 32, struct fsxattr)
>> +#define FS_IOC_FSGETXATTR   _IOR('X', 31, struct fsxattr)
>> +#define FS_IOC_FSSETXATTR   _IOW('X', 32, struct fsxattr)
> 
> Separate patch for whitespace cleanup.

Really?  I've heard Ted complain the other way, that whitespace cleanup
by itself is useless and should only be done as part of other changes.
As long as it is not overwhelming the rest of the patch I don't see an
issue with a minor improvement being part of another patch.

Otherwise, no bikeshedding from me.  Looks very reasonable, and the 256-char
limit is definitely large en

Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Andreas Dilger
[remove stable@ as this is not really a stable patch]

On Dec 14, 2017, at 7:30 AM, Yan, Zheng  wrote:
> 
> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara  wrote:
>> On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>>> We recently got an Oops report:
>>> 
>>> BUG: unable to handle kernel NULL pointer dereference at (null)
>>> IP: jbd2__journal_start+0x38/0x1a2
>>> [...]
>>> Call Trace:
>>>  ext4_page_mkwrite+0x307/0x52b
>>>  _ext4_get_block+0xd8/0xd8
>>>  do_page_mkwrite+0x6e/0xd8
>>>  handle_mm_fault+0x686/0xf9b
>>>  mntput_no_expire+0x1f/0x21e
>>>  __do_page_fault+0x21d/0x465
>>>  dput+0x4a/0x2f7
>>>  page_fault+0x22/0x30
>>>  copy_user_generic_string+0x2c/0x40
>>>  copy_page_to_iter+0x8c/0x2b8
>>>  generic_file_read_iter+0x26e/0x845
>>>  timerqueue_del+0x31/0x90
>>>  ceph_read_iter+0x697/0xa33 [ceph]
>>>  hrtimer_cancel+0x23/0x41
>>>  futex_wait+0x1c8/0x24d
>>>  get_futex_key+0x32c/0x39a
>>>  __vfs_read+0xe0/0x130
>>>  vfs_read.part.1+0x6c/0x123
>>>  handle_mm_fault+0x831/0xf9b
>>>  __fget+0x7e/0xbf
>>>  SyS_read+0x4d/0xb5
>>> 
>>> ceph_read_iter() uses current->journal_info to pass context info to
>>> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>>> has already gotten capability of using page cache (distinguish read
>>> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>>> then calls generic_file_read_iter().
>>> 
>>> In above Oops, page fault happened when copying data to userspace.
>>> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>>> current->journal_info and assumed it is journal handle.
>>> 
>>> I checked other filesystems, btrfs probably suffers similar problem
>>> for its readpage. (page fault happens when write() copies data from
>>> userspace memory and the memory is mapped to a file in btrfs.
>>> verify_parent_transid() can be called during readpage)
>>> 
>>> Cc: sta...@vger.kernel.org
>>> Signed-off-by: "Yan, Zheng" 
>> 
>> I agree with the analysis but the patch is too ugly too live. Ceph just
>> should not be abusing current->journal_info for passing information between
>> two random functions or when it does a hackery like this, it should just
>> make sure the pieces hold together. Poluting generic code to accommodate
>> this hack in Ceph is not acceptable. Also bear in mind there are likely
>> other code paths (e.g. memory reclaim) which could recurse into another
>> filesystem confusing it with non-NULL current->journal_info in the same
>> way.
> 
> But ...
> 
> some filesystem set journal_info in its write_begin(), then clear it
> in write_end(). If buffer for write is mapped to another filesystem,
> current->journal can leak to the later filesystem's page_readpage().
> The later filesystem may read current->journal and treat it as its own
> journal handle.  Besides, most filesystem's vm fault handle is
> filemap_fault(), filemap also may tigger memory reclaim.

Shouldn't the memory reclaim be prevented from recursing into the other
filesystem by use of GFP_NOFS, or the new memalloc_nofs annotation?

I don't think that ext4 is ever using current->journal on any read paths,
only in case of writes.

>> In this particular case I'm not sure why does ceph pass 'filp' into
>> readpage() / readpages() handler when it already gets that pointer as part
>> of arguments...
> 
> It actually a flag which tells ceph_readpages() if its caller is
> ceph_read_iter or readahead/fadvise/madvise. because when there are
> multiple clients read/write a file a the same time, page cache should
> be disabled.

I've wanted something similar for other reasons.  It would be better to
have a separate fs-specific pointer in the task struct to handle this
kind of information.  This can be used by the filesystem "upper half" to
communicate with the "lower half" (doing the writeout or other IO below
the VFS), and the "lower half" can use ->journal for handling the writeout.

However, some care would be needed to ensure that other processes accessing
this pointer would only do so if it is their own.  Something like
->fs_private_sb and ->fs_private_data would allow this sanely.  If the
->fs_private_sb != sb in the filesystem then ->fs_private_data is not valid
for this fs and cannot be used by the current filesystem code.  Alternately,
we could have a single ->fs_private pointer to reduce impact on task_struct
so long as all filesystems used the first field of the structure to point to
"sb", probably with a library helper to ensure this was done consistently:

data = current_fs_private_get(sb);
current_fs_private_set(sb, data);
data = current_fs_private_alloc(sb, size, gfp);

or whatever.

> Regards
> Yan, Zheng
> 
>> 
>>Honza
>> 
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index a728bed16c20..db2a50233c49 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, 

Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-17 Thread Andreas Dilger
On Nov 15, 2017, at 7:18 PM, Qu Wenruo  wrote:
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.

I recall Darrick was looking into a mechanism to do this.  Rather than
changing the whole block layer to take a callback to do a checksum, what
we looked at was to allow the upper-layer read to specify a "retry count"
to the lower-layer block device.  If the lower layer is able to retry the
read then it will read a different device (or combination of devices for
e.g. RAID-6) based on the retry count, until the upper layer gets a good
read (based on checksum, or whatever).  If there are no more devices (or
combinations) to try then a final error is returned.

Darrick can probably point at the original thread/patch.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-13 Thread Andreas Dilger
On May 10, 2017, at 11:10 PM, Eric Biggers  wrote:
> 
> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang 
>> out]
>> 
>> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
>>> Theodore Ts'o  writes:
>>> 
 On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> 1.) Privacy implications.  Say the filesystem is being shared between 
> multiple
>users, and one user unpacks foo.tar.gz into their home directory, which
>they've set to mode 700 to hide from other users.  Because of this new
>ioctl, all users will be able to see every (inode number, size in 
> blocks)
>pair that was added to the filesystem, as well as the exact layout of 
> the
>physical block allocations which might hint at how the files were 
> created.
>If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>regard, its presence on the filesystem will be revealed to all users.  
> And
>if any filesystems happen to prefer allocating blocks near the 
> containing
>directory, the directory the files are in would likely be revealed too.
>> 
>> Frankly, why are container users even allowed to make unrestricted ioctl
>> calls?  I thought we had a bunch of security infrastructure to constrain
>> what userspace can do to a system, so why don't ioctls fall under these
>> same protections?  If your containers are really that adversarial, you
>> ought to be blacklisting as much as you can.
>> 
> 
> Personally I don't find the presence of sandboxing features to be a very good
> excuse for introducing random insecure ioctls.  Not everyone has everything
> perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> about the filesystem ioctls, too, since they can be executed on any regular
> file, without having to open some device node in /dev.
> 
> (And this actually does happen; the SELinux policy in Android, for example,
> still allows apps to call any ioctl on their data files, despite all the 
> effort
> that has gone into whitelisting other types of ioctls.  Which should be fixed,
> of course, but it shows that this kind of mistake is very easy to make.)
> 
 Unix/Linux has historically not been terribly concerned about trying
 to protect this kind of privacy between users.  So for example, in
 order to do this, you would have to call GETFSMAP continously to track
 this sort of thing.  Someone who wanted to do this could probably get
 this information (and much, much more) by continuously running "ps" to
 see what processes are running.
 
 (I will note. wryly, that in the bad old days, when dozens of users
 were sharing a one MIPS Vax/780, it was considered a *good* thing
 that social pressure could be applied when it was found that someone
 was running a CPU or memory hogger on a time sharing system.  The
 privacy right of someone running "xtrek" to be able to hide this from
 other users on the system was never considered important at all.  :-)
>> 
>> Not to mention someone running GETFSMAP in a loop will be pretty obvious
>> both from the high kernel cpu usage and the huge number of metadata
>> operations.
> 
> Well, only if that someone running GETFSMAP actually wants to watch things in
> real-time (it's not necessary for all scenarios that have been mentioned), 
> *and*
> there is monitoring in place which actually detects it and can do something
> about it.
> 
> Yes, PIDs have traditionally been global, but today we have PID namespaces, 
> and
> many other isolation features such as mount namespaces.  Nothing is perfect, 
> of
> course, and containers are a lot worse than VMs, but it seems weird to use 
> that
> as an excuse to knowingly make things worse...
> 
>> 
 Fortunately, the days of timesharing seem to well behind us.  For
 those people who think that containers are as secure as VM's (hah,
 hah, hah), it might be that best way to handle this is to have a mount
 option that requires root access to this functionality.  For those
 people who really care about this, they can disable access.
>> 
>> Or use separate filesystems for each container so that exploitable bugs
>> that shut down the filesystem can't be used to kill the other
>> containers.  You could use a torrent of metadata-heavy operations
>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>> the other containers.
>> 
>>> What would be the reason for not putting this behind
>>> capable(CAP_SYS_ADMIN)?
>>> 
>>> What possible legitimate function could this functionality serve to
>>> users who don't own your filesystem?
>> 
>> As I've said before, it's to enable dedupe tools to decide, given a set
>> of files with shareable blocks, roughly how many other times each of
>> those shareable blocks are shared 

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Andreas Dilger
On Jan 17, 2017, at 8:59 AM, Theodore Ts'o  wrote:
> 
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
>> 
>> OK, so I've been staring into the code and AFAIU current->journal_info
>> can contain my stored information. I could either hijack part of the
>> word as the ref counting is only consuming low 12b. But that looks too
>> ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)
> 
>> But before going to play with that I am really wondering whether we need
>> all this with no journal at all. AFAIU what Jack told me it is the
>> journal lock(s) which is the biggest problem from the reclaim recursion
>> point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.
> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?  Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I think Dave was suggesting that the NOFS context allow a pointer to
an arbitrary struct, so that it is possible to dereference this in
the filesystem itself to determine if the recursion is safe or not.
That way, ext2 could store an inode pointer (if that is what it cares
about) and verify that writeback is not recursing on the same inode,
and XFS can store something different.  It would also need to store
some additional info (e.g. fstype or superblock pointer) so that it
can determine how to interpret the NOFS context pointer.

I think it makes sense to add a couple of void * pointers to the task
struct along with journal_info and leave it up to the filesystem to
determine how to use them.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [LSF/MM TOPIC] sharing pages between mappings

2017-01-11 Thread Andreas Dilger
On Jan 11, 2017, at 3:29 AM, Miklos Szeredi  wrote:
> 
> I know there's work on this for xfs, but could this be done in generic mm 
> code?
> 
> What are the obstacles?  page->mapping and page->index are the obvious ones.
> 
> If that's too difficult is it maybe enough to share mappings between
> files while they are completely identical and clone the mapping when
> necessary?
> 
> All COW filesystems would benefit, as well as layered ones: lots of
> fuse fs, and in some cases overlayfs too.

For layered filesystems it would also be useful to have an API to move
pages between mappings easily.

> Related:  what can DAX do in the presence of cloned block?
> 
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH] f2fs: support multiple devices

2016-11-09 Thread Andreas Dilger
On Nov 9, 2016, at 1:56 PM, Jaegeuk Kim  wrote:
> 
> This patch implements multiple devices support for f2fs.
> Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
> volume under one f2fs instance.
> 
> Internal block management is very simple, but we will modify block
> allocation and background GC policy to boost IO speed by exploiting them
> accoording to each device speed.

How will you integrate this into FIEMAP, since it is now possible if a
file is split across multiple devices then it will return ambiguous block
numbers for a file.  I've been meaning to merge the FIEMAP handling in
Lustre to support multiple devices in a single filesystem, so that this
can be detected in userspace.

struct ll_fiemap_extent {
__u64 fe_logical;  /* logical offset in bytes for the start of
* the extent from the beginning of the file
*/
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the disk
*/
__u64 fe_length;   /* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device;   /* device number for this extent */
__u32 fe_reserved[2];
};

This adds the 32-bit "fe_device" field, which would optionally be filled
in by the filesystem (zero otherwise).  It would return the kernel device
number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET
set) this could just return an integer device number since the device
number is meaningless (and may conflict) on a remote system.

Since AFAIK Btrfs also has multiple device support there are an increasing
number of places where this would be useful.

Cheers, Andreas

> 
> Signed-off-by: Jaegeuk Kim 
> ---
> fs/f2fs/data.c  |  55 ---
> fs/f2fs/f2fs.h  |  29 --
> fs/f2fs/segment.c   | 119 +
> fs/f2fs/super.c | 138 ++--
> include/linux/f2fs_fs.h |  10 +++-
> 5 files changed, 277 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 47ded0c..e2be24e 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -88,6 +88,46 @@ static void f2fs_write_end_io(struct bio *bio)
> }
> 
> /*
> + * Return true, if pre_bio's bdev is same as its target device.
> + */
> +struct block_device *f2fs_target_device(struct f2fs_sb_info *sbi,
> + block_t blk_addr, struct bio *bio)
> +{
> + struct block_device *bdev = sbi->sb->s_bdev;
> + int i;
> +
> + for (i = 0; i < sbi->s_ndevs; i++) {
> + if (FDEV(i).start_blk <= blk_addr &&
> + FDEV(i).end_blk >= blk_addr) {
> + blk_addr -= FDEV(i).start_blk;
> + bdev = FDEV(i).bdev;
> + break;
> + }
> + }
> + if (bio) {
> + bio->bi_bdev = bdev;
> + bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(blk_addr);
> + }
> + return bdev;
> +}
> +
> +int f2fs_target_device_index(struct f2fs_sb_info *sbi, block_t blkaddr)
> +{
> + int i;
> +
> + for (i = 0; i < sbi->s_ndevs; i++)
> + if (FDEV(i).start_blk <= blkaddr && FDEV(i).end_blk >= blkaddr)
> + return i;
> + return 0;
> +}
> +
> +static bool __same_bdev(struct f2fs_sb_info *sbi,
> + block_t blk_addr, struct bio *bio)
> +{
> + return f2fs_target_device(sbi, blk_addr, NULL) == bio->bi_bdev;
> +}
> +
> +/*
>  * Low-level block read/write IO operations.
>  */
> static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr,
> @@ -97,8 +137,7 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, 
> block_t blk_addr,
> 
>   bio = f2fs_bio_alloc(npages);
> 
> - bio->bi_bdev = sbi->sb->s_bdev;
> - bio->bi_iter.bi_sector = SECTOR_FROM_BLOCK(blk_addr);
> + f2fs_target_device(sbi, blk_addr, bio);
>   bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io;
>   bio->bi_private = is_read ? NULL : sbi;
> 
> @@ -273,7 +312,8 @@ void f2fs_submit_page_mbio(struct f2fs_io_info *fio)
>   down_write(>io_rwsem);
> 
>   if (io->bio && (io->last_block_in_bio != fio->new_blkaddr - 1 ||
> - (io->fio.op != fio->op || io->fio.op_flags != fio->op_flags)))
> + (io->fio.op != fio->op || io->fio.op_flags != fio->op_flags) ||
> + !__same_bdev(sbi, fio->new_blkaddr, io->bio)))
>   __submit_merged_bio(io);
> alloc_new:
>   if (io->bio == NULL) {
> @@ -965,7 +1005,6 @@ static struct bio *f2fs_grab_bio(struct inode *inode, 
> block_t blkaddr,
> {
>   struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>   struct fscrypt_ctx *ctx = NULL;
> - struct 

Re: [PATCH 5/5] don't set *REFERENCED unless we are on the lru list

2016-10-25 Thread Andreas Dilger
On Oct 25, 2016, at 4:44 PM, Omar Sandoval  wrote:
> 
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>> With anything that populates the inode/dentry cache with a lot of one time 
>> use
>> inodes we can really put a lot of pressure on the system for things we don't
>> need to keep in cache.  It takes two runs through the LRU to evict these one 
>> use
>> entries, and if you have a lot of memory you can end up with 10's of 
>> millions of
>> entries in the dcache or icache that have never actually been touched since 
>> they
>> were first instantiated, and it will take a lot of CPU and a lot of pressure 
>> to
>> evict all of them.
>> 
>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>> are being used after we've been put onto the LRU.  This makes a significant
>> difference in the system's ability to evict these useless cache entries.  
>> With a
>> fs_mark workload that creates 40 million files we get the following results 
>> (all
>> in files/sec)
>> 
>> BtrfsPatched Unpatched
>> Average Files/sec:   72209.3 63254.2
>> p50 Files/sec:   70850   57560
>> p90 Files/sec:   68757   53085
>> p99 Files/sec:   68757   53085
>> 
>> XFS  Patched Unpatched
>> Average Files/sec:   61025.5 60719.5
>> p50 Files/sec:   60107   59465
>> p90 Files/sec:   59300   57966
>> p99 Files/sec:   59227   57528
>> 
>> Ext4 Patched Unpatched
>> Average Files/sec:   121785.4119248.0
>> p50 Files/sec:   120852  119274
>> p90 Files/sec:   116703  112783
>> p99 Files/sec:   114393  104934
>> 
>> The reason Btrfs has a much larger improvement is because it holds a lot more
>> things in memory so benefits more from faster slab reclaim, but across the 
>> board
>> is an improvement for each of the file systems.
>> 
>> Signed-off-by: Josef Bacik 
>> ---
>> fs/dcache.c | 15 ++-
>> fs/inode.c  |  5 -
>> 2 files changed, 14 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 5c7cc95..a558075 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -779,8 +779,6 @@ repeat:
>>  goto kill_it;
>>  }
>> 
>> -if (!(dentry->d_flags & DCACHE_REFERENCED))
>> -dentry->d_flags |= DCACHE_REFERENCED;
>>  dentry_lru_add(dentry);
>> 
>>  dentry->d_lockref.count--;
>> @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry)
>>  dentry->d_lockref.count++;
>> }
>> 
>> +static inline void __dget_dlock_reference(struct dentry *dentry)
>> +{
>> +if (dentry->d_flags & DCACHE_LRU_LIST)
>> +dentry->d_flags |= DCACHE_REFERENCED;
>> +dentry->d_lockref.count++;
>> +}
>> +
>> static inline void __dget(struct dentry *dentry)
>> {
>>  lockref_get(>d_lockref);
>> @@ -875,7 +880,7 @@ again:
>>  (alias->d_flags & DCACHE_DISCONNECTED)) {
>>  discon_alias = alias;
>>  } else {
>> -__dget_dlock(alias);
>> +__dget_dlock_reference(alias);
>>  spin_unlock(>d_lock);
>>  return alias;
>>  }
>> @@ -886,7 +891,7 @@ again:
>>  alias = discon_alias;
>>  spin_lock(>d_lock);
>>  if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
>> -__dget_dlock(alias);
>> +__dget_dlock_reference(alias);
>>  spin_unlock(>d_lock);
>>  return alias;
>>  }
>> @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, 
>> const struct qstr *name)
>>  if (!d_same_name(dentry, parent, name))
>>  goto next;
>> 
>> -dentry->d_lockref.count++;
>> +__dget_dlock_reference(dentry);
> 
> This misses dentries that we get through __d_lookup_rcu(), so I think
> your change made it so most things aren't getting DCACHE_REFERENCED set
> at all. Maybe something like this instead?
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7cc95..d7a56a8 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -412,15 +412,6 @@ static void d_lru_shrink_move(struct list_lru_one *lru, 
> struct dentry *dentry,
>   list_lru_isolate_move(lru, >d_lru, list);
> }
> 
> -/*
> - * dentry_lru_(add|del)_list) must be called with d_lock held.
> - */
> -static void dentry_lru_add(struct dentry *dentry)
> -{
> - if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
> - d_lru_add(dentry);
> -}
> -
> /**
>  * d_drop - drop a dentry
>  * @dentry: dentry to drop
> @@ -779,9 +770,12 @@ void dput(struct dentry *dentry)
>  

Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Andreas Dilger

> On Jul 6, 2016, at 10:33 AM, Joerg Schilling 
>  wrote:
> 
> "Austin S. Hemmelgarn"  wrote:
> 
>> On 2016-07-06 11:22, Joerg Schilling wrote:
>>> 
>>> 
>>> You are mistaken.
>>> 
>>> stat /proc/$$/as
>>>  File: `/proc/6518/as'
>>>  Size: 2793472 Blocks: 5456   IO Block: 512regular file
>>> Device: 544h/88342528d  Inode: 7557Links: 1
>>> Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> stat /proc/$$/auxv
>>>  File: `/proc/6518/auxv'
>>>  Size: 168 Blocks: 1  IO Block: 512regular file
>>> Device: 544h/88342528d  Inode: 7568Links: 1
>>> Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> Any correct implementation of /proc returns the expected numbers in
>>> st_size as well as in st_blocks.
>> 
>> Odd, because I get 0 for both values on all the files in /proc/self and
>> all the top level files on all kernels I tested prior to sending that
> 
> I tested this with an official PROCFS-2 implementation that was written by
> the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased
> away last weekend.
> 
> You may have done your tests on an inofficial procfs implementation

So, what you are saying is that you don't care about star working properly
on Linux, because it has an "inofficial" procfs implementation, while Solaris
has an "official" implementation?

>>> Now you know why BTRFS is still an incomplete filesystem. In a few years
>>> when it turns 10, this may change. People who implement filesystems of
>>> course need to learn that they need to hide implementation details from
>>> the official user space interfaces.
>> 
>> So in other words you think we should be lying about how much is
>> actually allocated on disk and thus violating the standard directly (and
>> yes, ext4 and everyone else who does this with delayed allocation _is_
>> strictly speaking violating the standard, because _nothing_ is allocated
>> yet)?
> 
> If it returns 0, it would be lying or it would be wrong anyway as it did not
> check fpe available space.
> 
> Also note that I mentioned already that the priciple availability of SEEK_HOLE
> does not help as there is e.g. NFS...

So, it's OK that NFS is not POSIX compliant in various ways, and star will
deal with it, but you aren't willing to fix a heuristic used by star for a
behaviour that is unspecified by POSIX but has caused users to lose data
when archiving from several modern filesystems?

That's fine, so long as GNU tar is fixed to use the safe fallback in such
cases (i.e. trying to archive data from files that are newly created, even
if they report st_blocks == 0).

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-04 Thread Andreas Dilger
On Jul 2, 2016, at 1:18 AM, Pavel Raiskup  wrote:
> 
> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
> st_blocks info.  For example, in GNU tar there is optimization check [1]
> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
> tar considers that file is sparse (and does additional steps).
> 
> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
> are synced.  ATM, there happens that:
> 
>a) some "tool" creates sparse file
>b) that tool does not sync explicitly and exits ..
>c) tar is called immediately after that to archive the sparse file
>d) tar considers [2] the file is completely sparse (because st_blocks is
>   zero) and archives no data.  Here comes data loss.
> 
> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
> small and is in-lined (no real data blocks) -- I consider this is too bug in
> btrfs worth fixing.

We had a similar problem with both ext4 and Lustre - the client was reporting
zero blocks due to delayed allocation until data was written to disk.  While
those problems were fixed in the filesystem to report an estimate of the block
count before any blocks were actually written to disk, it seems like this may
be a problem that will come up again with other filesystems in the future.

I think in addition to fixing btrfs (because it needs to work with existing
tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
to handle this situation more robustly.  One option is if st_blocks == 0 then
tar should also check if st_mtime is less than 60s in the past, and if yes
then it should call fsync() on the file to flush any unwritten data to disk,
or assume the file is not sparse and read the whole file, so that it doesn't
incorrectly assume that the file is sparse and skip archiving the file data.

Cheers, Andreas

> 
> [1] 
> http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> [2] 
> http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
> 
> Tested on kernel:
> kernel-4.5.7-300.fc24.x86_64
> 
> Originally reported here, reproducer available there:
> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
> 
> Pavel
> 
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2016-04-27 Thread Andreas Dilger
On Apr 27, 2016, at 5:54 AM, Michal Hocko  wrote:
> 
> From: Michal Hocko 
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko 
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
> 
> fs/xfs/kmem.c |  4 ++--
> fs/xfs/kmem.h |  2 +-
> fs/xfs/libxfs/xfs_btree.c |  2 +-
> fs/xfs/xfs_aops.c |  6 +++---
> fs/xfs/xfs_trans.c| 12 ++--
> include/linux/sched.h |  2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>* the filesystem here and potentially deadlocking.
>*/
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   noio_flag = memalloc_noio_save();
> 
>   lflags = kmem_flags_convert(flags);
>   ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> 
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   memalloc_noio_restore(noio_flag);

Not really the fault of this patch, but it brings this nasty bit of code into
the light.  Is all of this machinery still needed given that __vmalloc() can
accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

>   return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
>   } else {
>   lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   lflags &= ~__GFP_FS;
>   }
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
>   struct xfs_btree_split_args *args = container_of(work,
>   struct xfs_btree_split_args, 
> work);
>   unsigned long   pflags;
> - unsigned long   new_pflags = PF_FSTRANS;
> + unsigned long   new_pflags = PF_MEMALLOC_NOFS;
> 
>   /*
>* we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
>* We hand off the transaction to the completion thread now, so
>* clear the flag here.
>*/
> - current_restore_flags_nested(>t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   return 0;
> }
> 
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
>* thus we need to mark ourselves as being in a transaction manually.
>* Similarly for freeze protection.
>*/
> - current_set_flags_nested(>t_pflags, PF_FSTRANS);
> + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> 
>   /* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
>* Given that we do not allow direct reclaim to call us, we should
>* never be called while in a filesystem transaction.
>*/
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>   goto redirty;
> 
>   /* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Andreas Dilger
On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  wrote:
> 
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
>> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
>>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
 On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.
 
 Yes, it's perfectly fine for fallocate to block for long periods of
 time. See what gfs2 does during preallocation of blocks - it ends up
 calling sb_issue_zerout() because it doesn't have unwritten
 extents, and hence can block for long periods of time
>>> 
>>> gfs2 fallocate is an implementation that will cause all but the most
>>> trivial users real pain.  Even the initial XFS implementation just
>>> marking the transactions synchronous made it unusable for all kinds
>>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
>>> to gfs2 will probab;ly hand your connection for extended periods of
>>> time.
>>> 
>>> If we need to support something like what gfs2 does we should have a
>>> separate flag for it.
>> 
>> Using fallocate() for preallocation was always intended to
>> be a faster, more efficient method allocating zeroed space
>> than having userspace write blocks of data. Faster, more efficient
>> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
>> that if the hardware has zeroing offloads (deterministic trim, write
>> same, etc) it will use them, and that will be much faster than
>> writing zeros from userspace.
>> 
>> IMO, what gfs2 is definitely within the intended usage of
>> fallocate() for accelerating the preallocation of blocks.
>> 
>> Yes, it may not be optimal for things like NFS servers which haven't
>> considered that a fallocate based offload operation might take some
>> time to execute, but that's not a problem with fallocate. i.e.
>> that's a problem with the nfs server ALLOCATE implementation not
>> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
>> and timeouts while the operation is run
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.

Well, the tape continued to be mounted in the background and/or the file
restored from the tape into the filesystem...

> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

If the later fallocate() was not re-doing the same work as the first one,
it should be fine for the client to re-send the fallocate() request.  The
fallocate() to reserve blocks does not touch the blocks that are already
allocated, so this is safe to do even if another process is writing to the
file.  If you have multiple processes writing and calling fallocate() with
PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then
the application is in for a world of hurt already.

> Would it be legal to continue the fallocate under the covers even after
> returning EINTR?

That might produce unexpected results in some cases, but it depends on
the options used.  Probably the safest is to not continue, and depend on
userspace to retry the operation on EINTR.  For fallocate() doing prealloc
or punch or zero this should eventually complete even if it is slow.

Cheers, Andreas

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Andreas Dilger
On Mar 31, 2016, at 1:55 AM, Christoph Hellwig  wrote:
> 
> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>> because it thinks the space is already allocated.  So a later overwrite
>> over this shared extent may hit enospc errors.
> 
> And this makes it an incorrect implementation of posix_fallocate,
> which glibcs implements using fallocate if available.

It isn't really useful for a COW filesystem to implement fallocate()
to reserve blocks.  Even if it did allocate all of the blocks on the
initial fallocate() call, when it comes time to overwrite these blocks
new blocks need to be allocated as the old ones will not be overwritten.

Because of snapshots that could hold references to the old blocks,
there isn't even the guarantee that the previous fallocated blocks will
be released in a reasonable time to free up an equal amount of space.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH v7 0/4] VFS: In-kernel copy system call

2015-10-24 Thread Andreas Dilger

> On Oct 24, 2015, at 10:52 AM, Eric Biggers  wrote:
> 
> A few comments:
> 
>>  if (!(file_in->f_mode & FMODE_READ) ||
>>  !(file_out->f_mode & FMODE_WRITE) ||
>>  (file_out->f_flags & O_APPEND) ||
>>  !file_out->f_op)
>>  return -EBADF;
> 
> Isn't 'f_op' always non-NULL?
> 
> If the destination file cannot be append-only, shouldn't this be documented?

Actually, wouldn't O_APPEND only be a problem if the target file wasn't
being appended to?  In other words, if the target i_size == start offset
then it should be possible to use the copy syscall on an O_APPEND file.

Cheers, Andreas

>>  if (inode_in->i_sb != inode_out->i_sb ||
>>  file_in->f_path.mnt != file_out->f_path.mnt)
>>  return -EXDEV;
> 
> Doesn't the same mount already imply the same superblock?
> 
>> /*
>> * copy_file_range() differs from regular file read and write in that it
>> * specifically allows return partial success.  When it does so is up to
>> * the copy_file_range method.
>> */
> 
> What does this mean?  I thought that read() and write() can also return 
> partial
> success.
> 
>>  f_out = fdget(fd_out);
>>  if (!f_in.file || !f_out.file) {
>>  ret = -EBADF;
>>  goto out;
>>  }
> 
> This looked wrong at first because it may call fdput() on a 'struct fd' that 
> was
> not successfully acquired, but it looks like it works currently because of how
> the FDPUT_FPUT flag is used.  It may be a good idea to write it the "obvious"
> way, though (use separate labels depending on which fdput()s need to happen).
> 
> 
> Other questions:
> 
> Should FMODE_PREAD or FMODE_PWRITE access be checked if the user specifies 
> their
> own 'off_in' or 'off_out', respectively?
> 
> What is supposed to happen if the user passes provides a file descriptor to a
> non-regular file, such as a block device or char device?
> 
> If the 'in' file has fewer than 'len' bytes remaining until EOF, what is the
> expected behavior?  It looks like the btrfs implementation has different
> behavior from the pagecache implementation.
> 
> It appears the btrfs implementation has alignment restrictions --- where is 
> this
> documented and how will users know what alignment to use?
> 
> Are copies within the same file permitted and can the ranges overlap?  The man
> page doesn't say.
> 
> It looks like the initial patch defines __NR_copy_file_range for the ARM
> architecture but doesn't actually hook that system call up for ARM; why is 
> that?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

2015-10-16 Thread Andreas Dilger

> On Oct 16, 2015, at 3:08 PM, Anna Schumaker  wrote:
> 
> copy_file_range() is a new system call for copying ranges of data
> completely in the kernel.  This gives filesystems an opportunity to
> implement some kind of "copy acceleration", such as reflinks or
> server-side-copy (in the case of NFS).
> 
> Signed-off-by: Anna Schumaker 
> Reviewed-by: Darrick J. Wong 
> ---
> v6:
> - Updates for removing most flags
> ---
> man2/copy_file_range.2 | 204 +
> man2/splice.2  |   1 +
> 2 files changed, 205 insertions(+)
> create mode 100644 man2/copy_file_range.2
> 
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> new file mode 100644
> index 000..6c52c85
> --- /dev/null
> +++ b/man2/copy_file_range.2
> @@ -0,0 +1,204 @@
> +.\"This manpage is Copyright (C) 2015 Anna Schumaker 
> 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of
> +.\" this manual under the conditions for verbatim copying, provided that
> +.\" the entire resulting derived work is distributed under the terms of
> +.\" a permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume.
> +.\" no responsibility for errors or omissions, or for damages resulting.
> +.\" from the use of the information contained herein.  The author(s) may.
> +.\" not have taken the same level of care in the production of this.
> +.\" manual, which is licensed free of charge, as they might when working.

Is there a reason why every. one. of. those. lines. ends. in. a. period?
I don't think that is needed for nroff, and other paragraphs would support
that conclusion.

> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +copy_file_range \- Copy a range of data from one file to another
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.B #include 
> +.B #include 
> +
> +.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out 
> ",
> +.BI "loff_t *" off_out ", size_t " len \
> +", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR copy_file_range ()
> +system call performs an in-kernel copy between two file descriptors
> +without the additional cost of transferring data from the kernel to userspace
> +and then back into the kernel.
> +It copies up to
> +.I len
> +bytes of data from file descriptor
> +.I fd_in
> +to file descriptor
> +.IR fd_out ,
> +overwriting any data that exists within the requested range of the target 
> file.
> +
> +The following semantics apply for
> +.IR off_in ,
> +and similar statements apply to
> +.IR off_out :
> +.IP * 3
> +If
> +.I off_in
> +is NULL, then bytes are read from
> +.I fd_in
> +starting from the current file offset, and the offset is
> +adjusted by the number of bytes copied.
> +.IP *
> +If
> +.I off_in
> +is not NULL, then
> +.I off_in
> +must point to a buffer that specifies the starting
> +offset where bytes from
> +.I fd_in
> +will be read.  The current file offset of
> +.I fd_in
> +is not changed, but
> +.I off_in
> +is adjusted appropriately.
> +.PP
> +
> +The
> +.I flags
> +argument can have the following flag set:
> +.TP 1.9i
> +.B COPY_FR_REFLINK
> +Create a lightweight "reflink", where data is not copied until
> +one of the files is modified.

This is a circular definition.  Something like:

 Create a lightweight reference to the data blocks in the original
 file, where data is not copied until one of the files is modified.

although I'm not sure if "lightweight" is really valuable there.

> +.PP
> +The default behavior
> +.RI ( flags
> +== 0) is to perform a full data copy of the requested range.
> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.

This is a bit vague.  When COPY_FR_REFLINK is used, no data is "copied",
per se, but I doubt that "0" would be returned in that case either.  It
probably makes sense to write something like:

   ... return the number of bytes accessible in the target file.

or maybe (s/accessible/transferred/) or

   ... return the number of bytes added to the target file.

or similar.

> +
> +On error,
> +.BR copy_file_range ()
> +returns \-1 and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-04 Thread Andreas Dilger
On Sep 4, 2015, at 2:16 PM, Anna Schumaker  wrote:
> 
> Copy system calls came up during Plumbers a couple of weeks ago,
> because several filesystems (including NFS and XFS) are currently
> working on copy acceleration implementations.  We haven't heard from
> Zach Brown in a while, so I volunteered to push his patches upstream
> so individual filesystems don't need to keep writing their own ioctls.
> 
> The first three patches are a simple reposting of Zach's patches
> from several months ago, with one minor error code fix.  The remaining
> patches add in a fallback mechanism when filesystems don't provide a
> copy function.  This is especially useful when doing a server-side
> copy on NFS (using the new COPY operation in NFS v4.2).  This fallback
> can be disabled by passing the flag COPY_REFLINK to the system call.
> 
> The last patch is a man page patch documenting this new system call,
> including an example program.
> 
> I tested the fallback option by using /dev/urandom to generate files
> of varying sizes and copying them.  I compared the time to copy
> against that of `cp` just to see if there is a noticable difference.
> I found that runtimes are roughly the same, but in-kernel copy tends
> to use less of the cpu.  Values in the tables below are averages
> across multiple trials.
> 
> 
> /usr/bin/cp |   512 MB  |   1024 MB |   1536 MB |   2048 MB
> -|---|---|---|---
>   user  |   0.00s   |   0.00s   |   0.00s   |   0.00s
> system  |   0.32s   |   0.52s   |   1.04s   |   1.04s
>cpu  | 73%   | 69%   | 62%   | 62%
>  total  |   0.446   |   0.757   |   1.197   |   1.667
> 
> 
>   VFS copy  |   512 MB  |   1024 MB |   1536 MB |   2048 MB
> -|---|---|---|---
>   user  |   0.00s   |   0.00s   |   0.00s   |  0.00s
> system  |   0.33s   |   0.49s   |   0.76s   |  0.99s
>cpu  | 77%   | 62%   | 60%   |59%
>  total  |   0.422   |   0.777   |   1.267   |  1.655
> 
> 
> Questions?  Comments?  Thoughts?

This is a bit of a surprising result, since in my testing in the
past, copy_{to/from}_user() is a major consumer of CPU time (50%
of a CPU core at 1GB/s).  What backing filesystem did you test on?

In theory, the VFS copy routines should save at least 50% of the
CPU usage since it only needs to make one copy (src->dest) instead
of two (kernel->user, user->kernel).  Ideally it wouldn't make any
data copies at all and just pass page references from the source
to the target.

Cheers, Andreas
> 
> Anna
> 
> 
> Anna Schumaker (5):
>  btrfs: Add mountpoint checking during btrfs_copy_file_range
>  vfs: Remove copy_file_range mountpoint checks
>  vfs: Copy should check len after file open mode
>  vfs: Copy should use file_out rather than file_in
>  vfs: Fall back on splice if no copy function defined
> 
> Zach Brown (3):
>  vfs: add copy_file_range syscall and vfs helper
>  x86: add sys_copy_file_range to syscall tables
>  btrfs: add .copy_file_range file operation
> 
> arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> fs/btrfs/ctree.h   |   3 +
> fs/btrfs/file.c|   1 +
> fs/btrfs/ioctl.c   |  95 ++--
> fs/read_write.c| 132 +
> include/linux/copy.h   |   6 ++
> include/linux/fs.h |   3 +
> include/uapi/asm-generic/unistd.h  |   4 +-
> include/uapi/linux/Kbuild  |   1 +
> include/uapi/linux/copy.h  |   6 ++
> kernel/sys_ni.c|   1 +
> 12 files changed, 214 insertions(+), 40 deletions(-)
> create mode 100644 include/linux/copy.h
> create mode 100644 include/uapi/linux/copy.h
> 
> -- 
> 2.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range()

2015-09-04 Thread Andreas Dilger
On Sep 4, 2015, at 3:38 PM, Darrick J. Wong  wrote:
> 
> On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote:
>> copy_file_range() is a new system call for copying ranges of data
>> completely in the kernel.  This gives filesystems an opportunity to
>> implement some kind of "copy acceleration", such as reflinks or
>> server-side-copy (in the case of NFS).
>> 
>> Signed-off-by: Anna Schumaker 
>> ---
>> man2/copy_file_range.2 | 168 
>> +
>> 1 file changed, 168 insertions(+)
>> create mode 100644 man2/copy_file_range.2
>> 
>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>> new file mode 100644
>> index 000..4a4cb73
>> --- /dev/null
>> +++ b/man2/copy_file_range.2
>> @@ -0,0 +1,168 @@
>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker 
>> 
>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +copy_file_range \- Copy a range of data from one file to another
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include 
>> +.B #include 
>> +.B #include 
>> +
>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in 
>> ",
>> +.BI "int " fd_out ", loff_t * " off_out ", size_t " len ",
>> +.BI "unsigned int " flags );
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR copy_file_range ()
>> +system call performs an in-kernel copy between two file descriptors
>> +without all that tedious mucking about in userspace.
> 
> ;)
> 
>> +It copies up to
>> +.I len
>> +bytes of data from file descriptor
>> +.I fd_in
>> +to file descriptor
>> +.I fd_out
>> +at
>> +.IR off_out .
>> +The file descriptors must not refer to the same file.
> 
> Why?  btrfs (and XFS) reflink can handle the case of a file sharing blocks
> with itself.
> 
>> +
>> +The following semantics apply for
>> +.IR fd_in ,
>> +and similar statements apply to
>> +.IR off_out :
>> +.IP * 3
>> +If
>> +.I off_in
>> +is NULL, then bytes are read from
>> +.I fd_in
>> +starting from the current file offset and the current
>> +file offset is adjusted appropriately.
>> +.IP *
>> +If
>> +.I off_in
>> +is not NULL, then
>> +.I off_in
>> +must point to a buffer that specifies the starting
>> +offset where bytes from
>> +.I fd_in
>> +will be read.  The current file offset of
>> +.I fd_in
>> +is not changed, but
>> +.I off_in
>> +is adjusted appropriately.
>> +.PP
>> +The default behavior of
>> +.BR copy_file_range ()
>> +is filesystem specific, and might result in creating a
>> +copy-on-write reflink.
>> +In the event that a given filesystem does not implement
>> +any form of copy acceleration, the kernel will perform
>> +a deep copy of the requested range by reading bytes from
> 
> I wonder if it's wise to allow deep copies -- what happens if
> len == 1T? Will this syscall just block for a really long time?

It should be interruptible, and return the length of the number of
bytes copied so far, just like read() and write().  That allows
the caller to continue where it left off, or abort and delete the
target file, or whatever it wants to do.

Cheers, Andreas

>> +.I fd_in
>> +and writing them to
>> +.IR fd_out .
> 
> "...if COPY_REFLINK is not set in flags."
> 
>> +
>> +Currently, Linux only supports the following flag:
>> +.TP 1.9i
>> +.B COPY_REFLINK
>> +Only perform the copy if the filesystem can do it as a reflink.
>> +Do not fall back on performing a deep copy.
>> +.SH RETURN VALUE
>> +Upon successful completion,
>> +.BR copy_file_range ()
>> +will return the number of bytes copied between files.
>> +This could be less than the length originally requested.
>> +
>> +On error,
>> +.BR copy_file_range ()
>> +returns \-1 and
>> +.I errno
>> +is set to indicate the error.
>> +.SH ERRORS
>> +.TP
>> +.B EBADF
>> +One or more file descriptors are not valid,
>> +or do not have proper read-write mode.
> 
> "or fd_out is not opened for writing"?
> 
>> +.TP
>> +.B EINVAL
>> +Requested range extends beyond the end of the file;
>> +.I flags
>> +argument is set to an invalid value.
>> +.TP
>> +.B EOPNOTSUPP
>> +.B COPY_REFLINK
>> +was specified in
>> +.IR flags ,
>> +but the target filesystem does not support reflinks.
>> +.TP
>> +.B EXDEV
>> +Target filesystem doesn't support cross-filesystem copies.
>> +.SH VERSIONS
> 
> Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...)
> that can be returned?  (I was looking at the fallocate manpage.)
> 
> --D
> 
>> +The
>> +.BR copy_file_range ()
>> +system call first appeared in Linux 4.3.
>> +.SH CONFORMING TO
>> +The
>> +.BR copy_file_range ()
>> +system call is a nonstandard Linux extension.
>> +.SH EXAMPLE
>> +.nf
>> +
>> +#define _GNU_SOURCE
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +
>> +int main(int argc, char **argv)
>> +{
>> +int fd_in, fd_out;
>> +struct stat stat;
>> +loff_t len, ret;
>> +
>> +if (argc != 3) {
>> + 

Re: [RFC 0/8] Allow GFP_NOFS allocation to fail

2015-08-05 Thread Andreas Dilger
On Aug 5, 2015, at 3:51 AM, mho...@kernel.org wrote:
 Hi,
 small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
 traditionally even though their reclaim capabilities are restricted
 because the VM code cannot recurse into filesystems to clean dirty
 pages. At the same time these allocation requests do not allow to
 trigger the OOM killer because that would lead to pre-mature OOM killing
 during heavy fs metadata workloads.
 
 This leaves the VM code in an unfortunate situation where GFP_NOFS
 requests is looping inside the allocator relying on somebody else to
 make a progress on its behalf. This is prone to deadlocks when the
 request is holding resources which are necessary for other task to make
 a progress and release memory (e.g. OOM victim is blocked on the lock
 held by the NONFS request). Another drawback is that the caller of
 the allocator cannot define any fallback strategy because the request
 doesn't fail.
 
 As the VM cannot do much about these requests we should face the reality
 and allow those allocations to fail. Johannes has already posted the
 patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2)
 but the discussion died pretty quickly.
 
 I was playing with this patch and xfs, ext[34] and btrfs for a while
 to see what is the effect under heavy memory pressure. As expected
 this led to some fallouts.
 
 My test consisted of a simple memory hog which allocates a lot of
 anonymous memory and writes to a fs mainly to trigger a fs activity on
 exit. In parallel there is a parallel fs metadata load (multiple tasks
 creating thousands of empty files and directories). All is running
 in a VM with small amount of memory to emulate an under provisioned
 system. The metadata load is triggering a sufficient load to invoke
 the direct reclaim even without the memory hog. The memory hog forks
 several tasks sharing the VM and OOM killer manages to kill it without 
 locking up the system (this was based on the test case from Tetsuo
 Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html -
 I just didn't want to kill my machine ;)).
 
 With all the patches applied none of the 4 filesystems gets aborted
 transactions and RO remount (well xfs didn't need any special
 treatment). This is obviously not sufficient to claim that failing
 GFP_NOFS is OK now but I think it is a good start for the further
 discussion. I would be grateful if FS people could have a look at
 those patches.  I have simply used __GFP_NOFAIL in the critical paths. 
 This might be not the best strategy but it sounds like a good first
 step.
 
 The first patch in the series also allows __GFP_NOFAIL allocations to
 access memory reserves when the system is OOM which should help those
 requests to make a forward progress - especially in combination with
 GFP_NOFS.
 
 The second patch tries to address a potential pre-mature OOM killer
 from the page fault path. I have posted it separately but it didn't
 get much traction.
 
 The third patch allows GFP_NOFS to fail and I believe it should see
 much more testing coverage. It would be really great if it could sit
 in the mmotm tree for few release cycles so that we can catch more
 fallouts.
 
 The rest are the FS specific patches to fortify allocations
 requests which are really needed to finish transactions without RO
 remounts. There might be more needed but my test case survives with
 these in place.

Wouldn't it make more sense to order the fs-specific patches _before_
the GFP_NOFS can fail patch (#3), so that once that patch is applied
all known failures have already been fixed?  Otherwise it could show
test failures during bisection that would be confusing.

Cheers, Andreas

 They would obviously need some rewording if they are going to be
 applied even without Patch3 and I will do that if respective
 maintainers will take them. Ext3 and JBD are going away soon so they
 might be dropped but they have been in the tree while I was testing
 so I've kept them.
 
 Thoughts? Opinions?
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Andreas Dilger
On Apr 10, 2015, at 4:00 PM, Zach Brown z...@redhat.com wrote:
 
 Add a copy_file_range() system call for offloading copies between
 regular files.
 
 This gives an interface to underlying layers of the storage stack which
 can copy without reading and writing all the data.  There are a few
 candidates that should support copy offloading in the nearer term:
 
 - btrfs shares extent references with its clone ioctl
 - NFS has patches to add a COPY command which copies on the server
 - SCSI has a family of XCOPY commands which copy in the device
 
 This system call avoids the complexity of also accelerating the creation
 of the destination file by operating on an existing destination file
 descriptor, not a path.
 
 Currently the high level vfs entry point limits copy offloading to files
 on the same mount and super (and not in the same file).  This can be
 relaxed if we get implementations which can copy between file systems
 safely.
 
 Signed-off-by: Zach Brown z...@redhat.com
 ---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 4 files changed, 136 insertions(+), 1 deletion(-)
 
 diff --git a/fs/read_write.c b/fs/read_write.c
 index 8e1b687..c65ce1d 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -17,6 +17,7 @@
 #include linux/pagemap.h
 #include linux/splice.h
 #include linux/compat.h
 +#include linux/mount.h
 #include internal.h
 
 #include asm/uaccess.h
 @@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
 in_fd,
   return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
 +
 +/*
 + * copy_file_range() differs from regular file read and write in that it
 + * specifically allows return partial success.  When it does so is up to
 + * the copy_file_range method.
 + */
 +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 + struct file *file_out, loff_t pos_out,
 + size_t len, int flags)

Minor nit - flags should be unsigned int to match the syscall.

 +{
 + struct inode *inode_in;
 + struct inode *inode_out;
 + ssize_t ret;
 +
 + if (flags)
 + return -EINVAL;
 +
 + if (len == 0)
 + return 0;
 +
 + /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */

This says ssize_t, but the len parameter is size_t...

 + ret = rw_verify_area(READ, file_in, pos_in, len);
 + if (ret = 0)
 + ret = rw_verify_area(WRITE, file_out, pos_out, len);
 + if (ret  0)
 + return ret;
 +
 + if (!(file_in-f_mode  FMODE_READ) ||
 + !(file_out-f_mode  FMODE_WRITE) ||
 + (file_out-f_flags  O_APPEND) ||
 + !file_in-f_op || !file_in-f_op-copy_file_range)
 + return -EINVAL;
 +
 + inode_in = file_inode(file_in);
 + inode_out = file_inode(file_out);
 +
 + /* make sure offsets don't wrap and the input is inside i_size */
 + if (pos_in + len  pos_in || pos_out + len  pos_out ||
 + pos_in + len  i_size_read(inode_in))
 + return -EINVAL;
 +
 + /* this could be relaxed once a method supports cross-fs copies */
 + if (inode_in-i_sb != inode_out-i_sb ||
 + file_in-f_path.mnt != file_out-f_path.mnt)
 + return -EXDEV;
 +
 + /* forbid ranges in the same file */
 + if (inode_in == inode_out)
 + return -EINVAL;
 +
 + ret = mnt_want_write_file(file_out);
 + if (ret)
 + return ret;
 +
 + ret = file_in-f_op-copy_file_range(file_in, pos_in, file_out, pos_out,
 +  len, flags);
 + if (ret  0) {
 + fsnotify_access(file_in);
 + add_rchar(current, ret);
 + fsnotify_modify(file_out);
 + add_wchar(current, ret);
 + }
 + inc_syscr(current);
 + inc_syscw(current);
 +
 + mnt_drop_write_file(file_out);
 +
 + return ret;
 +}
 +EXPORT_SYMBOL(vfs_copy_file_range);
 +
 +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
 + int, fd_out, loff_t __user *, off_out,
 + size_t, len, unsigned int, flags)
 +{
 + loff_t pos_in;
 + loff_t pos_out;
 + struct fd f_in;
 + struct fd f_out;
 + ssize_t ret;
 +
 + f_in = fdget(fd_in);
 + f_out = fdget(fd_out);
 + if (!f_in.file || !f_out.file) {
 + ret = -EBADF;
 + goto out;
 + }
 +
 + ret = -EFAULT;
 + if (off_in) {
 + if (copy_from_user(pos_in, off_in, sizeof(loff_t)))
 + goto out;
 + } else {
 + pos_in = f_in.file-f_pos;
 + }
 +
 + if (off_out) {
 + if (copy_from_user(pos_out, off_out, sizeof(loff_t)))
 + goto out;
 + } else {
 + pos_out = f_out.file-f_pos;
 + }
 +
 + ret = 

Re: Documenting MS_LAZYTIME

2015-02-20 Thread Andreas Dilger
On Feb 20, 2015, at 1:50 AM, Michael Kerrisk mtk.manpa...@gmail.com wrote:
 
 Hello Ted,
 
 Based on your commit message 0ae45f63d4e, I I wrote the documentation
 below for MS_LAZYTIME, to go into the mount(2) man page. Could you
 please check it over and let me know if it's accurate. In particular,
 I added pieces marked with * below that were not part of the commit
 message and I'd like confirmation that they're accurate.
 
 Thanks,
 
 Michael
 
 [[
   MS_LAZYTIME (since Linux 3.20)
  Only  update  filetimes (atime, mtime, ctime) on the in-
  memory version of the file  inode.   The  on-disk  time‐
  stamps are updated only when:
 
  (a)  the inode needs to be updated for some change unre‐
   lated to file timestamps;
 
  (b)  the application  employs  fsync(2),  syncfs(2),  or
   sync(2);
 
  (c)  an undeleted inode is evicted from memory; or
 
 * (d)  more than 24 hours have passed since the i-node was
 *  written to disk.
 
  This mount option significantly reduces  writes  to  the
  inode  table  for workloads that perform frequent random
  writes to preallocated files.
 
 * As at Linux 3.20, this option is supported only on ext4.

I _think_ that the lazytime mount option is generic for all filesystems.
I believe ext4 has an extra optimization for it, but that's it.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Andreas Dilger
On Dec 2, 2014, at 12:23 PM, Theodore Ts'o ty...@mit.edu wrote:
 On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
 
 This I do not understand. I thought that I_DIRTY_TIME, and the all
 lazytime mount option, is only for atime. So if there are dirty
 pages then there are also m/ctime that changed and surly we want to
 write these times to disk ASAP.
 
 What are the situations where you are most concerned about mtime or
 ctime being accurate after a crash?
 
 I've been running with it on my laptop for a while now, and it's
 certainly not a problem for build trees; remember, whenever you need
 to update the inode to update i_blocks or i_size, the inode (with its
 updated timestamps) will be flushed to disk anyway.
[snip]
 I'm not aware of an application which is doing a large number of
 non-allocating random writes (for example, such as a database), where
 said database actually cares about mtime being correct.
[snip]
 Did you have such a use case or application in mind?


One thing that comes to mind is touch/utimes()/utimensat().  Those
should definitely not result in timestamps being kept only in memory
for 24h, since the whole point of those calls is to update the times.
It makes sense for these APIs to dirty the inode for proper writeout.

Cheers, Andreas

 if we are lazytime also with m/ctime then I think I would like an
 option for only atime lazy. because m/ctime is cardinal to some
 operations even though I might want atime lazy.
 
 If there's a sufficiently compelling use case where we do actually
 care about mtime/ctime being accurate, and the current semantics don't
 provide enough of a guarantee, it's certainly something we could do.
 I'd rather keep things simple unless it's really there.  (After all,
 we did create the strictatime mount option, but I'm not sure anyone
 every ends up using it.  It woud be a shame if we created a
 strictcmtime, which had the same usage rate.)
 
 I'll also note that if it's only about atime updates, with the default
 relatime mount option, I'm not sure there's enough of a win to hae a
 mode to justify a lazyatime only option.  If you really neeed strict
 c/mtime after a crash, maybe the best thing to do is to just simply
 not use the lazytime mount option and be done with it.
 
 Cheeres,
 
   - Ted
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Andreas Dilger
On Nov 26, 2014, at 3:48 PM, Dave Chinner da...@fromorbit.com wrote:
 
 On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
 Add an optimization for the MS_LAZYTIME mount option so that we will
 opportunistically write out any inodes with the I_DIRTY_TIME flag set
 in a particular inode table block when we need to update some inode
 in that inode table block anyway.
 
 Also add some temporary code so that we can set the lazytime mount
 option without needing a modified /sbin/mount program which can set
 MS_LAZYTIME.  We can eventually make this go away once util-linux has
 added support.
 
 Google-Bug-Id: 18297052
 
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 ---
 fs/ext4/inode.c | 49 
 ++---
 fs/ext4/super.c |  9 +
 include/trace/events/ext4.h | 30 +++
 3 files changed, 85 insertions(+), 3 deletions(-)
 
 diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
 index 5653fa4..8308c82 100644
 --- a/fs/ext4/inode.c
 +++ b/fs/ext4/inode.c
 @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
 + * Opportunistically update the other time fields for other inodes in
 + * the same inode table block.
 + */
 +static void ext4_update_other_inodes_time(struct super_block *sb,
 +  unsigned long orig_ino, char *buf)
 +{
 +struct ext4_inode_info  *ei;
 +struct ext4_inode   *raw_inode;
 +unsigned long   ino;
 +struct inode*inode;
 +int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
 +int inode_size = EXT4_INODE_SIZE(sb);
 +
 +ino = orig_ino  ~(inodes_per_block - 1);
 +for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
 +if (ino == orig_ino)
 +continue;
 +inode = find_active_inode_nowait(sb, ino);
 +if (!inode ||
 +(inode-i_state  I_DIRTY_TIME) == 0 ||
 +!spin_trylock(inode-i_lock)) {
 +iput(inode);
 +continue;
 +}
 +inode-i_state = ~I_DIRTY_TIME;
 +inode-i_ts_dirty_day = 0;
 +spin_unlock(inode-i_lock);
 +inode_requeue_dirtytime(inode);
 +
 +ei = EXT4_I(inode);
 +raw_inode = (struct ext4_inode *) buf;
 +
 +spin_lock(ei-i_raw_lock);
 +EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
 +EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 +EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
 +ext4_inode_csum_set(inode, raw_inode, ei);
 +spin_unlock(ei-i_raw_lock);
 +trace_ext4_other_inode_update_time(inode, orig_ino);
 +iput(inode);
 +}
 +}
 
 Am I right in that this now does unlogged timestamp updates of
 inodes? What happens when that buffer gets overwritten by log
 recover after a crash? The timestamp updates get lost?
 
 FYI, XFS has had all sorts of nasty log recovery corner cases
 caused by log recovery overwriting non-logged inode updates like
 this. In the past few years we've removed every single non-logged
 inode update optimisation so that all changes (including timestamps)
 are transactional so inode state on disk not matching what log
 recovery wrote to disk for all the other inode metadata...
 
 Optimistic unlogged inode updates are a slippery slope, and history
 tells me that it doesn't lead to a nice place

Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
logical journaling, this isn't an unlogged update.  It is just taking
advantage of the fact that the whole block is going to be logged and
written to the disk anyway.  If the only update needed for other inodes
in the block is the timestamp then they may as well be flushed to disk
at the same time and avoid the need for another update later on.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Andreas Dilger
On Nov 21, 2014, at 1:59 PM, Theodore Ts'o ty...@mit.edu wrote:
 Guarantee that the on-disk timestamps will be no more than 24 hours
 stale.
 
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 ---
 fs/fs-writeback.c  | 1 +
 fs/inode.c | 7 ++-
 include/linux/fs.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)
 
 diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
 index ce7de22..eb04277 100644
 --- a/fs/fs-writeback.c
 +++ b/fs/fs-writeback.c
 @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
   if (flags  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
   trace_writeback_dirty_inode_start(inode, flags);
 
 + inode-i_ts_dirty_day = 0;
   if (sb-s_op-dirty_inode)
   sb-s_op-dirty_inode(inode, flags);
 
 diff --git a/fs/inode.c b/fs/inode.c
 index 6e91aca..f0d6232 100644
 --- a/fs/inode.c
 +++ b/fs/inode.c
 @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
 struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
 + int days_since_boot = jiffies / (HZ * 86400);
   int ret;
 
   if (inode-i_op-update_time) {
 @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
 timespec *time, int flags)
   if (flags  S_MTIME)
   inode-i_mtime = *time;
   }
 - if (inode-i_sb-s_flags  MS_LAZYTIME) {
 + if ((inode-i_sb-s_flags  MS_LAZYTIME) 
 + (!inode-i_ts_dirty_day ||
 +  inode-i_ts_dirty_day == days_since_boot)) {
   spin_lock(inode-i_lock);
   inode-i_state |= I_DIRTY_TIME;
   spin_unlock(inode-i_lock);
 + inode-i_ts_dirty_day = days_since_boot;

It isn't clear if this is correct.  It looks like the condition will
only be entered if i_ts_dirty_day == days_since_boot, but that is only
set inside the condition?  Shouldn't this be:

inode-i_ts_dirty_day = ~0U;

if ((inode-i_sb-s_flags  MS_LAZYTIME) 
inode-i_ts_dirty_day != days_since_boot)) {
spin_lock(inode-i_lock);
inode-i_state |= I_DIRTY_TIME;
spin_unlock(inode-i_lock);
inode-i_ts_dirty_day = days_since_boot;
}

and days_since_boot should be declared unsigned short so it wraps
in the same way as i_ts_dirty_day, maybe using typeof(i_ts_dirty_day)
so that there isn't any need to update this in the future if the type
changes.  Conceivably this could be an unsigned char, if that packed
into struct inode better, at the risk of losing a timestamp update on
an inode in cache for 256+ days and only modifying it 256 days later.

Cheers, Andreas

   return 0;
   }
 + inode-i_ts_dirty_day = 0;
   if (inode-i_op-write_time)
   return inode-i_op-write_time(inode);
   mark_inode_dirty_sync(inode);
 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index 489b2f2..e3574cd 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -575,6 +575,7 @@ struct inode {
   struct timespec i_ctime;
   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
   unsigned short  i_bytes;
 + unsigned short  i_ts_dirty_day;
   unsigned inti_blkbits;
   blkcnt_ti_blocks;
 
 -- 
 2.1.0
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 2/6] fiemap: add fe_phys_length and EXTENT_PHYS_LENGTH flag

2014-07-30 Thread Andreas Dilger

On Jul 30, 2014, at 11:18 AM, David Sterba dste...@suse.cz wrote:
 Add a new member to fiemap_extent that represents the physical extent
 length. This value is undefined if the flag EXTENT_PHYS_LENGTH is not
 set.

The description here of PHYS_LENGTH makes sense...

The patch description should also mention the name of the new member,
namely fe_phys_length

 +#define FIEMAP_EXTENT_PHYS_LENGTH0x0010 /* Physical length of extent
 + * not the same as logical */

But the comment doesn't match.  This implies that if PHYS_LENGTH is
set, fe_phys_length != fe_logi_length, but I don't think that is
necessarily correct.  I think it makes more sense to just set
PHYS_LENGTH when fe_phys_length is valid, and if PHYS_LENGTH is not
set then fe_phys_length aware applications should just use
fe_phys_length = fe_logi_length, and older applications would just
use fe_length for both as they would already today.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag

2014-07-24 Thread Andreas Dilger

On Jul 24, 2014, at 1:22 PM, David Sterba dste...@suse.cz wrote:
 On Thu, Jul 17, 2014 at 12:07:57AM -0600, Andreas Dilger wrote:
 any progress on this patch series?
 
 I'm sorry I got distracted at the end of year and did not finish the
 series.
 
 I never saw an updated version of this patch series after the last round of
 reviews, but it would be great to move it forward.  I have filefrag patches
 in my e2fsprogs tree waiting for an updated version of your patch.
 
 I recall the main changes were:
 - add FIEMAP_EXTENT_PHYS_LENGTH flag to indicate if fe_phys_length was valid
 
 fe_phys_length will be always valid, so other the flags are set only if it's
 not equal to the logical length.
 
 - rename fe_length to fe_logi_length and #define fe_length fe_logi_length
 - always fill in fe_phys_length (= fe_logi_length for uncompressed files)
  and set FIEMAP_EXTENT_PHYS_LENGTH whether the extent is compressed or not
 
 This is my understanding and contradicts the first point.

I think Dave Chinner's former point was that having fe_phys_length validity
depend on FIEMAP_EXTENT_DATA_COMPRESSED is a non-intuitive interface.  It is
not true that fe_phys_length would always be valid, since that is not the
case for older kernels that currently always set this field to 0, so they
need some flag to indicate if fe_phys_length is valid.  Alternately,
userspace could do:

if (ext-fe_phys_length == 0)
ext-fe_phys_length = ext-fe_logi_length;

but that pre-supposes that fe_phys_length == 0 is never a valid value when
fe_logi_length is non-zero, and this might introduce errors in some cases.
I could imagine that some compression methods might not allocate any space
at all if it was all zeroes, and just store a bit in the blockpointer or
extent, so having a separate FIEMAP_EXTENT_PHYS_LENGTH is probably safer
in the long run.  That opens up the question of whether a written zero
filled space that gets compressed away is different from a hole, but I'd
prefer to just return whatever the file mapping is than interpret it.

Cheers, Andreas

 - add WARN_ONCE() in fiemap_fill_next_extent() as described below
 
 I don't know if there was any clear statement about whether there should be
 separate FIEMAP_EXTENT_PHYS_LENGTH and FIEMAP_EXTENT_DATA_COMPRESSED flags,
 or if the latter should be implicit?  Probably makes sense to have separate
 flags.  It should be fine to use:
 
 #define FIEMAP_EXTENT_PHYS_LENGTH0x0010
 
 since this flag was never used.
 
 I've kept only FIEMAP_EXTENT_DATA_COMPRESSED, I don't see a need for
 FIEMAP_EXTENT_PHYS_LENGTH and this would be yet another flag because the
 FIEMAP_EXTENT_DATA_ENCODED is also implied.
 
 I'll send V4, we can discuss the PHYS_LENGTH flag then.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag

2014-07-17 Thread Andreas Dilger
David,
any progress on this patch series?

I never saw an updated version of this patch series after the last round of
reviews, but it would be great to move it forward.  I have filefrag patches
in my e2fsprogs tree waiting for an updated version of your patch.

I recall the main changes were:
- add FIEMAP_EXTENT_PHYS_LENGTH flag to indicate if fe_phys_length was valid
- rename fe_length to fe_logi_length and #define fe_length fe_logi_length
- always fill in fe_phys_length (= fe_logi_length for uncompressed files)
  and set FIEMAP_EXTENT_PHYS_LENGTH whether the extent is compressed or not
- add WARN_ONCE() in fiemap_fill_next_extent() as described below

I don't know if there was any clear statement about whether there should be
separate FIEMAP_EXTENT_PHYS_LENGTH and FIEMAP_EXTENT_DATA_COMPRESSED flags,
or if the latter should be implicit?  Probably makes sense to have separate
flags.  It should be fine to use:

#define FIEMAP_EXTENT_PHYS_LENGTH   0x0010

since this flag was never used.

Cheers, Andreas

On Dec 12, 2013, at 5:02 PM, Andreas Dilger adil...@dilger.ca wrote:
 On Dec 12, 2013, at 4:24 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Dec 12, 2013 at 04:25:59PM +0100, David Sterba wrote:
 This flag was not accepted when fiemap was proposed [2] due to lack of
 in-kernel users. Btrfs has compression for a long time and we'd like to
 see that an extent is compressed in the output of 'filefrag' utility
 once it's taught about it.
 
 For that purpose, a reserved field from fiemap_extent is used to let the
 filesystem store along the physcial extent length when the flag is set.
 This keeps compatibility with applications that use FIEMAP.
 
 I'd prefer to just see the new physical length field always filled
 out, regardless of whether it is a compressed extent or not. In
 terms of backwards compatibility to userspace, it makes no
 difference because the value of reserved/unused fields is undefined
 by the API. Yes, the implementation zeros them, but there's nothing
 in the documentation that says reserved fields must be zero.
 Hence I think we should just set it for every extent.
 
 I'd actually thought the same thing while reading the patch, but I figured
 people would object because it implies that old kernels will return a
 physical length of 0 bytes (which might be valid) and badly-written tools
 will not work correctly on older kernels.  That said, applications _should_
 be checking the FIEMAP_EXTENT_DATA_COMPRESSED flag, and I suspect in the
 future fewer developers will be confused if fe_phys_length == fe_length
 going forward.
 
 If the initial tools get it right (in particular filefrag), then hopefully
 others will get it correct also.
 
 From the point of view of the kernel API (fiemap_fill_next_extent),
 passing the physical extent size in the len parameter for normal
 extents, then passing 0 for the physical length makes absolutely
 no sense.
 
 IOWs, what you have created is a distinction between the extent's
 logical length and it's physical length. For uncompressed
 extents, they are both equal and they should both be passed to
 fiemap_fill_next_extent as the same value. Extents where they are
 different (i.e.  encoded extents) is when they can be different.
 Perhaps fiemap_fill_next_extent() should check and warn about
 mismatches when they differ and the relevant flags are not set...
 
 Seems reasonable to have a WARN_ONCE() in that case.  That would catch bugs
 in the filesystem, code as well:
 
   WARN_ONCE(phys_len != lgcl_len 
 !(flags  FIEMAP_EXTENT_DATA_COMPRESSED),
 physical len %llu != logical length %llu without 
 DATA_COMPRESSED\n,
 phys_len, logical_len, phys_len, logical_len);
 
 diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h
 index 93abfcd..0e32cae 100644
 --- a/include/uapi/linux/fiemap.h
 +++ b/include/uapi/linux/fiemap.h
 @@ -19,7 +19,9 @@ struct fiemap_extent {
 __u64 fe_physical; /* physical offset in bytes for the start
 * of the extent from the beginning of the disk */
 __u64 fe_length;   /* length in bytes for this extent */
 -   __u64 fe_reserved64[2];
 +   __u64 fe_phys_length; /* physical length in bytes, undefined if
 +  * DATA_COMPRESSED not set */
 +   __u64 fe_reserved64;
 __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
 __u32 fe_reserved[3];
 };
 
 The comment for fe_length needs to change, too, because it needs to
 indicate that it is the logical extent length and that it may be
 different to the fe_phys_length depending on the flags that are set
 on the extent.
 
 Would it make sense to rename fe_length to fe_logi_length (or something,
 I'm open to suggestions), and have a compat macro:
 
 #define fe_length fe_logi_length
 
 around for older applications?  That way, new developers would start to
 use the new name, old applications would still compile for both newer and
 older interfaces

Re: [PATCH 3/4 v3] btrfs: set FIEMAP_EXTENT_DATA_COMPRESSED for compressed extents

2013-12-12 Thread Andreas Dilger
On Dec 12, 2013, at 8:26 AM, David Sterba dste...@suse.cz wrote:
 Set the EXTENT_DATA_COMPRESSED flag together with EXTENT_ENCODED as
 defined by fiemap spec.
 
 Signed-off-by: David Sterba dste...@suse.cz
 ---
 fs/btrfs/extent_io.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
 index 5ea0ef5..8a28f15 100644
 --- a/fs/btrfs/extent_io.c
 +++ b/fs/btrfs/extent_io.c
 
 @@ -4220,9 +4222,12 @@ int extent_fiemap(struct inode *inode, struct 
 fiemap_extent_info *fieinfo,
 
   if (ref_cnt  1)
   flags |= FIEMAP_EXTENT_SHARED;
 + em_phys_len = em-block_len;
   }
 - if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags))
 + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) {
   flags |= FIEMAP_EXTENT_ENCODED;
 + flags |= FIEMAP_EXTENT_DATA_COMPRESSED;

(minor nit) Could combine these (not sure it makes a difference to the 
compiler):

flags |= (FIEMAP_EXTENT_ENCODED |
  FIEMAP_EXTENT_DATA_COMPRESSED);

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 0/4 v3] fiemap: introduce EXTENT_DATA_COMPRESSED flag

2013-12-12 Thread Andreas Dilger
On Dec 12, 2013, at 8:25 AM, David Sterba dste...@suse.cz wrote:
 The original FIEMAP patch did not define this bit, btrfs will make use of
 it.  The defined constant maintains the same value as originally proposed.
 
 Currently, the 'filefrag' utility has no way to recognize and denote a
 compressed extent. As implemented in btrfs right now, the compression step
 splits a big extent into smaller chunks and this is reported as a heavily
 fragmented file. Adding the flag to filefrag will at least give some
 explanation why, this has been confusing users for some time already.

The whole series looks good to me (one minor nit if it needs to be resubmitted
for some reason).  You can add my:

Reviewed-by: Andreas Dilger adil...@dilger.ca

 V3:
 Based on feedback from Andreas, implement #1 from V2, current users of
 fiemap_fill_next_extent (fs/, ext4, gfs2, ocfs2, nilfs2, xfs) updated
 accordingly, no functional change.
 
 V2:
 Based on feedback from Andreas, the fiemap_extent is now able to hold the
 physical extent length, to be filled by the filesystem callback.
 
 The filesystems do not have access to the structure that is passed back to
 userspace and are supposed to call fiemap_fill_next_extent, there's no direct
 way to fill fe_phys_length. There are two ways to pass it:
 
 1) extend fiemap_fill_next_extent to take phys_length and update all
   users (ext4, gfs2, ocfs2, nilfs2, xfs)
 
 2) add new function that takes arguments for all the fiemap_extent items,
   newly added phys_length compared to fiemap_fill_next_extent
 
 David Sterba (4):
  fiemap: fix comment at EXTENT_DATA_ENCRYPTED
  fiemap: add EXTENT_DATA_COMPRESSED flag
  btrfs: set FIEMAP_EXTENT_DATA_COMPRESSED for compressed extents
  Documentation/fiemap: Document the DATA_COMPRESSED flag
 
 Documentation/filesystems/fiemap.txt |   17 +
 fs/btrfs/extent_io.c |9 +++--
 fs/ext4/extents.c|3 ++-
 fs/ext4/inline.c |2 +-
 fs/gfs2/inode.c  |2 +-
 fs/ioctl.c   |   18 --
 fs/nilfs2/inode.c|8 +---
 fs/ocfs2/extent_map.c|4 ++--
 fs/xfs/xfs_iops.c|2 +-
 include/linux/fs.h   |2 +-
 include/uapi/linux/fiemap.h  |8 ++--
 11 files changed, 51 insertions(+), 24 deletions(-)
 
 -- 
 1.7.9
 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 2/4 v3] fiemap: add EXTENT_DATA_COMPRESSED flag

2013-12-12 Thread Andreas Dilger
On Dec 12, 2013, at 4:24 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Dec 12, 2013 at 04:25:59PM +0100, David Sterba wrote:
 This flag was not accepted when fiemap was proposed [2] due to lack of
 in-kernel users. Btrfs has compression for a long time and we'd like to
 see that an extent is compressed in the output of 'filefrag' utility
 once it's taught about it.
 
 For that purpose, a reserved field from fiemap_extent is used to let the
 filesystem store along the physcial extent length when the flag is set.
 This keeps compatibility with applications that use FIEMAP.
 
 I'd prefer to just see the new physical length field always filled
 out, regardless of whether it is a compressed extent or not. In
 terms of backwards compatibility to userspace, it makes no
 difference because the value of reserved/unused fields is undefined
 by the API. Yes, the implementation zeros them, but there's nothing
 in the documentation that says reserved fields must be zero.
 Hence I think we should just set it for every extent.

I'd actually thought the same thing while reading the patch, but I figured
people would object because it implies that old kernels will return a
physical length of 0 bytes (which might be valid) and badly-written tools
will not work correctly on older kernels.  That said, applications _should_
be checking the FIEMAP_EXTENT_DATA_COMPRESSED flag, and I suspect in the
future fewer developers will be confused if fe_phys_length == fe_length
going forward.

If the initial tools get it right (in particular filefrag), then hopefully
others will get it correct also.

 From the point of view of the kernel API (fiemap_fill_next_extent),
 passing the physical extent size in the len parameter for normal
 extents, then passing 0 for the physical length makes absolutely
 no sense.
 
 IOWs, what you have created is a distinction between the extent's
 logical length and it's physical length. For uncompressed
 extents, they are both equal and they should both be passed to
 fiemap_fill_next_extent as the same value. Extents where they are
 different (i.e.  encoded extents) is when they can be different.
 Perhaps fiemap_fill_next_extent() should check and warn about
 mismatches when they differ and the relevant flags are not set...

Seems reasonable to have a WARN_ONCE() in that case.  That would catch bugs
in the filesystem, code as well:

WARN_ONCE(phys_len != lgcl_len 
  !(flags  FIEMAP_EXTENT_DATA_COMPRESSED),
  physical len %llu != logical length %llu without 
DATA_COMPRESSED\n,
  phys_len, logical_len, phys_len, logical_len);

 diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h
 index 93abfcd..0e32cae 100644
 --- a/include/uapi/linux/fiemap.h
 +++ b/include/uapi/linux/fiemap.h
 @@ -19,7 +19,9 @@ struct fiemap_extent {
  __u64 fe_physical; /* physical offset in bytes for the start
  * of the extent from the beginning of the disk */
  __u64 fe_length;   /* length in bytes for this extent */
 -__u64 fe_reserved64[2];
 +__u64 fe_phys_length; /* physical length in bytes, undefined if
 +   * DATA_COMPRESSED not set */
 +__u64 fe_reserved64;
  __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
  __u32 fe_reserved[3];
 };
 
 The comment for fe_length needs to change, too, because it needs to
 indicate that it is the logical extent length and that it may be
 different to the fe_phys_length depending on the flags that are set
 on the extent.

Would it make sense to rename fe_length to fe_logi_length (or something,
I'm open to suggestions), and have a compat macro:

#define fe_length fe_logi_length

around for older applications?  That way, new developers would start to
use the new name, old applications would still compile for both newer and
older interfaces, and it doesn't affect the ABI at all.

 And, FWIW, I wouldn't mention specific flags in the comment here,
 but do it at the definition of the flags that indicate there is
 a difference between physical and logical extent lengths

Actually, I was thinking just the opposite for this field.  It seems useful
that the requirement for DATA_COMPRESSED being set is beside fe_phys_length
so that anyone using this field sees the correlation clearly.  I don't expect
everyone would read and understand the meaning of all the flags when looking
at the data structure.

Cheers, Andreas

 @@ -50,6 +52,8 @@ struct fiemap {
  * Sets EXTENT_UNKNOWN. */
 #define FIEMAP_EXTENT_ENCODED0x0008 /* Data can not be 
 read
  * while fs is unmounted */
 +#define FIEMAP_EXTENT_DATA_COMPRESSED   0x0040 /* Data is 
 compressed by fs.
 +* Sets EXTENT_ENCODED */
 
 i.e. here.
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com


Cheers, Andreas








Re: atime and filesystems with snapshots (especially Btrfs)

2012-05-25 Thread Andreas Dilger
On 2012-05-25, at 9:59, Alexander Block abloc...@googlemail.com wrote:

 On Fri, May 25, 2012 at 5:42 PM, Josef Bacik jo...@redhat.com wrote:
 On Fri, May 25, 2012 at 05:35:37PM +0200, Alexander Block wrote:
 Hello,
 
 (this is a resend with proper CC for linux-fsdevel and linux-kernel)
 
 I would like to start a discussion on atime in Btrfs (and other
 filesystems with snapshot support).
 
 As atime is updated on every access of a file or directory, we get
 many changes to the trees in btrfs that as always trigger cow
 operations. This is no problem as long as the changed tree blocks are
 not shared by other subvolumes. Performance is also not a problem, no
 matter if shared or not (thanks to relatime which is the default).
 The problems start when someone starts to use snapshots. If you for
 example snapshot your root and continue working on your root, after
 some time big parts of the tree will be cowed and unshared. In the
 worst case, the whole tree gets unshared and thus takes up the double
 space. Normally, a user would expect to only use extra space for a
 tree if he changes something.
 A worst case scenario would be if someone took regular snapshots for
 backup purposes and later greps the contents of all snapshots to find
 a specific file. This would touch all inodes in all trees and thus
 make big parts of the trees unshared.

Are you talking about the atime for the primary copy, or the atime for the 
snapshots?  IMHO, the atime should not be updated for a snapshot unless it is 
explicitly mounted r/w, or it isn't really a good snapshot. 

 relatime (which is the default) reduces this problem a little bit, as
 it by default only updates atime once a day. This means, if anyone
 wants to test this problem, mount with relatime disabled or change the
 system date before you try to update atime (that's the way i tested
 it).
 
 As a solution, I would suggest to make noatime the default for btrfs.
 I'm however not sure if it is allowed in linux to have different
 default mount options for different filesystem types. I know this
 discussion pops up every few years (last time it resulted in making
 relatime the default). But this is a special case for btrfs. atime is
 already bad on other filesystems, but it's much much worse in btrfs.
 
 
 Just mount with -o noatime, there's no chance of turning something like that 
 on
 by default since it will break some applications (notably mutt).  Thanks,
 
 Josef
 
 I know about the discussions regarding compatibility with existing
 applications. The problem here is, that it is not only a compatibility
 problem. Having atime enabled by default, may give you ENOSPC
 for reasons that a normal user does not understand or expect.
 As a normal user, I would think: If I never change something, why
 does it then take up more space just by reading it?
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Andreas Dilger
On 2012-04-16, at 9:13 AM, Jan Kara wrote:
 Another potential contention point might be patch 19. In that patch
 we make freeze_super() refuse to freeze the filesystem when there
 are open but unlinked files which may be impractical in some cases.
 The main reason for this is the problem with handling of file deletion
 from fput() called with mmap_sem held (e.g. from munmap(2)), and
 then there's the fact that we cannot really force such filesystem
 into a consistent state... But if people think that freezing with
 open but unlinked files should happen, then I have some possible
 solutions in mind (maybe as a separate patchset since this is
 large enough).

Looking at a desktop system, I think it is very typical that there
are open-unlinked files present, so I don't know if this is really
an acceptable solution.  It isn't clear from your comments whether
this is a blanket refusal for all open-unlinked files, or only in
some particular cases...

lsof | grep deleted
nautilus  25393  adilger   19r  REG   253,0  340 253954 
/home/adilger/.local/share/gvfs-metadata/home (deleted)
nautilus  25393  adilger   20r  REG   253,032768 253964 
/home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
gnome-ter 25623  adilger   22u  REG0,18178412717846 
/tmp/vtePIRJCW (deleted)
gnome-ter 25623  adilger   23u  REG0,18 55682717847 
/tmp/vteDCSJCW (deleted)
gnome-ter 25623  adilger   29u  REG0,18  4802728484 
/tmp/vte6C1TCW (deleted)

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

2012-04-16 Thread Andreas Dilger
On 2012-04-16, at 5:43 PM, Dave Chinner wrote:
 On Mon, Apr 16, 2012 at 03:02:50PM -0700, Andreas Dilger wrote:
 On 2012-04-16, at 9:13 AM, Jan Kara wrote:
 Another potential contention point might be patch 19. In that patch
 we make freeze_super() refuse to freeze the filesystem when there
 are open but unlinked files which may be impractical in some cases.
 The main reason for this is the problem with handling of file deletion from 
 fput() called with mmap_sem held (e.g. from munmap(2)),
 and then there's the fact that we cannot really force such filesystem
 into a consistent state... But if people think that freezing with
 open but unlinked files should happen, then I have some possible
 solutions in mind (maybe as a separate patchset since this is
 large enough).
 
 Looking at a desktop system, I think it is very typical that there
 are open-unlinked files present, so I don't know if this is really
 an acceptable solution.  It isn't clear from your comments whether
 this is a blanket refusal for all open-unlinked files, or only in
 some particular cases...
 
 Unlinked-but-open files are the reason that XFS dirties the log
 after the freeze process is complete. This ensures that if the
 system crashes while the filesystem is frozen then log recovery
 during the next mount will process the unlinked (orphaned) inodes
 and free the correctly. i.e. you can still freeze a filesystem with
 inodes in this state successfully and have everythign behave as
 you'd expect.
 
 I'm not sure how other filesystems handle this problem, but perhaps
 pushing this check down into filesystem specific code or adding a
 superblock feature flag might be a way to allow filesystems to
 handle this case in the way they think is best...

The ext3/4 code has long been able to handle open-unlinked files
properly after a crash (they are put into a singly-linked list from
the superblock on disk that is processed after journal recovery).

The issue here is that blocking freeze from succeeding with open-
unlinked files is an unreasonable assumption of this patch, and
I don't think it is acceptable to land this patchset (which IMHO
would prevent nearly every Gnome system from freezing unless these
apps have changed their behaviour in more recent releases).

Like you suggest, filesystems that handle this correctly should be
able to flag or otherwise indicate that this is OK, and allow the
freeze to continue.  For other filesystems that do not handle
open-unlinked file consistency during a filesystem freeze/snapshot
whether this should even be considered a new case, or is something
that has existed for ages already.

The other question is whether this is still a problem even for
filesystems handling the consistency issue, but from Jan's comment
above there is a new locking issue related to mmap_sem being added?

Cheers, Andreas
--
Andreas Dilger   Whamcloud, Inc.
Principal Lustre Engineerhttp://www.whamcloud.com/




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-11 Thread Andreas Dilger
On 2012-03-09, at 9:48 PM, Ted Ts'o wrote:
 On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:
 
 Just reading this on the plane, so I can't find the exact reference
 that I want, but a solution to this problem with htree was discussed
 a few years ago between myself and Coly Li.
 
 The basic idea is that for large directories the inode allocator
 starts by selecting a range of (relatively) free inodes based on the
 current directory size, and then piecewise maps the hash value for
 the filename into this inode range and uses that as the goal inode.
 
 I've heard you sketch out this idea before, but it's not been clean to
 me how well it would work in practice.  The potential downside is that
 it fragments the inode table, such that a single inode table block
 might not have as many inodes stored in it as we might otherwise
 would.  (Ideally, with 256 byte inodes, there would 16 of them in each
 4k block.)  This is something I'd definitely recommend modelling in
 simulation to see how well it works first.

Very true, I was thinking this as I wrote the email.

 I'd observe that if we knew in advance how many files would be in a
 particular directory (i.e., via a hint from userspace at mkdir time),
 then it would be possible to optimize this case much more easily.  In
 fact, if we had perfect knowledge --- if the userspace process could
 feed us the exact set of filenames that will be used in the directory,
 plus the exact file sizes for each of the file names --- then it would
 be possible to allocate inode numbers and starting block numbers such
 that when the files are read in readdir() order, the inode numbers and
 block numbers would line up perfectly into sequential writes.  

Except POSIX doesn't allow anything close to this at all.  Something like
fallocate() for directories would at least give the kernel a hint about
how large the directory is, but fewer userspace tools have this information
than the file size (e.g. tar doesn't really store a directory, just a
bunch of pathnames that happen to have largely the same components).

In the cases where size IS known in advance, or at least if the directory
is going to be large, an allocation policy like mine would essentially
predict where a uniformly distributed hash function will allocate the
inodes for better reading order.

 Of course, the trade off is that by optimizing for read order, the
 write order will tend to get painful as well!  So there's a tension
 here; making the read part of the benchmark faster will make the
 process of writing the directory hierarchy require more seeks --- and
 this assuming that you know file names and sizes ahead of time.
 
 (Unless, of course, you play the same game that Hans Reiser did when
 he cooked his tar benchmark by constructing a kernel source tarball
 where the file order in the tarball was perfectly ordered to guarantee
 the desired result given Reiser4's hash!  :-)

No, I definitely don't want to go down that path...  Optimization for the
sake of benchmarks is broken.

 I suspect the best optimization for now is probably something like
 this:
 
 1) Since the vast majority of directories are less than (say) 256k
 (this would be a tunable value),

Yes, this would be reasonable, and I suspect could be a lot larger than
256kB.  Probably variable based on the amount of RAM is reasonable.
Looking at http://www.pdsi-scidac.org/fsstats/ (which is listed as HPC,
but is a bit old and the filesystems aren't much larger than large SATA
drives today :-) having a 1MB cache would handle virtually every directory
in that survey (99.999%).

The unfortunate part is that this will help small directories, where the
problem is already largely hidden by readahead and the page cache, but it
doesn't help at all for huge directories that do not fit into cache.  That
is where my approach would do well, but the difficulty is in knowing at
create time how large the directory will be.

 for directories which are less than
 this threshold size, the entire directory is sucked in after the first
 readdir() after an opendir() or rewinddir().  The directory contents
 are then sorted by inode number (or loaded into an rbtree ordered by
 inode number), and returned back to userspace in the inode order via
 readdir().  The directory contents will be released on a closedir() or
 rewinddir().

What would the telldir() cookie be in this case?  The inode number, or
the file descriptor?  What happens if the directory size crosses the
threshold while the cache is in use?  I guess in that case the cache
could still be used (according to POSIX) because readdir() doesn't need
to report entries added after it started.

 2) For files larger than this size, we don't want to hold that much
 kernel memory during an opendir(), so fall back to ext4's current
 scheme.   
 
 3) If we want do to better than that, we could create new flag
 O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
 do something like dir = opendir

Re: [PATCH 1/7] BTRFS: Fix lseek return value for error

2011-09-17 Thread Andreas Dilger
On 2011-09-17, at 12:10 AM, Jeff Liu jeff@oracle.com wrote:
 I once posted a similar patch for this issue which can be found at:
 http://www.spinics.net/lists/linux-btrfs/msg12169.html
 
 with an additional improvement if the offset is larger or equal to the
 file size, return -ENXIO in directly:
 
if (offset = inode-i_size) {
mutex_unlock(inode-i_mutex);
return -ENXIO;
}

Except that is wrong, because it would then be impossible to write sparse 
files. 

 Thanks,
 -Jeff
 On 09/16/2011 11:48 PM, Christoph Hellwig wrote:
 
 On Thu, Sep 15, 2011 at 04:06:47PM -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c
 
 I think this should go to Chris/Linus ASAP.  But a slightly better
 patch description wouldn't hurt either.
 
 Also any reason you captialize BTRFS?
 
 
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
 fs/btrfs/file.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)
 
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index 3c3abff..7ec0a24 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -1818,19 +1818,17 @@ static loff_t btrfs_file_llseek(struct file *file, 
 loff_t offset, int origin)
case SEEK_DATA:
case SEEK_HOLE:
ret = find_desired_extent(inode, offset, origin);
 -if (ret) {
 -mutex_unlock(inode-i_mutex);
 -return ret;
 -}
 +if (ret)
 +goto error;
}
 
if (offset  0  !(file-f_mode  FMODE_UNSIGNED_OFFSET)) {
ret = -EINVAL;
 -goto out;
 +goto error;
}
if (offset  inode-i_sb-s_maxbytes) {
ret = -EINVAL;
 -goto out;
 +goto error;
}
 
/* Special lock needed here? */
 @@ -1841,6 +1839,9 @@ static loff_t btrfs_file_llseek(struct file *file, 
 loff_t offset, int origin)
 out:
mutex_unlock(inode-i_mutex);
return offset;
 +error:
 +mutex_unlock(inode-i_mutex);
 +return ret;
 }
 
 const struct file_operations btrfs_file_operations = {
 -- 
 1.7.4.4
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 ---end quoted text---
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester

2011-08-25 Thread Andreas Dilger
On 2011-08-25, at 12:40 AM, Dave Chinner wrote:
 On Thu, Aug 25, 2011 at 02:06:32AM -0400, Christoph Hellwig wrote:
 On Tue, Jun 28, 2011 at 11:33:19AM -0400, Josef Bacik wrote:
 This is a test to make sure seek_data/seek_hole is acting like it does on
 Solaris.  It will check to see if the fs supports finding a hole or not and 
 will adjust as necessary.
 
 Can you resend this with any updates that happened in the meantime?
 
 Dave also still had some comments about semantics, so it might be worth
 to incorporate that as well.
 
 The main questions I had when looking at this was how we should
 handle unwritten extents - the only answer I got was along the lines
 of we'll deal with that once filesystems have implemented
 something. That's a bit of a chicken-and-egg situation, and doesn't
 help me decide what is the best thing to do. I don't want to have to
 re-implement this code when it's decided i did the wrong thing
 initially.

Let's first clarify what you mean by an unwritten extent?  Do you mean a
preallocated extent that returns 0 when read, or do you mean a delayed
allocation extent that was written by the application that is still in
memory but not yet written to disk?

Unfortunately, ZFS has no concept of preallocated extents, so we can't
look to it for precedent, but it definitely has delayed allocation.
Possibly if UFS on Solaris has SEEK_HOLE and also preallocated extents
(I have no idea) it could be tested?

 The most basic question I really want answered is this:
 
   - is preallocated space a HOLE, or is it DATA?
 
 Whatever the answer, I think it should be consistently
 presented by all filesystems that support preallocation, and it
 should be encoded into the generic SEEK_HOLE/SEEK_DATA tests

My thought would be that a preallocated extent is still a HOLE, because
it doesn't contain data that an application actually cares about.  On
the other hand, a delalloc extent is DATA because it has something that
an application cares about.

The original reason SEEK_HOLE/SEEK_DATA were brought up over FIEMAP was
cp being able to consistently access delalloc data that was only in
the page cache, so if we don't get that right it will have been a pointless
exercise.

 Answering that question then helps answer the more complex questions
 I had, like:
 
   - what does SEEK_DATA return when you have a file layout
 like hole-prealloc-data?

I would think only the data part, since that is what the definition
of SEEK_DATA is IMHO.

 Answers to that sort of question need to be known so we can write
 corner-case tests to correctly verify the filesystem implementation.
 
 I like to do better than present userspace with an interface that
 behaves vastly different depending on the underlying filesystem, but
 if the answer is definition and implementation is entirely
 filesystem specific then I'll just go make something up
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester

2011-06-27 Thread Andreas Dilger
On 2011-06-27, at 12:02 PM, Josef Bacik wrote:
 This is a test to make sure seek_data/seek_hole is acting like it does on
 Solaris.  It will check to see if the fs supports finding a hole or not and 
 will
 adjust as necessary.
 
 diff --git a/src/seek-tester.c b/src/seek-tester.c
 new file mode 100644
 index 000..2b8c957
 --- /dev/null
 +++ b/src/seek-tester.c
 @@ -0,0 +1,473 @@
 +/*
 + * Copyright (C) 2011 Oracle.  All rights reserved.
 + * Copyright (C) 2011 Red Hat.  All rights reserved.
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public
 + * License along with this program; if not, write to the
 + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 + * Boston, MA 021110-1307, USA.
 + */
 +
 +#define _XOPEN_SOURCE 500
 +#include sys/types.h
 +#include sys/stat.h
 +#include errno.h
 +#include fcntl.h
 +#include stdio.h
 +#include string.h
 +#include unistd.h
 +#include stdlib.h
 +
 +#define SEEK_DATA3
 +#define SEEK_HOLE4

These should probably be #ifndef SEEK_DATA so that gcc doesn't complain
in the future when these are added to a standard header.

 +#define FS_NO_HOLES  (1  0)
 +#define QUIET(1  1)
 +
 +static blksize_t alloc_size;
 +static unsigned flags = 0;
 +
 +static int get_io_sizes(int fd)
 +{
 + struct stat buf;
 + int ret;
 +
 + ret = fstat(fd, buf);
 + if (ret)
 + fprintf(stderr,   ERROR %d: Failed to find io blocksize\n,
 + errno);
 +
 + /* st_blksize is typically also the allocation size */
 + alloc_size = buf.st_blksize;
 +
 + if (!(flags  QUIET))
 + printf(Allocation size: %ld\n, alloc_size);
 +
 + return ret;
 +}
 +
 +#define do_free(x)   do { if(x) free(x); } while(0);
 +
 +static void *do_malloc(size_t size)
 +{
 + void *buf;
 +
 + buf = malloc(size);
 + if (!buf)
 + fprintf(stderr,   ERROR: Unable to allocate %ld bytes\n,
 + (long)size);
 +
 + return buf;
 +}
 +
 +static int do_truncate(int fd, off_t length)
 +{
 + int ret;
 +
 + ret = ftruncate(fd, length);
 + if (ret)
 + fprintf(stderr,   ERROR %d: Failed to extend file 
 + to %ld bytes\n, errno, (long)length);
 + return ret;
 +}
 +
 +static ssize_t do_pwrite(int fd, const void *buf, size_t count, off_t offset)
 +{
 + ssize_t ret, written = 0;
 +
 + while (count  written) {
 + ret = pwrite(fd, buf + written, count - written, offset + 
 written);
 + if (ret  0) {
 + fprintf(stderr,   ERROR %d: Failed to write %ld 
 + bytes\n, errno, (long)count);
 + return ret;
 + }
 + written += ret;
 + }
 +
 + return 0;
 +}
 +
 +static int do_lseek(int testnum, int subtest, int fd, int origin, off_t set,
 + off_t exp)
 +{
 + off_t pos;
 + int ret = -1;
 +
 + pos = lseek(fd, set, origin);
 +
 + if (pos != exp) {
 + fprintf(stderr,   ERROR in Test %d.%d: POS expected %ld, 
 + got %ld\n, testnum, subtest, (long)exp, (long)pos);
 + goto out;
 + }
 +
 + if (pos == -1  errno != ENXIO) {
 + fprintf(stderr,   ERROR in Test %d.%d: ERRNO expected %d, 
 + got %d\n, testnum, subtest, ENXIO, errno);
 + goto out;
 + }
 +
 + ret = 0;
 +
 +out:
 + return ret;
 +}
 +
 +static int get_flags(int fd)
 +{
 + const char *buf = ABCDEF;
 + ssize_t written;
 + off_t pos;
 + int ret;
 +
 + ret = do_truncate(fd, alloc_size * 2);
 + if (ret)
 + return ret;
 +
 + written = do_pwrite(fd, buf, strlen(buf), 0);
 + if (written)
 + return -1;
 +
 + pos = lseek(fd, 0, SEEK_HOLE);
 + if (pos == alloc_size * 2) {
 + if (!(flags  QUIET))
 + printf(File system does not recognize holes, the only 
 +hole found will be at the end.\n);
 + flags |= FS_NO_HOLES;

This is a question that I've also had about compatibility with older
(well, every) Linux kernel that does not support SEEK_{HOLE,DATA}
today.

My reading of the existing generic_file_llseek() and default_llseek()
code, along with most filesystem-specific llseek() implementations is
that they will happily ignore the @whence parameter if it is not
known, and pretend like it is 0 (SEEK_SET), so they will just set the
position to the @offset parameter and return this 

Re: [PATCH 2/2] Btrfs: load the key from the dir item in readdir into a fake dentry

2011-05-26 Thread Andreas Dilger
On May 26, 2011, at 14:03, Andi Kleen wrote:
 On Thu, May 26, 2011 at 03:02:42PM -0400, Josef Bacik wrote:
 +/*
 + * If this dentry needs lookup, don't set the referenced flag so that it
 + * is more likely to be cleaned up by the dcache shrinker in case of
 + * memory pressure.
 + */
 +if (!d_need_lookup(dentry))
 +dentry-d_flags |= DCACHE_REFERENCED;
 
 No it doesn't at all. The allocation will just push everything else
 out.
 
 Really you cannot view this by only looking at the dcache.
 You have to look at the complete VM behaviour. All the caches
 and the other memory interact.

Even without this patch, if you are doing find / there will be considerable 
memory pressure from all of the pages that are accessed by readdir(), and to a 
lesser extent the directory dentry/inode entries.  I'm not sure whether the 
issue you raise is going to be significant in the end or not.


Taking this development a bit further, I've long thought it would be quite 
interesting if these dcache entries could be linked into a readdir-ordered 
linked list, and then discard the actual directory pages from cache entirely.  
This would potentially allow later readdir operations to work just by walking 
the readdir-ordered linked list and not touching the page cache at all.

Since normal operations like ls -l currently instantiate both a pagecache for 
readdir, and a dentry for each dirent, it could potentially even reduce memory 
if there was no a need to keep the directory pages in cache anymore.

 d_alloc uses a normal GFP_KERNEL, which is quite in appropiate for this.
 
 It should at least reclaim and probably more, but even then it's
 risky.
 
 
 Ah yeah I guess I should have probably used GFP_KERNEL.  Sorry about that,
 
 GFP_KERNEL is already used, but it's wrong. I'm not sure any
 of the existing GFP_* flags will give you the semantics you
 need in fact. The new flag Minchan added for readahead may come
 near, but even that is probably not enough.
 
 -Andi
 
 -- 
 a...@linux.intel.com -- Speaking for myself only.
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4

2011-05-25 Thread Andreas Dilger
On May 23, 2011, at 15:43, Josef Bacik wrote:
 This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags.  Turns 
 out
 using fiemap in things like cp cause more problems than it solves, so lets try
 and give userspace an interface that doesn't suck.  We need to match solaris
 here, and the definitions are
 
 diff --git a/fs/read_write.c b/fs/read_write.c
 index 5520f8a..9c3b453 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -64,6 +64,23 @@ generic_file_llseek_unlocked(struct file *file, loff_t 
 offset, int origin)
   return file-f_pos;
   offset += file-f_pos;
   break;
 + case SEEK_DATA:
 + /*
 +  * In the generic case the entire file is data, so as long as
 +  * offset isn't at the end of the file then the offset is data.
 +  */
 + if (offset = inode-i_size)
 + return -ENXIO;
 + break;
 + case SEEK_HOLE:
 + /*
 +  * There is a virtual hole at the end of the file, so as long as
 +  * offset isn't i_size or larger, return i_size.
 +  */
 + if (offset = inode-i_size)
 + return -ENXIO;
 + offset = inode-i_size;
 + break;
   }

What about all of the existing filesystems that currently just ignore
values of origin that they don't understand?  Looking through those
it appears that most of them will return offset for unknown values
of origin, which I guess is OK for SEEK_DATA, but is confusing for
SEEK_HOLE.  Some filesystems will return -EINVAL for values of origin
that are unknown.

Most of the filesystem-specific -llseek() methods don't do any error
checking on origin because this is handled at the sys_llseek() level,
and hasn't changed in many years.

I assume this patch is also dependent upon the remove default_llseek()
patch, so that the implementation of SEEK_DATA and SEEK_HOLE can be done
in only generic_file_llseek()?

Finally, while looking through the various -llseek() methods I notice
that many filesystems return i_size for SEEK_END, which clearly does
not make sense for filesystems like ext3/ext4 htree, btrfs, etc that
use hash keys instead of byte offsets for doing directory traversal.
The comment at generic_file_llseek() is that it is intended for use by
regular files.

Should the ext4_llseek() code be changed to return 0x7 for the
SEEK_END value?  That makes more sense compared to values returned for
SEEK_CUR so that an application can compare the current offset with
the final value for a progress bar.

Another interesting use is for N threads to process a large directory
in parallel by using max_off = llseek(dirfd, 0, SEEK_END) and then
each thread calls llseek(dirfd, thread_nr * max_off / N, SEEK_SET)
to process 1/N of the directory.


Cheers, Andreas






Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-20 Thread Andreas Dilger
On 2011-05-20, at 2:07 PM, Andi Kleen a...@firstfloor.org wrote:

 Josef Bacik jo...@redhat.com writes:
 
 Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
 for readdir, but unfortunately in the case of anything that stats the files 
 in
 order that readdir spits back (like oh say ls) that means we still have to do
 the normal lookup of the file, which means looking up our other index and 
 then
 looking up the inode.  What I want is a way to create dummy dentries when we
 find them in readdir so that when ls or anything else subsequently does a
 stat(), we already have the location information in the dentry and can go
 
 Funny I remember discussing this optimization a long time ago with
 Andrea. But the problem is still that it has the potential to pollute
 the dcache a lot when someone is reading a large directory.
 
 Consider the find / case. You don't want that to turn over all
 of your dcache.
 
 So if you do that you need some way to put the dummy dentries on a
 special LRU list that gets cleaned quickly and is kept small.

Actually, I recall years ago looking into something similar. It should be 
possible to drop such dentries even under GFP_NOFS because they can't be dirty 
and don't need any inode operations to free.

Putting them at the end of the cache LRU instead of the head would allow them 
to be dropped quickly under memory pressure. 

Cheers, Andreas--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-19 Thread Andreas Dilger
On May 19, 2011, at 11:58, Josef Bacik wrote:
 Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
 for readdir, but unfortunately in the case of anything that stats the files in
 order that readdir spits back (like oh say ls) that means we still have to do
 the normal lookup of the file, which means looking up our other index and then
 looking up the inode.  What I want is a way to create dummy dentries when we
 find them in readdir so that when ls or anything else subsequently does a
 stat(), we already have the location information in the dentry and can go
 straight to the inode itself.  The lookup stuff just assumes that if it finds 
 a
 dentry it is done, it doesn't perform a lookup.  So add a DCACHE_NEED_LOOKUP
 flag so that the lookup code knows it still needs to run i_op-lookup() on the
 parent to get the inode for the dentry.  I have tested this with btrfs and I
 went from something that looks like this
 
 http://people.redhat.com/jwhiter/ls-noreada.png
 
 To this
 
 http://people.redhat.com/jwhiter/ls-good.png
 
 Thats a savings of 1300 seconds, or 22 minutes.  That is a significant 
 savings.
 Thanks,

This comment should probably mention the number of files being tested, in order
to make a 1300s savings meaningful.  Similarly, it would be better to provide
the absolute times of tests in case these URLs disappear in the future.

That reduces the time to do ls -l on a 1M file directory from 2181s to 855s.

 Signed-off-by: Josef Bacik jo...@redhat.com
 ---
 fs/namei.c |   48 
 include/linux/dcache.h |1 +
 2 files changed, 49 insertions(+), 0 deletions(-)
 
 diff --git a/fs/namei.c b/fs/namei.c
 index e3c4f11..a1bff4f 100644
 --- a/fs/namei.c
 +++ b/fs/namei.c
 @@ -1198,6 +1198,29 @@ static struct dentry *d_alloc_and_lookup(struct dentry 
 *parent,
 }
 
 /*
 + * We already have a dentry, but require a lookup to be performed on the 
 parent
 + * directory to fill in d_inode. Returns the new dentry, or ERR_PTR on error.
 + * parent-d_inode-i_mutex must be held. d_lookup must have verified that no
 + * child exists while under i_mutex.
 + */
 +static struct dentry *d_inode_lookup(struct dentry *parent, struct dentry 
 *dentry,
 +  struct nameidata *nd)
 +{
 + struct inode *inode = parent-d_inode;
 + struct dentry *old;
 +
 + /* Don't create child dentry for a dead directory. */
 + if (unlikely(IS_DEADDIR(inode)))
 + return ERR_PTR(-ENOENT);
 +
 + old = inode-i_op-lookup(inode, dentry, nd);
 + if (unlikely(old)) {
 + dput(dentry);
 + dentry = old;
 + }
 + return dentry;
 +}
 +/*
 *  It's more convoluted than I'd like it to be, but... it's still fairly
 *  small and for now I'd prefer to have fast path as straight as possible.
 *  It _is_ time-critical.
 @@ -1236,6 +1259,13 @@ static int do_lookup(struct nameidata *nd, struct qstr 
 *name,
   goto unlazy;
   }
   }
 + if (unlikely(dentry-d_flags  DCACHE_NEED_LOOKUP)) {
 + if (nameidata_dentry_drop_rcu(nd, dentry))
 + return -ECHILD;
 + dput(dentry);
 + dentry = NULL;
 + goto retry;
 + }
   path-mnt = mnt;
   path-dentry = dentry;
   if (likely(__follow_mount_rcu(nd, path, inode, false)))
 @@ -1250,6 +1280,12 @@ unlazy:
   }
   } else {
   dentry = __d_lookup(parent, name);
 + if (unlikely(!dentry))
 + goto retry;
 + if (unlikely(dentry-d_flags  DCACHE_NEED_LOOKUP)) {
 + dput(dentry);
 + dentry = NULL;
 + }
   }
 
 retry:
 @@ -1268,6 +1304,18 @@ retry:
   /* known good */
   need_reval = 0;
   status = 1;
 + } else if (unlikely(dentry-d_flags  DCACHE_NEED_LOOKUP)) {
 + struct dentry *old;
 +
 + dentry-d_flags = ~DCACHE_NEED_LOOKUP;
 + dentry = d_inode_lookup(parent, dentry, nd);

Would it make sense to keep DCACHE_NEED_LOOKUP set in d_flags until _after_
the call to d_inode_lookup()?  That way the filesystem can positively know
it is doing the inode lookup from d_fsdata, instead of just inferring it
from the presence of d_fsdata?  It is already the filesystem that is setting
DCACHE_NEED_LOOKUP, so it should really be the one clearing this flag also.

I'm concerned that there may be filesystems that need d_fsdata for something
already, so the presence/absence of d_fsdata is not a clear indication to
the underlying filesystem of whether to do an inode lookup based on d_fsdata,
which might mean that it needs to keep yet another flag for this purpose.

 + if 

Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?

2011-05-05 Thread Andreas Dilger
On May 5, 2011, at 12:10, Matthew Wilcox wrote:
 On Wed, May 04, 2011 at 08:51:39AM -0600, Andreas Dilger wrote:
 I was aware of REQ_META, but I didn't know there was any benefit to
 using it. I think it would be easy to set REQ_META on all ext4 metadata
 if there was a reason to do so.
 
 The CFQ ioscheduler pays attention to it (prioritising metadata accesses
 over data accesses), and blocktrace will print an 'M' for metadata
 requests if it's set, so I think that's two excellent reasons to set
 REQ_META today.
 
 However, ext3, ext4, and XFS already use it:
 
 fs/ext4/inode.c:1500:   ll_rw_block(READ_META, 1, bh);
  - ext4_bread()

 fs/ext4/inode.c:4775:   submit_bh(READ_META, bh);
  - __ext4_get_inode_loc()
 fs/ext4/namei.c:924:ll_rw_block(READ_META, 1, bh);
  - ext4_find_entry()

Looking more closely at the code it seems that this is handling only a subset 
of the code.  There are many places in ext4 that are using sb_bread() instead 
of ext4_bread(), in particular in the extents and migration code that was 
developed more recently, but also in the xattr code which has been around a 
long time.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?

2011-05-04 Thread Andreas Dilger
On 2011-05-04, at 5:45 AM, Gao, Yunpeng yunpeng@intel.com wrote:
 Yes, I have been working on some changes that allow us to tag bios and
 pass the information out to storage. These patches have been on the back
 burner for a while due to other commitments. But I'll dig them out and
 post them later. We just discussed them a couple of weeks ago at the
 Linux Storage Workshop.
 
 That's great! Thanks for the update and look forward to your patches.
 
 In the meantime: Can you point me to the relevant eMMC stuff so I can
 see how many tiers or classes we have to work with there?
 
 I'm investigating on add some eMMC 4.5 features support in current Linux mmc 
 driver (drivers/mmc).
 The Linux mmc driver register the eMMC device as a normal Linux block device. 
 So, it can get all the Linux block layer bio flags.
 
 To below eMMC 4.5 new features:
 
 Data Tag:
 'The mechanism permits the device to receive from the host information about 
 specific data types (for instance file system metadata, time-stamps, 
 configuration parameters, etc.). The information is conveyed before a write 
 multiple blocks operation at well defined addresses. By receiving this 
 information the device can improve the access rate during the following read 
 and update operations and offer a more reliable and robust storage.'
 
 I guess the exist block layer flag REQ_META can be used to notify the low 
 level block driver/device to execute the Data Tag feature. But don't know why 
 currently most of Linux file systems don't use the REQ_META flag at all 
 (seems only gfs2 uses it now).

I was aware of REQ_META, but I didn't know there was any benefit to using it. I 
think it would be easy to set REQ_META on all ext4 metadata if there was a 
reason to do so. 

 Context Management:
 'To better differentiate between large sequential operations and small random 
 operations, and to improve multitasking support, contexts can be associated 
 with groups of read or write commands ... A context can be seen as an active 
 session, configured for a specific read/write pattern (e.g. sequential in 
 some granularity). Multiple read or write commands are associated with this 
 context to create some logical association between them, to allow device to 
 optimize performance.'
 
 To my understanding, to support this feature, it needs file system (or 
 application?) to notify the low level driver that the following data access 
 will be large sequential read/write operations.
 And I'm not sure is it possible for file system to pass this kind of 
 information to low level driver? Any idea?

A simple way to do this would be to use the inode number as the context, so 
writes to the same inode are grouped together. Another possibility is the PID, 
though that is less clearly correct (e.g. if tar is extracting a buch of files, 
are they all related or not).

Cheers, Andreas--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-22 Thread Andreas Dilger
On 2011-04-22, at 11:08 AM, Sunil Mushran sunil.mush...@oracle.com wrote:
 On 04/22/2011 10:03 AM, Eric Blake wrote:
 cp can read whatever blocksize it chooses. If that block contains
 zero, it would signal cp that maybe it should SEEK_DATA and skip
 reading all those blocks. That's all. We are not trying to achieve
 perfection. We are just trying to reduce cpu waste.
 
 If the fs supports SEEK_*, then great. If it does not, then it is no
 worse than before.
 But providing just SEEK_DATA _is_ worse than before if you don't provide
 the correct SEEK_HOLE everywhere.  Because then your algorithm of trying
 lseek(SEEK_DATA) after every run of zeros in the hopes of an
 optimization is a wasted syscall, since it will just return your current
 offset every time, so you end up with more syscalls than if you had used
 the single lseek(SEEK_DATA) that returns the end of the file up front,
 and known that the remainder of the file has no holes to even try
 seeking past.
 
 You are over-optimizing. strace any process on your box and you
 will find numerous wasted syscalls.

Sure, there are lots of wasted syscalls, but in this case the cost of doing 
extra SEEK_DATA and SEEK_HOLE operations may be fairly costly. This involves 
scanning the whole disk layout, possibly over a network, which might need tens 
or hundreds of disk seeks to read the metadata, unlike regular SEEK_SET.

Even SEEK_END is somewhat costly on a network filesystem, since the syscall 
needs to lock the file size in order to determine the end of the file, which 
can block other threads from writing to the file. 

So while I agree that the Linux mantra that syscalls are cheap is true if 
those syscalls don't actually do any work, I think it also ignores the cost of 
what some syscalls need to do behind the scenes. 

Cheers, Andreas--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: add datacow flag in inode flag

2011-03-15 Thread Andreas Dilger
On 2011-03-15, at 2:57 PM, Christoph Hellwig wrote:
 On Tue, Mar 15, 2011 at 04:26:50PM -0400, Chris Mason wrote:
  #define FS_EXTENT_FL 0x0008 /* Extents */
  #define FS_DIRECTIO_FL   0x0010 /* Use direct i/o */
 +#define FS_NOCOW_FL  0x0080 /* Do not cow file */
 +#define FS_COW_FL0x0100 /* Cow file */
  #define FS_RESERVED_FL   0x8000 /* reserved for ext2 lib */
 
 I'm fine with it.  I'll defer the check for conflicts with extN-specific flags
 to Ted, though.

Looking at the upstream e2fsprogs I see in that range:

 #define EXT4_EXTENTS_FL   0x0008 /* Inode uses extents */
 #define EXT4_EA_INODE_FL  0x0020 /* Inode used for large EA */
 #define EXT4_EOFBLOCKS_FL 0x0040 /* Blocks allocated beyond EOF */
 #define EXT4_SNAPFILE_FL  0x0100 /* Inode is a snapshot */
 #define EXT4_SNAPFILE_DELETED_FL  0x0400 /* Snapshot is being deleted */
 #define EXT4_SNAPFILE_SHRUNK_FL   0x0800 /* Snapshot shrink has completed 
 */
 #define EXT2_RESERVED_FL  0x8000 /* reserved for ext2 lib */
 
 #define EXT2_FL_USER_VISIBLE  0x004BDFFF /* User visible flags */

so there is a conflict with FS_COW_FL and EXT4_SNAPFILE_FL.  I don't know the 
semantics of those two flags enough to say for sure whether it is reasonable 
that they alias to each other, but at first glance COW and SNAPSHOT don't 
seem completely unrelated.

It also isn't clear to me whether the SNAPFILE_DELETED_FL and 
SNAPFILE_SHRUNK_FL really need to be persistent flags that are stored on disk, 
or if they are state flags that should only be stored in the in-memory inode?  
Not having them in the limited remaining inode flag space would be better...  
Amir?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add ioctl to set compress or cow per file/dir

2011-02-24 Thread Andreas Dilger
On 2011-02-24, at 2:40 AM, liubo wrote:
  #define FS_DIRECTIO_FL   0x0010 /* Use direct i/o */
 +#define FS_NOCOW_FL  0x0020 /* Do not cow file */
 +#define FS_COW_FL0x0010 /* Cow file */
  #define FS_RESERVED_FL   0x8000 /* reserved for ext2 
 lib */

I'm assuming that FS_COW_FL should not be the same as FS_DIRECTIO_FL?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in ext4/btrfs fallocate?

2010-12-28 Thread Andreas Dilger
On 2010-12-28, at 09:06, Marco Stornelli wrote:
 it seems that ext4/btrfs code for fallocate doesn't check for
 immutable/append inode flag.

fallocate() probably shouldn't be allowed for immutable files, but it makes a 
lot of sense to call fallocate() on append-only files to avoid fragmentation, 
though it should only be called with the KEEP_SIZE flag.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-07, at 10:02, Trond Myklebust wrote:

 On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
 It's just as stable as a real dev_t in the times of hotplug and udev.
 As long as you don't touch anything including not upgrading the kernel
 it's remain stable, otherwise it will break.  That's why modern
 nfs-utils default to using the uuid-based filehandle schemes instead of
 the dev_t based ones.  At least that's what I told - I really hope it's
 using the real UUIDs from the filesystem and not the horrible fsid hack
 that was once added - for some filesystems like XFS that field does not
 actually have any relation to the UUID historically.  And while we
 could have changed that it's too late now that nfs was hacked into
 abusing that field.
 
 IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
 they won't fit into the NFSv2 32-byte filehandles, so there is an
 '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
 
 See the mk_fsid() helper in fs/nfsd/nfsfh.h

It looks like mk_fsid() is only actually using the UUID if it is specified in 
the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
uuid=... option).

There was a patch in the open_by_handle() patch series that added an s_uuid 
field to the superblock, that could be used if no uuid= option is specified in 
the /etc/exports file.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-06, at 09:48, J. Bruce Fields wrote:
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
 Any chance we can add a -get_fsid(sb, inode) method to
 export_operations (or something simiar), that allows the filesystem to
 generate an FSID based on the volume and inode that is being exported?
 
 No objection from here.
 
 (Though I don't understand the inode argument--aren't subvolumes
 usually expected to have separate superblocks?)

I thought that if two directories from the same filesystem are both being 
exported at the same time that they would need to have different FSID values, 
hence the inode parameter to allow generating an FSID that is a function of 
both the filesystem (sb) and the directory being exported (inode)?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread Andreas Dilger
On 2010-12-08, at 16:07, Neil Brown wrote:
 On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com
 wrote:
 
 On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
 Any chance we can add a -get_fsid(sb, inode) method to
 export_operations (or something simiar), that allows the
 filesystem to generate an FSID based on the volume and
 inode that is being exported?
 
 No objection from here.
 
 My standard objection here is that you cannot guarantee that the
 fsid is 100% guarantied to be unique across all filesystems in
 the system (including filesystems mounted from dm snapshots of
 filesystems that are currently mounted).  NFSd needs this uniqueness.

Sure, but you also cannot guarantee that the devno is constant across reboots, 
yet NFS continues to use this much-less-constant value...

 This is only really an objection if user-space cannot over-ride
 the fsid provided by the filesystem.

Agreed.  It definitely makes sense to allow this, for whatever strange 
circumstances might arise.  However, defaulting to using the filesystem UUID 
definitely makes the most sense, and looking at the nfs-utils mountd code, it 
seems that this is already standard behaviour for local block devices 
(excluding btrfs filesystems).

 I'd be very happy to see an interface to user-space whereby
 user-space can get a reasonably unique fsid for a given
 filesystem.

Hmm, maybe I'm missing something, but why does userspace need to be able to get 
this value?  I would think that nfsd gets it from the filesystem directly in 
the kernel, but if a uuid= option is present in the exports file that is 
preferentially used over the value from the filesystem.

That said, I think Aneesh's open_by_handle patchset also made the UUID visible 
in /proc/pid/mountinfo, after the filesystems stored it in
sb-s_uuid at mount time.  That _should_ make it visible for non-block 
mountpoints as well, assuming they fill in s_uuid.

 Whether this is an export_operations method or some field in the
 'struct super' which gets copied out doesn't matter to me.

Since Aneesh has already developed patches, is there any objection to using 
those (last sent to linux-fsdevel on 2010-10-29):

[PATCH -V22 12/14] vfs: Export file system uuid via /proc/pid/mountinfo
[PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
[PATCH -V22 14/14] ext4: Copy fs UUID to superblock

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread Andreas Dilger
On 2010-12-03, at 15:45, J. Bruce Fields wrote:
 We're using statfs64.fs_fsid for this; I believe that's both stable
 across reboots and distinguishes between subvolumes, so that's OK.
 
 (That said, since fs_fsid doesn't work for other filesystems, we depend
 on an explicit check for a filesystem type of btrfs, which is
 awful--btrfs won't always be the only filesystem that wants to do this
 kind of thing, etc.)

Sigh, I wanted to be able to specify the NFS FSID directly from within the 
kernel for Lustre many years already.  Glad to see that this is moving forward.

Any chance we can add a -get_fsid(sb, inode) method to export_operations
(or something simiar), that allows the filesystem to generate an FSID based on 
the volume and inode that is being exported?

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-17 Thread Andreas Dilger
On 2010-11-16, at 20:11, Dave Chinner wrote:
 On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote:
 IMHO, it makes more sense for consistency and get what users
 expect that these be treated as flags.  Some users will want
 KEEP_SIZE, but in other cases it may make sense that a hole punch
 at the end of a file should shrink the file (i.e. the opposite of
 an append).
 
 What's wrong with ftruncate() for this?

It makes the API usage from applications more consistent.  It would be 
inconvenient, for example, if applications had to use a different system call 
if they were writing in the middle of the file vs. at the end, wouldn't it?

Similarly, if multiple threads are appending vs. punching (let's assume 
non-overlapping regions, for sanity, like a producer/consumer model punching 
out completed records) then using ftruncate() to remove the last record and 
shrink the file would require locking the whole file from userspace (unlike the 
append, which does this in the kernel), or risk discarding unprocessed data 
beyond the record that was punched out.

 There's plenty of open questions about the interface if we allow
 hole punching to change the file size. e.g. where do we set the EOF
 (offset or offset+len)?

I would think it natural that the new size is the start of the region, like an 
anti-write (where write sets the size at the end of the added bytes).

  What do we do with the rest of the blocks that are now beyond EOF?
 We weren't asked to punch them out, so do we leave them behind?

I definitely think they should be left as is.  If they were in the punched-out 
range, they would be deallocated, and if they are beyond EOF they will remain 
as they are - we didn't ask to remove them unless the punched-out range went to 
~0ULL (which would make it equivalent to an ftruncate()).

 What if we are leaving written blocks beyond EOF - does any filesystem other 
 than XFS support that (i.e. are we introducing different behaviour on 
 different filesystems)?

I'm not sure I understand what a written block beyond EOF means.  How can 
there be data beyond EOF?  I think the KEEP_SIZE flag is only relevant if the 
punch is spanning EOF, like the opposite of a write that is spanning EOF.  If 
KEEP_SIZE is set, then it leaves the size unchanged, and if unset and punch 
spans EOF it reduces the file size.  If the punch is not at EOF it doesn't 
change the file size, just like a write that is not at EOF.

 And what happens if the offset is beyond EOF? Do we extend the file, and if 
 so why wouldn't you just use ftruncate() instead?

Even if the effects were the same, it makes sense because applications may be 
using fallocate(PUNCH_HOLE) to punch out records, and having them special case 
the use of ftruncate() to get certain semantics at the end of the file adds 
needless complexity.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-17 Thread Andreas Dilger
On 2010-11-16, at 20:34, Josef Bacik wrote:
 FWIW I agree with Dave, the only question at this point is do we force users 
 to specify KEEP_SIZE with PUNCH_HOLE?  On one hand it makes the interface a 
 bit more consistent, on the other hand it makes the documentation a little 
 weird
 
 We have mode here, but if you want to use PUNCH_HOLE you also have to 
 specify KEEP_SIZE, so really it's like a flags field it's just named poorly

Even if this is the case, and we decide today that PUNCH_HOLE without KEEP_SIZE 
is not desirable to implement, it would be better to just return -EOPNOTSUPP if 
both flags are not set than assume one or the other is what the user wanted.  
That allows the ability to implement this in the future without breaking every 
application, while if it is assumed that KEEP_SIZE is always implicit there 
will never be a way to add that functionality without something awful like a 
separate CHANGE_SIZE flag for PUNCH_HOLE.

One option is to define FALLOC_FL_PUNCH_HOLE as 0x3 (so that KEEP_SIZE is 
always passed) and in the future we can define some new flag name like 
TRUNCATE_HOLE (or whatever) that is 0x2 only.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-16 Thread Andreas Dilger
On 2010-11-16, at 07:14, Jan Kara wrote:
 Yeah I went back and forth on this.  KEEP_SIZE won't change the behavior of 
 PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size.  I figured since 
 its mode and not flags it would be ok to make either way accepted, but 
 if you prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool 
 with that, just let me know one way or the other.
 
 So we call it mode but speak about flags? Seems a bit inconsistent.  
 I'd maybe lean a bit at the flags side and just make sure that only one of 
 FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE is set (interpreting 
 FALLOC_FL_KEEP_SIZE as allocate blocks beyond i_size). But I'm not sure what 
 others think.

IMHO, it makes more sense for consistency and get what users expect that 
these be treated as flags.  Some users will want KEEP_SIZE, but in other cases 
it may make sense that a hole punch at the end of a file should shrink the file 
(i.e. the opposite of an append).

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] lib: improve the performance of memcpy and memmove of the general version

2010-09-03 Thread Andreas Dilger
On 2010-09-03, at 05:03, Florian Weimer wrote:

 * Miao Xie:
 
 EXPORT_SYMBOL(memcpy);
 
 I think you need to change that to EXPORT_SYMBOL_GPL, because the code
 is now licensed under the GPL, and not the GPL plus kernel exceptions
 (whatever they are, but they undoubtly exist), unlike the original
 implementation.

Ouch.  That would basically make it impossible to implement a non-GPL module.  
Also, this should only apply to the lib version of the memcpy() routine, not 
the arch-specific ones that are written in assembly (maybe that already is 
true, I'm not sure).

Lustre is GPL, so it isn't fatal for us, but it definitely seems draconian, and 
almost a reason not to include this patch into the kernel.  Also, given that 
the original code is LGPL (which allows linking to non-GPL code) this seems 
counter to the intent of the original authors.

I suspect there isn't anything so brilliant in the glibc memcpy() that it 
couldn't be re-implemented without copying the code.  implement memcpy with 
aligned machine-word-sized chunks would be enough for anyone to reimplement it 
without ever having come close to the glibc code.

Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-08-03 Thread Andreas Dilger
On 2010-08-03, at 11:35, Dan Magenheimer wrote:
 - The FS should be block-device-based (e.g. a ram-based FS
  such as tmpfs should not enable cleancache)

When you say block device based, does this exclude network filesystems?  It 
would seem cleancache, like fscache, is actually best suited to high-latency 
network filesystems.

 - To ensure coherency/correctness, inode numbers must be unique
  (e.g. no emulating 64-bit inode space on 32-bit inode numbers)

Does it need to be restricted to inode numbers at all (i.e. can it use an 
opaque internal identifier like the NFS file handle)?  Disallowing cleancache 
on a filesystem that uses 64-bit (or larger) inodes on a 32-bit system reduces 
its usefulness.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-02 Thread Andreas Dilger
On 2010-06-02, at 20:46, Nitin Gupta wrote:
 On 06/03/2010 04:32 AM, Dan Magenheimer wrote:
 From: Minchan Kim [mailto:minchan@gmail.com]
 
 I am also eagerly awaiting Nitin Gupta's cleancache backend
 and implementation to do in-kernel page cache compression.
 
 Do Nitin say he will make backend of cleancache for
 page cache compression?
 
 It would be good feature.
 I have a interest, too. :)
 
 That was Nitin's plan for his GSOC project when we last discussed
 this.  Nitin is on the cc list and can comment if this has
 changed.
 
 Yes, I have just started work on in-kernel page cache compression
 backend for cleancache :)

Is there a design doc for this implementation?  I was thinking it would be 
quite clever to do compression in, say, 64kB or 128kB chunks in a mapping (to 
get decent compression) and then write these compressed chunks directly from 
the page cache to disk in btrfs and/or a revived compressed ext4.

That would mean that the on-disk compression algorithm needs to match the 
in-memory algorithm, which implies that the in-memory compression algorithm 
should be selectable on a per-mapping basis.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cleancache [PATCH 6/7] (was Transcendent Memory): ext4 hook

2010-04-28 Thread Andreas Dilger
On 2010-04-22, at 07:29, Dan Magenheimer wrote:
 Cleancache [PATCH 6/7] (was Transcendent Memory): ext4 hook
 
 Filesystems must explicitly enable cleancache.  For ext4,
 all other cleancache hooks are in the VFS layer.
 
 Signed-off-by: Dan Magenheimer dan.magenhei...@oracle.com

Given the minimal changes being done to ext3/ext4, I don't have any objection 
to this.  Being able to hook ext4 into SSDs for hierarchical caching is 
something that will become increasingly important for huge ext4 filesystems.

Acked-by: Andreas Dilger adil...@sun.com

 Diffstat:
 super.c  |2 ++
 1 file changed, 2 insertions(+)
 
 --- linux-2.6.34-rc5/fs/ext4/super.c  2010-04-19 17:29:56.0 -0600
 +++ linux-2.6.34-rc5-cleancache/fs/ext4/super.c   2010-04-21 
 10:13:00.0 -0600
 @@ -39,6 +39,7 @@
  #include linux/ctype.h
  #include linux/log2.h
  #include linux/crc16.h
 +#include linux/cleancache.h
  #include asm/uaccess.h
  
  #include ext4.h
 @@ -1784,6 +1785,7 @@ static int ext4_setup_super(struct super
   EXT4_INODES_PER_GROUP(sb),
   sbi-s_mount_opt);
 
 + sb-cleancache_poolid = cleancache_init_fs(PAGE_SIZE);
   return res;
  }


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source line numbers with x86_64 modules? [Was: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact]

2009-01-11 Thread Andreas Dilger
On Jan 10, 2009  16:15 -0500, Theodore Ts'o wrote:
 In my experience, there are very few kernel versions and hardware for
 which kdump works.  I've talked to the people who have to make kdump
 work, and every 12-18 months, with a new set of enterprise kernels
 comes out, they have to go and fix kdump so it works again for the set
 of hardware that they care about, and for the kernel version involved.

I'm sad that netconsole/netdump never made it big.  It was fairly useful,
and extending the eth drivers to add the polling mode was trivial to do.
We were using that for a few years, but it got replaced by kdump and it
appears to be less usable IMHO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Notes on support for multiple devices for a single filesystem

2008-12-17 Thread Andreas Dilger
).  When
filesystem X is mounted (using any one device from the filesystem passed
to mount) it can quickly scan this list to find all devices that for that
filesystem's UUID to assemble all of the required devices.

Since the filesystem would internally have to verify the list of devices
that would (somehow) be passed to the kernel during mount this doesn't
really gain much to keep the list in userspace.

 Advantage:Requires a mount.fstype helper or fs-specific knowledge
   in mount(8).
 Disadvantages:  Required libvolume_id / libblkid to keep state.
 
  2) whenever libvolume_id / libblkid find a device detected as a multi-device
 capable filesystem they call into the kernel through and ioctl / sysfs /
 etc to add it to a list in kernel space.  The kernel code then manages
 to select the right filesystem instances, and either adds it to an already
 running instance, or prepares a list for a future mount.

While not performance critical, this seems that calling into userspace
just to call back down into the kernel is more complex than it needs to
be (i.e. prone to failure).

I'm not stuck on doing this in the kernel, just thinking that the more
components involved the more likely it is that something goes wrong.  I
have enough grief with just initrd not having the right kernel module,
I'd hate to depend on tools to do the device assembly for the root fs.

 Advantages:keeps state in the kernel instead of in libraries
 Disadvantages: requires and additional user interface to mount
 
 c) will be deal with an option that allows adding named block devices
 on the mount command line, which is already required for design 1) but
 would have to be implemented additionally for design 2).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs trees for linux-next

2008-12-15 Thread Andreas Dilger
On Dec 11, 2008  09:43 -0500, Chris Mason wrote:
 The multi-device code uses a very simple brute force scan from userland
 to populate the list of devices that belong to a given FS.  Kay Sievers
 has some ideas on hotplug magic to make this less dumb.  (The scan isn't
 required for single device filesystems).

This should use libblkid to do the scanning of the devices, and it can
cache the results for efficiency.  Best would be to have the same LABEL+UUID
for all devices in the same filesystem, and then once any of these devices
are found the mount.btrfs code can query the rest of the devices to find
the remaining parts of the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html