Re: [PATCH] ceph: fix race in concurrent __ceph_remove_cap invocations

2020-11-12 Thread Yan, Zheng
On Thu, Nov 12, 2020 at 6:48 PM Luis Henriques  wrote:
>
> A NULL pointer dereference may occur in __ceph_remove_cap with some of the
> callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
> remove_session_caps_cb.  These aren't protected against the concurrent
> execution of __ceph_remove_cap.
>

they are protected by session mutex, never get executed concurrently

> Since the callers of this function hold the i_ceph_lock, the fix is simply
> a matter of returning immediately if caps->ci is NULL.
>
> Based on a patch from Jeff Layton.
>
> Cc: sta...@vger.kernel.org
> URL: https://tracker.ceph.com/issues/43272
> Link: https://www.spinics.net/lists/ceph-devel/msg47064.html
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/caps.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index ded4229c314a..443f164760d5 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1140,12 +1140,19 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
> queue_release)
>  {
> struct ceph_mds_session *session = cap->session;
> struct ceph_inode_info *ci = cap->ci;
> -   struct ceph_mds_client *mdsc =
> -   ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc;
> +   struct ceph_mds_client *mdsc;
> int removed = 0;
>
> +   /* 'ci' being NULL means he remove have already occurred */
> +   if (!ci) {
> +   dout("%s: cap inode is NULL\n", __func__);
> +   return;
> +   }
> +
> dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode);
>
> +   mdsc = ceph_inode_to_client(&ci->vfs_inode)->mdsc;
> +
> /* remove from inode's cap rbtree, and clear auth cap */
> rb_erase(&cap->ci_node, &ci->i_caps);
> if (ci->i_auth_cap == cap) {


Re: [PATCH V3] fs/ceph:fix double unlock in handle_cap_export()

2020-04-30 Thread Yan, Zheng
On Thu, Apr 30, 2020 at 2:13 PM Wu Bo  wrote:
>
> If the ceph_mdsc_open_export_target_session() return fails,
> we should add mutex_lock(&session->s_mutex) on IS_ERR(tsession) block
> to avoid twice unlocking. because the session->s_mutex will be unlock
> at the out_unlock lable.
>
> --
> v2 -> v3:
>   - Rewrite solution, adding a mutex_lock(&session->s_mutex)
> to the IS_ERR(tsession) block.
>   - Modify the comment more clearly.
> v1 -> v2:
>   - add spin_lock(&ci->i_ceph_lock) before goto out_unlock lable
>
>
> Signed-off-by: Wu Bo 
> ---
>  fs/ceph/caps.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 185db76..d27d778 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -3746,6 +3746,7 @@ static void handle_cap_export(struct inode *inode, 
> struct ceph_mds_caps *ex,
> WARN_ON(1);
> tsession = NULL;
> target = -1;
> +   mutex_lock(&session->s_mutex);
> }
> goto retry;
>
> --
> 1.8.3.1
>

 Reviewed-by: "Yan, Zheng" 


Re: [PATCH V2] fs/ceph:fix double unlock in handle_cap_export()

2020-04-29 Thread Yan, Zheng
On Wed, Apr 29, 2020 at 8:49 AM Wu Bo  wrote:
>
> On 2020/4/28 22:48, Jeff Layton wrote:
> > On Tue, 2020-04-28 at 21:13 +0800, Wu Bo wrote:
> >> if the ceph_mdsc_open_export_target_session() return fails,
> >> should add a lock to avoid twice unlocking.
> >> Because the lock will be released at the retry or out_unlock tag.
> >>
> >
> > The problem looks real, but...
> >
> >> --
> >> v1 -> v2:
> >> add spin_lock(&ci->i_ceph_lock) before goto out_unlock tag.
> >>
> >> Signed-off-by: Wu Bo 
> >> ---
> >>   fs/ceph/caps.c | 27 +++
> >>   1 file changed, 15 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >> index 185db76..414c0e2 100644
> >> --- a/fs/ceph/caps.c
> >> +++ b/fs/ceph/caps.c
> >> @@ -3731,22 +3731,25 @@ static void handle_cap_export(struct inode *inode, 
> >> struct ceph_mds_caps *ex,
> >>
> >>  /* open target session */
> >>  tsession = ceph_mdsc_open_export_target_session(mdsc, target);
> >> -if (!IS_ERR(tsession)) {
> >> -if (mds > target) {
> >> -mutex_lock(&session->s_mutex);
> >> -mutex_lock_nested(&tsession->s_mutex,
> >> -  SINGLE_DEPTH_NESTING);
> >> -} else {
> >> -mutex_lock(&tsession->s_mutex);
> >> -mutex_lock_nested(&session->s_mutex,
> >> -  SINGLE_DEPTH_NESTING);
> >> -}
> >> -new_cap = ceph_get_cap(mdsc, NULL);
> >> -} else {
> >> +if (IS_ERR(tsession)) {
> >>  WARN_ON(1);
> >>  tsession = NULL;
> >>  target = -1;
> >> +mutex_lock(&session->s_mutex);
> >> +spin_lock(&ci->i_ceph_lock);
> >> +goto out_unlock;
> >
> > Why did you make this case goto out_unlock instead of retrying as it did
> > before?
> >
>
> If the problem occurs, target = -1, and goto retry lable, you need to
> call __get_cap_for_mds() or even call __ceph_remove_cap(), and then jump
> to out_unlock lable. All I think is unnecessary, goto out_unlock instead
> of retrying directly.
>

__ceph_remove_cap() must be called even if opening target session
failed. I think adding a mutex_lock(&session->s_mutex) to the
IS_ERR(tsession) block should be enough.


> Thanks.
> Wu Bo
>
> >> +}
> >> +
> >> +if (mds > target) {
> >> +mutex_lock(&session->s_mutex);
> >> +mutex_lock_nested(&tsession->s_mutex,
> >> +SINGLE_DEPTH_NESTING);
> >> +} else {
> >> +mutex_lock(&tsession->s_mutex);
> >> +mutex_lock_nested(&session->s_mutex,
> >> +SINGLE_DEPTH_NESTING);
> >>  }
> >> +new_cap = ceph_get_cap(mdsc, NULL);
> >>  goto retry;
> >>
> >>   out_unlock:
> >
>
>


Re: [PATCH] fs/ceph:fix double unlock in handle_cap_export()

2020-04-28 Thread Yan, Zheng
On Tue, Apr 28, 2020 at 8:50 PM Wu Bo  wrote:
>
> If the ceph_mdsc_open_export_target_session() return fails,
> should add a lock to avoid twice unlocking.
> Because the lock will be released at the retry or out_unlock tag.
>

at retry label, i_ceph_lock get locked. I don't see how i_ceph_lock
can get double unlock

> Signed-off-by: Wu Bo 
> ---
>  fs/ceph/caps.c | 26 ++
>  1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 185db76..b5a62a8 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -3731,22 +3731,24 @@ static void handle_cap_export(struct inode *inode, 
> struct ceph_mds_caps *ex,
>
> /* open target session */
> tsession = ceph_mdsc_open_export_target_session(mdsc, target);
> -   if (!IS_ERR(tsession)) {
> -   if (mds > target) {
> -   mutex_lock(&session->s_mutex);
> -   mutex_lock_nested(&tsession->s_mutex,
> - SINGLE_DEPTH_NESTING);
> -   } else {
> -   mutex_lock(&tsession->s_mutex);
> -   mutex_lock_nested(&session->s_mutex,
> - SINGLE_DEPTH_NESTING);
> -   }
> -   new_cap = ceph_get_cap(mdsc, NULL);
> -   } else {
> +   if (IS_ERR(tsession)) {
> WARN_ON(1);
> tsession = NULL;
> target = -1;
> +   mutex_lock(&session->s_mutex);
> +   goto out_unlock;
> +   }
> +
> +   if (mds > target) {
> +   mutex_lock(&session->s_mutex);
> +   mutex_lock_nested(&tsession->s_mutex,
> +   SINGLE_DEPTH_NESTING);
> +   } else {
> +   mutex_lock(&tsession->s_mutex);
> +   mutex_lock_nested(&session->s_mutex,
> +   SINGLE_DEPTH_NESTING);
> }
> +   new_cap = ceph_get_cap(mdsc, NULL);
> goto retry;
>
>  out_unlock:
> --
> 1.8.3.1
>


Re: [PATCH] MAINTAINERS: take over for Zheng as CephFS kernel client maintainer

2019-06-26 Thread Yan, Zheng

On 6/26/19 7:26 PM, Jeff Layton wrote:

Zheng wants to be able to spend more time working on the MDS, so I've
volunteered to take over for him as the CephFS kernel client maintainer.



ACK

Thanks
Yan, Zheng


Signed-off-by: Jeff Layton 
---
  MAINTAINERS | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

Zheng, I'm assuming for now that you don't want to stay on as
co-maintainer. Let me know if that's incorrect and I'll resend.

diff --git a/MAINTAINERS b/MAINTAINERS
index d0ed735994a5..8836f9eb2ff0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3715,7 +3715,7 @@ F:arch/powerpc/platforms/cell/
  
  CEPH COMMON CODE (LIBCEPH)

  M:    Ilya Dryomov 
-M: "Yan, Zheng" 
+M: Jeff Layton 
  M:Sage Weil 
  L:ceph-de...@vger.kernel.org
  W:http://ceph.com/
@@ -3727,7 +3727,7 @@ F:include/linux/ceph/
  F:include/linux/crush/
  
  CEPH DISTRIBUTED FILE SYSTEM CLIENT (CEPH)

-M: "Yan, Zheng" 
+M: Jeff Layton 
  M:Sage Weil 
  M:Ilya Dryomov 
  L:ceph-de...@vger.kernel.org





Re: [PATCH v4 3/3] ceph: don't NULL terminate virtual xattrs

2019-06-25 Thread Yan, Zheng
On Tue, Jun 25, 2019 at 4:18 AM Jeff Layton  wrote:
>
> The convention with xattrs is to not store the termination with string
> data, given that it returns the length. This is how setfattr/getfattr
> operate.
>
> Most of ceph's virtual xattr routines use snprintf to plop the string
> directly into the destination buffer, but snprintf always NULL
> terminates the string. This means that if we send the kernel a buffer
> that is the exact length needed to hold the string, it'll end up
> truncated.
>
> Add a ceph_fmt_xattr helper function to format the string into an
> on-stack buffer that is should always be large enough to hold the whole
> thing and then memcpy the result into the destination buffer. If it does
> turn out that the formatted string won't fit in the on-stack buffer,
> then return -E2BIG and do a WARN_ONCE().
>
> Change over most of the virtual xattr routines to use the new helper. A
> couple of the xattrs are sourced from strings however, and it's
> difficult to know how long they'll be. Just have those memcpy the result
> in place after verifying the length.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/ceph/xattr.c | 84 ++---
>  1 file changed, 59 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
> index 9b77dca0b786..37b458a9af3a 100644
> --- a/fs/ceph/xattr.c
> +++ b/fs/ceph/xattr.c
> @@ -109,22 +109,49 @@ static ssize_t ceph_vxattrcb_layout(struct 
> ceph_inode_info *ci, char *val,
> return ret;
>  }
>
> +/*
> + * The convention with strings in xattrs is that they should not be NULL
> + * terminated, since we're returning the length with them. snprintf always
> + * NULL terminates however, so call it on a temporary buffer and then memcpy
> + * the result into place.
> + */
> +static int ceph_fmt_xattr(char *val, size_t size, const char *fmt, ...)
> +{
> +   int ret;
> +   va_list args;
> +   char buf[96]; /* NB: reevaluate size if new vxattrs are added */
> +
> +   va_start(args, fmt);
> +   ret = vsnprintf(buf, size ? sizeof(buf) : 0, fmt, args);
> +   va_end(args);
> +
> +   /* Sanity check */
> +   if (size && ret + 1 > sizeof(buf)) {
> +   WARN_ONCE(true, "Returned length too big (%d)", ret);
> +   return -E2BIG;
> +   }
> +
> +   if (ret <= size)
> +   memcpy(val, buf, ret);
> +   return ret;
> +}
> +
>  static ssize_t ceph_vxattrcb_layout_stripe_unit(struct ceph_inode_info *ci,
> char *val, size_t size)
>  {
> -   return snprintf(val, size, "%u", ci->i_layout.stripe_unit);
> +   return ceph_fmt_xattr(val, size, "%u", ci->i_layout.stripe_unit);
>  }
>
>  static ssize_t ceph_vxattrcb_layout_stripe_count(struct ceph_inode_info *ci,
>  char *val, size_t size)
>  {
> -   return snprintf(val, size, "%u", ci->i_layout.stripe_count);
> +   return ceph_fmt_xattr(val, size, "%u", ci->i_layout.stripe_count);
>  }
>
>  static ssize_t ceph_vxattrcb_layout_object_size(struct ceph_inode_info *ci,
> char *val, size_t size)
>  {
> -   return snprintf(val, size, "%u", ci->i_layout.object_size);
> +   return ceph_fmt_xattr(val, size, "%u", ci->i_layout.object_size);
>  }
>
>  static ssize_t ceph_vxattrcb_layout_pool(struct ceph_inode_info *ci,
> @@ -138,10 +165,13 @@ static ssize_t ceph_vxattrcb_layout_pool(struct 
> ceph_inode_info *ci,
>
> down_read(&osdc->lock);
> pool_name = ceph_pg_pool_name_by_id(osdc->osdmap, pool);
> -   if (pool_name)
> -   ret = snprintf(val, size, "%s", pool_name);
> -   else
> -   ret = snprintf(val, size, "%lld", pool);
> +   if (pool_name) {
> +   ret = strlen(pool_name);
> +   if (ret <= size)
> +   memcpy(val, pool_name, ret);
> +   } else {
> +   ret = ceph_fmt_xattr(val, size, "%lld", pool);
> +   }
> up_read(&osdc->lock);
> return ret;
>  }
> @@ -149,10 +179,13 @@ static ssize_t ceph_vxattrcb_layout_pool(struct 
> ceph_inode_info *ci,
>  static ssize_t ceph_vxattrcb_layout_pool_namespace(struct ceph_inode_info 
> *ci,
>char *val, size_t size)
>  {
> -   int ret = 0;
> +   ssize_t ret = 0;
> struct ceph_string *ns = ceph_try_get_string(ci->i_layout.pool_ns);
> +
> if (ns) {
> -   ret = snprintf(val, size, "%.*s", ns->len, ns->str);
> +   ret = ns->len;
> +   if (ret <= size)
> +   memcpy(val, ns->str, ret);
> ceph_put_string(ns);
> }
> return ret;
> @@ -163,50 +196,51 @@ static ssize_t 
> ceph_vxattrcb_layout_pool_namespace(struct ceph_inode_info *ci,
>  static ssize_t ceph_vxattrcb_dir_entries(struct ceph_inode_info *ci, char 
> *val,
>   

Re: [PATCH v3 0/2] ceph: don't NULL terminate virtual xattr values

2019-06-24 Thread Yan, Zheng
On Fri, Jun 21, 2019 at 10:21 PM Jeff Layton  wrote:
>
> v3: switch to using an intermediate buffer for snprintf destination
> add patch to fix ceph_vxattrcb_layout return value
> v2: drop bogus EXPORT_SYMBOL of static function
>
> This is the 3rd posting of this patchset. Instead of adding a new
> snprintf variant that doesn't NULL terminate, this set instead has
> the vxattr handlers use an intermediate buffer as the snprintf
> destination and then memcpy's the result into the destination buffer.
>
> Also, I added a patch to fix up the return of ceph_vxattrcb_layout. The
> existing code actually worked, but relied on casting a signed negative
> value to unsigned and back, which seemed a little sketchy.
>
> Most of the rationale for this set is in the description of the first
> patch of the series.
>
> Jeff Layton (2):
>   ceph: fix buffer length handling in virtual xattrs
>   ceph: fix return of ceph_vxattrcb_layout
>
>  fs/ceph/xattr.c | 113 ++--
>  1 file changed, 81 insertions(+), 32 deletions(-)
>

Reviewed-by
> --
> 2.21.0
>


Re: [PATCH 1/3] lib/vsprintf: add snprintf_noterm

2019-06-14 Thread Yan, Zheng
On Fri, Jun 14, 2019 at 9:48 PM Jeff Layton  wrote:
>
> The getxattr interface returns a length after filling out the value
> buffer, and the convention with xattrs is to not NULL terminate string
> data.
>
> CephFS implements some virtual xattrs by using snprintf to fill the
> buffer, but that always NULL terminates the string. If userland sends
> down a buffer that is just the right length to hold the text without
> termination then we end up truncating the value.
>
> Factor the formatting piece of vsnprintf into a separate helper
> function, and have vsnprintf call that and then do the NULL termination
> afterward. Then add a snprintf_noterm function that calls the new helper
> to populate the string but skips the termination.
>
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/kernel.h |   2 +
>  lib/vsprintf.c | 145 -
>  2 files changed, 103 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 2d14e21c16c0..2f305a347482 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,8 @@ extern int num_to_str(char *buf, int size,
>  extern __printf(2, 3) int sprintf(char *buf, const char * fmt, ...);
>  extern __printf(2, 0) int vsprintf(char *buf, const char *, va_list);
>  extern __printf(3, 4)
> +int snprintf_noterm(char *buf, size_t size, const char *fmt, ...);
> +extern __printf(3, 4)
>  int snprintf(char *buf, size_t size, const char *fmt, ...);
>  extern __printf(3, 0)
>  int vsnprintf(char *buf, size_t size, const char *fmt, va_list args);
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 791b6fa36905..ad5f4990eda3 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -2296,53 +2296,24 @@ set_precision(struct printf_spec *spec, int prec)
>  }
>
>  /**
> - * vsnprintf - Format a string and place it in a buffer
> + * vsnprintf_noterm - Format a string and place it in a buffer without NULL
> + *   terminating it
>   * @buf: The buffer to place the result into
> - * @size: The size of the buffer, including the trailing null space
> + * @end: The end of the buffer
>   * @fmt: The format string to use
>   * @args: Arguments for the format string
>   *
> - * This function generally follows C99 vsnprintf, but has some
> - * extensions and a few limitations:
> - *
> - *  - ``%n`` is unsupported
> - *  - ``%p*`` is handled by pointer()
> - *
> - * See pointer() or Documentation/core-api/printk-formats.rst for more
> - * extensive description.
> - *
> - * **Please update the documentation in both places when making changes**
> - *
> - * The return value is the number of characters which would
> - * be generated for the given input, excluding the trailing
> - * '\0', as per ISO C99. If you want to have the exact
> - * number of characters written into @buf as return value
> - * (not including the trailing '\0'), use vscnprintf(). If the
> - * return is greater than or equal to @size, the resulting
> - * string is truncated.
> - *
> - * If you're not already dealing with a va_list consider using snprintf().
> + * See the documentation over vsnprintf. This function does NOT add any NULL
> + * termination to the buffer. The caller must do that if necessary.
>   */
> -int vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
> +static int vsnprintf_noterm(char *buf, char *end, const char *fmt,
> +   va_list args)
>  {
> unsigned long long num;
> -   char *str, *end;
> +   char *str;
> struct printf_spec spec = {0};
>
> -   /* Reject out-of-range values early.  Large positive sizes are
> -  used for unknown buffer sizes. */
> -   if (WARN_ON_ONCE(size > INT_MAX))
> -   return 0;
> -
> str = buf;
> -   end = buf + size;
> -
> -   /* Make sure end is always >= buf */
> -   if (end < buf) {
> -   end = ((void *)-1);
> -   size = end - buf;
> -   }
> -
> while (*fmt) {
> const char *old_fmt = fmt;
> int read = format_decode(fmt, &spec);
> @@ -2462,18 +2433,69 @@ int vsnprintf(char *buf, size_t size, const char 
> *fmt, va_list args)
> str = number(str, end, num, spec);
> }
> }
> -
>  out:
> +   /* the trailing null byte doesn't count towards the total */
> +   return str-buf;
> +}
> +EXPORT_SYMBOL(vsnprintf_noterm);

export static function?

> +
> +/**
> + * vsnprintf - Format a string and place it in a buffer
> + * @buf: The buffer to place the result into
> + * @size: The size of the buffer, including the trailing null space
> + * @fmt: The format string to use
> + * @args: Arguments for the format string
> + *
> + * This function generally follows C99 vsnprintf, but has some
> + * extensions and a few limitations:
> + *
> + *  - ``%n`` is unsupported
> + *  - ``%p*`` is handled by pointer()
> + *
> + * See pointer() or Documentation/core-

[PATCH] ceph: use ceph_evict_inode to cleanup inode's resource

2019-06-01 Thread Yan, Zheng
remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freezing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/inode.c | 25 ++---
 fs/ceph/super.c |  1 +
 fs/ceph/super.h |  1 +
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 125ac54b5841..9e481b41d5bc 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -515,22 +515,13 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
return &ci->vfs_inode;
 }
 
-static void ceph_i_callback(struct rcu_head *head)
-{
-   struct inode *inode = container_of(head, struct inode, i_rcu);
-   struct ceph_inode_info *ci = ceph_inode(inode);
-
-   kfree(ci->i_symlink);
-   kmem_cache_free(ceph_inode_cachep, ci);
-}
-
-void ceph_destroy_inode(struct inode *inode)
+void ceph_evict_inode(struct inode *inode)
 {
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_inode_frag *frag;
struct rb_node *n;
 
-   dout("destroy_inode %p ino %llx.%llx\n", inode, ceph_vinop(inode));
+   dout("evict_inode %p ino %llx.%llx\n", inode, ceph_vinop(inode));
 
ceph_fscache_unregister_inode_cookie(ci);
 
@@ -577,7 +568,19 @@ void ceph_destroy_inode(struct inode *inode)
ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+}
 
+static void ceph_i_callback(struct rcu_head *head)
+{
+   struct inode *inode = container_of(head, struct inode, i_rcu);
+   struct ceph_inode_info *ci = ceph_inode(inode);
+
+   kfree(ci->i_symlink);
+   kmem_cache_free(ceph_inode_cachep, ci);
+}
+
+void ceph_destroy_inode(struct inode *inode)
+{
call_rcu(&inode->i_rcu, ceph_i_callback);
 }
 
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index b1ee41372e85..67eb9d592ab7 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -842,6 +842,7 @@ static const struct super_operations ceph_super_ops = {
.destroy_inode  = ceph_destroy_inode,
.write_inode= ceph_write_inode,
.drop_inode = ceph_drop_inode,
+   .evict_inode= ceph_evict_inode,
.sync_fs= ceph_sync_fs,
.put_super  = ceph_put_super,
.remount_fs = ceph_remount,
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9c82d213a5ab..98d2bafc2ee2 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -882,6 +882,7 @@ static inline bool __ceph_have_pending_cap_snap(struct 
ceph_inode_info *ci)
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
+extern void ceph_evict_inode(struct inode *inode);
 extern void ceph_destroy_inode(struct inode *inode);
 extern int ceph_drop_inode(struct inode *inode);
 
-- 
2.17.2



New CephFS kernel client maintainer

2019-05-30 Thread Yan, Zheng

Hello everyone.

I'd like to introduce new CephFS kernel client maintainer Jeff Layton 
. Jeff is a long time Linux kernel developer 
specializing in network file systems. He has worked on CephFS for three 
years. He also has made significant contributions to the kernel's NFS 
client and server, the CIFS client and the kernel's VFS layer.


I will continue to work on CephFS, but spend more time on improving 
CephFS metadata server.


Regards
Yan, Zheng


Re: [PATCH 8/8] ceph: hold i_ceph_lock when removing caps for freeing inode

2019-05-29 Thread Yan, Zheng

On 5/29/19 9:14 PM, Sasha Levin wrote:

Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.1.4, v5.0.18, v4.19.45, v4.14.121, 
v4.9.178, v4.4.180, v3.18.140.

v5.1.4: Build OK!
v5.0.18: Failed to apply! Possible dependencies:
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")

v4.19.45: Failed to apply! Possible dependencies:
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")

v4.14.121: Failed to apply! Possible dependencies:
 a1c6b8358171c ("ceph: define argument structure for handle_cap_grant")
 a57d9064e4ee4 ("ceph: flush pending works before shutdown super")
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")

v4.9.178: Failed to apply! Possible dependencies:
 a1c6b8358171c ("ceph: define argument structure for handle_cap_grant")
 a57d9064e4ee4 ("ceph: flush pending works before shutdown super")
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")

v4.4.180: Failed to apply! Possible dependencies:
 13d1ad16d05ee ("libceph: move message allocation out of 
ceph_osdc_alloc_request()")
 34b759b4a22b0 ("ceph: kill ceph_empty_snapc")
 3f1af42ad0fad ("libceph: enable large, variable-sized OSD requests")
 5be0389dac662 ("ceph: re-send AIO write request when getting -EOLDSNAP 
error")
 7627151ea30bc ("libceph: define new ceph_file_layout structure")
 779fe0fb8e188 ("ceph: rados pool namespace support")
 922dab6134178 ("libceph, rbd: ceph_osd_linger_request, watch/notify v2")
 a1c6b8358171c ("ceph: define argument structure for handle_cap_grant")
 ae458f5a171ba ("libceph: make r_request msg_size calculation clearer")
 c41d13a31fefe ("rbd: use header_oid instead of header_name")
 c8fe9b17d055f ("ceph: Asynchronous IO support")
 d30291b985d18 ("libceph: variable-sized ceph_object_id")
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")

v3.18.140: Failed to apply! Possible dependencies:
 10183a69551f7 ("ceph: check OSD caps before read/write")
 28127bdd2f843 ("ceph: convert inline data to normal data before data 
write")
 31c542a199d79 ("ceph: add inline data to pagecache")
 5be0389dac662 ("ceph: re-send AIO write request when getting -EOLDSNAP 
error")
 70db4f3629b34 ("ceph: introduce a new inode flag indicating if cached dentries 
are ordered")
 745a8e3bccbc6 ("ceph: don't pre-allocate space for cap release messages")
 7627151ea30bc ("libceph: define new ceph_file_layout structure")
 779fe0fb8e188 ("ceph: rados pool namespace support")
 83701246aee8f ("ceph: sync read inline data")
 a1c6b8358171c ("ceph: define argument structure for handle_cap_grant")
 affbc19a68f99 ("ceph: make sure syncfs flushes all cap snaps")
 c8fe9b17d055f ("ceph: Asynchronous IO support")
 d30291b985d18 ("libceph: variable-sized ceph_object_id")
 d3383a8e37f80 ("ceph: avoid block operation when !TASK_RUNNING 
(ceph_mdsc_sync)")
 e3ec8d6898f71 ("ceph: send cap releases more aggressively")
 e96a650a8174e ("ceph, rbd: delete unnecessary checks before two function 
calls")


How should we proceed with this patch?



please use following patch for old kernels

Regards
Yan, Zheng

---
From 55937416f12e096621b06ada7554cacb89d06e97 Mon Sep 17 00:00:00 2001
From: "Yan, Zheng" 
Date: Thu, 23 May 2019 11:01:37 +0800
Subject: [PATCH] ceph: hold i_ceph_lock when removing caps for freeing inode

ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Cc: sta...@vger.kernel.org
Signed-off-by: "Yan, Zheng" 
Reviewed-by: Jeff Layton 
---
 fs/ceph/caps.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index ff5d32cf9578..0fb4e919cdce 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1119,20 +1119,23 @@ static int send_cap_msg(struct cap_msg_args *arg)
 }

 /*
- * Queue cap releases when an inode is dropped from our cache.  Since
- * inode is about to be destroyed, there is no need for i_ceph_lock.
+ * Queue cap releases when an inode is dropped from our cache.
  */
 void ceph_queue_caps_release(struct inode *inode)
 {
struct ceph_inode_info *ci = ceph_inode(inode);
struct rb_node *p;

+   /* lock i_ceph_lock, because ceph_d_revalidate(..., LOOKUP_RCU)
+* may call __ceph_caps_issued_mask() on a freeing inode. */
+   spin_lock(&ci->i_ceph_lock);
p = rb_first(&ci->i_caps);
while (p) {
struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
p = rb_next(p);
__ceph_remove_cap(cap, true);
}
+   spin_unlock(&ci->i_ceph_lock);
 }

 /*
--
2.17.2






--
Thanks,
Sasha






Re: [PATCH] ceph: fix warning PTR_ERR_OR_ZERO can be used

2019-05-25 Thread Yan, Zheng

On 5/25/19 5:15 PM, Hariprasad Kelam wrote:

change1: fix below warning  reported by coccicheck

/fs/ceph/export.c:371:33-39: WARNING: PTR_ERR_OR_ZERO can be used

change2: typecasted PTR_ERR_OR_ZERO to long as dout expecting long

Signed-off-by: Hariprasad Kelam 
---
  fs/ceph/export.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index d3ef7ee42..15ff1b0 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -368,7 +368,7 @@ static struct dentry *ceph_get_parent(struct dentry *child)
}
  out:
dout("get_parent %p ino %llx.%llx err=%ld\n",
-child, ceph_vinop(inode), (IS_ERR(dn) ? PTR_ERR(dn) : 0));
+child, ceph_vinop(inode), (long)PTR_ERR_OR_ZERO(dn));
return dn;
  }
  



Applied.

Thanks
Yan, Zheng


[PATCH 2/8] ceph: single workqueue for inode related works

2019-05-23 Thread Yan, Zheng
We have three workqueue for inode works. Later patch will introduce
one more work for inode. It's not good to introcuce more workqueue
and add more 'struct work_struct' to 'struct ceph_inode_info'.

Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/file.c  |   2 +-
 fs/ceph/inode.c | 124 ++--
 fs/ceph/super.c |  28 +++
 fs/ceph/super.h |  17 ---
 4 files changed, 74 insertions(+), 97 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ccc054794542..b7be02dfb897 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -790,7 +790,7 @@ static void ceph_aio_complete_req(struct ceph_osd_request 
*req)
if (aio_work) {
INIT_WORK(&aio_work->work, ceph_aio_retry_work);
aio_work->req = req;
-   queue_work(ceph_inode_to_client(inode)->wb_wq,
+   queue_work(ceph_inode_to_client(inode)->inode_wq,
   &aio_work->work);
return;
}
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 6eabcdb321cb..d9ff349821f0 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -33,9 +33,7 @@
 
 static const struct inode_operations ceph_symlink_iops;
 
-static void ceph_invalidate_work(struct work_struct *work);
-static void ceph_writeback_work(struct work_struct *work);
-static void ceph_vmtruncate_work(struct work_struct *work);
+static void ceph_inode_work(struct work_struct *work);
 
 /*
  * find or create an inode, given the ceph ino number
@@ -509,10 +507,8 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
INIT_LIST_HEAD(&ci->i_snap_realm_item);
INIT_LIST_HEAD(&ci->i_snap_flush_item);
 
-   INIT_WORK(&ci->i_wb_work, ceph_writeback_work);
-   INIT_WORK(&ci->i_pg_inv_work, ceph_invalidate_work);
-
-   INIT_WORK(&ci->i_vmtruncate_work, ceph_vmtruncate_work);
+   INIT_WORK(&ci->i_work, ceph_inode_work);
+   ci->i_work_mask = 0;
 
ceph_fscache_inode_init(ci);
 
@@ -1750,51 +1746,62 @@ bool ceph_inode_set_size(struct inode *inode, loff_t 
size)
  */
 void ceph_queue_writeback(struct inode *inode)
 {
+   struct ceph_inode_info *ci = ceph_inode(inode);
+   set_bit(CEPH_I_WORK_WRITEBACK, &ci->i_work_mask);
+
ihold(inode);
-   if (queue_work(ceph_inode_to_client(inode)->wb_wq,
-  &ceph_inode(inode)->i_wb_work)) {
+   if (queue_work(ceph_inode_to_client(inode)->inode_wq,
+  &ci->i_work)) {
dout("ceph_queue_writeback %p\n", inode);
} else {
-   dout("ceph_queue_writeback %p failed\n", inode);
+   dout("ceph_queue_writeback %p already queued, mask=%lx\n",
+inode, ci->i_work_mask);
iput(inode);
}
 }
 
-static void ceph_writeback_work(struct work_struct *work)
-{
-   struct ceph_inode_info *ci = container_of(work, struct ceph_inode_info,
- i_wb_work);
-   struct inode *inode = &ci->vfs_inode;
-
-   dout("writeback %p\n", inode);
-   filemap_fdatawrite(&inode->i_data);
-   iput(inode);
-}
-
 /*
  * queue an async invalidation
  */
 void ceph_queue_invalidate(struct inode *inode)
 {
+   struct ceph_inode_info *ci = ceph_inode(inode);
+   set_bit(CEPH_I_WORK_INVALIDATE_PAGES, &ci->i_work_mask);
+
ihold(inode);
-   if (queue_work(ceph_inode_to_client(inode)->pg_inv_wq,
-  &ceph_inode(inode)->i_pg_inv_work)) {
+   if (queue_work(ceph_inode_to_client(inode)->inode_wq,
+  &ceph_inode(inode)->i_work)) {
dout("ceph_queue_invalidate %p\n", inode);
} else {
-   dout("ceph_queue_invalidate %p failed\n", inode);
+   dout("ceph_queue_invalidate %p already queued, mask=%lx\n",
+inode, ci->i_work_mask);
iput(inode);
}
 }
 
 /*
- * Invalidate inode pages in a worker thread.  (This can't be done
- * in the message handler context.)
+ * Queue an async vmtruncate.  If we fail to queue work, we will handle
+ * the truncation the next time we call __ceph_do_pending_vmtruncate.
  */
-static void ceph_invalidate_work(struct work_struct *work)
+void ceph_queue_vmtruncate(struct inode *inode)
 {
-   struct ceph_inode_info *ci = container_of(work, struct ceph_inode_info,
- i_pg_inv_work);
-   struct inode *inode = &ci->vfs_inode;
+   struct ceph_inode_info *ci = ceph_inode(inode);
+   set_bit(CEPH_I_WORK_VMTRUNCATE, &ci->i_work_mask);
+
+   ihold(inode);
+   if (queue_work(ceph_in

[PATCH 3/8] ceph: avoid iput_final() while holding mutex or in dispatch thread

2019-05-23 Thread Yan, Zheng
iput_final() may wait for reahahead pages. The wait can cause deadlock.
For example:

Workqueue: ceph-msgr ceph_con_workfn [libceph]
  Call Trace:
   schedule+0x36/0x80
   io_schedule+0x16/0x40
   __lock_page+0x101/0x140
   truncate_inode_pages_range+0x556/0x9f0
   truncate_inode_pages_final+0x4d/0x60
   evict+0x182/0x1a0
   iput+0x1d2/0x220
   iterate_session_caps+0x82/0x230 [ceph]
   dispatch+0x678/0xa80 [ceph]
   ceph_con_workfn+0x95b/0x1560 [libceph]
   process_one_work+0x14d/0x410
   worker_thread+0x4b/0x460
   kthread+0x105/0x140
   ret_from_fork+0x22/0x40

Workqueue: ceph-msgr ceph_con_workfn [libceph]
  Call Trace:
   __schedule+0x3d6/0x8b0
   schedule+0x36/0x80
   schedule_preempt_disabled+0xe/0x10
   mutex_lock+0x2f/0x40
   ceph_check_caps+0x505/0xa80 [ceph]
   ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
   writepages_finish+0x2d3/0x410 [ceph]
   __complete_request+0x26/0x60 [libceph]
   handle_reply+0x6c8/0xa10 [libceph]
   dispatch+0x29a/0xbb0 [libceph]
   ceph_con_workfn+0x95b/0x1560 [libceph]
   process_one_work+0x14d/0x410
   worker_thread+0x4b/0x460
   kthread+0x105/0x140
   ret_from_fork+0x22/0x40

In above example, truncate_inode_pages_range() waits for readahead pages
while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
OSD dispatch thread. Later OSD replies (for readahead) can't be handled.

ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
can happen if iput_final() is called while holding snap_rwsem.

In general, it's not good to call iput_final() inside MDS/OSD threads or
while holding any mutex.

The fix is introducing ceph_async_iput(), which calls iput_final() in
workqueue.

Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/caps.c   | 12 
 fs/ceph/inode.c  | 31 +++
 fs/ceph/mds_client.c | 28 ++--
 fs/ceph/quota.c  |  9 ++---
 fs/ceph/snap.c   | 16 +++-
 fs/ceph/super.h  |  2 +-
 6 files changed, 71 insertions(+), 27 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 079d0df9650c..0176241eaea7 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2992,8 +2992,10 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info 
*ci, int nr,
}
if (complete_capsnap)
wake_up_all(&ci->i_cap_wq);
-   while (put-- > 0)
-   iput(inode);
+   while (put-- > 0) {
+   /* avoid calling iput_final() in osd dispatch threads */
+   ceph_async_iput(inode);
+   }
 }
 
 /*
@@ -3964,8 +3966,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 done:
mutex_unlock(&session->s_mutex);
 done_unlocked:
-   iput(inode);
ceph_put_string(extra_info.pool_ns);
+   /* avoid calling iput_final() in mds dispatch threads */
+   ceph_async_iput(inode);
return;
 
 flush_cap_releases:
@@ -4011,7 +4014,8 @@ void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
if (inode) {
dout("check_delayed_caps on %p\n", inode);
ceph_check_caps(ci, flags, NULL);
-   iput(inode);
+   /* avoid calling iput_final() in tick thread */
+   ceph_async_iput(inode);
}
}
spin_unlock(&mdsc->cap_delay_lock);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index d9ff349821f0..8cfece240ffe 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1480,7 +1480,8 @@ static int readdir_prepopulate_inodes_only(struct 
ceph_mds_request *req,
pr_err("fill_inode badness on %p got %d\n", in, rc);
err = rc;
}
-   iput(in);
+   /* avoid calling iput_final() in mds dispatch threads */
+   ceph_async_iput(in);
}
 
return err;
@@ -1678,8 +1679,11 @@ int ceph_readdir_prepopulate(struct ceph_mds_request 
*req,
 &req->r_caps_reservation);
if (ret < 0) {
pr_err("fill_inode badness on %p\n", in);
-   if (d_really_is_negative(dn))
-   iput(in);
+   if (d_really_is_negative(dn)) {
+   /* avoid calling iput_final() in mds
+* dispatch threads */
+   ceph_async_iput(in);
+   }
d_drop(dn);
err = ret;
goto next_item;
@@ -1689,7 +1693,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
if (ceph_security_xattr_deadlock(in)) {
dout(" skip splicing dn %p to inode %p"
 " (security xattr deadlock)\n&q

[PATCH 6/8] ceph: use READ_ONCE to access d_parent in RCU critical section

2019-05-23 Thread Yan, Zheng
Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/mds_client.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 60e8ddbdfdc5..870754e9d572 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -913,7 +913,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
struct inode *dir;
 
rcu_read_lock();
-   parent = req->r_dentry->d_parent;
+   parent = READ_ONCE(req->r_dentry->d_parent);
dir = req->r_parent ? : d_inode_rcu(parent);
 
if (!dir || dir->i_sb != mdsc->fsc->sb) {
@@ -2131,8 +2131,8 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int 
*plen, u64 *pbase,
if (inode && ceph_snap(inode) == CEPH_SNAPDIR) {
dout("build_path path+%d: %p SNAPDIR\n",
 pos, temp);
-   } else if (stop_on_nosnap && inode && dentry != temp &&
-  ceph_snap(inode) == CEPH_NOSNAP) {
+   } else if (stop_on_nosnap && dentry != temp &&
+  inode && ceph_snap(inode) == CEPH_NOSNAP) {
spin_unlock(&temp->d_lock);
pos++; /* get rid of any prepended '/' */
break;
@@ -2145,7 +2145,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int 
*plen, u64 *pbase,
memcpy(path + pos, temp->d_name.name, temp->d_name.len);
}
spin_unlock(&temp->d_lock);
-   temp = temp->d_parent;
+   temp = READ_ONCE(temp->d_parent);
 
/* Are we at the root? */
if (IS_ROOT(temp))
-- 
2.17.2



[PATCH 8/8] ceph: hold i_ceph_lock when removing caps for freeing inode

2019-05-23 Thread Yan, Zheng
ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Cc: sta...@vger.kernel.org
Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/caps.c  | 10 ++
 fs/ceph/inode.c |  2 +-
 fs/ceph/super.h |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 0176241eaea7..7754d7679122 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1263,20 +1263,22 @@ static int send_cap_msg(struct cap_msg_args *arg)
 }
 
 /*
- * Queue cap releases when an inode is dropped from our cache.  Since
- * inode is about to be destroyed, there is no need for i_ceph_lock.
+ * Queue cap releases when an inode is dropped from our cache.
  */
-void __ceph_remove_caps(struct inode *inode)
+void __ceph_remove_caps(struct ceph_inode_info *ci)
 {
-   struct ceph_inode_info *ci = ceph_inode(inode);
struct rb_node *p;
 
+   /* lock i_ceph_lock, because ceph_d_revalidate(..., LOOKUP_RCU)
+* may call __ceph_caps_issued_mask() on a freeing inode. */
+   spin_lock(&ci->i_ceph_lock);
p = rb_first(&ci->i_caps);
while (p) {
struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
p = rb_next(p);
__ceph_remove_cap(cap, true);
}
+   spin_unlock(&ci->i_ceph_lock);
 }
 
 /*
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index e47a25495be5..30d0cdc21035 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -534,7 +534,7 @@ void ceph_destroy_inode(struct inode *inode)
 
ceph_fscache_unregister_inode_cookie(ci);
 
-   __ceph_remove_caps(inode);
+   __ceph_remove_caps(ci);
 
if (__ceph_has_any_quota(ci))
ceph_adjust_quota_realms_count(inode, false);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 11aeb540b0cf..e74867743e07 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1003,7 +1003,7 @@ extern void ceph_add_cap(struct inode *inode,
 unsigned cap, unsigned seq, u64 realmino, int flags,
 struct ceph_cap **new_cap);
 extern void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release);
-extern void __ceph_remove_caps(struct inode* inode);
+extern void __ceph_remove_caps(struct ceph_inode_info *ci);
 extern void ceph_put_cap(struct ceph_mds_client *mdsc,
 struct ceph_cap *cap);
 extern int ceph_is_any_caps(struct inode *inode);
-- 
2.17.2



[PATCH 4/8] ceph: close race between d_name_cmp() and update_dentry_lease()

2019-05-23 Thread Yan, Zheng
d_name_cmp() and update_dentry_lease() lock and unlock dentry->d_lock
respectively. Dentry may get renamed between them. The fix is moving
the dentry name compare into update_dentry_lease().

This patch introduce two version of update_dentry_lease(). One version
is for the case that parent inode is locked. It does not need to check
parent/target inode and dentry name. Another version is for the case
that parent inode is not locked. It checks arent/target inode and dentry
name after locking dentry->d_lock.

Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/inode.c | 164 ++--
 1 file changed, 88 insertions(+), 76 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 8cfece240ffe..e47a25495be5 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1031,59 +1031,38 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
 }
 
 /*
- * caller should hold session s_mutex.
+ * caller should hold session s_mutex and dentry->d_lock.
  */
-static void update_dentry_lease(struct dentry *dentry,
-   struct ceph_mds_reply_lease *lease,
-   struct ceph_mds_session *session,
-   unsigned long from_time,
-   struct ceph_vino *tgt_vino,
-   struct ceph_vino *dir_vino)
+static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
+ struct ceph_mds_reply_lease *lease,
+ struct ceph_mds_session *session,
+ unsigned long from_time,
+ struct ceph_mds_session **old_lease_session)
 {
struct ceph_dentry_info *di = ceph_dentry(dentry);
long unsigned duration = le32_to_cpu(lease->duration_ms);
long unsigned ttl = from_time + (duration * HZ) / 1000;
long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
-   struct inode *dir;
-   struct ceph_mds_session *old_lease_session = NULL;
-
-   /*
-* Make sure dentry's inode matches tgt_vino. NULL tgt_vino means that
-* we expect a negative dentry.
-*/
-   if (!tgt_vino && d_really_is_positive(dentry))
-   return;
-
-   if (tgt_vino && (d_really_is_negative(dentry) ||
-   !ceph_ino_compare(d_inode(dentry), tgt_vino)))
-   return;
 
-   spin_lock(&dentry->d_lock);
dout("update_dentry_lease %p duration %lu ms ttl %lu\n",
 dentry, duration, ttl);
 
-   dir = d_inode(dentry->d_parent);
-
-   /* make sure parent matches dir_vino */
-   if (!ceph_ino_compare(dir, dir_vino))
-   goto out_unlock;
-
/* only track leases on regular dentries */
if (ceph_snap(dir) != CEPH_NOSNAP)
-   goto out_unlock;
+   return;
 
di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
if (duration == 0) {
__ceph_dentry_dir_lease_touch(di);
-   goto out_unlock;
+   return;
}
 
if (di->lease_gen == session->s_cap_gen &&
time_before(ttl, di->time))
-   goto out_unlock;  /* we already have a newer lease. */
+   return;  /* we already have a newer lease. */
 
if (di->lease_session && di->lease_session != session) {
-   old_lease_session = di->lease_session;
+   *old_lease_session = di->lease_session;
di->lease_session = NULL;
}
 
@@ -1096,6 +1075,62 @@ static void update_dentry_lease(struct dentry *dentry,
di->time = ttl;
 
__ceph_dentry_lease_touch(di);
+}
+
+static inline void update_dentry_lease(struct inode *dir, struct dentry 
*dentry,
+   struct ceph_mds_reply_lease *lease,
+   struct ceph_mds_session *session,
+   unsigned long from_time)
+{
+   struct ceph_mds_session *old_lease_session = NULL;
+   spin_lock(&dentry->d_lock);
+   __update_dentry_lease(dir, dentry, lease, session, from_time,
+ &old_lease_session);
+   spin_unlock(&dentry->d_lock);
+   if (old_lease_session)
+   ceph_put_mds_session(old_lease_session);
+}
+
+/*
+ * update dentry lease without having parent inode locked
+ */
+static void update_dentry_lease_careful(struct dentry *dentry,
+   struct ceph_mds_reply_lease *lease,
+   struct ceph_mds_session *session,
+   unsigned long from_time,
+   char *dname, u32 dname_len,
+   stru

[PATCH 5/8] ceph: fix dir_lease_is_valid()

2019-05-23 Thread Yan, Zheng
It should call __ceph_dentry_dir_lease_touch() under dentry->d_lock.
Besides, ceph_dentry(dentry) can be NULL when called by LOOKUP_RCU
d_revalidate()

Cc: sta...@vger.kernel.org # v5.1+
Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/dir.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0637149fb9f9..1271024a3797 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1512,18 +1512,26 @@ static int __dir_lease_try_check(const struct dentry 
*dentry)
 static int dir_lease_is_valid(struct inode *dir, struct dentry *dentry)
 {
struct ceph_inode_info *ci = ceph_inode(dir);
-   struct ceph_dentry_info *di = ceph_dentry(dentry);
-   int valid = 0;
+   int valid;
+   int shared_gen;
 
spin_lock(&ci->i_ceph_lock);
-   if (atomic_read(&ci->i_shared_gen) == di->lease_shared_gen)
-   valid = __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1);
+   valid = __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1);
+   shared_gen = atomic_read(&ci->i_shared_gen);
spin_unlock(&ci->i_ceph_lock);
-   if (valid)
-   __ceph_dentry_dir_lease_touch(di);
-   dout("dir_lease_is_valid dir %p v%u dentry %p v%u = %d\n",
-dir, (unsigned)atomic_read(&ci->i_shared_gen),
-dentry, (unsigned)di->lease_shared_gen, valid);
+   if (valid) {
+   struct ceph_dentry_info *di;
+   spin_lock(&dentry->d_lock);
+   di = ceph_dentry(dentry);
+   if (dir == d_inode(dentry->d_parent) &&
+   di && di->lease_shared_gen == shared_gen)
+   __ceph_dentry_dir_lease_touch(di);
+   else
+   valid = 0;
+   spin_unlock(&dentry->d_lock);
+   }
+   dout("dir_lease_is_valid dir %p v%u dentry %p = %d\n",
+dir, (unsigned)atomic_read(&ci->i_shared_gen), dentry, valid);
return valid;
 }
 
-- 
2.17.2



[PATCH 7/8] ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()

2019-05-23 Thread Yan, Zheng
Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/dir.c|  7 +++
 fs/ceph/mds_client.c | 24 +---
 fs/ceph/mds_client.h |  1 -
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 1271024a3797..72efad28857c 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1433,8 +1433,7 @@ static bool __dentry_lease_is_valid(struct 
ceph_dentry_info *di)
return false;
 }
 
-static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags,
-struct inode *dir)
+static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
 {
struct ceph_dentry_info *di;
struct ceph_mds_session *session = NULL;
@@ -1466,7 +1465,7 @@ static int dentry_lease_is_valid(struct dentry *dentry, 
unsigned int flags,
spin_unlock(&dentry->d_lock);
 
if (session) {
-   ceph_mdsc_lease_send_msg(session, dir, dentry,
+   ceph_mdsc_lease_send_msg(session, dentry,
 CEPH_MDS_LEASE_RENEW, seq);
ceph_put_mds_session(session);
}
@@ -1566,7 +1565,7 @@ static int ceph_d_revalidate(struct dentry *dentry, 
unsigned int flags)
   ceph_snap(d_inode(dentry)) == CEPH_SNAPDIR) {
valid = 1;
} else {
-   valid = dentry_lease_is_valid(dentry, flags, dir);
+   valid = dentry_lease_is_valid(dentry, flags);
if (valid == -ECHILD)
return valid;
if (valid || dir_lease_is_valid(dir, dentry)) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 870754e9d572..98c500dbec3f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3941,31 +3941,33 @@ static void handle_lease(struct ceph_mds_client *mdsc,
 }
 
 void ceph_mdsc_lease_send_msg(struct ceph_mds_session *session,
- struct inode *inode,
  struct dentry *dentry, char action,
  u32 seq)
 {
struct ceph_msg *msg;
struct ceph_mds_lease *lease;
-   int len = sizeof(*lease) + sizeof(u32);
-   int dnamelen = 0;
+   struct inode *dir;
+   int len = sizeof(*lease) + sizeof(u32) + NAME_MAX;
 
-   dout("lease_send_msg inode %p dentry %p %s to mds%d\n",
-inode, dentry, ceph_lease_op_name(action), session->s_mds);
-   dnamelen = dentry->d_name.len;
-   len += dnamelen;
+   dout("lease_send_msg identry %p %s to mds%d\n",
+dentry, ceph_lease_op_name(action), session->s_mds);
 
msg = ceph_msg_new(CEPH_MSG_CLIENT_LEASE, len, GFP_NOFS, false);
if (!msg)
return;
lease = msg->front.iov_base;
lease->action = action;
-   lease->ino = cpu_to_le64(ceph_vino(inode).ino);
-   lease->first = lease->last = cpu_to_le64(ceph_vino(inode).snap);
lease->seq = cpu_to_le32(seq);
-   put_unaligned_le32(dnamelen, lease + 1);
-   memcpy((void *)(lease + 1) + 4, dentry->d_name.name, dnamelen);
 
+   spin_lock(&dentry->d_lock);
+   dir = d_inode(dentry->d_parent);
+   lease->ino = cpu_to_le64(ceph_inode(dir)->i_vino.ino);
+   lease->first = lease->last = cpu_to_le64(ceph_inode(dir)->i_vino.snap);
+
+   put_unaligned_le32(dentry->d_name.len, lease + 1);
+   memcpy((void *)(lease + 1) + 4,
+  dentry->d_name.name, dentry->d_name.len);
+   spin_unlock(&dentry->d_lock);
/*
 * if this is a preemptive lease RELEASE, no need to
 * flush request stream, since the actual request will
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 9c28b86abcf4..330769ecb601 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -505,7 +505,6 @@ extern char *ceph_mdsc_build_path(struct dentry *dentry, 
int *plen, u64 *base,
 
 extern void __ceph_mdsc_drop_dentry_lease(struct dentry *dentry);
 extern void ceph_mdsc_lease_send_msg(struct ceph_mds_session *session,
-struct inode *inode,
 struct dentry *dentry, char action,
 u32 seq);
 
-- 
2.17.2



[PATCH 1/8] ceph: fix error handling in ceph_get_caps()

2019-05-23 Thread Yan, Zheng
The function return 0 even when interrupted or try_get_cap_refs()
return error.

Introduce by commit 1199d7da2d "ceph: simplify arguments and return
semantics of try_get_cap_refs"

Signed-off-by: "Yan, Zheng" 
---
 fs/ceph/caps.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 72f8e1311392..079d0df9650c 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2738,15 +2738,13 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
_got = 0;
ret = try_get_cap_refs(ci, need, want, endoff,
   false, &_got);
-   if (ret == -EAGAIN) {
+   if (ret == -EAGAIN)
continue;
-   } else if (!ret) {
-   int err;
-
+   if (!ret) {
DEFINE_WAIT_FUNC(wait, woken_wake_function);
add_wait_queue(&ci->i_cap_wq, &wait);
 
-   while (!(err = try_get_cap_refs(ci, need, want, endoff,
+   while (!(ret = try_get_cap_refs(ci, need, want, endoff,
true, &_got))) {
if (signal_pending(current)) {
ret = -ERESTARTSYS;
@@ -2756,14 +2754,16 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
}
 
remove_wait_queue(&ci->i_cap_wq, &wait);
-   if (err == -EAGAIN)
+   if (ret == -EAGAIN)
continue;
}
-   if (ret == -ESTALE) {
-   /* session was killed, try renew caps */
-   ret = ceph_renew_caps(&ci->vfs_inode);
-   if (ret == 0)
-   continue;
+   if (ret < 0) {
+   if (ret == -ESTALE) {
+   /* session was killed, try renew caps */
+   ret = ceph_renew_caps(&ci->vfs_inode);
+   if (ret == 0)
+   continue;
+   }
return ret;
}
 
-- 
2.17.2



Re: [PATCH 4/4] ceph: fix improper use of smp_mb__before_atomic()

2019-05-20 Thread Yan, Zheng

On 5/21/19 1:23 AM, Andrea Parri wrote:

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic64_set() primitive.

Replace the barrier with an smp_mb().

Fixes: fdd4e15838e59 ("ceph: rework dcache readdir")
Cc: sta...@vger.kernel.org
Reported-by: "Paul E. McKenney" 
Reported-by: Peter Zijlstra 
Signed-off-by: Andrea Parri 
Cc: "Yan, Zheng" 
Cc: Sage Weil 
Cc: Ilya Dryomov 
Cc: ceph-de...@vger.kernel.org
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
---
  fs/ceph/super.h | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 6edab9a750f8a..e02f4ff0be3f1 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -541,7 +541,12 @@ static inline void __ceph_dir_set_complete(struct 
ceph_inode_info *ci,
   long long release_count,
   long long ordered_count)
  {
-   smp_mb__before_atomic();
+   /*
+* Makes sure operations that setup readdir cache (update page
+* cache and i_size) are strongly ordered w.r.t. the following
+* atomic64_set() operations.
+*/
+   smp_mb();
atomic64_set(&ci->i_complete_seq[0], release_count);
atomic64_set(&ci->i_complete_seq[1], ordered_count);
  }



Applied, thanks



Re: [PATCH 4/5] ceph: fix improper use of smp_mb__before_atomic()

2019-05-13 Thread Yan, Zheng

On 5/10/19 4:55 AM, Andrea Parri wrote:

On Tue, Apr 30, 2019 at 05:08:43PM +0800, Yan, Zheng wrote:

On Tue, Apr 30, 2019 at 4:26 PM Peter Zijlstra  wrote:


On Mon, Apr 29, 2019 at 10:15:00PM +0200, Andrea Parri wrote:

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic64_set() primitive.

Replace the barrier with an smp_mb().




@@ -541,7 +541,7 @@ static inline void __ceph_dir_set_complete(struct 
ceph_inode_info *ci,
  long long release_count,
  long long ordered_count)
  {
- smp_mb__before_atomic();


same
 /*
  * XXX: the comment that explain this barrier goes here.
  */



makes sure operations that setup readdir cache (update page cache and
i_size) are strongly ordered with following atomic64_set.


Thanks for the suggestion, Yan.

To be clear: would you like me to integrate your comment and resend?
any other suggestions?



Yes, please

Regards
Yan, Zheng


Thanx,
   Andrea





+ smp_mb();



   atomic64_set(&ci->i_complete_seq[0], release_count);
   atomic64_set(&ci->i_complete_seq[1], ordered_count);
  }
--
2.7.4





Re: [PATCH 4/5] ceph: fix improper use of smp_mb__before_atomic()

2019-04-30 Thread Yan, Zheng
On Tue, Apr 30, 2019 at 4:26 PM Peter Zijlstra  wrote:
>
> On Mon, Apr 29, 2019 at 10:15:00PM +0200, Andrea Parri wrote:
> > This barrier only applies to the read-modify-write operations; in
> > particular, it does not apply to the atomic64_set() primitive.
> >
> > Replace the barrier with an smp_mb().
> >
>
> > @@ -541,7 +541,7 @@ static inline void __ceph_dir_set_complete(struct 
> > ceph_inode_info *ci,
> >  long long release_count,
> >  long long ordered_count)
> >  {
> > - smp_mb__before_atomic();
>
> same
> /*
>  * XXX: the comment that explain this barrier goes here.
>  */
>

makes sure operations that setup readdir cache (update page cache and
i_size) are strongly ordered with following atomic64_set.

> > + smp_mb();
>
> >   atomic64_set(&ci->i_complete_seq[0], release_count);
> >   atomic64_set(&ci->i_complete_seq[1], ordered_count);
> >  }
> > --
> > 2.7.4
> >


Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-04-17 Thread Yan, Zheng
On Tue, Apr 16, 2019 at 9:30 PM Luis Henriques  wrote:
>
> Luis Henriques  writes:
>
> > "Yan, Zheng"  writes:
> >
> >> On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques  wrote:
> >>>
> >>> Luis Henriques  writes:
> >>>
> >>> > "Yan, Zheng"  writes:
> >>> >
> >>> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques  
> >>> >> wrote:
> >>> >>>
> >>> >>> "Yan, Zheng"  writes:
> >>> >>>
> >>> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques 
> >>> >>> >  wrote:
> >>> >>> >>
> >>> >>> >> "Yan, Zheng"  writes:
> >>> >>> >>
> >>> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques 
> >>> >>> >> >  wrote:
> >>> >>> >> >>
> >>> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest 
> >>> >>> >> >> generic/013:
> >>> >>> >> >>
> >>> >>> >> >> unreferenced object 0x8881fccca940 (size 32):
> >>> >>> >> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
> >>> >>> >> >>   hex dump (first 32 bytes):
> >>> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
> >>> >>> >> >> 
> >>> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >>> >>> >> >> 
> >>> >>> >> >>   backtrace:
> >>> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
> >>> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
> >>> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
> >>> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
> >>> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
> >>> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c
> >>> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
> >>> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400
> >>> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
> >>> >>> >> >> [<b593e4b3>] kthread+0x112/0x130
> >>> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40
> >>> >>> >> >> [<ba1c9c1d>] 0x
> >>> >>> >> >>
> >>> >>> >> >> It looks like it is possible that we miss a flush_ack from the 
> >>> >>> >> >> MDS when,
> >>> >>> >> >> for example, umounting the filesystem.  In that case, we can 
> >>> >>> >> >> simply drop
> >>> >>> >> >> the reference to the ceph_snap_context obtained in 
> >>> >>> >> >> ceph_queue_cap_snap().
> >>> >>> >> >>
> >>> >>> >> >> Link: https://tracker.ceph.com/issues/38224
> >>> >>> >> >> Cc: sta...@vger.kernel.org
> >>> >>> >> >> Signed-off-by: Luis Henriques 
> >>> >>> >> >> ---
> >>> >>> >> >>  fs/ceph/caps.c | 7 +++
> >>> >>> >> >>  1 file changed, 7 insertions(+)
> >>> >>> >> >>
> >>> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >>> >>> >> >> index 36a8dc699448..208f4dc6f574 100644
> >>> >>> >> >> --- a/fs/ceph/caps.c
> >>> >>> >> >> +++ b/fs/ceph/caps.c
> >>> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
> >>> >>> >> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
> >>> 

Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-04-02 Thread Yan, Zheng
On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques  wrote:
>
> Luis Henriques  writes:
>
> > "Yan, Zheng"  writes:
> >
> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques  
> >> wrote:
> >>>
> >>> "Yan, Zheng"  writes:
> >>>
> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques  
> >>> > wrote:
> >>> >>
> >>> >> "Yan, Zheng"  writes:
> >>> >>
> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques  
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
> >>> >> >>
> >>> >> >> unreferenced object 0x8881fccca940 (size 32):
> >>> >> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
> >>> >> >>   hex dump (first 32 bytes):
> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
> >>> >> >> 
> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >>> >> >> 
> >>> >> >>   backtrace:
> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c
> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400
> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
> >>> >> >> [<b593e4b3>] kthread+0x112/0x130
> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40
> >>> >> >> [<ba1c9c1d>] 0x
> >>> >> >>
> >>> >> >> It looks like it is possible that we miss a flush_ack from the MDS 
> >>> >> >> when,
> >>> >> >> for example, umounting the filesystem.  In that case, we can simply 
> >>> >> >> drop
> >>> >> >> the reference to the ceph_snap_context obtained in 
> >>> >> >> ceph_queue_cap_snap().
> >>> >> >>
> >>> >> >> Link: https://tracker.ceph.com/issues/38224
> >>> >> >> Cc: sta...@vger.kernel.org
> >>> >> >> Signed-off-by: Luis Henriques 
> >>> >> >> ---
> >>> >> >>  fs/ceph/caps.c | 7 +++
> >>> >> >>  1 file changed, 7 insertions(+)
> >>> >> >>
> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >>> >> >> index 36a8dc699448..208f4dc6f574 100644
> >>> >> >> --- a/fs/ceph/caps.c
> >>> >> >> +++ b/fs/ceph/caps.c
> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
> >>> >> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
> >>> >> >>  {
> >>> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm;
> >>> >> >> +
> >>> >> >> spin_lock(&realm->inodes_with_caps_lock);
> >>> >> >> list_del_init(&ci->i_snap_realm_item);
> >>> >> >> ci->i_snap_realm_counter++;
> >>> >> >> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct 
> >>> >> >> ceph_inode_info *ci)
> >>> >> >> spin_unlock(&realm->inodes_with_caps_lock);
> >>> >> >> 
> >>> >> >> ceph_put_snap_realm(ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc,
> >>> >> >> realm);
> >>> >> >> +   /*
> >>> >> >> +* ci->i_head_snapc should be NULL, bu

Re: [RFC PATCH] ceph: Convert to fs_context

2019-03-27 Thread Yan, Zheng
On Wed, Mar 20, 2019 at 10:53 PM David Howells  wrote:
>
> Signed-off-by: David Howells 
> cc: Ilya Dryomov 
> cc: "Yan, Zheng" 
> cc: Sage Weil 
> cc: ceph-de...@vger.kernel.org
> ---
>
>  drivers/block/rbd.c |  362 +++-
>  fs/ceph/cache.c |9 -
>  fs/ceph/cache.h |2
>  fs/ceph/super.c |  697 
> +++
>  fs/ceph/super.h |1
>  fs/fs_context.c |2
>  fs/fs_parser.c  |2
>  include/linux/ceph/ceph_debug.h |1
>  include/linux/ceph/libceph.h|   17 +
>  net/ceph/ceph_common.c  |  410 ++-
>  10 files changed, 726 insertions(+), 777 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 4ba967d65cf9..489f6c2322a6 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -34,7 +34,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>
>  #include 
> @@ -747,40 +747,6 @@ static struct rbd_client *rbd_client_find(struct 
> ceph_options *ceph_opts)
>  /*
>   * (Per device) rbd map options
>   */
> -enum {
> -   Opt_queue_depth,
> -   Opt_alloc_size,
> -   Opt_lock_timeout,
> -   Opt_last_int,
> -   /* int args above */
> -   Opt_pool_ns,
> -   Opt_last_string,
> -   /* string args above */
> -   Opt_read_only,
> -   Opt_read_write,
> -   Opt_lock_on_read,
> -   Opt_exclusive,
> -   Opt_notrim,
> -   Opt_err
> -};
> -
> -static match_table_t rbd_opts_tokens = {
> -   {Opt_queue_depth, "queue_depth=%d"},
> -   {Opt_alloc_size, "alloc_size=%d"},
> -   {Opt_lock_timeout, "lock_timeout=%d"},
> -   /* int args above */
> -   {Opt_pool_ns, "_pool_ns=%s"},
> -   /* string args above */
> -   {Opt_read_only, "read_only"},
> -   {Opt_read_only, "ro"},  /* Alternate spelling */
> -   {Opt_read_write, "read_write"},
> -   {Opt_read_write, "rw"}, /* Alternate spelling */
> -   {Opt_lock_on_read, "lock_on_read"},
> -   {Opt_exclusive, "exclusive"},
> -   {Opt_notrim, "notrim"},
> -   {Opt_err, NULL}
> -};
> -
>  struct rbd_options {
> int queue_depth;
> int alloc_size;
> @@ -799,85 +765,98 @@ struct rbd_options {
>  #define RBD_EXCLUSIVE_DEFAULT  false
>  #define RBD_TRIM_DEFAULT   true
>
> -struct parse_rbd_opts_ctx {
> -   struct rbd_spec *spec;
> -   struct rbd_options  *opts;
> +enum {
> +   Opt_alloc_size,
> +   Opt_exclusive,
> +   Opt_lock_on_read,
> +   Opt_lock_timeout,
> +   Opt_notrim,
> +   Opt_pool_ns,
> +   Opt_queue_depth,
> +   Opt_read_only,
> +   Opt_read_write,
> +};
> +
> +static const struct fs_parameter_spec rbd_param_specs[] = {
> +   fsparam_u32 ("alloc_size",  Opt_alloc_size),
> +   fsparam_flag("exclusive",   Opt_exclusive),
> +   fsparam_flag("lock_on_read",Opt_lock_on_read),
> +   fsparam_u32 ("lock_timeout",Opt_lock_timeout),
> +   fsparam_flag("notrim",  Opt_notrim),
> +   fsparam_string  ("_pool_ns",Opt_pool_ns),
> +   fsparam_u32 ("queue_depth", Opt_queue_depth),
> +   fsparam_flag("ro",  Opt_read_only),
> +   fsparam_flag("rw",  Opt_read_write),
> +   {}
> +};
> +
> +static const struct fs_parameter_description rbd_parameters = {
> +   .name   = "rbd",
> +   .specs  = rbd_param_specs,
>  };
>
> -static int parse_rbd_opts_token(char *c, void *private)
> +static int rbd_parse_param(struct ceph_config_context *ctx, struct 
> fs_parameter *param)
>  {
> -   struct parse_rbd_opts_ctx *pctx = private;
> -   substring_t argstr[MAX_OPT_ARGS];
> -   int token, intval, ret;
> +   struct rbd_options *opts = ctx->rbd_opts;
> +   struct rbd_spec *spec = ctx->rbd_spec;
> +   struct fs_parse_result result;
> +   int ret, opt;
>
> -   token = match_token(c, rbd_opts_tokens, argstr);
> -   if (token < Opt_last_int) {
> -   ret = match_int(&argstr[0], &intval);
> -   if (ret < 0) {
> - 

Re: [RFC PATCH] ceph: Convert to fs_context

2019-03-25 Thread Yan, Zheng
On Wed, Mar 20, 2019 at 10:53 PM David Howells  wrote:
>
> Signed-off-by: David Howells 
> cc: Ilya Dryomov 
> cc: "Yan, Zheng" 
> cc: Sage Weil 
> cc: ceph-de...@vger.kernel.org
> ---
>
>  drivers/block/rbd.c |  362 +++-
>  fs/ceph/cache.c |9 -
>  fs/ceph/cache.h |2
>  fs/ceph/super.c |  697 
> +++
>  fs/ceph/super.h |1
>  fs/fs_context.c |2
>  fs/fs_parser.c  |2
>  include/linux/ceph/ceph_debug.h |1
>  include/linux/ceph/libceph.h|   17 +
>  net/ceph/ceph_common.c  |  410 ++-
>  10 files changed, 726 insertions(+), 777 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 4ba967d65cf9..489f6c2322a6 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -34,7 +34,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>
>  #include 
> @@ -747,40 +747,6 @@ static struct rbd_client *rbd_client_find(struct 
> ceph_options *ceph_opts)
>  /*
>   * (Per device) rbd map options
>   */
> -enum {
> -   Opt_queue_depth,
> -   Opt_alloc_size,
> -   Opt_lock_timeout,
> -   Opt_last_int,
> -   /* int args above */
> -   Opt_pool_ns,
> -   Opt_last_string,
> -   /* string args above */
> -   Opt_read_only,
> -   Opt_read_write,
> -   Opt_lock_on_read,
> -   Opt_exclusive,
> -   Opt_notrim,
> -   Opt_err
> -};
> -
> -static match_table_t rbd_opts_tokens = {
> -   {Opt_queue_depth, "queue_depth=%d"},
> -   {Opt_alloc_size, "alloc_size=%d"},
> -   {Opt_lock_timeout, "lock_timeout=%d"},
> -   /* int args above */
> -   {Opt_pool_ns, "_pool_ns=%s"},
> -   /* string args above */
> -   {Opt_read_only, "read_only"},
> -   {Opt_read_only, "ro"},  /* Alternate spelling */
> -   {Opt_read_write, "read_write"},
> -   {Opt_read_write, "rw"}, /* Alternate spelling */
> -   {Opt_lock_on_read, "lock_on_read"},
> -   {Opt_exclusive, "exclusive"},
> -   {Opt_notrim, "notrim"},
> -   {Opt_err, NULL}
> -};
> -
>  struct rbd_options {
> int queue_depth;
> int alloc_size;
> @@ -799,85 +765,98 @@ struct rbd_options {
>  #define RBD_EXCLUSIVE_DEFAULT  false
>  #define RBD_TRIM_DEFAULT   true
>
> -struct parse_rbd_opts_ctx {
> -   struct rbd_spec *spec;
> -   struct rbd_options  *opts;
> +enum {
> +   Opt_alloc_size,
> +   Opt_exclusive,
> +   Opt_lock_on_read,
> +   Opt_lock_timeout,
> +   Opt_notrim,
> +   Opt_pool_ns,
> +   Opt_queue_depth,
> +   Opt_read_only,
> +   Opt_read_write,
> +};
> +
> +static const struct fs_parameter_spec rbd_param_specs[] = {
> +   fsparam_u32 ("alloc_size",  Opt_alloc_size),
> +   fsparam_flag("exclusive",   Opt_exclusive),
> +   fsparam_flag("lock_on_read",Opt_lock_on_read),
> +   fsparam_u32 ("lock_timeout",Opt_lock_timeout),
> +   fsparam_flag("notrim",  Opt_notrim),
> +   fsparam_string  ("_pool_ns",Opt_pool_ns),
> +   fsparam_u32 ("queue_depth", Opt_queue_depth),
> +   fsparam_flag("ro",  Opt_read_only),
> +   fsparam_flag("rw",  Opt_read_write),
> +   {}
> +};
> +
> +static const struct fs_parameter_description rbd_parameters = {
> +   .name   = "rbd",
> +   .specs  = rbd_param_specs,
>  };
>
> -static int parse_rbd_opts_token(char *c, void *private)
> +static int rbd_parse_param(struct ceph_config_context *ctx, struct 
> fs_parameter *param)
>  {
> -   struct parse_rbd_opts_ctx *pctx = private;
> -   substring_t argstr[MAX_OPT_ARGS];
> -   int token, intval, ret;
> +   struct rbd_options *opts = ctx->rbd_opts;
> +   struct rbd_spec *spec = ctx->rbd_spec;
> +   struct fs_parse_result result;
> +   int ret, opt;
>
> -   token = match_token(c, rbd_opts_tokens, argstr);
> -   if (token < Opt_last_int) {
> -   ret = match_int(&argstr[0], &intval);
> -   if (ret < 0) {
> - 

Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-03-18 Thread Yan, Zheng
On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques  wrote:
> >>
> >> "Yan, Zheng"  writes:
> >>
> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques  
> >> > wrote:
> >> >>
> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
> >> >>
> >> >> unreferenced object 0x8881fccca940 (size 32):
> >> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
> >> >>   hex dump (first 32 bytes):
> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >> >>   backtrace:
> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
> >> >> [<fc287b83>] dispatch+0x362/0x176c
> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400
> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
> >> >> [<b593e4b3>] kthread+0x112/0x130
> >> >> [<a8587dca>] ret_from_fork+0x35/0x40
> >> >> [<ba1c9c1d>] 0x
> >> >>
> >> >> It looks like it is possible that we miss a flush_ack from the MDS when,
> >> >> for example, umounting the filesystem.  In that case, we can simply drop
> >> >> the reference to the ceph_snap_context obtained in 
> >> >> ceph_queue_cap_snap().
> >> >>
> >> >> Link: https://tracker.ceph.com/issues/38224
> >> >> Cc: sta...@vger.kernel.org
> >> >> Signed-off-by: Luis Henriques 
> >> >> ---
> >> >>  fs/ceph/caps.c | 7 +++
> >> >>  1 file changed, 7 insertions(+)
> >> >>
> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >> >> index 36a8dc699448..208f4dc6f574 100644
> >> >> --- a/fs/ceph/caps.c
> >> >> +++ b/fs/ceph/caps.c
> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
> >> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
> >> >>  {
> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm;
> >> >> +
> >> >> spin_lock(&realm->inodes_with_caps_lock);
> >> >> list_del_init(&ci->i_snap_realm_item);
> >> >> ci->i_snap_realm_counter++;
> >> >> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct 
> >> >> ceph_inode_info *ci)
> >> >> spin_unlock(&realm->inodes_with_caps_lock);
> >> >> ceph_put_snap_realm(ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc,
> >> >> realm);
> >> >> +   /*
> >> >> +* ci->i_head_snapc should be NULL, but we may still be waiting 
> >> >> for a
> >> >> +* flush_ack from the MDS.  In that case, we still hold a ref 
> >> >> for the
> >> >> +* ceph_snap_context and we need to drop it.
> >> >> +*/
> >> >> +   ceph_put_snap_context(ci->i_head_snapc);
> >> >>  }
> >> >>
> >> >>  /*
> >> >
> >> > This does not seem right.  i_head_snapc is cleared when
> >> > (ci->i_wrbuffer_ref_head == 0 && ci->i_dirty_caps == 0 &&
> >> > ci->i_flushing_caps == 0) . Nothing do with dropping ci->i_snap_realm.
> >> > Did you see 'reconnect denied' during the test? If you did, the fix
> >> > should be in iterate_session_caps()
> >> >
> >>
> >> No, I didn't saw any 'reconnect denied' in the test.  The test actually
> >> seems to execute fine, except from the memory leak.
> >>
> >> It's very difficult to reproduce this issue, but last time I managed to
> >> get this memory leak to trigger I actually had some debugging code in
> >> drop_inode_snap_realm, something like:
> >>
> >>   if (ci->i_head_snapc)
> >> printk("i_head_snapc: 0x%px\n", ci->i_head_snapc);
> >
> > please add code that prints i_wrbuffer_ref_head, i_dirty_caps,
> > i_flushing_caps. and try reproducing it again.
> >
>
> Ok, it took me a few hours, but I managed to reproduce the bug, with
> those extra printks.  All those values are set to 0 when the bug
> triggers (and i_head_snapc != NULL).
>

Thanks, which test triggers this bug?

I searched that code, found we may fail to cleanup i_head_snap in two
places. One is in ceph_queue_cap_snap, Another is in
remove_session_caps_cb().

> Cheers,
> --
> Luis
>
>
> >
> >>
> >> This printk was only executed when the bug triggered (during a
> >> filesystem umount) and the address shown was the same as in the kmemleak
> >> warning.
> >>
> >> After spending some time looking, I assumed this to be a missing call to
> >> handle_cap_flush_ack, which would do the i_head_snapc cleanup.
> >>
> >> Cheers,
> >> --
> >> Luis
> >


Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Yan, Zheng
On Tue, Mar 12, 2019 at 10:22 PM Luis Henriques  wrote:
>
> The CephFS kernel client does not enforce quotas set in a directory that
> isn't visible from the mount point.  For example, given the path
> '/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with
>
>   mount -t ceph ::/dir1/ /mnt
>
> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
> to a quota realm that points to it.
>
> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
> unknown inodes.  Any inode reference obtained this way will be added to a
> list in ceph_mds_client, and will only be released when the filesystem is
> umounted.
>
> Link: https://tracker.ceph.com/issues/38482
> Reported-by: Hendrik Peyerl 
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/mds_client.c |  15 ++
>  fs/ceph/mds_client.h |   2 +
>  fs/ceph/quota.c  | 106 +++
>  fs/ceph/super.h  |   2 +
>  4 files changed, 115 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 163fc74bf221..1dc24c3525fe 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> mdsc->max_sessions = 0;
> mdsc->stopping = 0;
> atomic64_set(&mdsc->quotarealms_count, 0);
> +   INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
> +   spin_lock_init(&mdsc->quotarealms_inodes_lock);
> mdsc->last_snap_seq = 0;
> init_rwsem(&mdsc->snap_rwsem);
> mdsc->snap_realms = RB_ROOT;
> @@ -3726,6 +3728,8 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>   */
>  void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>  {
> +   struct ceph_inode_info *ci;
> +
> dout("pre_umount\n");
> mdsc->stopping = 1;
>
> @@ -3738,6 +3742,17 @@ void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>  * their inode/dcache refs
>  */
> ceph_msgr_flush();
> +   /*
> +* It's now safe to clean quotarealms_inode_list without holding
> +* mdsc->quotarealms_inodes_lock
> +*/
> +   while (!list_empty(&mdsc->quotarealms_inodes_list)) {
> +   ci = list_first_entry(&mdsc->quotarealms_inodes_list,
> + struct ceph_inode_info,
> + i_quotarealms_inode_item);
> +   list_del(&ci->i_quotarealms_inode_item);
> +   iput(&ci->vfs_inode);
> +   }
>  }
>
>  /*
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 729da155ebf0..58968fb338ec 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -329,6 +329,8 @@ struct ceph_mds_client {
> int stopping;  /* true if shutting down */
>
> atomic64_t  quotarealms_count; /* # realms with quota */
> +   struct list_headquotarealms_inodes_list;
> +   spinlock_t  quotarealms_inodes_lock;
>
> /*
>  * snap_rwsem will cover cap linkage into snaprealms, and
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index 9455d3aef0c3..d1ab1b331c0d 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -22,7 +22,16 @@ void ceph_adjust_quota_realms_count(struct inode *inode, 
> bool inc)
>  static inline bool ceph_has_realms_with_quotas(struct inode *inode)
>  {
> struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
> -   return atomic64_read(&mdsc->quotarealms_count) > 0;
> +   struct super_block *sb = mdsc->fsc->sb;
> +
> +   if (atomic64_read(&mdsc->quotarealms_count) > 0)
> +   return true;
> +   /* if root is the real CephFS root, we don't have quota realms */
> +   if (sb->s_root->d_inode &&
> +   (sb->s_root->d_inode->i_ino == CEPH_INO_ROOT))
> +   return false;
> +   /* otherwise, we can't know for sure */
> +   return true;
>  }
>
>  void ceph_handle_quota(struct ceph_mds_client *mdsc,
> @@ -68,6 +77,37 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
> iput(inode);
>  }
>
> +/*
> + * This function will try to lookup a realm inode.  If the inode is found
> + * (through an MDS LOOKUPINO operation), the realm->inode will be updated and
> + * the inode will also be added to an mdsc list which will be freed only when
> + * the filesystem is umounted.
> + */
> +static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc,
> +struct super_block *sb,
> +struct ceph_snap_realm *realm)
> +{
> +   struct inode *in;
> +
> +   in = ceph_lookup_inode(sb, realm->ino);
> +   if (IS_ERR(in)) {
> +   pr_warn("Can't lookup inode %llx (err: %ld)\n",
> +   realm->ino, PTR_ERR(in));
> +   return in;
> +   }
> +
> +   spin_lock(&mdsc->quotarealms_inodes_lock);

Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Yan, Zheng
On Mon, Mar 18, 2019 at 6:55 PM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Mon, Mar 18, 2019 at 5:06 PM Gregory Farnum  wrote:
> >>
> >> On Mon, Mar 18, 2019 at 2:32 PM Yan, Zheng  wrote:
> >> > After reading the code carefully. I feel a little uncomfortable with
> >> > the "lookup_ino" in get_quota_realm.  how about populating directories
> >> > above the 'mount subdir' during mounting (similar to cifs_get_root ).
>
> Wouldn't it be a problem if the directory layout (or, in this case, the
> snaprealm layout) change during the mount lifetime?  In that case we
> would need to do this lookup anyway.
>

right

> >>
> >> Isn't that going to be a problem for any clients which have
> >>restricted filesystem access permissions? They may not be able to see
> >>all the directories above their mount point.  -Greg
> >
> > using lookup_ino to get inode above the "mount subdir" has the same problem
> >
>
> In this case I believe we get an -EPERM from the MDS.  And then the
> client simply falls back to the 'default' behaviour, which is to allow
> the user to create/write to files as if there were no quotas set.
>

fair enough

> Cheers,
> --
> Luis


Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-03-18 Thread Yan, Zheng
On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques  wrote:
> >>
> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
> >>
> >> unreferenced object 0x8881fccca940 (size 32):
> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
> >>   hex dump (first 32 bytes):
> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >>   backtrace:
> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
> >> [<fc287b83>] dispatch+0x362/0x176c
> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
> >> [<4168e3a9>] process_one_work+0x1d4/0x400
> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
> >> [<b593e4b3>] kthread+0x112/0x130
> >> [<a8587dca>] ret_from_fork+0x35/0x40
> >> [<ba1c9c1d>] 0x
> >>
> >> It looks like it is possible that we miss a flush_ack from the MDS when,
> >> for example, umounting the filesystem.  In that case, we can simply drop
> >> the reference to the ceph_snap_context obtained in ceph_queue_cap_snap().
> >>
> >> Link: https://tracker.ceph.com/issues/38224
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Luis Henriques 
> >> ---
> >>  fs/ceph/caps.c | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >> index 36a8dc699448..208f4dc6f574 100644
> >> --- a/fs/ceph/caps.c
> >> +++ b/fs/ceph/caps.c
> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
> >>  {
> >> struct ceph_snap_realm *realm = ci->i_snap_realm;
> >> +
> >> spin_lock(&realm->inodes_with_caps_lock);
> >> list_del_init(&ci->i_snap_realm_item);
> >> ci->i_snap_realm_counter++;
> >> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct 
> >> ceph_inode_info *ci)
> >> spin_unlock(&realm->inodes_with_caps_lock);
> >> ceph_put_snap_realm(ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc,
> >> realm);
> >> +   /*
> >> +* ci->i_head_snapc should be NULL, but we may still be waiting 
> >> for a
> >> +* flush_ack from the MDS.  In that case, we still hold a ref for 
> >> the
> >> +* ceph_snap_context and we need to drop it.
> >> +*/
> >> +   ceph_put_snap_context(ci->i_head_snapc);
> >>  }
> >>
> >>  /*
> >
> > This does not seem right.  i_head_snapc is cleared when
> > (ci->i_wrbuffer_ref_head == 0 && ci->i_dirty_caps == 0 &&
> > ci->i_flushing_caps == 0) . Nothing do with dropping ci->i_snap_realm.
> > Did you see 'reconnect denied' during the test? If you did, the fix
> > should be in iterate_session_caps()
> >
>
> No, I didn't saw any 'reconnect denied' in the test.  The test actually
> seems to execute fine, except from the memory leak.
>
> It's very difficult to reproduce this issue, but last time I managed to
> get this memory leak to trigger I actually had some debugging code in
> drop_inode_snap_realm, something like:
>
>   if (ci->i_head_snapc)
> printk("i_head_snapc: 0x%px\n", ci->i_head_snapc);

please add code that prints i_wrbuffer_ref_head, i_dirty_caps,
i_flushing_caps. and try reproducing it again.


>
> This printk was only executed when the bug triggered (during a
> filesystem umount) and the address shown was the same as in the kmemleak
> warning.
>
> After spending some time looking, I assumed this to be a missing call to
> handle_cap_flush_ack, which would do the i_head_snapc cleanup.
>
> Cheers,
> --
> Luis


Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Yan, Zheng
On Mon, Mar 18, 2019 at 5:06 PM Gregory Farnum  wrote:
>
> On Mon, Mar 18, 2019 at 2:32 PM Yan, Zheng  wrote:
> > After reading the code carefully. I feel a little uncomfortable with
> > the "lookup_ino" in get_quota_realm.  how about populating directories
> > above the 'mount subdir' during mounting (similar to cifs_get_root ).
>
> Isn't that going to be a problem for any clients which have restricted
> filesystem access permissions? They may not be able to see all the
> directories above their mount point.
> -Greg

using lookup_ino to get inode above the "mount subdir" has the same problem


Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Yan, Zheng
_rwsem);
> +   if (old_realm)
> +   ceph_put_snap_realm(mdsc, old_realm);
> +   goto restart;
> +   }
> is_same = (old_realm == new_realm);
> up_read(&mdsc->snap_rwsem);
>
> @@ -166,6 +240,7 @@ static bool check_quota_exceeded(struct inode *inode, 
> enum quota_check_op op,
> return false;
>
> down_read(&mdsc->snap_rwsem);
> +restart:
> realm = ceph_inode(inode)->i_snap_realm;
> if (realm)
> ceph_get_snap_realm(mdsc, realm);
> @@ -173,12 +248,23 @@ static bool check_quota_exceeded(struct inode *inode, 
> enum quota_check_op op,
> pr_err_ratelimited("check_quota_exceeded: ino (%llx.%llx) "
>"null i_snap_realm\n", ceph_vinop(inode));
> while (realm) {
> +   bool has_inode;
> +
> spin_lock(&realm->inodes_with_caps_lock);
> -   in = realm->inode ? igrab(realm->inode) : NULL;
> +   has_inode = realm->inode;
> +   in = has_inode ? igrab(realm->inode) : NULL;
> spin_unlock(&realm->inodes_with_caps_lock);
> -   if (!in)
> +   if (has_inode && !in)
> break;
> -
> +   if (!in) {
> +   up_read(&mdsc->snap_rwsem);
> +   in = lookup_quotarealm_inode(mdsc, inode->i_sb, 
> realm);
> +   down_read(&mdsc->snap_rwsem);
> +   if (IS_ERR(in))
> +   break;
> +   ceph_put_snap_realm(mdsc, realm);
> +       goto restart;
> +   }
> ci = ceph_inode(in);
> spin_lock(&ci->i_ceph_lock);
> if (op == QUOTA_CHECK_MAX_FILES_OP) {
> @@ -314,7 +400,7 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, 
> struct kstatfs *buf)
> bool is_updated = false;
>
> down_read(&mdsc->snap_rwsem);
> -   realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root));
> +   realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root), true);
> up_read(&mdsc->snap_rwsem);
> if (!realm)
> return false;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index ce51e98b08ec..cc7766aeb73b 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -375,6 +375,8 @@ struct ceph_inode_info {
> struct list_head i_snap_realm_item;
> struct list_head i_snap_flush_item;
>
> +   struct list_head i_quotarealms_inode_item;
> +
> struct work_struct i_wb_work;  /* writeback work */
> struct work_struct i_pg_inv_work;  /* page invalidation work */
>

After reading the code carefully. I feel a little uncomfortable with
the "lookup_ino" in get_quota_realm.  how about populating directories
above the 'mount subdir' during mounting (similar to cifs_get_root ).

Regards
Yan, Zheng


Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-03-17 Thread Yan, Zheng
On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques  wrote:
>
> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
>
> unreferenced object 0x8881fccca940 (size 32):
>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
>   hex dump (first 32 bytes):
> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [] build_snap_context+0x5b/0x2a0
> [<21a00533>] rebuild_snap_realms+0x27/0x90
> [] rebuild_snap_realms+0x42/0x90
> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
> [] ceph_handle_snap+0x317/0x5f3
> [] dispatch+0x362/0x176c
> [] ceph_con_workfn+0x9ce/0x2cf0
> [<4168e3a9>] process_one_work+0x1d4/0x400
> [<2188e9e7>] worker_thread+0x2d/0x3c0
> [] kthread+0x112/0x130
> [] ret_from_fork+0x35/0x40
> [] 0x
>
> It looks like it is possible that we miss a flush_ack from the MDS when,
> for example, umounting the filesystem.  In that case, we can simply drop
> the reference to the ceph_snap_context obtained in ceph_queue_cap_snap().
>
> Link: https://tracker.ceph.com/issues/38224
> Cc: sta...@vger.kernel.org
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/caps.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 36a8dc699448..208f4dc6f574 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
>  {
> struct ceph_snap_realm *realm = ci->i_snap_realm;
> +
> spin_lock(&realm->inodes_with_caps_lock);
> list_del_init(&ci->i_snap_realm_item);
> ci->i_snap_realm_counter++;
> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct 
> ceph_inode_info *ci)
> spin_unlock(&realm->inodes_with_caps_lock);
> ceph_put_snap_realm(ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc,
> realm);
> +   /*
> +* ci->i_head_snapc should be NULL, but we may still be waiting for a
> +* flush_ack from the MDS.  In that case, we still hold a ref for the
> +* ceph_snap_context and we need to drop it.
> +*/
> +   ceph_put_snap_context(ci->i_head_snapc);
>  }
>
>  /*

This does not seem right.  i_head_snapc is cleared when
(ci->i_wrbuffer_ref_head == 0 && ci->i_dirty_caps == 0 &&
ci->i_flushing_caps == 0) . Nothing do with dropping ci->i_snap_realm.
Did you see 'reconnect denied' during the test? If you did, the fix
should be in iterate_session_caps()


Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts

2019-03-10 Thread Yan, Zheng
On Sat, Mar 9, 2019 at 12:30 AM Luis Henriques  wrote:
>
> The CephFS kernel client does not enforce quotas set in a directory that isn't
> visible from the mount point.  For example, given the path '/dir1/dir2', if 
> quotas
> are set in 'dir1' and the filesystem is mounted with
>
>   mount -t ceph ::/dir1/ /mnt
>
> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs 
> to
> a quota realm that points to it.
>
> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
> unknown inodes.  Any inode reference obtained this way will be added to a list
> in ceph_mds_client, and will only be released when the filesystem is umounted.
>
> Link: https://tracker.ceph.com/issues/38482
> Reported-by: Hendrik Peyerl 
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/mds_client.c | 14 +++
>  fs/ceph/mds_client.h |  2 +
>  fs/ceph/quota.c  | 91 +++-
>  fs/ceph/super.h  |  2 +
>  4 files changed, 99 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 163fc74bf221..72c5ce5e4209 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> mdsc->max_sessions = 0;
> mdsc->stopping = 0;
> atomic64_set(&mdsc->quotarealms_count, 0);
> +   INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
> +   spin_lock_init(&mdsc->quotarealms_inodes_lock);
> mdsc->last_snap_seq = 0;
> init_rwsem(&mdsc->snap_rwsem);
> mdsc->snap_realms = RB_ROOT;
> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>   */
>  void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>  {
> +   struct ceph_inode_info *ci;
> +
> dout("pre_umount\n");
> mdsc->stopping = 1;
>
> +   spin_lock(&mdsc->quotarealms_inodes_lock);
> +   while(!list_empty(&mdsc->quotarealms_inodes_list)) {
> +   ci = list_first_entry(&mdsc->quotarealms_inodes_list,
> + struct ceph_inode_info,
> + i_quotarealms_inode_item);
> +   list_del(&ci->i_quotarealms_inode_item);
> +   iput(&ci->vfs_inode);

iput while holding spinlock is not good

> +   }
> +   spin_unlock(&mdsc->quotarealms_inodes_lock);
> +
> lock_unlock_sessions(mdsc);
> ceph_flush_dirty_caps(mdsc);
> wait_requests(mdsc);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 729da155ebf0..58968fb338ec 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -329,6 +329,8 @@ struct ceph_mds_client {
> int stopping;  /* true if shutting down */
>
> atomic64_t  quotarealms_count; /* # realms with quota */
> +   struct list_headquotarealms_inodes_list;
> +   spinlock_t  quotarealms_inodes_lock;
>
> /*
>  * snap_rwsem will cover cap linkage into snaprealms, and
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index 9455d3aef0c3..c57c0b709efe 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -22,7 +22,16 @@ void ceph_adjust_quota_realms_count(struct inode *inode, 
> bool inc)
>  static inline bool ceph_has_realms_with_quotas(struct inode *inode)
>  {
> struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
> -   return atomic64_read(&mdsc->quotarealms_count) > 0;
> +   struct super_block *sb = mdsc->fsc->sb;
> +
> +   if (atomic64_read(&mdsc->quotarealms_count) > 0)
> +   return true;
> +   /* if root is the real CephFS root, we don't have quota realms */
> +   if (sb->s_root->d_inode &&
> +   (sb->s_root->d_inode->i_ino == CEPH_INO_ROOT))
> +   return false;
> +   /* otherwise, we can't know for sure */
> +   return true;
>  }
>
>  void ceph_handle_quota(struct ceph_mds_client *mdsc,
> @@ -68,6 +77,37 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
> iput(inode);
>  }
>
> +/*
> + * This function will try to lookup a realm inode.  If the inode is found
> + * (through an MDS LOOKUPINO operation), the realm->inode will be updated and
> + * the inode will also be added to an mdsc list which will be freed only when
> + * the filesystem is umounted.
> + */
> +static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc,
> +struct super_block *sb,
> +struct ceph_snap_realm *realm)
> +{
> +   struct inode *in;
> +
> +   in = ceph_lookup_inode(sb, realm->ino);
> +   if (IS_ERR(in)) {
> +   pr_warn("Can't lookup inode %llx (err: %ld)\n",
> +   realm->ino, PTR_ERR(in));
> +   return in;
> +   }
> +
> +   spin_lock(&mdsc->quotarealms_inodes_lock);
> +   list_add(&ceph_inode(in)->i_quotarealms_inode_

Re: [RFC PATCH 2/2] ceph: quota: fix quota subdir mounts

2019-03-07 Thread Yan, Zheng
On Thu, Mar 7, 2019 at 7:02 PM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Thu, Mar 7, 2019 at 2:21 AM Luis Henriques  wrote:
> >>
> >> "Yan, Zheng"  writes:
> >>
> >> > On Sat, Mar 2, 2019 at 3:13 AM Luis Henriques  
> >> > wrote:
> >> >>
> >> >> The CephFS kernel client doesn't enforce quotas that are set in a
> >> >> directory that isn't visible in the mount point.  For example, given the
> >> >> path '/dir1/dir2', if quotas are set in 'dir1' and the mount is done in 
> >> >> with
> >> >>
> >> >>   mount -t ceph ::/dir1/ /mnt
> >> >>
> >> >> then the client can't access the 'dir1' inode from the quota realm dir2
> >> >> belongs to.
> >> >>
> >> >> This patch fixes this by simply doing an MDS LOOKUPINO Op and grabbing a
> >> >> reference to it (so that it doesn't disappear again).  This also 
> >> >> requires an
> >> >> extra field in ceph_snap_realm so that we know we have to release that
> >> >> reference when destroying the realm.
> >> >>
> >> >
> >> > This may cause circle reference if somehow an inode owned by snaprealm
> >> > get moved into mount subdir (other clients do rename).  how about
> >> > holding these inodes in mds_client?
> >>
> >> Ok, before proceeded any further I wanted to make sure that what you
> >> were suggesting was something like the patch below.  It simply keeps a
> >> list of inodes in ceph_mds_client until the filesystem is umounted,
> >> iput()ing them at that point.
> >>
> > yes,
> >
> >> I'm sure I'm missing another place where the reference should be
> >> dropped, but I couldn't figure it out yet.  It can't be
> >> ceph_destroy_inode; drop_inode_snap_realm is a possibility, but what if
> >> the inode becomes visible in the meantime?  Well, I'll continue thinking
> >> about it.
> >
> > why do you think we need to clean up the references at other place.
> > what problem you encountered.
>
> I'm not really seeing any issue, at least not at the moment.  I believe
> that we could just be holding refs to inodes that may not exist anymore
> in the cluster.  For example, in client 1:
>
>  mkdir -p /mnt/a/b
>  setfattr -n ceph.quota.max_files -v 5 /mnt/a
>
> In client 2 we mount:
>
>  mount :/a/b /mnt
>
> This client will access the realm and inode for 'a' (adding that inode
> to the ceph_mds_client list), because it has quotas.  If client 1 then
> deletes 'a', client 2 will continue to have a reference to that inode in
> that list.  That's why I thought we should be able to clean up that refs
> list in some other place, although that's probably a big deal, since we
> won't be able to a lot with this mount anyway.
>

Agree, it's not big deal

> Cheers,
> --
> Luis


Re: [RFC PATCH 2/2] ceph: quota: fix quota subdir mounts

2019-03-06 Thread Yan, Zheng
On Thu, Mar 7, 2019 at 2:21 AM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Sat, Mar 2, 2019 at 3:13 AM Luis Henriques  wrote:
> >>
> >> The CephFS kernel client doesn't enforce quotas that are set in a
> >> directory that isn't visible in the mount point.  For example, given the
> >> path '/dir1/dir2', if quotas are set in 'dir1' and the mount is done in 
> >> with
> >>
> >>   mount -t ceph ::/dir1/ /mnt
> >>
> >> then the client can't access the 'dir1' inode from the quota realm dir2
> >> belongs to.
> >>
> >> This patch fixes this by simply doing an MDS LOOKUPINO Op and grabbing a
> >> reference to it (so that it doesn't disappear again).  This also requires 
> >> an
> >> extra field in ceph_snap_realm so that we know we have to release that
> >> reference when destroying the realm.
> >>
> >
> > This may cause circle reference if somehow an inode owned by snaprealm
> > get moved into mount subdir (other clients do rename).  how about
> > holding these inodes in mds_client?
>
> Ok, before proceeded any further I wanted to make sure that what you
> were suggesting was something like the patch below.  It simply keeps a
> list of inodes in ceph_mds_client until the filesystem is umounted,
> iput()ing them at that point.
>
yes,

> I'm sure I'm missing another place where the reference should be
> dropped, but I couldn't figure it out yet.  It can't be
> ceph_destroy_inode; drop_inode_snap_realm is a possibility, but what if
> the inode becomes visible in the meantime?  Well, I'll continue thinking
> about it.

why do you think we need to clean up the references at other place.
what problem you encountered.

Regards
Yan, Zheng
>
> Function get_quota_realm() should have a similar construct to lookup
> inodes.  But it's a bit more tricky, because ceph_quota_is_same_realm()
> requires snap_rwsem to be held for the 2 invocations of
> get_quota_realm.
>
> Cheers,
> --
> Luis
>
> From a429a4c167186781bd235a25d72be893baf9e029 Mon Sep 17 00:00:00 2001
> From: Luis Henriques 
> Date: Wed, 6 Mar 2019 17:58:04 +
> Subject: [PATCH] ceph: quota: fix quota subdir mounts (II)
>
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/mds_client.c | 14 ++
>  fs/ceph/mds_client.h |  2 ++
>  fs/ceph/quota.c  | 34 ++
>  fs/ceph/super.h  |  2 ++
>  4 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 163fc74bf221..72c5ce5e4209 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> mdsc->max_sessions = 0;
> mdsc->stopping = 0;
> atomic64_set(&mdsc->quotarealms_count, 0);
> +   INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
> +   spin_lock_init(&mdsc->quotarealms_inodes_lock);
> mdsc->last_snap_seq = 0;
> init_rwsem(&mdsc->snap_rwsem);
> mdsc->snap_realms = RB_ROOT;
> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>   */
>  void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>  {
> +   struct ceph_inode_info *ci;
> +
> dout("pre_umount\n");
> mdsc->stopping = 1;
>
> +   spin_lock(&mdsc->quotarealms_inodes_lock);
> +   while(!list_empty(&mdsc->quotarealms_inodes_list)) {
> +   ci = list_first_entry(&mdsc->quotarealms_inodes_list,
> + struct ceph_inode_info,
> + i_quotarealms_inode_item);
> +   list_del(&ci->i_quotarealms_inode_item);
> +   iput(&ci->vfs_inode);
> +   }
> +   spin_unlock(&mdsc->quotarealms_inodes_lock);
> +
> lock_unlock_sessions(mdsc);
> ceph_flush_dirty_caps(mdsc);
> wait_requests(mdsc);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 729da155ebf0..58968fb338ec 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -329,6 +329,8 @@ struct ceph_mds_client {
> int stopping;  /* true if shutting down */
>
> atomic64_t  quotarealms_count; /* # realms with quota */
> +   struct list_headquotarealms_inodes_list;
> +   spinlock_t  quotarealms_inodes_lock;
>
> /*
>  * snap_rwsem will cover cap linkage into snaprealms, a

Re: [RFC PATCH 2/2] ceph: quota: fix quota subdir mounts

2019-03-03 Thread Yan, Zheng
On Sat, Mar 2, 2019 at 3:13 AM Luis Henriques  wrote:
>
> The CephFS kernel client doesn't enforce quotas that are set in a
> directory that isn't visible in the mount point.  For example, given the
> path '/dir1/dir2', if quotas are set in 'dir1' and the mount is done in with
>
>   mount -t ceph ::/dir1/ /mnt
>
> then the client can't access the 'dir1' inode from the quota realm dir2
> belongs to.
>
> This patch fixes this by simply doing an MDS LOOKUPINO Op and grabbing a
> reference to it (so that it doesn't disappear again).  This also requires an
> extra field in ceph_snap_realm so that we know we have to release that
> reference when destroying the realm.
>

This may cause circle reference if somehow an inode owned by snaprealm
get moved into mount subdir (other clients do rename).  how about
holding these inodes in mds_client?

> Links: https://tracker.ceph.com/issues/3848
> Reported-by: Hendrik Peyerl 
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/caps.c  |  2 +-
>  fs/ceph/quota.c | 30 +++---
>  fs/ceph/snap.c  |  3 +++
>  fs/ceph/super.h |  2 ++
>  4 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index bba28a5034ba..e79994ff53d6 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1035,7 +1035,7 @@ static void drop_inode_snap_realm(struct 
> ceph_inode_info *ci)
> list_del_init(&ci->i_snap_realm_item);
> ci->i_snap_realm_counter++;
> ci->i_snap_realm = NULL;
> -   if (realm->ino == ci->i_vino.ino)
> +   if ((realm->ino == ci->i_vino.ino) && !realm->own_inode)
> realm->inode = NULL;
> spin_unlock(&realm->inodes_with_caps_lock);
> ceph_put_snap_realm(ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc,
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index 9455d3aef0c3..f6b972d222e4 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -22,7 +22,16 @@ void ceph_adjust_quota_realms_count(struct inode *inode, 
> bool inc)
>  static inline bool ceph_has_realms_with_quotas(struct inode *inode)
>  {
> struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
> -   return atomic64_read(&mdsc->quotarealms_count) > 0;
> +   struct super_block *sb = mdsc->fsc->sb;
> +
> +   if (atomic64_read(&mdsc->quotarealms_count) > 0)
> +   return true;
> +   /* if root is the real CephFS root, we don't have quota realms */
> +   if (sb->s_root->d_inode &&
> +   (sb->s_root->d_inode->i_ino == CEPH_INO_ROOT))
> +   return false;
> +   /* otherwise, we can't know for sure */
> +   return true;
>  }
>
>  void ceph_handle_quota(struct ceph_mds_client *mdsc,
> @@ -166,6 +175,7 @@ static bool check_quota_exceeded(struct inode *inode, 
> enum quota_check_op op,
> return false;
>
> down_read(&mdsc->snap_rwsem);
> +restart:
> realm = ceph_inode(inode)->i_snap_realm;
> if (realm)
> ceph_get_snap_realm(mdsc, realm);
> @@ -176,8 +186,22 @@ static bool check_quota_exceeded(struct inode *inode, 
> enum quota_check_op op,
> spin_lock(&realm->inodes_with_caps_lock);
> in = realm->inode ? igrab(realm->inode) : NULL;
> spin_unlock(&realm->inodes_with_caps_lock);
> -   if (!in)
> -   break;
> +   if (!in) {
> +   up_read(&mdsc->snap_rwsem);
> +   in = ceph_lookup_inode(inode->i_sb, realm->ino);
> +   down_read(&mdsc->snap_rwsem);
> +   if (IS_ERR(in)) {
> +   pr_warn("Can't lookup inode %llx (err: 
> %ld)\n",
> +   realm->ino, PTR_ERR(in));
> +   break;
> +   }
> +   spin_lock(&realm->inodes_with_caps_lock);
> +   realm->inode = in;
> +   realm->own_inode = true;
> +   spin_unlock(&realm->inodes_with_caps_lock);
> +   ceph_put_snap_realm(mdsc, realm);
> +   goto restart;
> +   }
>
> ci = ceph_inode(in);
> spin_lock(&ci->i_ceph_lock);
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index f74193da0e09..c84ed8e8526a 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -117,6 +117,7 @@ static struct ceph_snap_realm *ceph_create_snap_realm(
>
> atomic_set(&realm->nref, 1);/* for caller */
> realm->ino = ino;
> +   realm->own_inode = false;
> INIT_LIST_HEAD(&realm->children);
> INIT_LIST_HEAD(&realm->child_item);
> INIT_LIST_HEAD(&realm->empty_item);
> @@ -184,6 +185,8 @@ static void __destroy_snap_realm(struct ceph_mds_client 
> *mdsc,
> kfree(realm->prior_parent_snaps);
> kfree(realm->snaps);
> ceph_put_snap_context(realm->cached_context);
> +   

Re: [PATCH] ceph: quota: fix null pointer dereference in quota check

2018-11-06 Thread Yan, Zheng



> On Nov 5, 2018, at 19:00, Luis Henriques  wrote:
> 
> This patch fixes a possible null pointer dereference in
> check_quota_exceeded, detected by the static checker smatch, with the
> following warning:
> 
>fs/ceph/quota.c:240 check_quota_exceeded()
> error: we previously assumed 'realm' could be null (see line 188)
> 
> Reported-by: Dan Carpenter 
> Link: https://lkml.kernel.org/n/20181101065318.2cylxol6d444nzeu@kili.mountain
> Fixes: b7a2921765cf ("ceph: quota: support for ceph.quota.max_files")
> Signed-off-by: Luis Henriques 
> ---
> fs/ceph/quota.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index 32d4f13784ba..03f4d24db8fe 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -237,7 +237,8 @@ static bool check_quota_exceeded(struct inode *inode, 
> enum quota_check_op op,
>   ceph_put_snap_realm(mdsc, realm);
>   realm = next;
>   }
> - ceph_put_snap_realm(mdsc, realm);
> + if (realm)
> + ceph_put_snap_realm(mdsc, realm);
>   up_read(&mdsc->snap_rwsem);
> 
>   return exceeded;

Applied, thanks.

Yan, Zheng




Re: [PATCH] ceph: add destination file data sync before doing any remote copy

2018-10-23 Thread Yan, Zheng
On Tue, Oct 23, 2018 at 5:08 PM Luis Henriques  wrote:
>
> If we try to copy into a file that was just written, any data that is remote
> copied will be overwritten by our buffered writes once they are flushed.  When
> this happens, the call to invalidate_inode_pages2_range will also return a
> -EBUSY error.
>
> This patch fixes this by also sync'ing the destination file before starting 
> any
> copy.
>
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/file.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f788496fafcc..b4607baa8969 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1932,10 +1932,17 @@ static ssize_t ceph_copy_file_range(struct file 
> *src_file, loff_t src_off,
> if (!prealloc_cf)
> return -ENOMEM;
>
> -   /* Start by sync'ing the source file */
> +   /* Start by sync'ing the source and destination files */
> ret = file_write_and_wait_range(src_file, src_off, (src_off + len));
> -   if (ret < 0)
> +   if (ret < 0) {
> +   dout("failed to write src file (%zd)\n", ret);
> +   goto out;
> +   }
> +   ret = file_write_and_wait_range(dst_file, dst_off, (dst_off + len));
> +   if (ret < 0) {
> +   dout("failed to write dst file (%zd)\n", ret);
>     goto out;
> +   }
>
> /*
>  * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other

Applied, thanks

Yan, Zheng


Re: [PATCH] ceph: only allow punch hole mode in fallocate

2018-10-09 Thread Yan, Zheng
On Wed, Oct 10, 2018 at 1:54 AM Luis Henriques  wrote:
>
> Current implementation of cephfs fallocate isn't correct as it doesn't
> really reserve the space in the cluster, which means that a subsequent
> call to a write may actually fail due to lack of space.  In fact, it is
> currently possible to fallocate an amount space that is larger than the
> free space in the cluster.
>
> Since there's no easy solution to fix this at the moment, this patch
> simply removes support for all fallocate operations but
> FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE).
>
> Link: https://tracker.ceph.com/issues/36317
> Cc: sta...@vger.kernel.org
> Fixes: ad7a60de882a ("ceph: punch hole support")
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/file.c | 45 +
>  1 file changed, 9 insertions(+), 36 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 92ab20433682..91a7ad259bcf 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1735,7 +1735,6 @@ static long ceph_fallocate(struct file *file, int mode,
> struct ceph_file_info *fi = file->private_data;
> struct inode *inode = file_inode(file);
> struct ceph_inode_info *ci = ceph_inode(inode);
> -   struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
> struct ceph_cap_flush *prealloc_cf;
> int want, got = 0;
> int dirty;
> @@ -1743,10 +1742,7 @@ static long ceph_fallocate(struct file *file, int mode,
> loff_t endoff = 0;
> loff_t size;
>
> -   if ((offset + length) > max(i_size_read(inode), fsc->max_file_size))
> -   return -EFBIG;
> -
> -   if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +   if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> return -EOPNOTSUPP;
>
> if (!S_ISREG(inode->i_mode))
> @@ -1763,18 +1759,6 @@ static long ceph_fallocate(struct file *file, int mode,
> goto unlock;
> }
>
> -   if (!(mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)) &&
> -   ceph_quota_is_max_bytes_exceeded(inode, offset + length)) {
> -   ret = -EDQUOT;
> -   goto unlock;
> -   }
> -
> -   if (ceph_osdmap_flag(&fsc->client->osdc, CEPH_OSDMAP_FULL) &&
> -   !(mode & FALLOC_FL_PUNCH_HOLE)) {
> -   ret = -ENOSPC;
> -   goto unlock;
> -   }
> -
> if (ci->i_inline_version != CEPH_INLINE_NONE) {
> ret = ceph_uninline_data(file, NULL);
> if (ret < 0)
> @@ -1782,12 +1766,12 @@ static long ceph_fallocate(struct file *file, int 
> mode,
> }
>
> size = i_size_read(inode);
> -   if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> -   endoff = offset + length;
> -   ret = inode_newsize_ok(inode, endoff);
> -   if (ret)
> -   goto unlock;
> -   }
> +
> +   /* Are we punching a hole beyond EOF? */
> +   if (offset >= size)
> +   goto unlock;
> +   if ((offset + length) > size)
> +   length = size - offset;
>
> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> @@ -1798,16 +1782,8 @@ static long ceph_fallocate(struct file *file, int mode,
> if (ret < 0)
> goto unlock;
>
> -   if (mode & FALLOC_FL_PUNCH_HOLE) {
> -   if (offset < size)
> -   ceph_zero_pagecache_range(inode, offset, length);
> -   ret = ceph_zero_objects(inode, offset, length);
> -   } else if (endoff > size) {
> -   truncate_pagecache_range(inode, size, -1);
> -   if (ceph_inode_set_size(inode, endoff))
> -   ceph_check_caps(ceph_inode(inode),
> -   CHECK_CAPS_AUTHONLY, NULL);
> -   }
> +   ceph_zero_pagecache_range(inode, offset, length);
> +   ret = ceph_zero_objects(inode, offset, length);
>
> if (!ret) {
> spin_lock(&ci->i_ceph_lock);
> @@ -1817,9 +1793,6 @@ static long ceph_fallocate(struct file *file, int mode,
> spin_unlock(&ci->i_ceph_lock);
> if (dirty)
> __mark_inode_dirty(inode, dirty);
> -   if ((endoff > size) &&
> -   ceph_quota_is_max_bytes_approaching(inode, endoff))
> -   ceph_check_caps(ci, CHECK_CAPS_NODELAY, NULL);
> }
>
> ceph_put_cap_refs(ci, got);

Applied, thanks

Yan, Zheng


Re: [PATCH v2 04/17] ceph: fix compat_ioctl for ceph_dir_operations

2018-09-12 Thread Yan, Zheng
On Wed, Sep 12, 2018 at 11:10 PM Arnd Bergmann  wrote:
>
> The ceph_ioctl function is used both for files and directories, but only
> the files support doing that in 32-bit compat mode.
>
> For consistency, add the same compat handler to the dir operations
> as well.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/ceph/dir.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 82928cea0209..da73f29d7faa 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1489,6 +1489,7 @@ const struct file_operations ceph_dir_fops = {
> .open = ceph_open,
> .release = ceph_release,
> .unlocked_ioctl = ceph_ioctl,
> +   .compat_ioctl = ceph_ioctl,
> .fsync = ceph_fsync,
> .lock = ceph_lock,
> .flock = ceph_flock,
> --
> 2.18.0
>

Reviewed-by: "Yan, Zheng" 


Re: [PATCH v2] fs: ceph: Adding new return type vm_fault_t

2018-07-23 Thread Yan, Zheng
&oldset);
> - if (ret < 0)
> - ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + if (err < 0)
> + ret = vmf_error(err);
> 
>   return ret;
> }
> @@ -1524,7 +1525,7 @@ static int ceph_filemap_fault(struct vm_fault *vmf)
> /*
>  * Reuse write_begin here for simplicity.
>  */
> -static int ceph_page_mkwrite(struct vm_fault *vmf)
> +static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> {
>   struct vm_area_struct *vma = vmf->vma;
>   struct inode *inode = file_inode(vma->vm_file);
> @@ -1535,8 +1536,9 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   loff_t off = page_offset(page);
>   loff_t size = i_size_read(inode);
>   size_t len;
> - int want, got, ret;
> + int want, got, err;
>   sigset_t oldset;
> + vm_fault_t ret = VM_FAULT_SIGBUS;
> 
>   prealloc_cf = ceph_alloc_cap_flush();
>   if (!prealloc_cf)
> @@ -1550,10 +1552,10 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   lock_page(page);
>   locked_page = page;
>   }
> - ret = ceph_uninline_data(vma->vm_file, locked_page);
> + err = ceph_uninline_data(vma->vm_file, locked_page);
>   if (locked_page)
>   unlock_page(locked_page);
> - if (ret < 0)
> + if (err < 0)
>   goto out_free;
>   }
> 
> @@ -1570,9 +1572,9 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   want = CEPH_CAP_FILE_BUFFER;
> 
>   got = 0;
> - ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len,
> + err = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len,
>   &got, NULL);
> - if (ret < 0)
> + if (err < 0)
>   goto out_free;
> 
>   dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
> @@ -1590,13 +1592,13 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   break;
>   }
> 
> - ret = ceph_update_writeable_page(vma->vm_file, off, len, page);
> - if (ret >= 0) {
> + err = ceph_update_writeable_page(vma->vm_file, off, len, page);
> + if (err >= 0) {
>   /* success.  we'll keep the page locked. */
>   set_page_dirty(page);
>   ret = VM_FAULT_LOCKED;
>   }
> - } while (ret == -EAGAIN);
> + } while (err == -EAGAIN);
> 
>   if (ret == VM_FAULT_LOCKED ||
>   ci->i_inline_version != CEPH_INLINE_NONE) {
> @@ -1610,14 +1612,14 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   __mark_inode_dirty(inode, dirty);
>   }
> 
> - dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %d\n",
> + dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
>inode, off, len, ceph_cap_string(got), ret);
>   ceph_put_cap_refs(ci, got);
> out_free:
>   ceph_restore_sigs(&oldset);
>   ceph_free_cap_flush(prealloc_cf);
> - if (ret < 0)
> - ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + if (err < 0)
> + ret = vmf_error(err);
>   return ret;
> }
> 

Applied, Thanks

Yan, Zheng

> -- 
> 1.9.1
> 



Re: [PATCH] fs: ceph: Adding new return type vm_fault_t

2018-07-23 Thread Yan, Zheng
 ceph_restore_sigs(&oldset);
> - if (ret < 0)
> - ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + if (err < 0)
> + ret = vmf_error(err);
> 
>   return ret;
> }
> @@ -1520,7 +1521,7 @@ static int ceph_filemap_fault(struct vm_fault *vmf)
> /*
>  * Reuse write_begin here for simplicity.
>  */
> -static int ceph_page_mkwrite(struct vm_fault *vmf)
> +static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> {
>   struct vm_area_struct *vma = vmf->vma;
>   struct inode *inode = file_inode(vma->vm_file);
> @@ -1531,8 +1532,9 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   loff_t off = page_offset(page);
>   loff_t size = i_size_read(inode);
>   size_t len;
> - int want, got, ret;
> + int want, got, err;
>   sigset_t oldset;
> + vm_fault_t ret;
> 
>   prealloc_cf = ceph_alloc_cap_flush();
>   if (!prealloc_cf)
> @@ -1546,10 +1548,10 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   lock_page(page);
>   locked_page = page;
>   }
> - ret = ceph_uninline_data(vma->vm_file, locked_page);
> + err = ceph_uninline_data(vma->vm_file, locked_page);
>   if (locked_page)
>   unlock_page(locked_page);
> - if (ret < 0)
> + if (err < 0)
>   goto out_free;
>   }
> 
> @@ -1566,9 +1568,9 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   want = CEPH_CAP_FILE_BUFFER;
> 
>   got = 0;
> - ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len,
> + err = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len,
>   &got, NULL);
> - if (ret < 0)
> + if (err < 0)
>   goto out_free;
> 
>   dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
> @@ -1586,13 +1588,13 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   break;
>   }
> 
> - ret = ceph_update_writeable_page(vma->vm_file, off, len, page);
> - if (ret >= 0) {
> + err = ceph_update_writeable_page(vma->vm_file, off, len, page);
> + if (err >= 0) {
>   /* success.  we'll keep the page locked. */
>   set_page_dirty(page);
>   ret = VM_FAULT_LOCKED;
>   }
> - } while (ret == -EAGAIN);
> + } while (err == -EAGAIN);
> 
>   if (ret == VM_FAULT_LOCKED ||
>   ci->i_inline_version != CEPH_INLINE_NONE) {
> @@ -1606,14 +1608,14 @@ static int ceph_page_mkwrite(struct vm_fault *vmf)
>   __mark_inode_dirty(inode, dirty);
>   }
> 
> - dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %d\n",
> + dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
>inode, off, len, ceph_cap_string(got), ret);
>   ceph_put_cap_refs(ci, got);
> out_free:
>   ceph_restore_sigs(&oldset);
>   ceph_free_cap_flush(prealloc_cf);
> - if (ret < 0)
> - ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + if (err < 0)
> + ret = vmf_error(err);
>   return ret;
> }
> 

Applied, thanks

Yan, Zheng

> -- 
> 1.9.1
> 



Re: [PATCH 2/5] ceph: stop using current_kernel_time()

2018-07-15 Thread Yan, Zheng
The cephfs part (patch 2~5) looks good for me.

Regards
Yan, Zheng


On Sat, Jul 14, 2018 at 4:21 AM Arnd Bergmann  wrote:
>
> ceph_mdsc_create_request() is one of the last callers of the
> deprecated current_kernel_time() as well as timespec_trunc().
>
> This changes it to use the timespec64 based interfaces instead,
> though we still need to convert the result until we are ready to
> change over req->r_stamp.
>
> The output of the two functions, ktime_get_coarse_real_ts64() and
> current_kernel_time() is the same coarse-granular timestamp,
> the only difference here is that ktime_get_coarse_real_ts64()
> doesn't overflow in 2038.
>
> Signed-off-by: Arnd Bergmann 
> ---
> v2: add clarification that this is the same timestamp
> ---
>  fs/ceph/mds_client.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index dc8bc664a871..69c839316a7a 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1779,6 +1779,7 @@ struct ceph_mds_request *
>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>  {
> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
> +   struct timespec64 ts;
>
> if (!req)
> return ERR_PTR(-ENOMEM);
> @@ -1797,7 +1798,9 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, 
> int op, int mode)
> init_completion(&req->r_safe_completion);
> INIT_LIST_HEAD(&req->r_unsafe_item);
>
> -   req->r_stamp = timespec_trunc(current_kernel_time(), 
> mdsc->fsc->sb->s_time_gran);
> +   ktime_get_coarse_real_ts64(&ts);
> +   req->r_stamp = timespec64_to_timespec(timespec64_trunc(ts,
> +   mdsc->fsc->sb->s_time_gran));
>
> req->r_op = op;
> req->r_direct_mode = mode;
> --
> 2.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: fix use-after-free in ceph_statfs

2018-05-29 Thread Yan, Zheng
ebugging due to kernel taint
>
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/super.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index b33082e6878f..9c788e59fc04 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -45,7 +45,7 @@ static void ceph_put_super(struct super_block *s)
>  static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
> struct ceph_fs_client *fsc = ceph_inode_to_client(d_inode(dentry));
> -   struct ceph_monmap *monmap = fsc->client->monc.monmap;
> +   struct ceph_mon_client *monc = &fsc->client->monc;
> struct ceph_statfs st;
> u64 fsid;
> int err;
> @@ -58,7 +58,7 @@ static int ceph_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> }
>
> dout("statfs\n");
> -   err = ceph_monc_do_statfs(&fsc->client->monc, data_pool, &st);
> +   err = ceph_monc_do_statfs(monc, data_pool, &st);
> if (err < 0)
> return err;
>
> @@ -94,8 +94,11 @@ static int ceph_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> buf->f_namelen = NAME_MAX;
>
> /* Must convert the fsid, for consistent values across arches */
> -   fsid = le64_to_cpu(*(__le64 *)(&monmap->fsid)) ^
> -  le64_to_cpu(*((__le64 *)&monmap->fsid + 1));
> +   mutex_lock(&monc->mutex);
> +   fsid = le64_to_cpu(*(__le64 *)(&monc->monmap->fsid)) ^
> +  le64_to_cpu(*((__le64 *)&monc->monmap->fsid + 1));
> +   mutex_unlock(&monc->mutex);
> +
> buf->f_fsid.val[0] = fsid & 0x;
> buf->f_fsid.val[1] = fsid >> 32;
>


Applied, thanks

Yan, Zheng

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: fix st_nlink stat for directories

2018-05-21 Thread Yan, Zheng


> On May 21, 2018, at 17:27, Luis Henriques  wrote:
> 
> Currently, calling stat on a cephfs directory returns 1 for st_nlink.
> This behaviour has recently changed in the fuse client, as some
> applications seem to expect this value to be either 0 (if it's
> unlinked) or 2 + number of subdirectories.  This behaviour was changed
> in the fuse client with commit 67c7e4619188 ("client: use common
> interp of st_nlink for dirs").
> 
> This patch modifies the kernel client to have a similar behaviour.
> 
> Link: https://tracker.ceph.com/issues/23873
> Signed-off-by: Luis Henriques 
> ---
> fs/ceph/inode.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index aa7c5a4ff137..6ad66ef8fd66 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -2288,6 +2288,14 @@ int ceph_getattr(const struct path *path, struct kstat 
> *stat,
>   stat->size = ci->i_files + ci->i_subdirs;
>   stat->blocks = 0;
>   stat->blksize = 65536;
> + /*
> +  * Some applications rely on the number of st_nlink
> +  * value on directories to be either 0 (if unlinked)
> +  * or 2 + number of subdirectories.
> +  */
> + if (stat->nlink == 1)
> + /* '.' + '..' + subdirs */
> + stat->nlink = 1 + 1 + ci->i_subdirs;
>   }
>   }
>   return err;

Applied, thanks.

Yan, Zheng



Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara  wrote:
> On Thu 14-12-17 22:30:26, Yan, Zheng wrote:
>> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara  wrote:
>> > On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>> >> We recently got an Oops report:
>> >>
>> >> BUG: unable to handle kernel NULL pointer dereference at (null)
>> >> IP: jbd2__journal_start+0x38/0x1a2
>> >> [...]
>> >> Call Trace:
>> >>   ext4_page_mkwrite+0x307/0x52b
>> >>   _ext4_get_block+0xd8/0xd8
>> >>   do_page_mkwrite+0x6e/0xd8
>> >>   handle_mm_fault+0x686/0xf9b
>> >>   mntput_no_expire+0x1f/0x21e
>> >>   __do_page_fault+0x21d/0x465
>> >>   dput+0x4a/0x2f7
>> >>   page_fault+0x22/0x30
>> >>   copy_user_generic_string+0x2c/0x40
>> >>   copy_page_to_iter+0x8c/0x2b8
>> >>   generic_file_read_iter+0x26e/0x845
>> >>   timerqueue_del+0x31/0x90
>> >>   ceph_read_iter+0x697/0xa33 [ceph]
>> >>   hrtimer_cancel+0x23/0x41
>> >>   futex_wait+0x1c8/0x24d
>> >>   get_futex_key+0x32c/0x39a
>> >>   __vfs_read+0xe0/0x130
>> >>   vfs_read.part.1+0x6c/0x123
>> >>   handle_mm_fault+0x831/0xf9b
>> >>   __fget+0x7e/0xbf
>> >>   SyS_read+0x4d/0xb5
>> >>
>> >> ceph_read_iter() uses current->journal_info to pass context info to
>> >> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>> >> has already gotten capability of using page cache (distinguish read
>> >> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>> >> then calls generic_file_read_iter().
>> >>
>> >> In above Oops, page fault happened when copying data to userspace.
>> >> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>> >> current->journal_info and assumed it is journal handle.
>> >>
>> >> I checked other filesystems, btrfs probably suffers similar problem
>> >> for its readpage. (page fault happens when write() copies data from
>> >> userspace memory and the memory is mapped to a file in btrfs.
>> >> verify_parent_transid() can be called during readpage)
>> >>
>> >> Cc: sta...@vger.kernel.org
>> >> Signed-off-by: "Yan, Zheng" 
>> >
>> > I agree with the analysis but the patch is too ugly too live. Ceph just
>> > should not be abusing current->journal_info for passing information between
>> > two random functions or when it does a hackery like this, it should just
>> > make sure the pieces hold together. Poluting generic code to accommodate
>> > this hack in Ceph is not acceptable. Also bear in mind there are likely
>> > other code paths (e.g. memory reclaim) which could recurse into another
>> > filesystem confusing it with non-NULL current->journal_info in the same
>> > way.
>>
>> But ...
>>
>> some filesystem set journal_info in its write_begin(), then clear it
>> in write_end(). If buffer for write is mapped to another filesystem,
>> current->journal can leak to the later filesystem's page_readpage().
>> The later filesystem may read current->journal and treat it as its own
>> journal handle.  Besides, most filesystem's vm fault handle is
>> filemap_fault(), filemap also may tigger memory reclaim.
>
> Did you really observe this? Because write path uses
> iov_iter_copy_from_user_atomic() which does not allow page faults to
> happen. All page faulting happens in iov_iter_fault_in_readable() before
> ->write_begin() is called. And the recursion problems like you mention
> above are exactly the reason why things are done in a more complicated way
> like this.

I think you are right.

>
>> >
>> > In this particular case I'm not sure why does ceph pass 'filp' into
>> > readpage() / readpages() handler when it already gets that pointer as part
>> > of arguments...
>>
>> It actually a flag which tells ceph_readpages() if its caller is
>> ceph_read_iter or readahead/fadvise/madvise. because when there are
>> multiple clients read/write a file a the same time, page cache should
>> be disabled.
>
> I'm not sure I understand the reasoning properly but from what you say
> above it rather seems the 'hint' should be stored in the inode (or possibly
> struct file)?
>

The capability of using page cache is hold by the process who got it.
ceph_read_iter() first gets the capability, calls
generic_file_read_iter(), then release the capability. The capability
can not be easily stored in inode or file because it can be revoked by
server any time if caller does not hold it

Regards
Yan, Zheng


> Honza
> --
> Jan Kara 
> SUSE Labs, CR


Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara  wrote:
> On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>> We recently got an Oops report:
>>
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: jbd2__journal_start+0x38/0x1a2
>> [...]
>> Call Trace:
>>   ext4_page_mkwrite+0x307/0x52b
>>   _ext4_get_block+0xd8/0xd8
>>   do_page_mkwrite+0x6e/0xd8
>>   handle_mm_fault+0x686/0xf9b
>>   mntput_no_expire+0x1f/0x21e
>>   __do_page_fault+0x21d/0x465
>>   dput+0x4a/0x2f7
>>   page_fault+0x22/0x30
>>   copy_user_generic_string+0x2c/0x40
>>   copy_page_to_iter+0x8c/0x2b8
>>   generic_file_read_iter+0x26e/0x845
>>   timerqueue_del+0x31/0x90
>>   ceph_read_iter+0x697/0xa33 [ceph]
>>   hrtimer_cancel+0x23/0x41
>>   futex_wait+0x1c8/0x24d
>>   get_futex_key+0x32c/0x39a
>>   __vfs_read+0xe0/0x130
>>   vfs_read.part.1+0x6c/0x123
>>   handle_mm_fault+0x831/0xf9b
>>   __fget+0x7e/0xbf
>>   SyS_read+0x4d/0xb5
>>
>> ceph_read_iter() uses current->journal_info to pass context info to
>> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>> has already gotten capability of using page cache (distinguish read
>> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>> then calls generic_file_read_iter().
>>
>> In above Oops, page fault happened when copying data to userspace.
>> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>> current->journal_info and assumed it is journal handle.
>>
>> I checked other filesystems, btrfs probably suffers similar problem
>> for its readpage. (page fault happens when write() copies data from
>> userspace memory and the memory is mapped to a file in btrfs.
>> verify_parent_transid() can be called during readpage)
>>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: "Yan, Zheng" 
>
> I agree with the analysis but the patch is too ugly too live. Ceph just
> should not be abusing current->journal_info for passing information between
> two random functions or when it does a hackery like this, it should just
> make sure the pieces hold together. Poluting generic code to accommodate
> this hack in Ceph is not acceptable. Also bear in mind there are likely
> other code paths (e.g. memory reclaim) which could recurse into another
> filesystem confusing it with non-NULL current->journal_info in the same
> way.

But ...

some filesystem set journal_info in its write_begin(), then clear it
in write_end(). If buffer for write is mapped to another filesystem,
current->journal can leak to the later filesystem's page_readpage().
The later filesystem may read current->journal and treat it as its own
journal handle.  Besides, most filesystem's vm fault handle is
filemap_fault(), filemap also may tigger memory reclaim.

>
> In this particular case I'm not sure why does ceph pass 'filp' into
> readpage() / readpages() handler when it already gets that pointer as part
> of arguments...

It actually a flag which tells ceph_readpages() if its caller is
ceph_read_iter or readahead/fadvise/madvise. because when there are
multiple clients read/write a file a the same time, page cache should
be disabled.

Regards
Yan, Zheng

>
> Honza
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a728bed16c20..db2a50233c49 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, 
>> unsigned long address,
>>   unsigned int flags)
>>  {
>>   int ret;
>> + void *old_journal_info;
>>
>>   __set_current_state(TASK_RUNNING);
>>
>> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
>> unsigned long address,
>>   if (flags & FAULT_FLAG_USER)
>>   mem_cgroup_oom_enable();
>>
>> + /*
>> +  * Fault can happen when filesystem A's read_iter()/write_iter()
>> +  * copies data to/from userspace. Filesystem A may have set
>> +  * current->journal_info. If the userspace memory is MAP_SHARED
>> +  * mapped to a file in filesystem B, we later may call filesystem
>> +  * B's vm operation. Filesystem B may also want to read/set
>> +  * current->journal_info.
>> +  */
>> + old_journal_info = current->journal_info;
>> + current->journal_info = NULL;
>> +
>>   if (unlikely(is_vm_hugetlb_page(vma)))
>>   ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>>   else
>>   ret = __handle_mm_fault(vma, address, flags);
>>
>> + current->journal_info = old_journal_info;
>> +
>>   if (flags & FAULT_FLAG_USER) {
>>   mem_cgroup_oom_disable();
>>   /*
>> --
>> 2.13.6
>>
> --
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
We recently got an Oops report:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: jbd2__journal_start+0x38/0x1a2
[...]
Call Trace:
  ext4_page_mkwrite+0x307/0x52b
  _ext4_get_block+0xd8/0xd8
  do_page_mkwrite+0x6e/0xd8
  handle_mm_fault+0x686/0xf9b
  mntput_no_expire+0x1f/0x21e
  __do_page_fault+0x21d/0x465
  dput+0x4a/0x2f7
  page_fault+0x22/0x30
  copy_user_generic_string+0x2c/0x40
  copy_page_to_iter+0x8c/0x2b8
  generic_file_read_iter+0x26e/0x845
  timerqueue_del+0x31/0x90
  ceph_read_iter+0x697/0xa33 [ceph]
  hrtimer_cancel+0x23/0x41
  futex_wait+0x1c8/0x24d
  get_futex_key+0x32c/0x39a
  __vfs_read+0xe0/0x130
  vfs_read.part.1+0x6c/0x123
  handle_mm_fault+0x831/0xf9b
  __fget+0x7e/0xbf
  SyS_read+0x4d/0xb5

ceph_read_iter() uses current->journal_info to pass context info to
ceph_readpages(). Because ceph_readpages() needs to know if its caller
has already gotten capability of using page cache (distinguish read
from readahead/fadvise). ceph_read_iter() set current->journal_info,
then calls generic_file_read_iter().

In above Oops, page fault happened when copying data to userspace.
Page fault handler called ext4_page_mkwrite(). Ext4 code read
current->journal_info and assumed it is journal handle.

I checked other filesystems, btrfs probably suffers similar problem
for its readpage. (page fault happens when write() copies data from
userspace memory and the memory is mapped to a file in btrfs.
verify_parent_transid() can be called during readpage)

Cc: sta...@vger.kernel.org
Signed-off-by: "Yan, Zheng" 
---
 mm/memory.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..db2a50233c49 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned 
long address,
unsigned int flags)
 {
int ret;
+   void *old_journal_info;
 
__set_current_state(TASK_RUNNING);
 
@@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
 
+   /*
+* Fault can happen when filesystem A's read_iter()/write_iter()
+* copies data to/from userspace. Filesystem A may have set
+* current->journal_info. If the userspace memory is MAP_SHARED
+* mapped to a file in filesystem B, we later may call filesystem
+* B's vm operation. Filesystem B may also want to read/set
+* current->journal_info.
+*/
+   old_journal_info = current->journal_info;
+   current->journal_info = NULL;
+
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);
 
+   current->journal_info = old_journal_info;
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
/*
-- 
2.13.6



Re: [PATCH] mm: save current->journal_info before calling fault/page_mkwrite

2017-12-13 Thread Yan, Zheng

> On 14 Dec 2017, at 08:59, Andrew Morton  wrote:
> 
> On Wed, 13 Dec 2017 11:58:36 +0800 "Yan, Zheng"  wrote:
> 
>> We recently got an Oops report:
>> 
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: jbd2__journal_start+0x38/0x1a2
>> [...]
>> Call Trace:
>>  ext4_page_mkwrite+0x307/0x52b
>>  _ext4_get_block+0xd8/0xd8
>>  do_page_mkwrite+0x6e/0xd8
>>  handle_mm_fault+0x686/0xf9b
>>  mntput_no_expire+0x1f/0x21e
>>  __do_page_fault+0x21d/0x465
>>  dput+0x4a/0x2f7
>>  page_fault+0x22/0x30
>>  copy_user_generic_string+0x2c/0x40
>>  copy_page_to_iter+0x8c/0x2b8
>>  generic_file_read_iter+0x26e/0x845
>>  timerqueue_del+0x31/0x90
>>  ceph_read_iter+0x697/0xa33 [ceph]
>>  hrtimer_cancel+0x23/0x41
>>  futex_wait+0x1c8/0x24d
>>  get_futex_key+0x32c/0x39a
>>  __vfs_read+0xe0/0x130
>>  vfs_read.part.1+0x6c/0x123
>>  handle_mm_fault+0x831/0xf9b
>>  __fget+0x7e/0xbf
>>  SyS_read+0x4d/0xb5
>> 
>> The reason is that page fault can happen when one filesystem copies
>> data from/to userspace, the filesystem may set current->journal_info.
>> If the userspace memory is mapped to a file on another filesystem,
>> the later filesystem may also want to use current->journal_info.
>> 
> 
> whoops.
> 
> A cc:stable will be needed here...
> 
> A filesystem doesn't "copy data from/to userspace".  I assume here
> we're referring to a read() where the source is a pagecache page for
> filesystem A and the destination is a MAP_SHARED page in filesystem B?
> 
> But in that case I don't see why filesystem A would have a live
> ->journal_info?  It's just doing a read.
> 
> So can we please have more detailed info on the exact scenario here?
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2347,12 +2347,22 @@ static int do_page_mkwrite(struct vm_fault *vmf)
>> {
>>  int ret;
>>  struct page *page = vmf->page;
>> +void *old_journal_info = current->journal_info;
>>  unsigned int old_flags = vmf->flags;
>> 
>> +/*
>> + * If the fault happens during read_iter() copies data to
>> + * userspace, filesystem may have set current->journal_info.
>> + * If the userspace memory is mapped to a file on another
>> + * filesystem, page_mkwrite() of the later filesystem may
>> + * want to access/modify current->journal_info.
>> + */
>> +current->journal_info = NULL;
>>  vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>> 
>>  ret = vmf->vma->vm_ops->page_mkwrite(vmf);
>> -/* Restore original flags so that caller is not surprised */
>> +/* Restore original journal_info and flags */
>> +current->journal_info = old_journal_info;
>>  vmf->flags = old_flags;
>>  if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
>>  return ret;
>> @@ -3191,9 +3201,20 @@ static int do_anonymous_page(struct vm_fault *vmf)
>> static int __do_fault(struct vm_fault *vmf)
>> {
>>  struct vm_area_struct *vma = vmf->vma;
>> +void *old_journal_info = current->journal_info;
>>  int ret;
>> 
>> +/*
>> + * If the fault happens during write_iter() copies data from
>> + * userspace, filesystem may have set current->journal_info.
>> + * If the userspace memory is mapped to a file on another
>> + * filesystem, fault handler of the later filesystem may want
>> + * to access/modify current->journal_info.
>> + */
>> +current->journal_info = NULL;
>>  ret = vma->vm_ops->fault(vmf);
>> +/* Restore original journal_info */
>> +current->journal_info = old_journal_info;
>>  if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>>  VM_FAULT_DONE_COW)))
>>  return ret;
> 
> Can you explain why you chose these two sites?  Rather than, for
> example, way up in handle_mm_fault()?

I think they are the only two places that code can enter another filesystem

> 
> It's hard to believe that a fault handler will alter ->journal_info if
> it is handling a read fault, so perhaps we only need to do this for a
> write fault?  Although such an optimization probably isn't worthwhile. 
> The whole thing is only about three instructions.

ceph uses current->journal_info for both read/write operations. I think btrfs 
also read current->journal_info during read-only operation. (I mentioned this 
in my previous reply)

Regards
Yan, Zheng
 



Re: [PATCH] mm: save current->journal_info before calling fault/page_mkwrite

2017-12-13 Thread Yan, Zheng


> On 14 Dec 2017, at 08:59, Andrew Morton  wrote:
> 
> On Wed, 13 Dec 2017 11:58:36 +0800 "Yan, Zheng"  wrote:
> 
>> We recently got an Oops report:
>> 
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: jbd2__journal_start+0x38/0x1a2
>> [...]
>> Call Trace:
>>  ext4_page_mkwrite+0x307/0x52b
>>  _ext4_get_block+0xd8/0xd8
>>  do_page_mkwrite+0x6e/0xd8
>>  handle_mm_fault+0x686/0xf9b
>>  mntput_no_expire+0x1f/0x21e
>>  __do_page_fault+0x21d/0x465
>>  dput+0x4a/0x2f7
>>  page_fault+0x22/0x30
>>  copy_user_generic_string+0x2c/0x40
>>  copy_page_to_iter+0x8c/0x2b8
>>  generic_file_read_iter+0x26e/0x845
>>  timerqueue_del+0x31/0x90
>>  ceph_read_iter+0x697/0xa33 [ceph]
>>  hrtimer_cancel+0x23/0x41
>>  futex_wait+0x1c8/0x24d
>>  get_futex_key+0x32c/0x39a
>>  __vfs_read+0xe0/0x130
>>  vfs_read.part.1+0x6c/0x123
>>  handle_mm_fault+0x831/0xf9b
>>  __fget+0x7e/0xbf
>>  SyS_read+0x4d/0xb5
>> 
>> The reason is that page fault can happen when one filesystem copies
>> data from/to userspace, the filesystem may set current->journal_info.
>> If the userspace memory is mapped to a file on another filesystem,
>> the later filesystem may also want to use current->journal_info.
>> 
> 
> whoops.
> 
> A cc:stable will be needed here...
> 
> A filesystem doesn't "copy data from/to userspace".  I assume here
> we're referring to a read() where the source is a pagecache page for
> filesystem A and the destination is a MAP_SHARED page in filesystem B?
> 
> But in that case I don't see why filesystem A would have a live
> ->journal_info?  It's just doing a read.


Background: when there are multiple cephfs clients read/write a file at time 
same time, read/write should go directly to object store daemon, using page 
cache is disabled.

ceph_read_iter() uses current->journal_info to pass context information to 
ceph_readpages().  ceph_readpages() needs to know if its caller has already 
gotten capability of using page cache (distinguish read from 
readahead/fadvise). If not, it tries getting the capability by itself. I 
checked other filesystem, btrfs probably suffers similar problem for its 
readpages. (verify_parent_transid() uses current->journal_info and it can be 
called by by btrfs_get_extent())

Regards
Yan, Zheng

> 
> So can we please have more detailed info on the exact scenario here?
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2347,12 +2347,22 @@ static int do_page_mkwrite(struct vm_fault *vmf)
>> {
>>  int ret;
>>  struct page *page = vmf->page;
>> +void *old_journal_info = current->journal_info;
>>  unsigned int old_flags = vmf->flags;
>> 
>> +/*
>> + * If the fault happens during read_iter() copies data to
>> + * userspace, filesystem may have set current->journal_info.
>> + * If the userspace memory is mapped to a file on another
>> + * filesystem, page_mkwrite() of the later filesystem may
>> + * want to access/modify current->journal_info.
>> + */
>> +current->journal_info = NULL;
>>  vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>> 
>>  ret = vmf->vma->vm_ops->page_mkwrite(vmf);
>> -/* Restore original flags so that caller is not surprised */
>> +/* Restore original journal_info and flags */
>> +current->journal_info = old_journal_info;
>>  vmf->flags = old_flags;
>>  if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
>>  return ret;
>> @@ -3191,9 +3201,20 @@ static int do_anonymous_page(struct vm_fault *vmf)
>> static int __do_fault(struct vm_fault *vmf)
>> {
>>  struct vm_area_struct *vma = vmf->vma;
>> +void *old_journal_info = current->journal_info;
>>  int ret;
>> 
>> +/*
>> + * If the fault happens during write_iter() copies data from
>> + * userspace, filesystem may have set current->journal_info.
>> + * If the userspace memory is mapped to a file on another
>> + * filesystem, fault handler of the later filesystem may want
>> + * to access/modify current->journal_info.
>> + */
>> +current->journal_info = NULL;
>>  ret = vma->vm_ops->fault(vmf);
>> +/* Restore original journal_info */
>> +current->journal_info = old_journal_info;
>>  if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>>  VM_FAULT_DONE_COW)))
>>  return ret;
> 
> Can you explain why you chose these two sites?  Rather than, for
> example, way up in handle_mm_fault()?
> 
> It's hard to believe that a fault handler will alter ->journal_info if
> it is handling a read fault, so perhaps we only need to do this for a
> write fault?  Although such an optimization probably isn't worthwhile. 
> The whole thing is only about three instructions.
> 
> 



[PATCH] mm: save current->journal_info before calling fault/page_mkwrite

2017-12-12 Thread Yan, Zheng
We recently got an Oops report:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: jbd2__journal_start+0x38/0x1a2
[...]
Call Trace:
  ext4_page_mkwrite+0x307/0x52b
  _ext4_get_block+0xd8/0xd8
  do_page_mkwrite+0x6e/0xd8
  handle_mm_fault+0x686/0xf9b
  mntput_no_expire+0x1f/0x21e
  __do_page_fault+0x21d/0x465
  dput+0x4a/0x2f7
  page_fault+0x22/0x30
  copy_user_generic_string+0x2c/0x40
  copy_page_to_iter+0x8c/0x2b8
  generic_file_read_iter+0x26e/0x845
  timerqueue_del+0x31/0x90
  ceph_read_iter+0x697/0xa33 [ceph]
  hrtimer_cancel+0x23/0x41
  futex_wait+0x1c8/0x24d
  get_futex_key+0x32c/0x39a
  __vfs_read+0xe0/0x130
  vfs_read.part.1+0x6c/0x123
  handle_mm_fault+0x831/0xf9b
  __fget+0x7e/0xbf
  SyS_read+0x4d/0xb5

The reason is that page fault can happen when one filesystem copies
data from/to userspace, the filesystem may set current->journal_info.
If the userspace memory is mapped to a file on another filesystem,
the later filesystem may also want to use current->journal_info.

Signed-off-by: "Yan, Zheng" 
---
 mm/memory.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..e51383cd49bf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,12 +2347,22 @@ static int do_page_mkwrite(struct vm_fault *vmf)
 {
int ret;
struct page *page = vmf->page;
+   void *old_journal_info = current->journal_info;
unsigned int old_flags = vmf->flags;
 
+   /*
+* If the fault happens during read_iter() copies data to
+* userspace, filesystem may have set current->journal_info.
+* If the userspace memory is mapped to a file on another
+* filesystem, page_mkwrite() of the later filesystem may
+* want to access/modify current->journal_info.
+*/
+   current->journal_info = NULL;
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
-   /* Restore original flags so that caller is not surprised */
+   /* Restore original journal_info and flags */
+   current->journal_info = old_journal_info;
vmf->flags = old_flags;
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
@@ -3191,9 +3201,20 @@ static int do_anonymous_page(struct vm_fault *vmf)
 static int __do_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
+   void *old_journal_info = current->journal_info;
int ret;
 
+   /*
+* If the fault happens during write_iter() copies data from
+* userspace, filesystem may have set current->journal_info.
+* If the userspace memory is mapped to a file on another
+* filesystem, fault handler of the later filesystem may want
+* to access/modify current->journal_info.
+*/
+   current->journal_info = NULL;
ret = vma->vm_ops->fault(vmf);
+   /* Restore original journal_info */
+   current->journal_info = old_journal_info;
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
-- 
2.13.6



Re: [PATCH] ceph: Fix bool initialization/comparison

2017-10-08 Thread Yan, Zheng


> On 7 Oct 2017, at 22:02, Thomas Meyer  wrote:
> 
> Bool initializations should use true and false. Bool tests don't need
> comparisons.
> 
> Signed-off-by: Thomas Meyer 
> ---
> 
> diff -u -p a/fs/ceph/caps.c b/fs/ceph/caps.c
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1711,7 +1711,7 @@ void ceph_check_caps(struct ceph_inode_i
> 
>   /* if we are unmounting, flush any unused caps immediately. */
>   if (mdsc->stopping)
> - is_delayed = 1;
> + is_delayed = true;
> 
>   spin_lock(&ci->i_ceph_lock);
> 
> @@ -3185,8 +3185,8 @@ static void handle_cap_flush_ack(struct
>   int dirty = le32_to_cpu(m->dirty);
>   int cleaned = 0;
>   bool drop = false;
> - bool wake_ci = 0;
> - bool wake_mdsc = 0;
> + bool wake_ci = false;
> + bool wake_mdsc = false;
> 
>   list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
>   if (cf->tid == flush_tid)

Applied, thanks

Yan, Zheng




Re: [PATCH 0/3] Ceph: Adjustments for some function implementations

2017-08-20 Thread Yan, Zheng

> On 21 Aug 2017, at 02:37, SF Markus Elfring  
> wrote:
> 
> From: Markus Elfring 
> Date: Sun, 20 Aug 2017 20:32:10 +0200
> 
> Three update suggestions were taken into account
> from static source code analysis.
> 
> Markus Elfring (3):
>  Delete an error message for a failed memory allocation in 
> __get_or_create_frag()
>  Delete an unnecessary return statement in update_dentry_lease()
>  Adjust 36 checks for null pointers
> 
> fs/ceph/addr.c   |  2 +-
> fs/ceph/cache.c  |  2 +-
> fs/ceph/caps.c   |  4 ++--
> fs/ceph/debugfs.c|  2 +-
> fs/ceph/file.c   |  2 +-
> fs/ceph/inode.c  | 14 +-
> fs/ceph/mds_client.c | 22 +++---
> fs/ceph/mdsmap.c |  6 +++---
> fs/ceph/super.c  | 18 +-
> fs/ceph/xattr.c  |  8 
> 10 files changed, 38 insertions(+), 42 deletions(-)
> 
> — 

Whole series applied, thanks

Yan, Zheng

> 2.14.0
> 



Re: [PATCH 1/2] ceph: use errseq_t for writeback error reporting

2017-07-26 Thread Yan, Zheng
On Tue, Jul 25, 2017 at 10:50 PM, Jeff Layton  wrote:
> From: Jeff Layton 
>
> Ensure that when writeback errors are marked that we report those to all
> file descriptions that were open at the time of the error.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/ceph/caps.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 7007ae2a5ad2..13f6edf24acd 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2110,7 +2110,7 @@ int ceph_fsync(struct file *file, loff_t start, loff_t 
> end, int datasync)
>
> dout("fsync %p%s\n", inode, datasync ? " datasync" : "");
>
> -   ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +   ret = file_write_and_wait_range(file, start, end);
> if (ret < 0)
> goto out;
>
> --
> 2.13.3
>

Reviewed-by: "Yan, Zheng" 


Re: [PATCH 2/2] ceph: pagecache writeback fault injection switch

2017-07-26 Thread Yan, Zheng
On Tue, Jul 25, 2017 at 10:50 PM, Jeff Layton  wrote:
> From: Jeff Layton 
>
> Testing ceph for proper writeback error handling turns out to be quite
> difficult. I tried using iptables to block traffic but that didn't
> give reliable results.
>
> I hacked in this wb_fault switch that makes the filesystem pretend that
> writeback failed, even when it succeeds. With this, I could verify that
> cephfs fsync error reporting does work properly.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/ceph/addr.c| 7 +++
>  fs/ceph/debugfs.c | 8 +++-
>  fs/ceph/super.h   | 2 ++
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 50836280a6f8..a3831d100e16 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -584,6 +584,10 @@ static int writepage_nounlock(struct page *page, struct 
> writeback_control *wbc)
>page_off, len,
>truncate_seq, truncate_size,
>&inode->i_mtime, &page, 1);
> +
> +   if (fsc->wb_fault && err >= 0)
> +   err = -EIO;
> +
> if (err < 0) {
> struct writeback_control tmp_wbc;
> if (!wbc)
> @@ -666,6 +670,9 @@ static void writepages_finish(struct ceph_osd_request 
> *req)
> struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
> bool remove_page;
>
> +   if (fsc->wb_fault && rc >= 0)
> +   rc = -EIO;
> +
> dout("writepages_finish %p rc %d\n", inode, rc);
> if (rc < 0) {
> mapping_set_error(mapping, rc);
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 4e2d112c982f..e1e6eaa12031 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -197,7 +197,6 @@ CEPH_DEFINE_SHOW_FUNC(caps_show)
>  CEPH_DEFINE_SHOW_FUNC(dentry_lru_show)
>  CEPH_DEFINE_SHOW_FUNC(mds_sessions_show)
>
> -
>  /*
>   * debugfs
>   */
> @@ -231,6 +230,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
> debugfs_remove(fsc->debugfs_caps);
> debugfs_remove(fsc->debugfs_mdsc);
> debugfs_remove(fsc->debugfs_dentry_lru);
> +   debugfs_remove(fsc->debugfs_wb_fault);
>  }
>
>  int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> @@ -298,6 +298,12 @@ int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> if (!fsc->debugfs_dentry_lru)
> goto out;
>
> +   fsc->debugfs_wb_fault = debugfs_create_bool("wb_fault",
> +   0600, fsc->client->debugfs_dir,
> +   &fsc->wb_fault);
> +   if (!fsc->debugfs_wb_fault)
> +   goto out;
> +
> return 0;
>
>  out:
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index f02a2225fe42..a38fd6203b77 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -84,6 +84,7 @@ struct ceph_fs_client {
>
> unsigned long mount_state;
> int min_caps;  /* min caps i added */
> +   bool wb_fault;
>
> struct ceph_mds_client *mdsc;
>
> @@ -100,6 +101,7 @@ struct ceph_fs_client {
> struct dentry *debugfs_bdi;
> struct dentry *debugfs_mdsc, *debugfs_mdsmap;
> struct dentry *debugfs_mds_sessions;
> +   struct dentry *debugfs_wb_fault;
>  #endif
>

I think it's better not to enable this feature by default. Enabling it
by compilation option or mount option?

Regards
Yan, Zheng

>  #ifdef CONFIG_CEPH_FSCACHE
> --
> 2.13.3
>


Re: [PATCH] ceph: kernel client startsync can be removed

2017-07-23 Thread Yan, Zheng
   \
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 901bb82..5c9d696 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -863,8 +863,6 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst,
>   dst->cls.method_len = src->cls.method_len;
>   dst->cls.indata_len = cpu_to_le32(src->cls.indata_len);
>   break;
> - case CEPH_OSD_OP_STARTSYNC:
> - break;
>   case CEPH_OSD_OP_WATCH:
>   dst->watch.cookie = cpu_to_le64(src->watch.cookie);
>   dst->watch.ver = cpu_to_le64(0);
> @@ -916,9 +914,6 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst,
>  * if the file was recently truncated, we include information about its
>  * old and new size so that the object can be updated appropriately.  (we
>  * avoid synchronously deleting truncated objects because it's slow.)
> - *
> - * if @do_sync, include a 'startsync' command so that the osd will flush
> - * data quickly.
>  */
> struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
>  struct ceph_file_layout *layout,
> -- 
> 1.8.3.1
> 

Applied, thanks.

Yan, Zheng




Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-04 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 10:18 PM, Arnd Bergmann  wrote:
> On Fri, Jun 2, 2017 at 2:18 PM, Yan, Zheng  wrote:
>> On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann  wrote:
>>> On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng  wrote:
>>> What I meant is another related problem in ceph_mkdir() where the
>>> i_ctime field of the parent inode is different between the persistent
>>> representation in the mds and the in-memory representation.
>>>
>>
>> I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set 
>> to
>> r_stamp. When client receives request reply, it set its in-memory inode's 
>> ctime
>> to the same time stamp.
>
> Ok, I see it now, thanks for the clarification. Most other file systems do 
> this
> the other way round and update all fields in the in-memory inode structure
> first and then write that to persistent storage, so I was getting confused 
> about
> the order of events here.
>
> If I understand it all right, we have three different behaviors in ceph now,
> though the differences are very minor and probably don't ever matter:
>
> - in setattr(), we update ctime in the in-memory inode first and then send
>   the same time to the mds, and expect to set it again when the reply comes.
>
> - in ceph_write_iter write() and mmap/page_mkwrite(), we call
>   file_update_time() to set i_mtime and i_ctime to the same
>   timestamp first once a write is observed by the fs and then take
>   two other timestamps that we send to the mds, and update the
>   in-memory inode a second time when the reply comes. ctime
>   is never older than mtime here, as far as I can tell, but it may
>   be newer when the timer interrupt happens between taking the
>   two stamps.

We don't use request to send i_mtime/i_ctime to mds in this case.
Instead, we use cap flush message. i_mtime/i_ctime are directly
encoded in cap flush message. When mds receives the cap flush message,
it writes i_mtime/i_ctime to persistent storage and sends a cap flush
ack message to client. (when client receives the cap flush ack
message, it does not update i_mtime/i_ctime). There is no issue as you
described.

>
> - in all other calls, we only update the inode (and/or parent inode)
>   after the reply arrives.

There are two cases. 1. Client updates in-memory inode's ctime, it
sends the new ctime to mds through cap flush message. 2. client set
mds request's r_stamp and send the request to mds. MDS updates
relavent inodes' ctime and sends reply to client. Client updates
in-memory inodes' ctime according to the reply.

Regards
Yan, Zheng

>
>Arnd


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann  wrote:
> On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng  wrote:
>> On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann  wrote:
>>> On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng  wrote:
>>>> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann  wrote:
>>>>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng  wrote:
>>>>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani  
>>>>>> wrote:
>>>>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz  
>>>>>>> wrote:
>>>>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng  wrote:
>>>>>
>>>>> I believe the bug you see is the result of the two timestamps
>>>>> currently being almost guaranteed to be different in the latest
>>>>> kernels.
>>>>> Changing r_stamp to use current_kernel_time() will make it the
>>>>> same value most of the time (as it was before Deepa's patch),
>>>>> but when the timer interrupt happens between the timestamps,
>>>>> the two are still different, it's just much harder to hit.
>>>>>
>>>>> I think the proper solution should be to change __ceph_setattr()
>>>>> in a way that has req->r_stamp always synchronized with i_ctime.
>>>>> If we copy i_ctime to r_stamp, that will also take care of the
>>>>> future issues with the planned changes to current_time().
>>>>>
>>>> I already have a patch
>>>> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af
>>>
>>> Looks good to me. In case anyone cares:
>>> Acked-by: Arnd Bergmann 
>>>
>>>>> The part I don't understand is what else r_stamp (i.e. the time
>>>>> stamp in ceph_msg_data with type==
>>>>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting
>>>>> ctime in CEPH_MDS_OP_SETATTR.
>>>>>
>>>>> Will this be used to update the stored i_ctime for other operations
>>>>> too? If so, we would need to synchronize it with the in-memory
>>>>> i_ctime for all operations that do this.
>>>>>
>>>>
>>>> yes,  mds uses it to update ctime of modified inodes. For example,
>>>> when handling mkdir, mds set ctime of both parent inode and new inode
>>>> to r_stamp.
>>>
>>> I see, so we may have a variation of that problem there as well: From
>>> my reading of the code, the child inode is not in memory yet, so
>>> that seems fine, but I could not find where the parent in-memory inode
>>> i_ctime is updated in ceph, but it is most likely not the same as
>>> req->r_stamp (assuming it gets updated at all).
>>
>> i_ctime is updated when handling request reply, by ceph_fill_file_time().
>> __ceph_setattr() can update the in-memory inode's ctime after request
>> reply is received. The difference between ktime_get_real_ts() and
>> current_time() can be larger than round-trip time of request. So it's
>> still possible that __ceph_setattr() make ctime go back.
>
> But the __ceph_setattr() problem should be fixed by your patch, right?
>
> What I meant is another related problem in ceph_mkdir() where the
> i_ctime field of the parent inode is different between the persistent
> representation in the mds and the in-memory representation.
>

I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set to
r_stamp. When client receives request reply, it set its in-memory inode's ctime
to the same time stamp.

Regards
Yan, Zheng

> Arnd
>
>>> Would it make sense require all callers of ceph_mdsc_do_request()
>>> to update r_stamp at the same time as i_ctime to keep them in sync?


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann  wrote:
> On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng  wrote:
>> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann  wrote:
>>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng  wrote:
>>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani  
>>>> wrote:
>>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz  
>>>>> wrote:
>>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng  wrote:
>>>
>>> I believe the bug you see is the result of the two timestamps
>>> currently being almost guaranteed to be different in the latest
>>> kernels.
>>> Changing r_stamp to use current_kernel_time() will make it the
>>> same value most of the time (as it was before Deepa's patch),
>>> but when the timer interrupt happens between the timestamps,
>>> the two are still different, it's just much harder to hit.
>>>
>>> I think the proper solution should be to change __ceph_setattr()
>>> in a way that has req->r_stamp always synchronized with i_ctime.
>>> If we copy i_ctime to r_stamp, that will also take care of the
>>> future issues with the planned changes to current_time().
>>>
>> I already have a patch
>> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af
>
> Looks good to me. In case anyone cares:
> Acked-by: Arnd Bergmann 
>
>>> The part I don't understand is what else r_stamp (i.e. the time
>>> stamp in ceph_msg_data with type==
>>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting
>>> ctime in CEPH_MDS_OP_SETATTR.
>>>
>>> Will this be used to update the stored i_ctime for other operations
>>> too? If so, we would need to synchronize it with the in-memory
>>> i_ctime for all operations that do this.
>>>
>>
>> yes,  mds uses it to update ctime of modified inodes. For example,
>> when handling mkdir, mds set ctime of both parent inode and new inode
>> to r_stamp.
>
> I see, so we may have a variation of that problem there as well: From
> my reading of the code, the child inode is not in memory yet, so
> that seems fine, but I could not find where the parent in-memory inode
> i_ctime is updated in ceph, but it is most likely not the same as
> req->r_stamp (assuming it gets updated at all).

i_ctime is updated when handling request reply, by ceph_fill_file_time().
__ceph_setattr() can update the in-memory inode's ctime after request
reply is received. The difference between ktime_get_real_ts() and
current_time() can be larger than round-trip time of request. So it's
still possible that __ceph_setattr() make ctime go back.

Regards
Yan, Zheng


>
> Would it make sense require all callers of ceph_mdsc_do_request()
> to update r_stamp at the same time as i_ctime to keep them in sync?
>
> Arnd


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann  wrote:
> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng  wrote:
>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani  
>> wrote:
>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz  wrote:
>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng  wrote:
>>>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann  wrote:
>>>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng  wrote:
>>>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani  
>>>>>>> wrote:
>>>>>>
>>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>>>> index 517838b..77204da 100644
>>>>>>>> --- a/drivers/block/rbd.c
>>>>>>>> +++ b/drivers/block/rbd.c
>>>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>>>>>>> rbd_obj_request *obj_request)
>>>>>>>>  {
>>>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>>>>>>
>>>>>>>> -   osd_req->r_mtime = CURRENT_TIME;
>>>>>>>> +   ktime_get_real_ts(&osd_req->r_mtime);
>>>>>>>> osd_req->r_data_offset = obj_request->offset;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>> index c681762..1d3fa90 100644
>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>>>>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int 
>>>>>>>> mode)
>>>>>>>>  {
>>>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>>>>>>> +   struct timespec ts;
>>>>>>>>
>>>>>>>> if (!req)
>>>>>>>> return ERR_PTR(-ENOMEM);
>>>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>>>>>>> *mdsc, int op, int mode)
>>>>>>>> init_completion(&req->r_safe_completion);
>>>>>>>> INIT_LIST_HEAD(&req->r_unsafe_item);
>>>>>>>>
>>>>>>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>>>>>>> +   ktime_get_real_ts(&ts);
>>>>>>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>>>>>>
>>>>>>> This change causes our kernel_untar_tar test case to fail (inode's
>>>>>>> ctime goes back). The reason is that there is time drift between the
>>>>>>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>>>>>>> revert this change until current_time() uses ktime_get_real_ts()
>>>>>>> internally.
>>>>>>
>>>>>> Hmm, the change was not supposed to have a user-visible effect, so
>>>>>> something has gone wrong, but I don't immediately see how it
>>>>>> relates to what you observe.
>>>>>>
>>>>>> ktime_get_real_ts() and current_time() use the same time base, there
>>>>>> is no drift, but there is a difference in resolution, as the latter uses
>>>>>> the time stamp of the last jiffies update, which may be up to one jiffy
>>>>>> (10ms) behind the exact time we put in the request stamps here.
>>>>>>
>>>>>> Do you still see problems if you use current_kernel_time() instead of
>>>>>> ktime_get_real_ts()?
>>>>>
>>>>> The problem disappears after using current_kernel_time().
>>>>>
>>>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417
>>>>
>>>> From the commit above:
>>>> "It seems there is time drift between ktime_get_real_ts() and
>>>> current_kernel_time()"
>>>>
>>>> Its more of a granularity difference. current_kernel_time() returns
>>>> the cached time at the last tick, where as ktime_get_real_ts() reads
>>>> the clocksource hardware and returns the im

Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani  wrote:
> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz  wrote:
>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng  wrote:
>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann  wrote:
>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng  wrote:
>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani  
>>>>> wrote:
>>>>
>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>> index 517838b..77204da 100644
>>>>>> --- a/drivers/block/rbd.c
>>>>>> +++ b/drivers/block/rbd.c
>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>>>>> rbd_obj_request *obj_request)
>>>>>>  {
>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>>>>
>>>>>> -   osd_req->r_mtime = CURRENT_TIME;
>>>>>> +   ktime_get_real_ts(&osd_req->r_mtime);
>>>>>> osd_req->r_data_offset = obj_request->offset;
>>>>>>  }
>>>>>>
>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>> index c681762..1d3fa90 100644
>>>>>> --- a/fs/ceph/mds_client.c
>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>>>>  {
>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>>>>> +   struct timespec ts;
>>>>>>
>>>>>> if (!req)
>>>>>> return ERR_PTR(-ENOMEM);
>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>>>>> *mdsc, int op, int mode)
>>>>>> init_completion(&req->r_safe_completion);
>>>>>> INIT_LIST_HEAD(&req->r_unsafe_item);
>>>>>>
>>>>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>>>>> +   ktime_get_real_ts(&ts);
>>>>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>>>>
>>>>> This change causes our kernel_untar_tar test case to fail (inode's
>>>>> ctime goes back). The reason is that there is time drift between the
>>>>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>>>>> revert this change until current_time() uses ktime_get_real_ts()
>>>>> internally.
>>>>
>>>> Hmm, the change was not supposed to have a user-visible effect, so
>>>> something has gone wrong, but I don't immediately see how it
>>>> relates to what you observe.
>>>>
>>>> ktime_get_real_ts() and current_time() use the same time base, there
>>>> is no drift, but there is a difference in resolution, as the latter uses
>>>> the time stamp of the last jiffies update, which may be up to one jiffy
>>>> (10ms) behind the exact time we put in the request stamps here.
>>>>
>>>> Do you still see problems if you use current_kernel_time() instead of
>>>> ktime_get_real_ts()?
>>>
>>> The problem disappears after using current_kernel_time().
>>>
>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417
>>
>> From the commit above:
>> "It seems there is time drift between ktime_get_real_ts() and
>> current_kernel_time()"
>>
>> Its more of a granularity difference. current_kernel_time() returns
>> the cached time at the last tick, where as ktime_get_real_ts() reads
>> the clocksource hardware and returns the immediate time.
>>
>> Filesystems usually use the cached time (similar to
>> CLOCK_REALTIME_COARSE), for performance reasons, as touching the
>> clocksource takes time.
>
> Alternatively, it would be best for this code also to use current_time().
> I had suggested this in one of the previous versions of the patch.
> The implementation of current_time() will change when we switch vfs to
> use 64 bit time. This will prevent such errors from happening again.
> But, this also means there is more code reordering for these modules
> to get a reference to inode.
>

I took a look. it's quite inconvenience to use current_time(). I
prefer to temporarily use current_kernel_time().

Regards
Yan, Zheng



> -Deepa


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann  wrote:
> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng  wrote:
>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani  
>> wrote:
>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 517838b..77204da 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>> rbd_obj_request *obj_request)
>>>  {
>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>
>>> -   osd_req->r_mtime = CURRENT_TIME;
>>> +   ktime_get_real_ts(&osd_req->r_mtime);
>>> osd_req->r_data_offset = obj_request->offset;
>>>  }
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index c681762..1d3fa90 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>  {
>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>> +   struct timespec ts;
>>>
>>> if (!req)
>>> return ERR_PTR(-ENOMEM);
>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>> *mdsc, int op, int mode)
>>> init_completion(&req->r_safe_completion);
>>> INIT_LIST_HEAD(&req->r_unsafe_item);
>>>
>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>> +   ktime_get_real_ts(&ts);
>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>
>> This change causes our kernel_untar_tar test case to fail (inode's
>> ctime goes back). The reason is that there is time drift between the
>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>> revert this change until current_time() uses ktime_get_real_ts()
>> internally.
>
> Hmm, the change was not supposed to have a user-visible effect, so
> something has gone wrong, but I don't immediately see how it
> relates to what you observe.
>
> ktime_get_real_ts() and current_time() use the same time base, there
> is no drift, but there is a difference in resolution, as the latter uses
> the time stamp of the last jiffies update, which may be up to one jiffy
> (10ms) behind the exact time we put in the request stamps here.
>
It happens in following sequence of events

1. create a new file, the inode's ctime is set to ktime_get_real_ts()
2. chmod the new file, the inode's ctime is set to current_time().

Inode's ctime goes back when current_time() is behind ktime_get_real_ts().

Regards
Yan, Zheng

> Do you still see problems if you use current_kernel_time() instead of
> ktime_get_real_ts()?
>
> Arnd


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann  wrote:
> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng  wrote:
>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani  
>> wrote:
>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 517838b..77204da 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>> rbd_obj_request *obj_request)
>>>  {
>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>
>>> -   osd_req->r_mtime = CURRENT_TIME;
>>> +   ktime_get_real_ts(&osd_req->r_mtime);
>>> osd_req->r_data_offset = obj_request->offset;
>>>  }
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index c681762..1d3fa90 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>  {
>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>> +   struct timespec ts;
>>>
>>> if (!req)
>>> return ERR_PTR(-ENOMEM);
>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>> *mdsc, int op, int mode)
>>> init_completion(&req->r_safe_completion);
>>> INIT_LIST_HEAD(&req->r_unsafe_item);
>>>
>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>> +   ktime_get_real_ts(&ts);
>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>
>> This change causes our kernel_untar_tar test case to fail (inode's
>> ctime goes back). The reason is that there is time drift between the
>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>> revert this change until current_time() uses ktime_get_real_ts()
>> internally.
>
> Hmm, the change was not supposed to have a user-visible effect, so
> something has gone wrong, but I don't immediately see how it
> relates to what you observe.
>
> ktime_get_real_ts() and current_time() use the same time base, there
> is no drift, but there is a difference in resolution, as the latter uses
> the time stamp of the last jiffies update, which may be up to one jiffy
> (10ms) behind the exact time we put in the request stamps here.
>
> Do you still see problems if you use current_kernel_time() instead of
> ktime_get_real_ts()?

The problem disappears after using current_kernel_time().

https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417


Regards
Yan, Zheng
>
> Arnd


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani  wrote:
> CURRENT_TIME is not y2038 safe.
> The macro will be deleted and all the references to it
> will be replaced by ktime_get_* apis.
>
> struct timespec is also not y2038 safe.
> Retain timespec for timestamp representation here as ceph
> uses it internally everywhere.
> These references will be changed to use struct timespec64
> in a separate patch.
>
> The current_fs_time() api is being changed to use vfs
> struct inode* as an argument instead of struct super_block*.
>
> Set the new mds client request r_stamp field using
> ktime_get_real_ts() instead of using current_fs_time().
>
> Also, since r_stamp is used as mtime on the server, use
> timespec_trunc() to truncate the timestamp, using the right
> granularity from the superblock.
>
> This api will be transitioned to be y2038 safe along
> with vfs.
>
> Signed-off-by: Deepa Dinamani 
> Reviewed-by: Arnd Bergmann 
> ---
>  drivers/block/rbd.c   | 2 +-
>  fs/ceph/mds_client.c  | 4 +++-
>  net/ceph/messenger.c  | 6 --
>  net/ceph/osd_client.c | 4 ++--
>  4 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 517838b..77204da 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
> rbd_obj_request *obj_request)
>  {
> struct ceph_osd_request *osd_req = obj_request->osd_req;
>
> -   osd_req->r_mtime = CURRENT_TIME;
> +   ktime_get_real_ts(&osd_req->r_mtime);
> osd_req->r_data_offset = obj_request->offset;
>  }
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index c681762..1d3fa90 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>  {
> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
> +   struct timespec ts;
>
> if (!req)
> return ERR_PTR(-ENOMEM);
> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, 
> int op, int mode)
> init_completion(&req->r_safe_completion);
> INIT_LIST_HEAD(&req->r_unsafe_item);
>
> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
> +   ktime_get_real_ts(&ts);
> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);

This change causes our kernel_untar_tar test case to fail (inode's
ctime goes back). The reason is that there is time drift between the
time stamps got by  ktime_get_real_ts() and current_time(). We need to
revert this change until current_time() uses ktime_get_real_ts()
internally.

Regards
Yan, Zheng


>
> req->r_op = op;
> req->r_direct_mode = mode;
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index f76bb33..5766a6c 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1386,8 +1386,9 @@ static void prepare_write_keepalive(struct 
> ceph_connection *con)
> dout("prepare_write_keepalive %p\n", con);
> con_out_kvec_reset(con);
> if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
> -   struct timespec now = CURRENT_TIME;
> +   struct timespec now;
>
> +   ktime_get_real_ts(&now);
> con_out_kvec_add(con, sizeof(tag_keepalive2), 
> &tag_keepalive2);
> ceph_encode_timespec(&con->out_temp_keepalive2, &now);
> con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
> @@ -3176,8 +3177,9 @@ bool ceph_con_keepalive_expired(struct ceph_connection 
> *con,
>  {
> if (interval > 0 &&
> (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) {
> -   struct timespec now = CURRENT_TIME;
> +   struct timespec now;
> struct timespec ts;
> +   ktime_get_real_ts(&now);
> jiffies_to_timespec(interval, &ts);
> ts = timespec_add(con->last_keepalive_ack, ts);
> return timespec_compare(&now, &ts) >= 0;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index e15ea9e..242d7c0 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -3574,7 +3574,7 @@ ceph_osdc_watch(struct ceph_osd_client *osdc,
> ceph_oid_copy(&lreq->t.base_oid, oid);
> ceph_oloc_copy(&lreq->t.base_oloc, oloc);
> lreq->t.flags = CEPH_OSD_FLAG_WRITE;
> -   lr

Re: [PATCH] ceph: check i_nlink while converting a file handle to dentry

2017-05-18 Thread Yan, Zheng

> On 19 May 2017, at 00:48, Luis Henriques  wrote:
> 
> On Thu, May 18, 2017 at 09:36:44PM +0800, Yan, Zheng wrote:
>> 
>>> On 17 May 2017, at 19:21, Luis Henriques  wrote:
>>> 
>>> Converting a file handle to a dentry can be done call after the inode
>>> unlink.  This means that __fh_to_dentry() requires an extra check to
>>> verify the number of links is not 0.
>>> 
>>> The issue can be easily reproduced using xfstest generic/426, which does
>>> something like:
>>> 
>>> name_to_handle_at(&fh)
>>> echo 3 > /proc/sys/vm/drop_caches
>>> unlink()
>>> open_by_handle_at(&fh)
>>> 
>>> The call to open_by_handle_at() should fail, as the file doesn't exist
>>> anymore.
>>> 
>>> Cc: sta...@vger.kernel.org
>>> Link: http://tracker.ceph.com/issues/19958
>>> Signed-off-by: Luis Henriques 
>>> ---
>>> fs/ceph/export.c | 4 
>>> 1 file changed, 4 insertions(+)
>>> 
>>> diff --git a/fs/ceph/export.c b/fs/ceph/export.c
>>> index e8f11fa565c5..7df550c13d7f 100644
>>> --- a/fs/ceph/export.c
>>> +++ b/fs/ceph/export.c
>>> @@ -91,6 +91,10 @@ static struct dentry *__fh_to_dentry(struct super_block 
>>> *sb, u64 ino)
>>> ceph_mdsc_put_request(req);
>>> if (!inode)
>>> return ERR_PTR(-ESTALE);
>>> +   if (inode->i_nlink == 0) {
>>> +   iput(inode);
>>> +   return ERR_PTR(-ESTALE);
>>> +   }
>>> }
>> 
>> maybe we should do this check in MDS
> 
> Thank you for your review.  Are you suggesting to do this check *only* in
> the MDS?  I guess that's probably a good thing to do too, but don't you
> think it's a good idea to keep it on the client side as well?

Applied. Thanks

Yan, Zheng

> 
> Cheers,
> --
> Luís



Re: [PATCH] ceph: check i_nlink while converting a file handle to dentry

2017-05-18 Thread Yan, Zheng

> On 17 May 2017, at 19:21, Luis Henriques  wrote:
> 
> Converting a file handle to a dentry can be done call after the inode
> unlink.  This means that __fh_to_dentry() requires an extra check to
> verify the number of links is not 0.
> 
> The issue can be easily reproduced using xfstest generic/426, which does
> something like:
> 
>  name_to_handle_at(&fh)
>  echo 3 > /proc/sys/vm/drop_caches
>  unlink()
>  open_by_handle_at(&fh)
> 
> The call to open_by_handle_at() should fail, as the file doesn't exist
> anymore.
> 
> Cc: sta...@vger.kernel.org
> Link: http://tracker.ceph.com/issues/19958
> Signed-off-by: Luis Henriques 
> ---
> fs/ceph/export.c | 4 
> 1 file changed, 4 insertions(+)
> 
> diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> index e8f11fa565c5..7df550c13d7f 100644
> --- a/fs/ceph/export.c
> +++ b/fs/ceph/export.c
> @@ -91,6 +91,10 @@ static struct dentry *__fh_to_dentry(struct super_block 
> *sb, u64 ino)
>   ceph_mdsc_put_request(req);
>   if (!inode)
>   return ERR_PTR(-ESTALE);
> + if (inode->i_nlink == 0) {
> + iput(inode);
> + return ERR_PTR(-ESTALE);
> + }
>   }

maybe we should do this check in MDS

Regards
Yan, Zheng

> 
>   return d_obtain_alias(inode);



Re: [PATCH] ceph: Check that the new inode size is within limits in ceph_fallocate()

2017-05-07 Thread Yan, Zheng
On Sat, May 6, 2017 at 1:28 AM, Luis Henriques  wrote:
> Currently the ceph client doesn't respect the rlimit in fallocate.  This
> means that a user can allocate a file with size > RLIMIT_FSIZE.  This
> patch adds the call to inode_newsize_ok() to verify filesystem limits and
> ulimits.  This should make ceph successfully run xfstest generic/228.
>
> Signed-off-by: Luis Henriques 
> ---
>  fs/ceph/file.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 26cc95421cca..bc5809d4d2d4 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1636,8 +1636,12 @@ static long ceph_fallocate(struct file *file, int mode,
> }
>
> size = i_size_read(inode);
> -   if (!(mode & FALLOC_FL_KEEP_SIZE))
> +   if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> endoff = offset + length;
> +   ret = inode_newsize_ok(inode, endoff);
> +   if (ret)
> +   goto unlock;
> +   }
>
> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> --

Applied, thanks

Yan, Zheng

> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: fix memory leak in __ceph_setxattr()

2017-04-28 Thread Yan, Zheng

> On 28 Apr 2017, at 18:14, Luis Henriques  wrote:
> 
> The ceph_inode_xattr needs to be released when removing an xattr.  Easily
> reproducible running the 'generic/020' test from xfstests or simply by
> doing:
> 
>  attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test
> 
> While there, also fix the error path.
> 
> Here's the kmemleak splat:
> 
> unreferenced object 0x88001f86fbc0 (size 64):
>  comm "attr", pid 244, jiffies 4294904246 (age 98.464s)
>  hex dump (first 32 bytes):
>40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff  @28.
>00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
>  backtrace:
>[] kmemleak_alloc+0x49/0xa0
>[] kmem_cache_alloc+0x9b/0xf0
>[] __ceph_setxattr+0x17e/0x820
>[] ceph_set_xattr_handler+0x37/0x40
>[] __vfs_removexattr+0x4b/0x60
>[] vfs_removexattr+0x77/0xd0
>[] removexattr+0x41/0x60
>[] path_removexattr+0x75/0xa0
>[] SyS_lremovexattr+0xb/0x10
>[] entry_SYSCALL_64_fastpath+0x13/0x94
>[] 0x
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Luis Henriques 
> ---
> fs/ceph/xattr.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
> index febc28f9e2c2..75267cdd5dfd 100644
> --- a/fs/ceph/xattr.c
> +++ b/fs/ceph/xattr.c
> @@ -392,6 +392,7 @@ static int __set_xattr(struct ceph_inode_info *ci,
> 
>   if (update_xattr) {
>   int err = 0;
> +
>   if (xattr && (flags & XATTR_CREATE))
>   err = -EEXIST;
>   else if (!xattr && (flags & XATTR_REPLACE))
> @@ -399,12 +400,14 @@ static int __set_xattr(struct ceph_inode_info *ci,
>   if (err) {
>   kfree(name);
>   kfree(val);
> + kfree(*newxattr);
>   return err;
>   }
>   if (update_xattr < 0) {
>   if (xattr)
>   __remove_xattr(ci, xattr);
>   kfree(name);
> + kfree(*newxattr);
>   return 0;
>   }
>   }

Applied, thanks

Yan, Zheng




Re: [PATCH v2] ceph: Fix file open flags on ppc64

2017-04-28 Thread Yan, Zheng

> On 28 Apr 2017, at 17:14, Alexander Graf  wrote:
> 
> 
> 
> On 28.04.17 09:57, Yan, Zheng wrote:
>> 
>>> On 28 Apr 2017, at 00:34, Alexander Graf  wrote:
>>> 
>>> The file open flags (O_foo) are platform specific and should never go
>>> out to an interface that is not local to the system.
>>> 
>>> Unfortunately these flags have leaked out onto the wire in the cephfs
>>> implementation. That lead to bogus flags getting transmitted on ppc64.
>>> 
>>> This patch converts the kernel view of flags to the ceph view of file
>>> open flags.
>>> 
>>> Fixes: 124e68e74 ("ceph: file operations")
>>> Signed-off-by: Alexander Graf 
>>> 
>>> —
>> 
>> I removed the "unused open flags” dout and applied the patch. Thank you for 
>> tracking down and fixing the bug.
> 
> I actually put that in on purpose, in case anyone in 2 years tries to find 
> out why a particular flag doesn't get populated :).
> 
Ok, I will put it back.

Regards
Yan, Zheng

> 
> Alex



Re: [PATCH v2] ceph: Fix file open flags on ppc64

2017-04-28 Thread Yan, Zheng

> On 28 Apr 2017, at 00:34, Alexander Graf  wrote:
> 
> The file open flags (O_foo) are platform specific and should never go
> out to an interface that is not local to the system.
> 
> Unfortunately these flags have leaked out onto the wire in the cephfs
> implementation. That lead to bogus flags getting transmitted on ppc64.
> 
> This patch converts the kernel view of flags to the ceph view of file
> open flags.
> 
> Fixes: 124e68e74 ("ceph: file operations")
> Signed-off-by: Alexander Graf 
> 
> —

I removed the "unused open flags” dout and applied the patch. Thank you for 
tracking down and fixing the bug.

Regards
Yan, Zheng

> 
> v1 -> v2:
> 
>  - Only convert flags mds knows about
>  - Fix le conversion
> ---
> fs/ceph/file.c   | 35 ++-
> include/linux/ceph/ceph_fs.h | 12 
> 2 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 26cc954..9cac018 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -103,6 +103,39 @@ static size_t dio_get_pagev_size(const struct iov_iter 
> *it)
>   return ERR_PTR(ret);
> }
> 
> +static __le32 ceph_flags_sys2wire(u32 flags)
> +{
> + u32 wire_flags = 0;
> +
> + switch (flags & O_ACCMODE) {
> + case O_RDONLY:
> + wire_flags |= CEPH_O_RDONLY;
> + break;
> + case O_WRONLY:
> + wire_flags |= CEPH_O_WRONLY;
> + break;
> + case O_RDWR:
> + wire_flags |= CEPH_O_RDWR;
> + break;
> + }
> + flags &= ~O_ACCMODE;
> +
> +#define ceph_sys2wire(a) if (flags & a) { wire_flags |= CEPH_##a; flags &= 
> ~a; }
> +
> + ceph_sys2wire(O_CREAT);
> + ceph_sys2wire(O_EXCL);
> + ceph_sys2wire(O_TRUNC);
> + ceph_sys2wire(O_DIRECTORY);
> + ceph_sys2wire(O_NOFOLLOW);
> +
> +#undef ceph_sys2wire
> +
> + if (flags)
> + dout("Unused open flags: %x", flags);
> +
> + return cpu_to_le32(wire_flags);
> +}
> +
> /*
>  * Prepare an open request.  Preallocate ceph_cap to avoid an
>  * inopportune ENOMEM later.
> @@ -123,7 +156,7 @@ static size_t dio_get_pagev_size(const struct iov_iter 
> *it)
>   if (IS_ERR(req))
>   goto out;
>   req->r_fmode = ceph_flags_to_mode(flags);
> - req->r_args.open.flags = cpu_to_le32(flags);
> + req->r_args.open.flags = ceph_flags_sys2wire(flags);
>   req->r_args.open.mode = cpu_to_le32(create_mode);
> out:
>   return req;
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index f4b2ee1..9581c70 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -366,6 +366,18 @@ enum {
> #define CEPH_READDIR_FRAG_COMPLETE(1<<8)
> #define CEPH_READDIR_HASH_ORDER   (1<<9)
> 
> +/*
> + * open request flags
> + */
> +#define CEPH_O_RDONLY
> +#define CEPH_O_WRONLY0001
> +#define CEPH_O_RDWR  0002
> +#define CEPH_O_CREAT 0100
> +#define CEPH_O_EXCL  0200
> +#define CEPH_O_TRUNC 1000
> +#define CEPH_O_DIRECTORY 0020
> +#define CEPH_O_NOFOLLOW  0040
> +
> union ceph_mds_request_args {
>   struct {
>   __le32 mask; /* CEPH_CAP_* */
> -- 
> 1.8.5.6
> 



Re: [PATCH] ceph: Fix file open flags on ppc64

2017-04-24 Thread Yan, Zheng

> On 21 Apr 2017, at 21:59, Alexander Graf  wrote:
> 
> 
> 
> On 21.04.17 04:22, Yan, Zheng wrote:
>> 
>>> On 20 Apr 2017, at 20:40, Alexander Graf  wrote:
>>> 
>>> The file open flags (O_foo) are platform specific and should never go
>>> out to an interface that is not local to the system.
>>> 
>>> Unfortunately these flags have leaked out onto the wire in the cephfs
>>> implementation. That lead to bogus flags getting transmitted on ppc64.
>>> 
>>> This patch converts the kernel view of flags to the ceph view of file
>>> open flags. On x86 systems, the new function should get optimized out
>>> by smart compilers. On systems where the flags differ, it should adapt
>>> them accordingly.
>>> 
>>> Fixes: 124e68e74 ("ceph: file operations")
>>> Signed-off-by: Alexander Graf 
>>> ---
>>> fs/ceph/file.c   | 45 
>>> +++-
>>> include/linux/ceph/ceph_fs.h | 24 +++
>>> 2 files changed, 68 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 26cc954..0ed6392 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -103,6 +103,49 @@ static size_t dio_get_pagev_size(const struct iov_iter 
>>> *it)
>>> return ERR_PTR(ret);
>>> }
>>> 
>>> +static u32 ceph_flags_sys2wire(u32 flags)
>>> +{
>>> +   u32 wire_flags = 0;
>>> +
>>> +   switch (flags & O_ACCMODE) {
>>> +   case O_RDONLY:
>>> +   wire_flags |= CEPH_O_RDONLY;
>>> +   break;
>>> +   case O_WRONLY:
>>> +   wire_flags |= CEPH_O_WRONLY;
>>> +   break;
>>> +   case O_RDWR:
>>> +   wire_flags |= CEPH_O_RDWR;
>>> +   break;
>>> +   }
>>> +   flags &= ~O_ACCMODE;
>>> +
>>> +#define ceph_sys2wire(a) if (flags & a) { wire_flags |= CEPH_##a; flags &= 
>>> ~a; }
>>> +
>>> +   ceph_sys2wire(O_CREAT);
>>> +   ceph_sys2wire(O_NOCTTY);
>>> +   ceph_sys2wire(O_TRUNC);
>>> +   ceph_sys2wire(O_APPEND);
>>> +   ceph_sys2wire(O_NONBLOCK);
>>> +   ceph_sys2wire(O_DSYNC);
>>> +   ceph_sys2wire(FASYNC);
>>> +   ceph_sys2wire(O_DIRECT);
>>> +   ceph_sys2wire(O_LARGEFILE);
>>> +   ceph_sys2wire(O_DIRECTORY);
>>> +   ceph_sys2wire(O_NOFOLLOW);
>>> +   ceph_sys2wire(O_NOATIME);
>>> +   ceph_sys2wire(O_CLOEXEC);
>>> +   ceph_sys2wire(__O_SYNC);
>>> +   ceph_sys2wire(O_PATH);
>>> +   ceph_sys2wire(__O_TMPFILE);
>>> +
>>> +#undef ceph_sys2wire
>>> +
>>> +   WARN_ONCE(flags, "Found unknown open flags: %x", flags);
>>> +
>>> +   return wire_flags;
>>> +}
>>> +
>>> /*
>>> * Prepare an open request.  Preallocate ceph_cap to avoid an
>>> * inopportune ENOMEM later.
>>> @@ -123,7 +166,7 @@ static size_t dio_get_pagev_size(const struct iov_iter 
>>> *it)
>>> if (IS_ERR(req))
>>> goto out;
>>> req->r_fmode = ceph_flags_to_mode(flags);
>>> -   req->r_args.open.flags = cpu_to_le32(flags);
>>> +   req->r_args.open.flags = ceph_flags_sys2wire(cpu_to_le32(flags));
>> 
>> cpu_to_le32(ceph_flags_sys2wire(flags)) ?
> 
> Jeff also pointed to it. Definitely, yes. My thinko.
> 
>> 
>>> req->r_args.open.mode = cpu_to_le32(create_mode);
>>> out:
>>> return req;
>>> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
>>> index f4b2ee1..d254b48 100644
>>> --- a/include/linux/ceph/ceph_fs.h
>>> +++ b/include/linux/ceph/ceph_fs.h
>>> @@ -366,6 +366,30 @@ enum {
>>> #define CEPH_READDIR_FRAG_COMPLETE  (1<<8)
>>> #define CEPH_READDIR_HASH_ORDER (1<<9)
>>> 
>>> +/*
>>> + * open request flags
>>> + */
>>> +#define CEPH_O_RDONLY  
>>> +#define CEPH_O_WRONLY  0001
>>> +#define CEPH_O_RDWR0002
>>> +#define CEPH_O_CREAT   0100
>>> +#define CEPH_O_EXCL0200
>>> +#define CEPH_O_NOCTTY  0400
>>> +#define CEPH_O_TRUNC   1000
>>> +#define CEPH_O_APPEND  2000
>>> +#define CEPH_O_NONBLOCK4000
>>> +#define CEPH_O_DSYNC   0001
>>> +#define CEPH_FASYNC0002
>>> +#define CEPH_O_DIRECT  0004
>>> +#define CEPH_O_LARGEFILE   0010
>>> +#define CEPH_O_DIRECTORY   0020
>>> +#define CEPH_O_NOFOLLOW0040
>>> +#define CEPH_O_NOATIME 0100
>>> +#define CEPH_O_CLOEXEC 0200
>>> +#define CEPH___O_SYNC  0400
>>> +#define CEPH_O_PATH01000
>>> +#define CEPH___O_TMPFILE   02000
>>> +
>> 
>> RDONLY, WRONLY, RDWR, CREAT, EXCL, TRUNC, DIRECTORY, NOFOLLOW  are used by 
>> mds, no need to define the rest.
> 
> So currently the function checks for any bits it doesn't know about. Do we 
> want to drop that check then?

yes. I think it’s better to only encode flags that mds understands to wire.

Regards
Yan, Zheng 

> 
> 
> Alex



Re: [PATCH] ceph: Fix file open flags on ppc64

2017-04-20 Thread Yan, Zheng

> On 20 Apr 2017, at 20:40, Alexander Graf  wrote:
> 
> The file open flags (O_foo) are platform specific and should never go
> out to an interface that is not local to the system.
> 
> Unfortunately these flags have leaked out onto the wire in the cephfs
> implementation. That lead to bogus flags getting transmitted on ppc64.
> 
> This patch converts the kernel view of flags to the ceph view of file
> open flags. On x86 systems, the new function should get optimized out
> by smart compilers. On systems where the flags differ, it should adapt
> them accordingly.
> 
> Fixes: 124e68e74 ("ceph: file operations")
> Signed-off-by: Alexander Graf 
> ---
> fs/ceph/file.c   | 45 +++-
> include/linux/ceph/ceph_fs.h | 24 +++
> 2 files changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 26cc954..0ed6392 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -103,6 +103,49 @@ static size_t dio_get_pagev_size(const struct iov_iter 
> *it)
>   return ERR_PTR(ret);
> }
> 
> +static u32 ceph_flags_sys2wire(u32 flags)
> +{
> + u32 wire_flags = 0;
> +
> + switch (flags & O_ACCMODE) {
> + case O_RDONLY:
> + wire_flags |= CEPH_O_RDONLY;
> + break;
> + case O_WRONLY:
> + wire_flags |= CEPH_O_WRONLY;
> + break;
> + case O_RDWR:
> + wire_flags |= CEPH_O_RDWR;
> + break;
> + }
> + flags &= ~O_ACCMODE;
> +
> +#define ceph_sys2wire(a) if (flags & a) { wire_flags |= CEPH_##a; flags &= 
> ~a; }
> +
> + ceph_sys2wire(O_CREAT);
> + ceph_sys2wire(O_NOCTTY);
> + ceph_sys2wire(O_TRUNC);
> + ceph_sys2wire(O_APPEND);
> + ceph_sys2wire(O_NONBLOCK);
> + ceph_sys2wire(O_DSYNC);
> + ceph_sys2wire(FASYNC);
> + ceph_sys2wire(O_DIRECT);
> + ceph_sys2wire(O_LARGEFILE);
> + ceph_sys2wire(O_DIRECTORY);
> + ceph_sys2wire(O_NOFOLLOW);
> + ceph_sys2wire(O_NOATIME);
> + ceph_sys2wire(O_CLOEXEC);
> + ceph_sys2wire(__O_SYNC);
> + ceph_sys2wire(O_PATH);
> + ceph_sys2wire(__O_TMPFILE);
> +
> +#undef ceph_sys2wire
> +
> + WARN_ONCE(flags, "Found unknown open flags: %x", flags);
> +
> + return wire_flags;
> +}
> +
> /*
>  * Prepare an open request.  Preallocate ceph_cap to avoid an
>  * inopportune ENOMEM later.
> @@ -123,7 +166,7 @@ static size_t dio_get_pagev_size(const struct iov_iter 
> *it)
>   if (IS_ERR(req))
>   goto out;
>   req->r_fmode = ceph_flags_to_mode(flags);
> - req->r_args.open.flags = cpu_to_le32(flags);
> + req->r_args.open.flags = ceph_flags_sys2wire(cpu_to_le32(flags));

cpu_to_le32(ceph_flags_sys2wire(flags)) ?

>   req->r_args.open.mode = cpu_to_le32(create_mode);
> out:
>   return req;
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index f4b2ee1..d254b48 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -366,6 +366,30 @@ enum {
> #define CEPH_READDIR_FRAG_COMPLETE(1<<8)
> #define CEPH_READDIR_HASH_ORDER   (1<<9)
> 
> +/*
> + * open request flags
> + */
> +#define CEPH_O_RDONLY
> +#define CEPH_O_WRONLY0001
> +#define CEPH_O_RDWR  0002
> +#define CEPH_O_CREAT 0100
> +#define CEPH_O_EXCL  0200
> +#define CEPH_O_NOCTTY0400
> +#define CEPH_O_TRUNC 1000
> +#define CEPH_O_APPEND2000
> +#define CEPH_O_NONBLOCK  4000
> +#define CEPH_O_DSYNC 0001
> +#define CEPH_FASYNC  0002
> +#define CEPH_O_DIRECT0004
> +#define CEPH_O_LARGEFILE 0010
> +#define CEPH_O_DIRECTORY 0020
> +#define CEPH_O_NOFOLLOW  0040
> +#define CEPH_O_NOATIME   0100
> +#define CEPH_O_CLOEXEC   0200
> +#define CEPH___O_SYNC04000000
> +#define CEPH_O_PATH  01000
> +#define CEPH___O_TMPFILE 02000
> +

RDONLY, WRONLY, RDWR, CREAT, EXCL, TRUNC, DIRECTORY, NOFOLLOW  are used by mds, 
no need to define the rest.

> union ceph_mds_request_args {
>   struct {
>   __le32 mask; /* CEPH_CAP_* */

Besides, the ceph mds server requires the fix too. Do you have patch for that?

Regards
Yan, Zheng

> -- 
> 1.8.5.6
> 



Re: [PATCH v4] ceph: set io_pages bdi hint

2017-01-10 Thread Yan, Zheng

> On 10 Jan 2017, at 21:17, Andreas Gerstmayr  
> wrote:
> 
> This patch sets the io_pages bdi hint based on the rsize mount option.
> Without this patch large buffered reads (request size > max readahead)
> are processed sequentially in chunks of the readahead size (i.e. read
> requests are sent out up to the readahead size, then the
> do_generic_file_read() function waits until the first page is received).
> 
> With this patch read requests are sent out at once up to the size
> specified in the rsize mount option (default: 64 MB).
> 
> Signed-off-by: Andreas Gerstmayr 
> ---
> 
> Changes in v4:
>  - update documentation
> 
> (Note: This patch depends on kernel version 4.10-rc1)
> 
> 
> Documentation/filesystems/ceph.txt | 5 ++---
> fs/ceph/super.c| 8 
> fs/ceph/super.h| 4 ++--
> 3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/filesystems/ceph.txt 
> b/Documentation/filesystems/ceph.txt
> index f5306ee..0b302a1 100644
> --- a/Documentation/filesystems/ceph.txt
> +++ b/Documentation/filesystems/ceph.txt
> @@ -98,11 +98,10 @@ Mount Options
>   size.
> 
>   rsize=X
> - Specify the maximum read size in bytes.  By default there is no
> - maximum.
> + Specify the maximum read size in bytes.  Default: 64 MB.
> 
>   rasize=X
> - Specify the maximum readahead.
> + Specify the maximum readahead.  Default: 8 MB.
> 
>   mount_timeout=X
>   Specify the timeout value for mount (in seconds), in the case
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 6bd20d7..a0a0b6d 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -952,6 +952,14 @@ static int ceph_register_bdi(struct super_block *sb,
>   fsc->backing_dev_info.ra_pages =
>   VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
> 
> + if (fsc->mount_options->rsize > fsc->mount_options->rasize &&
> + fsc->mount_options->rsize >= PAGE_SIZE)
> + fsc->backing_dev_info.io_pages =
> + (fsc->mount_options->rsize + PAGE_SIZE - 1)
> + >> PAGE_SHIFT;
> + else if (fsc->mount_options->rsize == 0)
> + fsc->backing_dev_info.io_pages = ULONG_MAX;
> +
>   err = bdi_register(&fsc->backing_dev_info, NULL, "ceph-%ld",
>  atomic_long_inc_return(&bdi_seq));
>   if (!err)
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 3373b61..88b2e6e 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -45,8 +45,8 @@
> #define ceph_test_mount_opt(fsc, opt) \
>   (!!((fsc)->mount_options->flags & CEPH_MOUNT_OPT_##opt))
> 
> -#define CEPH_RSIZE_DEFAULT 0   /* max read size */
> -#define CEPH_RASIZE_DEFAULT(8192*1024) /* readahead */
> +#define CEPH_RSIZE_DEFAULT  (64*1024*1024) /* max read size */
> +#define CEPH_RASIZE_DEFAULT (8192*1024)/* max readahead */
> #define CEPH_MAX_READDIR_DEFAULT1024
> #define CEPH_MAX_READDIR_BYTES_DEFAULT  (512*1024)
> #define CEPH_SNAPDIRNAME_DEFAULT".snap”

Applied, Thanks
Yan, Zheng

> -- 
> 1.8.3.1
> 



Re: [PATCH v2] ceph/iov_iter: fix bad iov_iter handling in ceph splice codepaths

2017-01-10 Thread Yan, Zheng
st struct kvec *kvec,
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 6b415b5a100d..b8fa377b0cef 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -745,6 +745,24 @@ size_t iov_iter_single_seg_count(const struct iov_iter 
> *i)
> }
> EXPORT_SYMBOL(iov_iter_single_seg_count);
> 
> +/* Return offset into page for current iov_iter segment */
> +size_t iov_iter_single_seg_page_offset(const struct iov_iter *i)
> +{
> + size_t  base;
> +
> + if (i->type & ITER_PIPE)
> + base = i->pipe->bufs[i->idx].offset;
> + else if (i->type & ITER_BVEC)
> + base = i->bvec->bv_offset;
> + else if (i->type & ITER_KVEC)
> + base = (size_t)i->kvec->iov_base;
> + else
> + base = (size_t)i->iov->iov_base;
> +
> + return (base + i->iov_offset) & (PAGE_SIZE - 1);
> +}
> +EXPORT_SYMBOL(iov_iter_single_seg_page_offset);
> +
> void iov_iter_kvec(struct iov_iter *i, int direction,
>   const struct kvec *kvec, unsigned long nr_segs,
>   size_t count)
> @@ -830,6 +848,59 @@ unsigned long iov_iter_gap_alignment(const struct 
> iov_iter *i)
> }
> EXPORT_SYMBOL(iov_iter_gap_alignment);
> 
> +/**
> + * iov_iter_pvec_size - find length of page aligned iovecs in iov_iter
> + * @i: iov_iter to in which to find the size
> + *
> + * Some filesystems can stitch together multiple iovecs into a single
> + * page vector when both the previous tail and current base are page
> + * aligned. This function discovers the length that can fit in a single
> + * pagevec and returns it.
> + */
> +size_t iov_iter_pvec_size(const struct iov_iter *i)
> +{
> + size_t size = i->count;
> + size_t pv_size = 0;
> + bool contig = false, first = true;
> +
> + if (!size)
> + return 0;
> +
> + /* Pipes are naturally aligned for this */
> + if (unlikely(i->type & ITER_PIPE))
> + return size;
> +
> + /*
> +  * An iov can be page vectored when the current base and previous
> +  * tail are both page aligned. Note that we don't require that the
> +  * initial base in the first iovec also be page aligned.
> +  */
> + iterate_all_kinds(i, size, v,
> + ({
> +  if (first || (contig && PAGE_ALIGNED(v.iov_base))) {
> + pv_size += v.iov_len;
> + first = false;
> + contig = PAGE_ALIGNED(v.iov_base + v.iov_len);
> +  }; 0;
> +  }),
> + ({
> +  if (first || (contig && v.bv_offset == 0)) {
> +     pv_size += v.bv_len;
> + first = false;
> + contig = PAGE_ALIGNED(v.bv_offset + v.bv_len);
> +  }
> +  }),
> + ({
> +  if (first || (contig && PAGE_ALIGNED(v.iov_base))) {
> + pv_size += v.iov_len;
> + first = false;
> + contig = PAGE_ALIGNED(v.iov_base + v.iov_len);
> +  }
> +  }))
> + return pv_size;
> +}
> +EXPORT_SYMBOL(iov_iter_pvec_size);
> +
> static inline size_t __pipe_get_pages(struct iov_iter *i,
>   size_t maxsize,
>   struct page **pages,

Reviewed-by: Yan, Zheng 
> -- 
> 2.7.4
> 



Re: [PATCH] ceph: fix spelling mistake: "enabing" -> "enabling"

2016-12-29 Thread Yan, Zheng
On Fri, Dec 30, 2016 at 4:19 AM, Colin King  wrote:
> From: Colin Ian King 
>
> trivial fix to spelling mistake in debug message
>
> Signed-off-by: Colin Ian King 
> ---
>  fs/ceph/cache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c
> index 5bc5d37..4e7421c 100644
> --- a/fs/ceph/cache.c
> +++ b/fs/ceph/cache.c
> @@ -234,7 +234,7 @@ void ceph_fscache_file_set_cookie(struct inode *inode, 
> struct file *filp)
> fscache_enable_cookie(ci->fscache, ceph_fscache_can_enable,
> inode);
> if (fscache_cookie_enabled(ci->fscache)) {
> -   dout("fscache_file_set_cookie %p %p enabing cache\n",
> +   dout("fscache_file_set_cookie %p %p enabling cache\n",
>          inode, filp);
> }
> }
> --
> 2.10.2

Applied, thanks

Yan, Zheng

>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: cleanup ACCESS_ONCE -> READ_ONCE

2016-12-28 Thread Yan, Zheng
equest mdsmap err %d\n", err);
> @@ -3550,7 +3550,7 @@ void ceph_mdsc_sync(struct ceph_mds_client *mdsc)
> {
>   u64 want_tid, want_flush;
> 
> - if (ACCESS_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_SHUTDOWN)
> + if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_SHUTDOWN)
>   return;
> 
>   dout("sync\n");
> @@ -3581,7 +3581,7 @@ void ceph_mdsc_sync(struct ceph_mds_client *mdsc)
>  */
> static bool done_closing_sessions(struct ceph_mds_client *mdsc, int skipped)
> {
> - if (ACCESS_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_SHUTDOWN)
> + if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_SHUTDOWN)
>   return true;
>   return atomic_read(&mdsc->num_sessions) <= skipped;
> }

Applied, Thanks

Yan, Zheng

> -- 
> 2.10.2
> 



Re: [git pull] vfs fix

2016-11-28 Thread Yan, Zheng

> On 27 Nov 2016, at 10:51, Al Viro  wrote:
> 
> On Sun, Nov 27, 2016 at 02:25:09AM +, Al Viro wrote:
> 
>> Anyway, leaving that BUG_ON() had been wrong; I can send a followup
>> massaging that thing as you've suggested, if you are interested in
>> that.  But keep in mind that the whole iov_iter_get_pages...() calling
>> conventions are going to be changed, hopefully soon.
> 
> BTW, speaking of iov_iter_get_pages_alloc() callers and splice:
> "ceph: combine as many iovec as possile into one OSD request" has added
> static size_t dio_get_pagev_size(const struct iov_iter *it)
> {
>const struct iovec *iov = it->iov;
>const struct iovec *iovend = iov + it->nr_segs;
>size_t size;
> 
>size = iov->iov_len - it->iov_offset;
>/*
> * An iov can be page vectored when both the current tail
> * and the next base are page aligned.
> */
>while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) &&
>   (++iov < iovend && PAGE_ALIGNED((iov->iov_base {
>size += iov->iov_len;
>}
>dout("dio_get_pagevlen len = %zu\n", size);
>return size;
> }

The read/write interface for ceph cluster accepts page vector. But we can only 
specify offset for the first page. Except the first and last pages, all other 
pages must be full size. This function finds size of pages that satisfy this 
requirement

Regards
Yan, Zheng

> 
> ... with 'it' possibly being bio_vec-backed iterator.  Could somebody
> explain what that code is trying to do?
> 
> I would really like to hear details - I'm not saying that the goal mentioned
> in the commit message is worthless, but I don't know ceph codebase well
> enough to tell what would be a good way to implement that.  In its current
> form it's obviously broken.



Re: [git pull] vfs.git

2016-11-11 Thread Yan, Zheng

> On 12 Nov 2016, at 01:25, Linus Torvalds  
> wrote:
> 
> On Thu, Nov 10, 2016 at 10:05 PM, Al Viro  wrote:
>>Christoph's and Jan's aio fixes, fixup for generic_file_splice_read
>> (removal of pointless detritus that actually breaks it when used for gfs2
>> ->splice_read()) and fixup for generic_file_read_iter() interaction with
>> ITER_PIPE destinations.
> 
> Hmm. I also just pulled the Ceph update that has commit 8a8d56176635
> ("ceph: use default file splice read callback"). I _think_ this splice
> fix makes that ceph change unnecessary. But testing is always good.

The commit is still needed. Al only fixes ITER_PIPE interaction with direct_IO. 
(it’s a no-op)
Cephfs case is special. Depending on what capabilities client has, client is 
allowed or
disallowed to read data from page cache. MDS changes client’s capabilities 
dynamically.
We don’t want to splice read fail when client is disallowed to get page from 
page cache.

Regards
Yan, Zheng 



> Ilya? Can you double-check the current -git tree (well, what I *will*
> push out soon after it has passed my build tests)?
> 
> Because I think Ceph can go back to using generic_file_splice_read again.
> 
>  Linus



Re: [PATCH] ceph: fix printing wrong return variable in ceph_direct_read_write()

2016-11-08 Thread Yan, Zheng

> On 8 Nov 2016, at 17:16, Zhi Zhang  wrote:
> 
> Fix printing wrong return variable for invalidate_inode_pages2_range
> in ceph_direct_read_write().
> 
> Signed-off-by: Zhi Zhang 
> ---
> fs/ceph/file.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 18630e8..0136195 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -906,7 +906,7 @@ void ceph_sync_write_wait(struct inode *inode)
>   pos >> PAGE_SHIFT,
>   (pos + count) >> PAGE_SHIFT);
>   if (ret2 < 0)
> -   dout("invalidate_inode_pages2_range returned
> %d\n", ret);
> +   dout("invalidate_inode_pages2_range returned
> %d\n", ret2);
> 
>   flags = CEPH_OSD_FLAG_ORDERSNAP |
>   CEPH_OSD_FLAG_ONDISK |
> 
> 

Applied, Thanks

Yan, Zheng

> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2...@gmail.com
> zhangz.da...@outlook.com



Re: [PATCH 07/28] ceph: avoid false positive maybe-uninitialized warning

2016-10-17 Thread Yan, Zheng

> On 18 Oct 2016, at 06:08, Arnd Bergmann  wrote:
> 
> A recent rework removed the initialization of the local 'root'
> variable that is returned from ceph_real_mount:
> 
> fs/ceph/super.c: In function 'ceph_mount':
> fs/ceph/super.c:1016:38: error: 'root' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
> 
> It's not obvious to me what the correct fix is, so this just
> returns the saved root as we did before.
> 
> Fixes: ce2728aaa82b ("ceph: avoid accessing / when mounting a subpath")
> Signed-off-by: Arnd Bergmann 
> Cc: "Yan, Zheng" 
> ---
> fs/ceph/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index a29ffce..79a4be8 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -821,7 +821,8 @@ static struct dentry *ceph_real_mount(struct 
> ceph_fs_client *fsc)
>   dout("mount start %p\n", fsc);
>   mutex_lock(&fsc->client->mount_mutex);
> 
> - if (!fsc->sb->s_root) {
> + root = dget(fsc->sb->s_root);
> + if (!root) {
>   const char *path;
>   err = __ceph_open_session(fsc->client, started);
>   if (err < 0)
> — 

This bug has already been fixed.

Regards
Yan, Zheng

> 2.9.0
> 



Re: [PATCH] ceph: Fix uninitialized dentry pointer in ceph_real_mount()

2016-10-13 Thread Yan, Zheng
On Thu, Oct 13, 2016 at 11:15 PM, Geert Uytterhoeven
 wrote:
> fs/ceph/super.c: In function ‘ceph_real_mount’:
> fs/ceph/super.c:818: warning: ‘root’ may be used uninitialized in this 
> function
>
> If s_root is already valid, dentry pointer root is never initialized,
> and returned by ceph_real_mount(). This will cause a crash later when
> the caller dereferences the pointer.
>
> Fix this by initializing root early.
>
> Fixes: ce2728aaa82bbeba ("ceph: avoid accessing / when mounting a subpath")
> Signed-off-by: Geert Uytterhoeven 
> ---
> Compile-tested only.
> ---
>  fs/ceph/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index a29ffce981879d5f..794c5fd0e0cf5e45 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -821,7 +821,8 @@ static struct dentry *ceph_real_mount(struct 
> ceph_fs_client *fsc)
> dout("mount start %p\n", fsc);
> mutex_lock(&fsc->client->mount_mutex);
>
> -   if (!fsc->sb->s_root) {
> +   root = fsc->sb->s_root;
> +   if (!root) {
> const char *path;
> err = __ceph_open_session(fsc->client, started);
> if (err < 0)

For sb->s_root is not NULL case, we also need to increase sb->s_root's
reference count. I applied this patch and fixed it.

Regards
Yan, Zheng


> --
> 1.9.1
>


Re: [PATCHv2] cephfs: Fix scheduler warning due to nested blocking

2016-10-11 Thread Yan, Zheng

> On 11 Oct 2016, at 17:16, Nikolay Borisov  wrote:
> 
> try_get_cap_refs can be used as a condition in a wait_event* calls.
> This is all fine until it has to call __ceph_do_pending_vmtruncate,
> which in turn acquires the i_truncate_mutex. This leads to a situation
> in which a task's state is !TASK_RUNNING and at the same time it's
> trying to acquire a sleeping primitive. In essence a nested sleeping
> primitives are being used. This causes the following warning:
> 
> WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 
> __might_sleep+0x9f/0xb0()
> do not call blocking ops when !TASK_RUNNING; state=1 set at 
> [] prepare_to_wait_event+0x5d/0x110
> ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
> CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G   O
> 4.4.20-clouder2 #6
> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
>  8838b416fa88 812f4409 8838b416fad0
> 81a034f2 8838b416fac0 81052b46 81a0432c
> 0061   88167bda54a0
> Call Trace:
> [] dump_stack+0x67/0x9e
> [] warn_slowpath_common+0x86/0xc0
> [] warn_slowpath_fmt+0x4c/0x50
> [] ? prepare_to_wait_event+0x5d/0x110
> [] ? prepare_to_wait_event+0x5d/0x110
> [] __might_sleep+0x9f/0xb0
> [] mutex_lock+0x20/0x40
> [] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
> [] try_get_cap_refs+0xa2/0x320 [ceph]
> [] ceph_get_caps+0x255/0x2b0 [ceph]
> [] ? wait_woken+0xb0/0xb0
> [] ceph_write_iter+0x2b1/0xde0 [ceph]
> [] ? schedule_timeout+0x202/0x260
> [] ? kmem_cache_free+0x1ea/0x200
> [] ? iput+0x9e/0x230
> [] ? __might_sleep+0x52/0xb0
> [] ? __might_fault+0x37/0x40
> [] ? cp_new_stat+0x153/0x170
> [] __vfs_write+0xaa/0xe0
> [] vfs_write+0xa9/0x190
> [] ? set_close_on_exec+0x31/0x70
> [] SyS_write+0x46/0xa0
> 
> This happens since wait_event_interruptible can interfere with the
> mutex locking code, since they both fiddle with the task state.
> 
> Fix the issue by using the newly-added nested blocking infrastructure
> in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
> nested blocking")
> 
> Link: https://lwn.net/Articles/628628/
> Signed-off-by: Nikolay Borisov 
> ---
> fs/ceph/caps.c | 12 +---
> 1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index c69e1253b47b..9d401520b981 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2467,6 +2467,7 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
> int want,
> loff_t endoff, int *got, struct page **pinned_page)
> {
>   int _got, ret, err = 0;
> + DEFINE_WAIT_FUNC(wait, woken_wake_function);
> 
>   ret = ceph_pool_perm_check(ci, need);
>   if (ret < 0)
> @@ -2486,9 +2487,14 @@ int ceph_get_caps(struct ceph_inode_info *ci, int 
> need, int want,
>   if (err < 0)
>   return err;
>   } else {
> - ret = wait_event_interruptible(ci->i_cap_wq,
> - try_get_cap_refs(ci, need, want, endoff,
> -  true, &_got, &err));
> + add_wait_queue(&ci->i_cap_wq, &wait);
> +
> + while (!try_get_cap_refs(ci, need, want, endoff,
> + true, &_got, &err))
> +         wait_woken(&wait, TASK_INTERRUPTIBLE, 
> MAX_SCHEDULE_TIMEOUT);
> +
> + remove_wait_queue(&ci->i_cap_wq, &wait);
> +
>   if (err == -EAGAIN)
>   continue;
>   if (err < 0)
> -- 
> 2.5.0
> 

Applied, thanks

Yan, Zheng




Re: [PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Yan, Zheng

> On 10 Oct 2016, at 21:50, Ilya Dryomov  wrote:
> 
> On Mon, Oct 10, 2016 at 3:13 PM, Nikolay Borisov  wrote:
>> 
>> 
>> On 10/10/2016 04:11 PM, Yan, Zheng wrote:
>>> 
>>>> On 10 Oct 2016, at 20:56, Nikolay Borisov  wrote:
>>>> 
>>>> In case __ceph_do_getattr returns an error and the retry_op in
>>>> ceph_read_iter is not READ_INLINE, then it's possible to invoke
>>>> __free_page on a page which is NULL, this naturally leads to a crash.
>>>> This can happen when, for example, a process waiting on a MDS reply
>>>> receives sigterm.
>>>> 
>>>> Fix this by explicitly checking whether the page is set or not.
>>>> 
>>>> Signed-off-by: Nikolay Borisov 
>>>> Link: http://www.spinics.net/lists/ceph-users/msg31592.html
>>>> ---
>>>> 
>>>> Inverted the condition, so resending with correct condition
>>>> this time.
>>>> 
>>>> fs/ceph/file.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index 3c68e6aee2f0..7413313ae6c8 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -929,7 +929,8 @@ again:
>>>> statret = __ceph_do_getattr(inode, page,
>>>> CEPH_STAT_CAP_INLINE_DATA, !!page);
>>>> if (statret < 0) {
>>>> - __free_page(page);
>>>> +if (page)
>>>> +__free_page(page);
>>>> if (statret == -ENODATA) {
>>>> BUG_ON(retry_op != READ_INLINE);
>>>> goto again;
>>>> —
>>> Reviewed-by: Yan, Zheng 
>> 
>> I believe this needs to also be tagged as stable. To whomever is going
>> to merge it: can you please do that?
> 
> I'll do that.  Zheng, do you see any other issues with the killable
> wait here?

No.  this change is obvious, what’s your concern?

Regards
Yan, Zheng

> 
> Thanks,
> 
>Ilya



Re: [PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Yan, Zheng

> On 10 Oct 2016, at 21:13, Nikolay Borisov  wrote:
> 
> 
> 
> On 10/10/2016 04:11 PM, Yan, Zheng wrote:
>> 
>>> On 10 Oct 2016, at 20:56, Nikolay Borisov  wrote:
>>> 
>>> In case __ceph_do_getattr returns an error and the retry_op in
>>> ceph_read_iter is not READ_INLINE, then it's possible to invoke
>>> __free_page on a page which is NULL, this naturally leads to a crash.
>>> This can happen when, for example, a process waiting on a MDS reply
>>> receives sigterm.
>>> 
>>> Fix this by explicitly checking whether the page is set or not.
>>> 
>>> Signed-off-by: Nikolay Borisov 
>>> Link: http://www.spinics.net/lists/ceph-users/msg31592.html
>>> ---
>>> 
>>> Inverted the condition, so resending with correct condition
>>> this time. 
>>> 
>>> fs/ceph/file.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 3c68e6aee2f0..7413313ae6c8 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -929,7 +929,8 @@ again:
>>> statret = __ceph_do_getattr(inode, page,
>>> CEPH_STAT_CAP_INLINE_DATA, !!page);
>>> if (statret < 0) {
>>> -    __free_page(page);
>>> +   if (page)
>>> +   __free_page(page);
>>> if (statret == -ENODATA) {
>>> BUG_ON(retry_op != READ_INLINE);
>>> goto again;
>>> — 
>> Reviewed-by: Yan, Zheng 
> 
> I believe this needs to also be tagged as stable. To whomever is going
> to merge it: can you please do that?
> 

need to get it upstream first

> 
>> 
>>> 2.5.0
>>> 
>> 



Re: [PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Yan, Zheng

> On 10 Oct 2016, at 20:56, Nikolay Borisov  wrote:
> 
> In case __ceph_do_getattr returns an error and the retry_op in
> ceph_read_iter is not READ_INLINE, then it's possible to invoke
> __free_page on a page which is NULL, this naturally leads to a crash.
> This can happen when, for example, a process waiting on a MDS reply
> receives sigterm.
> 
> Fix this by explicitly checking whether the page is set or not.
> 
> Signed-off-by: Nikolay Borisov 
> Link: http://www.spinics.net/lists/ceph-users/msg31592.html
> ---
> 
> Inverted the condition, so resending with correct condition
> this time. 
> 
> fs/ceph/file.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 3c68e6aee2f0..7413313ae6c8 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -929,7 +929,8 @@ again:
>   statret = __ceph_do_getattr(inode, page,
>   CEPH_STAT_CAP_INLINE_DATA, !!page);
>   if (statret < 0) {
> -  __free_page(page);
> + if (page)
> + __free_page(page);
>   if (statret == -ENODATA) {
>           BUG_ON(retry_op != READ_INLINE);
>   goto again;
> — 
Reviewed-by: Yan, Zheng 

> 2.5.0
> 



Re: [PATCH 1/1] ceph: do not modify fi->frag in need_reset_readdir()

2016-08-28 Thread Yan, Zheng

> On Aug 29, 2016, at 00:47, Nicolas Iooss  wrote:
> 
> Commit f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
> modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
> fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
> comparison operator with an assignment one.
> 
> This looks like a typo which is reported by clang when building the
> kernel with some warning flags:
> 
>fs/ceph/dir.c:600:22: error: using the result of an assignment as a
>condition without parentheses [-Werror,-Wparentheses]
>} else if (fi->frag |= fpos_frag(new_pos)) {
>   ~^
>fs/ceph/dir.c:600:22: note: place parentheses around the assignment
>to silence this warning
>} else if (fi->frag |= fpos_frag(new_pos)) {
>^
>   ( )
>fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
>assignment into an inequality comparison
>} else if (fi->frag |= fpos_frag(new_pos)) {
>^~
>!=
> 
> Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
> Cc: sta...@vger.kernel.org # 4.7.x
> Signed-off-by: Nicolas Iooss 
> ---
> fs/ceph/dir.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index c64a0b794d49..df4b3e6fa563 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -597,7 +597,7 @@ static bool need_reset_readdir(struct ceph_file_info *fi, 
> loff_t new_pos)
>   if (is_hash_order(new_pos)) {
>   /* no need to reset last_name for a forward seek when
>* dentries are sotred in hash order */
> - } else if (fi->frag |= fpos_frag(new_pos)) {
> + } else if (fi->frag != fpos_frag(new_pos)) {
>   return true;
>   }
>   rinfo = fi->last_readdir ? &fi->last_readdir->r_reply_info : NULL;


Applied, thanks

Yan, Zheng



> -- 
> 2.9.3
> 



Re: [PATCH] ceph: Correctly return NXIO errors from ceph_llseek.

2016-07-25 Thread Yan, Zheng

> On Jul 22, 2016, at 01:43, Phil Turnbull  wrote:
> 
> ceph_llseek does not correctly return NXIO errors because the 'out' path
> always returns 'offset'.
> 
> Fixes: 06222e491e66 ("fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's 
> that define their own llseek")
> Signed-off-by: Phil Turnbull 
> ---
> fs/ceph/file.c | 12 +---
> 1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index ce2f5795e44b..13adb5b2ef29 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1448,16 +1448,14 @@ static loff_t ceph_llseek(struct file *file, loff_t 
> offset, int whence)
> {
>   struct inode *inode = file->f_mapping->host;
>   loff_t i_size;
> - int ret;
> + loff_t ret;
> 
>   inode_lock(inode);
> 
>   if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) {
>   ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
> - if (ret < 0) {
> - offset = ret;
> + if (ret < 0)
>   goto out;
> - }
>   }
> 
>   i_size = i_size_read(inode);
> @@ -1473,7 +1471,7 @@ static loff_t ceph_llseek(struct file *file, loff_t 
> offset, int whence)
>* write() or lseek() might have altered it
>*/
>   if (offset == 0) {
> - offset = file->f_pos;
> + ret = file->f_pos;
>   goto out;
>   }
>   offset += file->f_pos;
> @@ -1493,11 +1491,11 @@ static loff_t ceph_llseek(struct file *file, loff_t 
> offset, int whence)
>   break;
>   }
> 
> - offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
> +     ret = vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
> 
> out:
>   inode_unlock(inode);
> - return offset;
> + return ret;
> }
> 
> static inline void ceph_zero_partial_page(

applied, thanks

Yan, Zheng

> -- 
> 2.9.0.rc2
> 



Re: [PATCH v2] ceph: Mark the file cache as unreclaimable

2016-07-25 Thread Yan, Zheng

> On Jul 26, 2016, at 01:12, Nikolay Borisov  wrote:
> 
> Ceph creates multiple caches with the SLAB_RECLAIMABLE flag set, so
> that it can satisfy its internal needs. Inspecting the code shows that
> most of the caches are indeed reclaimable since they are directly
> related to the generic inode/dentry shrinkers. However, one of the
> cache used to satisfy struct file is not reclaimable since its
> entries are freed only when the last reference to the file is
> dropped. If a heavily loaded node opens a lot of files it can
> introduce non-trivial discrepancies between memory shown as reclaimable
> and what is actually reclaimed when drop_caches is used.
> 
> Fix this by removing the reclaimable flag for the file's cache.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> Fixed checkpatch warning + missing SOB line
> 
> fs/ceph/super.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 91e02481ce06..8697cac6add0 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -672,8 +672,8 @@ static int __init init_caches(void)
>   if (ceph_dentry_cachep == NULL)
>   goto bad_dentry;
> 
> - ceph_file_cachep = KMEM_CACHE(ceph_file_info,
> -   SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD);
> + ceph_file_cachep = KMEM_CACHE(ceph_file_info, SLAB_MEM_SPREAD);
> +
>   if (ceph_file_cachep == NULL)
>   goto bad_file;
> 

Applied, thanks

Yan, Zheng

> -- 
> 2.7.4
> 



Re: [PATCH 2/8] ceph: don't use ->d_time

2016-06-28 Thread Yan, Zheng

> On Jun 28, 2016, at 16:09, Miklos Szeredi  wrote:
> 
> On Thu, Jun 23, 2016 at 8:21 AM, Yan, Zheng  wrote:
>> 
>>> On Jun 22, 2016, at 22:35, Miklos Szeredi  wrote:
>>> 
>>> Pretty simple: just use ceph_dentry_info.time instead (which was already
>>> there, unused).
>>> 
>>> Signed-off-by: Miklos Szeredi 
>>> Cc: Yan, Zheng 
>>> ---
>>> fs/ceph/dir.c| 6 +++---
>>> fs/ceph/inode.c  | 4 ++--
>>> fs/ceph/mds_client.c | 4 ++--
>>> fs/ceph/super.h  | 2 +-
>>> 4 files changed, 8 insertions(+), 8 deletions(-)
>> 
>> Reviewed-by: Yan, Zheng 
> 
> Can you please take this through your tree?

applied, thanks

Yan, Zheng


> 
> Thanks,
> Miklos



Re: [PATCH] ceph: fix spelling mistake: "resgister" -> "register"

2016-06-23 Thread Yan, Zheng

> On Jun 24, 2016, at 00:01, Joe Perches  wrote:
> 
> On Thu, 2016-06-23 at 14:45 +0100, Colin King wrote:
>> trivial fix to spelling mistake in pr_err message
> []
>> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c
> []
>> @@ -71,7 +71,7 @@ int ceph_fscache_register_fs(struct ceph_fs_client* fsc)
>>&ceph_fscache_fsid_object_def,
>>fsc, true);
>>  if (!fsc->fscache)
>> -pr_err("Unable to resgister fsid: %p fscache cookie", fsc);
>> +pr_err("Unable to register fsid: %p fscache cookie", fsc);
> 
> Could change to "cookie\n" to avoid possible interleaving
> from other messages too.

Applied , thanks

Yan, Zheng





Re: [PATCH 2/8] ceph: don't use ->d_time

2016-06-22 Thread Yan, Zheng

> On Jun 22, 2016, at 22:35, Miklos Szeredi  wrote:
> 
> Pretty simple: just use ceph_dentry_info.time instead (which was already
> there, unused).
> 
> Signed-off-by: Miklos Szeredi 
> Cc: Yan, Zheng 
> ---
> fs/ceph/dir.c| 6 +++---
> fs/ceph/inode.c  | 4 ++--
> fs/ceph/mds_client.c | 4 ++--
> fs/ceph/super.h  | 2 +-
> 4 files changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: Yan, Zheng 

> 
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 6e0fedf6713b..8ff7bcc7fc88 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -59,7 +59,7 @@ int ceph_init_dentry(struct dentry *dentry)
> 
>   di->dentry = dentry;
>   di->lease_session = NULL;
> - dentry->d_time = jiffies;
> + di->time = jiffies;
>   /* avoid reordering d_fsdata setup so that the check above is safe */
>   smp_mb();
>   dentry->d_fsdata = di;
> @@ -1124,7 +1124,7 @@ static int ceph_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
> void ceph_invalidate_dentry_lease(struct dentry *dentry)
> {
>   spin_lock(&dentry->d_lock);
> - dentry->d_time = jiffies;
> + ceph_dentry(dentry)->time = jiffies;
>   ceph_dentry(dentry)->lease_shared_gen = 0;
>   spin_unlock(&dentry->d_lock);
> }
> @@ -1154,7 +1154,7 @@ static int dentry_lease_is_valid(struct dentry *dentry)
>   spin_unlock(&s->s_gen_ttl_lock);
> 
>   if (di->lease_gen == gen &&
> - time_before(jiffies, dentry->d_time) &&
> + time_before(jiffies, di->time) &&
>   time_before(jiffies, ttl)) {
>   valid = 1;
>   if (di->lease_renew_after &&
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index f059b5997072..7a33178ef850 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1018,7 +1018,7 @@ static void update_dentry_lease(struct dentry *dentry,
>   goto out_unlock;
> 
>   if (di->lease_gen == session->s_cap_gen &&
> - time_before(ttl, dentry->d_time))
> + time_before(ttl, di->time))
>   goto out_unlock;  /* we already have a newer lease. */
> 
>   if (di->lease_session && di->lease_session != session)
> @@ -1032,7 +1032,7 @@ static void update_dentry_lease(struct dentry *dentry,
>   di->lease_seq = le32_to_cpu(lease->seq);
>   di->lease_renew_after = half_ttl;
>   di->lease_renew_from = 0;
> - dentry->d_time = ttl;
> + di->time = ttl;
> out_unlock:
>   spin_unlock(&dentry->d_lock);
>   return;
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 2103b823bec0..db9c654d42cd 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3231,7 +3231,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>   msecs_to_jiffies(le32_to_cpu(h->duration_ms));
> 
>   di->lease_seq = seq;
> - dentry->d_time = di->lease_renew_from + duration;
> + di->time = di->lease_renew_from + duration;
>   di->lease_renew_after = di->lease_renew_from +
>   (duration >> 1);
>   di->lease_renew_from = 0;
> @@ -3316,7 +3316,7 @@ void ceph_mdsc_lease_release(struct ceph_mds_client 
> *mdsc, struct inode *inode,
>   if (!di || !di->lease_session ||
>   di->lease_session->s_mds < 0 ||
>   di->lease_gen != di->lease_session->s_cap_gen ||
> - !time_before(jiffies, dentry->d_time)) {
> + !time_before(jiffies, di->time)) {
>   dout("lease_release inode %p dentry %p -- "
>"no lease\n",
>inode, dentry);
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 0168b49fb6ad..10776db93143 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -246,7 +246,7 @@ struct ceph_dentry_info {
>   unsigned long lease_renew_after, lease_renew_from;
>   struct list_head lru;
>   struct dentry *dentry;
> - u64 time;
> + unsigned long time;
>   u64 offset;
> };
> 
> -- 
> 2.5.5
> 



[PATCH] FS-Cache: make check_consistency callback return int

2016-05-20 Thread Yan, Zheng
__fscache_check_consistency() calls check_consistency() callback
and return the callback's return value. But the return type of
check_consistency() is bool. So __fscache_check_consistency()
return 1 if the cache is inconsistent. This is inconsistent with
the document.

Signed-off-by: "Yan, Zheng" 
---
 fs/cachefiles/interface.c | 2 +-
 include/linux/fscache-cache.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 861d611..ce5f345 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -380,7 +380,7 @@ static void cachefiles_sync_cache(struct fscache_cache 
*_cache)
  * check if the backing cache is updated to FS-Cache
  * - called by FS-Cache when evaluates if need to invalidate the cache
  */
-static bool cachefiles_check_consistency(struct fscache_operation *op)
+static int cachefiles_check_consistency(struct fscache_operation *op)
 {
struct cachefiles_object *object;
struct cachefiles_cache *cache;
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 604e152..13ba552 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -241,7 +241,7 @@ struct fscache_cache_ops {
 
/* check the consistency between the backing cache and the FS-Cache
 * cookie */
-   bool (*check_consistency)(struct fscache_operation *op);
+   int (*check_consistency)(struct fscache_operation *op);
 
/* store the updated auxiliary data on an object */
void (*update_object)(struct fscache_object *object);
-- 
2.5.5



[PATCH] FS-Cache: wake up write waiter after invalidating writes

2016-05-17 Thread Yan, Zheng
Signed-off-by: "Yan, Zheng" 
---
 fs/fscache/page.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 3078b67..c8c4f79 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -887,6 +887,8 @@ void fscache_invalidate_writes(struct fscache_cookie 
*cookie)
put_page(results[i]);
}
 
+   wake_up_bit(&cookie->flags, 0);
+
_leave("");
 }
 
-- 
2.5.5



Re: [PATCH v2 1/1] fs/ceph: make logical calculation functions return bool

2016-05-06 Thread Yan, Zheng

> On May 6, 2016, at 15:14, Zhang Zhuoyu  
> wrote:
> 
> Hi,  Yan, Viro
> 
> Any other comments on this updated version?

applied (after removing the rados.h hunk)

Thanks
Yan, Zheng

> 
> Zhuoyu
> 
>> -Original Message-
>> From: hellozzy1...@126.com [mailto:hellozzy1...@126.com] On Behalf Of
>> Zhang Zhuoyu
>> Sent: Friday, March 25, 2016 5:19 PM
>> To: ceph-de...@vger.kernel.org
>> Cc: s...@redhat.com; z...@redhat.com; idryo...@gmail.com; linux-
>> ker...@vger.kernel.org; Zhang Zhuoyu
>> 
>> Subject: [PATCH v2 1/1] fs/ceph: make logical calculation functions return
>> bool
>> 
>> This patch makes serverl logical caculation functions return bool to
> improve
>> readability due to these particular functions only using 0/1 as their
> return
>> value.
>> 
>> No functional change.
>> --
>> v2 changelog:
>> ceph_ino_compare() is used by ilookup5, and ilookup5() wants function
>> pointer int (*test)(struct inode *, void *), so let ceph_ino_compare()
> keep its
>> return value as int.
>> 
>> Signed-off-by: Zhang Zhuoyu 
>> ---
>> fs/ceph/cache.c|  2 +-
>> fs/ceph/dir.c  |  2 +-
>> include/linux/ceph/ceph_frag.h | 12 ++--
>> include/linux/ceph/decode.h|  2 +-
>> include/linux/ceph/osdmap.h|  6 +++---
>> include/linux/ceph/rados.h | 16 
>> net/ceph/ceph_common.c |  2 +-
>> 7 files changed, 21 insertions(+), 21 deletions(-)
>> 
>> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c index 834f9f3..1f3a3f6
> 100644
>> --- a/fs/ceph/cache.c
>> +++ b/fs/ceph/cache.c
>> @@ -238,7 +238,7 @@ static void
>> ceph_vfs_readpage_complete_unlock(struct page *page, void *data, int
>>  unlock_page(page);
>> }
>> 
>> -static inline int cache_valid(struct ceph_inode_info *ci)
>> +static inline bool cache_valid(struct ceph_inode_info *ci)
>> {
>>  return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>  (ci->i_fscache_gen == ci->i_rdcache_gen)); diff --git
>> a/fs/ceph/dir.c b/fs/ceph/dir.c index 26be849..7eed41e 100644
>> --- a/fs/ceph/dir.c
>> +++ b/fs/ceph/dir.c
>> @@ -584,7 +584,7 @@ struct dentry *ceph_finish_lookup(struct
>> ceph_mds_request *req,
>>  return dentry;
>> }
>> 
>> -static int is_root_ceph_dentry(struct inode *inode, struct dentry
> *dentry)
>> +static bool is_root_ceph_dentry(struct inode *inode, struct dentry
>> +*dentry)
>> {
>>  return ceph_ino(inode) == CEPH_INO_ROOT &&
>>  strncmp(dentry->d_name.name, ".ceph", 5) == 0; diff --git
>> a/include/linux/ceph/ceph_frag.h b/include/linux/ceph/ceph_frag.h index
>> 5babb8e..44e6067 100644
>> --- a/include/linux/ceph/ceph_frag.h
>> +++ b/include/linux/ceph/ceph_frag.h
>> @@ -40,11 +40,11 @@ static inline __u32 ceph_frag_mask_shift(__u32 f)
>>  return 24 - ceph_frag_bits(f);
>> }
>> 
>> -static inline int ceph_frag_contains_value(__u32 f, __u32 v)
>> +static inline bool ceph_frag_contains_value(__u32 f, __u32 v)
>> {
>>  return (v & ceph_frag_mask(f)) == ceph_frag_value(f);  } -static
>> inline int ceph_frag_contains_frag(__u32 f, __u32 sub)
>> +static inline bool ceph_frag_contains_frag(__u32 f, __u32 sub)
>> {
>>  /* is sub as specific as us, and contained by us? */
>>  return ceph_frag_bits(sub) >= ceph_frag_bits(f) && @@ -56,12
>> +56,12 @@ static inline __u32 ceph_frag_parent(__u32 f)
>>  return ceph_frag_make(ceph_frag_bits(f) - 1,
>>   ceph_frag_value(f) & (ceph_frag_mask(f) << 1));  }
> -
>> static inline int ceph_frag_is_left_child(__u32 f)
>> +static inline bool ceph_frag_is_left_child(__u32 f)
>> {
>>  return ceph_frag_bits(f) > 0 &&
>>  (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) ==
>> 0;  } -static inline int ceph_frag_is_right_child(__u32 f)
>> +static inline bool ceph_frag_is_right_child(__u32 f)
>> {
>>  return ceph_frag_bits(f) > 0 &&
>>  (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) ==
>> 1; @@ -86,11 +86,11 @@ static inline __u32 ceph_frag_make_child(__u32 f,
>> int by, int i)
>>  return ceph_frag_make(newbits,
>>   ceph_frag_value(f) | (i << (24 - newbits)));  }
> -static
>> inline int ceph_frag_is_leftmost(__u32 f)
>> +static inline bool ceph_frag_is_leftmost(__u32 f)
>> {
>>

Re: [PATCH v2 1/1] fs/ceph: make logical calculation functions return bool

2016-05-06 Thread Yan, Zheng

> On May 6, 2016, at 15:14, Zhang Zhuoyu  
> wrote:
> 
> Hi,  Yan, Viro
> 
> Any other comments on this updated version?

I have no comment for the cephfs part. 

Ilya, do you like the libceph part?

Regards
Yan, Zheng

> 
> Zhuoyu
> 
>> -Original Message-
>> From: hellozzy1...@126.com [mailto:hellozzy1...@126.com] On Behalf Of
>> Zhang Zhuoyu
>> Sent: Friday, March 25, 2016 5:19 PM
>> To: ceph-de...@vger.kernel.org
>> Cc: s...@redhat.com; z...@redhat.com; idryo...@gmail.com; linux-
>> ker...@vger.kernel.org; Zhang Zhuoyu
>> 
>> Subject: [PATCH v2 1/1] fs/ceph: make logical calculation functions return
>> bool
>> 
>> This patch makes serverl logical caculation functions return bool to
> improve
>> readability due to these particular functions only using 0/1 as their
> return
>> value.
>> 
>> No functional change.
>> --
>> v2 changelog:
>> ceph_ino_compare() is used by ilookup5, and ilookup5() wants function
>> pointer int (*test)(struct inode *, void *), so let ceph_ino_compare()
> keep its
>> return value as int.
>> 
>> Signed-off-by: Zhang Zhuoyu 
>> ---
>> fs/ceph/cache.c|  2 +-
>> fs/ceph/dir.c  |  2 +-
>> include/linux/ceph/ceph_frag.h | 12 ++--
>> include/linux/ceph/decode.h|  2 +-
>> include/linux/ceph/osdmap.h|  6 +++---
>> include/linux/ceph/rados.h | 16 
>> net/ceph/ceph_common.c |  2 +-
>> 7 files changed, 21 insertions(+), 21 deletions(-)
>> 
>> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c index 834f9f3..1f3a3f6
> 100644
>> --- a/fs/ceph/cache.c
>> +++ b/fs/ceph/cache.c
>> @@ -238,7 +238,7 @@ static void
>> ceph_vfs_readpage_complete_unlock(struct page *page, void *data, int
>>  unlock_page(page);
>> }
>> 
>> -static inline int cache_valid(struct ceph_inode_info *ci)
>> +static inline bool cache_valid(struct ceph_inode_info *ci)
>> {
>>  return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>  (ci->i_fscache_gen == ci->i_rdcache_gen)); diff --git
>> a/fs/ceph/dir.c b/fs/ceph/dir.c index 26be849..7eed41e 100644
>> --- a/fs/ceph/dir.c
>> +++ b/fs/ceph/dir.c
>> @@ -584,7 +584,7 @@ struct dentry *ceph_finish_lookup(struct
>> ceph_mds_request *req,
>>  return dentry;
>> }
>> 
>> -static int is_root_ceph_dentry(struct inode *inode, struct dentry
> *dentry)
>> +static bool is_root_ceph_dentry(struct inode *inode, struct dentry
>> +*dentry)
>> {
>>  return ceph_ino(inode) == CEPH_INO_ROOT &&
>>  strncmp(dentry->d_name.name, ".ceph", 5) == 0; diff --git
>> a/include/linux/ceph/ceph_frag.h b/include/linux/ceph/ceph_frag.h index
>> 5babb8e..44e6067 100644
>> --- a/include/linux/ceph/ceph_frag.h
>> +++ b/include/linux/ceph/ceph_frag.h
>> @@ -40,11 +40,11 @@ static inline __u32 ceph_frag_mask_shift(__u32 f)
>>  return 24 - ceph_frag_bits(f);
>> }
>> 
>> -static inline int ceph_frag_contains_value(__u32 f, __u32 v)
>> +static inline bool ceph_frag_contains_value(__u32 f, __u32 v)
>> {
>>  return (v & ceph_frag_mask(f)) == ceph_frag_value(f);  } -static
>> inline int ceph_frag_contains_frag(__u32 f, __u32 sub)
>> +static inline bool ceph_frag_contains_frag(__u32 f, __u32 sub)
>> {
>>  /* is sub as specific as us, and contained by us? */
>>  return ceph_frag_bits(sub) >= ceph_frag_bits(f) && @@ -56,12
>> +56,12 @@ static inline __u32 ceph_frag_parent(__u32 f)
>>  return ceph_frag_make(ceph_frag_bits(f) - 1,
>>   ceph_frag_value(f) & (ceph_frag_mask(f) << 1));  }
> -
>> static inline int ceph_frag_is_left_child(__u32 f)
>> +static inline bool ceph_frag_is_left_child(__u32 f)
>> {
>>  return ceph_frag_bits(f) > 0 &&
>>  (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) ==
>> 0;  } -static inline int ceph_frag_is_right_child(__u32 f)
>> +static inline bool ceph_frag_is_right_child(__u32 f)
>> {
>>  return ceph_frag_bits(f) > 0 &&
>>  (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) ==
>> 1; @@ -86,11 +86,11 @@ static inline __u32 ceph_frag_make_child(__u32 f,
>> int by, int i)
>>  return ceph_frag_make(newbits,
>>   ceph_frag_value(f) | (i << (24 - newbits)));  }
> -static
>> inline int ceph_frag_is_leftmost(__u32 f)
>> +static inline bool ceph_frag

Re: [GIT PULL] Ceph fixes for -rc7

2016-03-29 Thread Yan, Zheng
On Wed, Mar 30, 2016 at 8:24 AM, NeilBrown  wrote:
> On Fri, Mar 25 2016, Ilya Dryomov wrote:
>
>> On Fri, Mar 25, 2016 at 5:02 AM, NeilBrown  wrote:
>>> On Sun, Mar 06 2016, Sage Weil wrote:
>>>
>>>> Hi Linus,
>>>>
>>>> Please pull the following Ceph patch from
>>>>
>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
>>>> for-linus
>>>>
>>>> This is a final commit we missed to align the protocol compatibility with
>>>> the feature bits.  It decodes a few extra fields in two different messages
>>>> and reports EIO when they are used (not yet supported).
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>>
>>>> 
>>>> Yan, Zheng (1):
>>>>   ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
>>>
>>> Just wondering, but was CEPH_FEATURE_FS_FILE_LAYOUT_V2 supposed to have
>>> exactly the same value as CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING (and
>>> CEPH_FEATURE_CRUSH_TUNABLES5)??
>>
>> Yes, that was the point of getting it merged into -rc7.
>
> I did wonder if that might be the case.
>
>>
>>> Because when I backported this patch (and many others) to some ancient
>>> enterprise kernel, it caused mounts to fail.  If it really is meant to
>>> be the same value, then I must have some other backported issue to find
>>> and fix.
>>
>> It has to be backported in concert with changes that add support for
>> the other two bits.
>
> I have everything from fs/ceph and net/ceph as of 4.5, with adjustments
> for different core code.
>
>>  How did mount fail?
>
> "can't read superblock".
> dmesg contains
>
> [   50.822479] libceph: client144098 fsid 2b73bc29-3e78-490a-8fc6-21da1bf901ba
> [   50.823746] libceph: mon0 192.168.1.122:6789 session established
> [   51.635312] ceph: problem parsing mds trace -5
> [   51.635317] ceph: mds parse_reply err -5
> [   51.635318] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)
>
> then a hex dump of header:, front: footer:
>
> Maybe my MDS is causing the problem?  It is based on v10.0.5 which
> contains
>
> #define CEPH_FEATURE_CRUSH_TUNABLES5(1ULL<<58) /* chooseleaf stable mode 
> */
> // duplicated since it was introduced at the same time as 
> CEPH_FEATURE_CRUSH_TUN
> #define CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING   (1ULL<<58) /* New, v7 encoding 
> */
>
> in ceph_features.h  i.e. two features using bit 58, but not
> FS_FILE_LAYOUT_V2
>
> Should I expect Linux 4.5 to work with ceph 10.0.5 ??

Sorry, cephfs in linux 4.5 does not work with 10.0.5. Please upgrade
to ceph 10.1.0

Regards
Yan, Zheng

>
> Thanks,
> NeilBrown


Re: [GIT PULL] Ceph fixes for -rc7

2016-03-27 Thread Yan, Zheng
On Fri, Mar 25, 2016 at 12:02 PM, NeilBrown  wrote:
> On Sun, Mar 06 2016, Sage Weil wrote:
>
>> Hi Linus,
>>
>> Please pull the following Ceph patch from
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
>> for-linus
>>
>> This is a final commit we missed to align the protocol compatibility with
>> the feature bits.  It decodes a few extra fields in two different messages
>> and reports EIO when they are used (not yet supported).
>>
>> Thanks!
>> sage
>>
>>
>> 
>> Yan, Zheng (1):
>>   ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
>
> Just wondering, but was CEPH_FEATURE_FS_FILE_LAYOUT_V2 supposed to have
> exactly the same value as CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING (and
> CEPH_FEATURE_CRUSH_TUNABLES5)??
> Because when I backported this patch (and many others) to some ancient
> enterprise kernel, it caused mounts to fail.  If it really is meant to
> be the same value, then I must have some other backported issue to find
> and fix.

Yes, it's mean to be some value. For the mount failure, please make
sure the MDS is compiled from the newest ceph code.

Regards
Yan, Zheng

>
> Thanks,
> NeilBrown
>
>
>>
>>  fs/ceph/addr.c |  4 
>>  fs/ceph/caps.c | 27 ---
>>  fs/ceph/inode.c|  2 ++
>>  fs/ceph/mds_client.c   | 16 
>>  fs/ceph/mds_client.h   |  1 +
>>  fs/ceph/super.h|  1 +
>>  include/linux/ceph/ceph_features.h |  1 +
>>  7 files changed, 49 insertions(+), 3 deletions(-)


Re: [PATCH] fs/ceph: make logical calculation functions return bool

2016-03-25 Thread Yan, Zheng

> On Mar 25, 2016, at 14:19, Zhang Zhuoyu  
> wrote:
> 
> This patch makes serverl logical caculation functions return bool to
> improve readability due to these particular functions only using 0/1 as
> their return value.
> 
> No functional change.
> 
> Signed-off-by: Zhang Zhuoyu 
> ---
> fs/ceph/cache.c|  2 +-
> fs/ceph/dir.c  |  2 +-
> fs/ceph/super.h|  2 +-
> include/linux/ceph/ceph_frag.h | 12 ++--
> include/linux/ceph/decode.h|  2 +-
> include/linux/ceph/osdmap.h|  6 +++---
> include/linux/ceph/rados.h | 16 
> net/ceph/ceph_common.c |  2 +-
> 8 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c
> index 834f9f3..1f3a3f6 100644
> --- a/fs/ceph/cache.c
> +++ b/fs/ceph/cache.c
> @@ -238,7 +238,7 @@ static void ceph_vfs_readpage_complete_unlock(struct page 
> *page, void *data, int
>   unlock_page(page);
> }
> 
> -static inline int cache_valid(struct ceph_inode_info *ci)
> +static inline bool cache_valid(struct ceph_inode_info *ci)
> {
>   return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>   (ci->i_fscache_gen == ci->i_rdcache_gen));
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 26be849..7eed41e 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -584,7 +584,7 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request 
> *req,
>   return dentry;
> }
> 
> -static int is_root_ceph_dentry(struct inode *inode, struct dentry *dentry)
> +static bool is_root_ceph_dentry(struct inode *inode, struct dentry *dentry)
> {
>   return ceph_ino(inode) == CEPH_INO_ROOT &&
>   strncmp(dentry->d_name.name, ".ceph", 5) == 0;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index b82f507..db88fef 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -415,7 +415,7 @@ static inline u64 ceph_snap(struct inode *inode)
>   return ceph_inode(inode)->i_vino.snap;
> }
> 
> -static inline int ceph_ino_compare(struct inode *inode, void *data)
> +static inline bool ceph_ino_compare(struct inode *inode, void *data)

This one is for ilookup5.  ilookup5() wants function pointer "int 
(*test)(struct inode *, void *)”.  The rest change looks good.

Regards
Yan, Zheng


> {
>   struct ceph_vino *pvino = (struct ceph_vino *)data;
>   struct ceph_inode_info *ci = ceph_inode(inode);
> diff --git a/include/linux/ceph/ceph_frag.h b/include/linux/ceph/ceph_frag.h
> index 5babb8e..44e6067 100644
> --- a/include/linux/ceph/ceph_frag.h
> +++ b/include/linux/ceph/ceph_frag.h
> @@ -40,11 +40,11 @@ static inline __u32 ceph_frag_mask_shift(__u32 f)
>   return 24 - ceph_frag_bits(f);
> }
> 
> -static inline int ceph_frag_contains_value(__u32 f, __u32 v)
> +static inline bool ceph_frag_contains_value(__u32 f, __u32 v)
> {
>   return (v & ceph_frag_mask(f)) == ceph_frag_value(f);
> }
> -static inline int ceph_frag_contains_frag(__u32 f, __u32 sub)
> +static inline bool ceph_frag_contains_frag(__u32 f, __u32 sub)
> {
>   /* is sub as specific as us, and contained by us? */
>   return ceph_frag_bits(sub) >= ceph_frag_bits(f) &&
> @@ -56,12 +56,12 @@ static inline __u32 ceph_frag_parent(__u32 f)
>   return ceph_frag_make(ceph_frag_bits(f) - 1,
>ceph_frag_value(f) & (ceph_frag_mask(f) << 1));
> }
> -static inline int ceph_frag_is_left_child(__u32 f)
> +static inline bool ceph_frag_is_left_child(__u32 f)
> {
>   return ceph_frag_bits(f) > 0 &&
>   (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) == 0;
> }
> -static inline int ceph_frag_is_right_child(__u32 f)
> +static inline bool ceph_frag_is_right_child(__u32 f)
> {
>   return ceph_frag_bits(f) > 0 &&
>   (ceph_frag_value(f) & (0x100 >> ceph_frag_bits(f))) == 1;
> @@ -86,11 +86,11 @@ static inline __u32 ceph_frag_make_child(__u32 f, int by, 
> int i)
>   return ceph_frag_make(newbits,
>ceph_frag_value(f) | (i << (24 - newbits)));
> }
> -static inline int ceph_frag_is_leftmost(__u32 f)
> +static inline bool ceph_frag_is_leftmost(__u32 f)
> {
>   return ceph_frag_value(f) == 0;
> }
> -static inline int ceph_frag_is_rightmost(__u32 f)
> +static inline bool ceph_frag_is_rightmost(__u32 f)
> {
>   return ceph_frag_value(f) == ceph_frag_mask(f);
> }
> diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
> index a6ef9cc..19e9932 100644
> --- a/include/linux/ceph/decode.h
> +++ b/include/linux/ceph/deco

  1   2   3   4   5   6   >