Re: [PATCH] ceph: don't return -ESTALE if there's still an open file

2020-05-18 Thread Gregory Farnum
Maybe we resolved this conversation; I can't quite tell...

On Fri, May 15, 2020 at 12:16 PM Jeff Layton  wrote:
>
> On Fri, 2020-05-15 at 19:56 +0300, Amir Goldstein wrote:
> > On Fri, May 15, 2020 at 2:38 PM Jeff Layton  wrote:
> > > On Fri, 2020-05-15 at 12:15 +0100, Luis Henriques wrote:
> > > > On Fri, May 15, 2020 at 09:42:24AM +0300, Amir Goldstein wrote:
> > > > > +CC: fstests
> > > > >
> > > > > On Thu, May 14, 2020 at 4:15 PM Jeff Layton  
> > > > > wrote:
> > > > > > On Thu, 2020-05-14 at 13:48 +0100, Luis Henriques wrote:
> > > > > > > On Thu, May 14, 2020 at 08:10:09AM -0400, Jeff Layton wrote:
> > > > > > > > On Thu, 2020-05-14 at 12:14 +0100, Luis Henriques wrote:
> > > > > > > > > Similarly to commit 03f219041fdb ("ceph: check i_nlink while 
> > > > > > > > > converting
> > > > > > > > > a file handle to dentry"), this fixes another corner case with
> > > > > > > > > name_to_handle_at/open_by_handle_at.  The issue has been 
> > > > > > > > > detected by
> > > > > > > > > xfstest generic/467, when doing:
> > > > > > > > >
> > > > > > > > >  - name_to_handle_at("/cephfs/myfile")
> > > > > > > > >  - open("/cephfs/myfile")
> > > > > > > > >  - unlink("/cephfs/myfile")
> > > > > > > > >  - open_by_handle_at()
> > > > > > > > >
> > > > > > > > > The call to open_by_handle_at should not fail because the 
> > > > > > > > > file still
> > > > > > > > > exists and we do have a valid handle to it.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Luis Henriques 
> > > > > > > > > ---
> > > > > > > > >  fs/ceph/export.c | 13 +++--
> > > > > > > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> > > > > > > > > index 79dc06881e78..8556df9d94d0 100644
> > > > > > > > > --- a/fs/ceph/export.c
> > > > > > > > > +++ b/fs/ceph/export.c
> > > > > > > > > @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct 
> > > > > > > > > super_block *sb, u64 ino)
> > > > > > > > >
> > > > > > > > >  static struct dentry *__fh_to_dentry(struct super_block *sb, 
> > > > > > > > > u64 ino)
> > > > > > > > >  {
> > > > > > > > > + struct ceph_inode_info *ci;
> > > > > > > > >   struct inode *inode = __lookup_inode(sb, ino);
> > > > > > > > > +
> > > > > > > > >   if (IS_ERR(inode))
> > > > > > > > >   return ERR_CAST(inode);
> > > > > > > > >   if (inode->i_nlink == 0) {
> > > > > > > > > - iput(inode);
> > > > > > > > > - return ERR_PTR(-ESTALE);
> > > > > > > > > + bool is_open;
> > > > > > > > > + ci = ceph_inode(inode);
> > > > > > > > > + spin_lock(>i_ceph_lock);
> > > > > > > > > + is_open = __ceph_is_file_opened(ci);
> > > > > > > > > + spin_unlock(>i_ceph_lock);
> > > > > > > > > + if (!is_open) {
> > > > > > > > > + iput(inode);
> > > > > > > > > + return ERR_PTR(-ESTALE);
> > > > > > > > > + }
> > > > > > > > >   }
> > > > > > > > >   return d_obtain_alias(inode);
> > > > > > > > >  }
> > > > > > > >
> > > > > > > > Thanks Luis. Out of curiousity, is there any reason we 
> > > > > > > > shouldn't ignore
> > > > > > > > the i_nlink value here? Does anything obviously break if we do?
> > > > > > >
> > > > > > > Yes, the scenario described in commit 03f219041fdb is still 
> > > > > > > valid, which
> > > > > > > is basically the same but without the extra open(2):
> > > > > > >
> > > > > > >   - name_to_handle_at("/cephfs/myfile")
> > > > > > >   - unlink("/cephfs/myfile")
> > > > > > >   - open_by_handle_at()
> > > > > > >
> > > > > >
> > > > > > Ok, I guess we end up doing some delayed cleanup, and that allows 
> > > > > > the
> > > > > > inode to be found in that situation.
> > > > > >
> > > > > > > The open_by_handle_at man page isn't really clear about these 2 
> > > > > > > scenarios,
> > > > > > > but generic/426 will fail if -ESTALE isn't returned.  Want me to 
> > > > > > > add a
> > > > > > > comment to the code, describing these 2 scenarios?
> > > > > > >
> > > > > >
> > > > > > (cc'ing Amir since he added this test)
> > > > > >
> > > > > > I don't think there is any hard requirement that open_by_handle_at
> > > > > > should fail in that situation. It generally does for most 
> > > > > > filesystems
> > > > > > due to the way they handle cl794798fa xfsqa: test open_by_handle() 
> > > > > > on unlinked and freed inode clusters
> > > > > eaning up unlinked inodes, but I don't
> > > > > > think it's technically illegal to allow the inode to still be 
> > > > > > found. If
> > > > > > the caller cares about whether it has been unlinked it can always 
> > > > > > test
> > > > > > i_nlink itself.
> > > > > >
> > > > > > Amir, is this required for some reason that I'm not aware of?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > The origin of this test is in fstests commit:
> > > > > 794798fa xfsqa: test open_by_handle() on unlinked and freed inode 
> > > > > clusters
> > > > >
> > > > > It was 

Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Gregory Farnum
On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques  wrote:
>
> "Jeff Layton"  writes:
>
> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
> >> OSDs are able to perform object copies across different pools.  Thus,
> >> there's no need to prevent copy_file_range from doing remote copies if the
> >> source and destination superblocks are different.  Only return -EXDEV if
> >> they have different fsid (the cluster ID).
> >>
> >> Signed-off-by: Luis Henriques 
> >> ---
> >>  fs/ceph/file.c | 18 ++
> >>  1 file changed, 14 insertions(+), 4 deletions(-)
> >>
> >> Hi,
> >>
> >> Here's the patch changelog since initial submittion:
> >>
> >> - Dropped have_fsid checks on client structs
> >> - Use %pU to print the fsid instead of raw hex strings (%*ph)
> >> - Fixed 'To:' field in email so that this time the patch hits vger
> >>
> >> Cheers,
> >> --
> >> Luis
> >>
> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >> index 685a03cc4b77..4a624a1dd0bb 100644
> >> --- a/fs/ceph/file.c
> >> +++ b/fs/ceph/file.c
> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
> >> *src_file, loff_t src_off,
> >>  struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> >>  struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> >>  struct ceph_cap_flush *prealloc_cf;
> >> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> >>  struct ceph_object_locator src_oloc, dst_oloc;
> >>  struct ceph_object_id src_oid, dst_oid;
> >>  loff_t endoff = 0, size;
> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file 
> >> *src_file, loff_t src_off,
> >>
> >>  if (src_inode == dst_inode)
> >>  return -EINVAL;
> >> -if (src_inode->i_sb != dst_inode->i_sb)
> >> -return -EXDEV;
> >> +if (src_inode->i_sb != dst_inode->i_sb) {
> >> +struct ceph_fs_client *dst_fsc = 
> >> ceph_inode_to_client(dst_inode);
> >> +
> >> +if (ceph_fsid_compare(_fsc->client->fsid,
> >> +  _fsc->client->fsid)) {
> >> +dout("Copying object across different clusters:");
> >> +dout("  src fsid: %pU dst fsid: %pU\n",
> >> + _fsc->client->fsid, _fsc->client->fsid);
> >> +return -EXDEV;
> >> +}
> >> +}
> >
> > Just to be clear: what happens here if I mount two entirely separate
> > clusters, and their OSDs don't have any access to one another? Will this
> > fail at some later point with an error that we can catch so that we can
> > fall back?
>
> This is exactly what this check prevents: if we have two CephFS from two
> unrelated clusters mounted and we try to copy a file across them, the
> operation will fail with -EXDEV[1] because the FSIDs for these two
> ceph_fs_client will be different.  OTOH, if these two filesystems are
> within the same cluster (and thus with the same FSID), then the OSDs are
> able to do 'copy-from' operations between them.
>
> I've tested all these scenarios and they seem to be handled correctly.
> Now, I'm assuming that *all* OSDs within the same ceph cluster can
> communicate between themselves; if this assumption is false, then this
> patch is broken.  But again, I'm not aware of any mechanism that
> prevents 2 OSDs from communicating between them.

Your assumption is correct: all OSDs in a Ceph cluster can communicate
with each other. I'm not aware of any plans to change this.

I spent a bit of time trying to figure out how this could break
security models and things and didn't come up with anything, so I
think functionally it's fine even though I find it a bit scary.

Also, yes, cluster FSIDs are UUIDs so they shouldn't collide.
-Greg

>
> [1] Actually, the files will still be copied because we'll fallback into
> the default VFS generic_copy_file_range behaviour, which is to do
> reads+writes operations.
>
> Cheers,
> --
> Luis
>
>
> >
> >
> >>  if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> >>  return -EROFS;
> >>
> >> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file 
> >> *src_file, loff_t src_off,
> >>   * efficient).
> >>   */
> >>
> >> -if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
> >> +if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
> >>  return -EOPNOTSUPP;
> >>
> >>  if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
> >> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file 
> >> *src_file, loff_t src_off,
> >>  dst_ci->i_vino.ino, dst_objnum);
> >>  /* Do an object remote copy */
> >>  err = ceph_osdc_copy_from(
> >> -_inode_to_client(src_inode)->client->osdc,
> >> +_fsc->client->osdc,
> >>  src_ci->i_vino.snap, 0,
> >>  _oid, _oloc,
> >>  

Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Gregory Farnum
On Mon, Mar 18, 2019 at 4:25 PM Luis Henriques  wrote:
>
> "Yan, Zheng"  writes:
>
> > On Mon, Mar 18, 2019 at 5:06 PM Gregory Farnum  wrote:
> >>
> >> On Mon, Mar 18, 2019 at 2:32 PM Yan, Zheng  wrote:
> >> > After reading the code carefully. I feel a little uncomfortable with
> >> > the "lookup_ino" in get_quota_realm.  how about populating directories
> >> > above the 'mount subdir' during mounting (similar to cifs_get_root ).
>
> Wouldn't it be a problem if the directory layout (or, in this case, the
> snaprealm layout) change during the mount lifetime?  In that case we
> would need to do this lookup anyway.
>
> >>
> >> Isn't that going to be a problem for any clients which have
> >>restricted filesystem access permissions? They may not be able to see
> >>all the directories above their mount point.  -Greg
> >
> > using lookup_ino to get inode above the "mount subdir" has the same problem
> >
>
> In this case I believe we get an -EPERM from the MDS.  And then the
> client simply falls back to the 'default' behaviour, which is to allow
> the user to create/write to files as if there were no quotas set.

Right, but a client is a lot more likely to have access to
/home// than they are to also be able to look at /, and
/home.

So if they can do a lookup on /home// they get their quota,
but if they have to look it up from the root it's lost even though the
client has permission to see all the things that contain their quota
state... :/

>
> Cheers,
> --
> Luis


Re: [PATCH v2 2/2] ceph: quota: fix quota subdir mounts

2019-03-18 Thread Gregory Farnum
On Mon, Mar 18, 2019 at 2:32 PM Yan, Zheng  wrote:
> After reading the code carefully. I feel a little uncomfortable with
> the "lookup_ino" in get_quota_realm.  how about populating directories
> above the 'mount subdir' during mounting (similar to cifs_get_root ).

Isn't that going to be a problem for any clients which have restricted
filesystem access permissions? They may not be able to see all the
directories above their mount point.
-Greg


Re: [PATCH] ceph: Delete unused variables in mds_client

2017-10-12 Thread Gregory Farnum
On Thu, Oct 12, 2017 at 2:55 PM, Christos Gkekas  wrote:
> Remove variables in mds_client that are set but never used.

Hmm, it looks like all the values removed here (except err) *are* used
and this patch would break the world. Specifically...

>
> Signed-off-by: Christos Gkekas 
> ---
>  fs/ceph/mds_client.c | 11 ---
>  1 file changed, 11 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index f23c820..cadf9b6 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3797,11 +3797,8 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> void *p = msg->front.iov_base;
> void *end = p + msg->front.iov_len;
> u32 epoch;
> -   u32 map_len;
> u32 num_fs;
> u32 mount_fscid = (u32)-1;
> -   u8 struct_v, struct_cv;
> -   int err = -EINVAL;
>
> ceph_decode_need(, end, sizeof(u32), bad);
> epoch = ceph_decode_32();
> @@ -3809,10 +3806,6 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> dout("handle_fsmap epoch %u\n", epoch);
>
> ceph_decode_need(, end, 2 + sizeof(u32), bad);
> -   struct_v = ceph_decode_8();
> -   struct_cv = ceph_decode_8();
> -   map_len = ceph_decode_32();
> -

p here is a pointer into a buffer which was sent to us over the
network (msg->front.iov_base, shown at top of patch). ceph_decode is
reading data out of that buffer into these variables and advancing p
for the subsequent readers. If you drop these statements, we start
trying to read values at the wrong offsets, and obviously nothing
makes sense.

> ceph_decode_need(, end, sizeof(u32) * 3, bad);
> p += sizeof(u32) * 2; /* skip epoch and legacy_client_fscid */
>
> @@ -3820,12 +3813,9 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> while (num_fs-- > 0) {
> void *info_p, *info_end;
> u32 info_len;
> -   u8 info_v, info_cv;
> u32 fscid, namelen;
>
> ceph_decode_need(, end, 2 + sizeof(u32), bad);
> -   info_v = ceph_decode_8();
> -   info_cv = ceph_decode_8();
> info_len = ceph_decode_32();
> ceph_decode_need(, end, info_len, bad);
> info_p = p;
> @@ -3852,7 +3842,6 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
>0, true);
> ceph_monc_renew_subs(>client->monc);
> } else {
> -   err = -ENOENT;
> goto err_out;

Amusingly, this bit does seem to be correct, as the err_out block just
unconditionally sets mdsc->mdsmap_err to -ENOENT (and is only
reachable from this goto).

> }
> return;

I assume this patch was generated by some code cleanliness tool, so it
appears to need some work... :)
-Greg


Re: [PATCH] ceph: Delete unused variables in mds_client

2017-10-12 Thread Gregory Farnum
On Thu, Oct 12, 2017 at 2:55 PM, Christos Gkekas  wrote:
> Remove variables in mds_client that are set but never used.

Hmm, it looks like all the values removed here (except err) *are* used
and this patch would break the world. Specifically...

>
> Signed-off-by: Christos Gkekas 
> ---
>  fs/ceph/mds_client.c | 11 ---
>  1 file changed, 11 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index f23c820..cadf9b6 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3797,11 +3797,8 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> void *p = msg->front.iov_base;
> void *end = p + msg->front.iov_len;
> u32 epoch;
> -   u32 map_len;
> u32 num_fs;
> u32 mount_fscid = (u32)-1;
> -   u8 struct_v, struct_cv;
> -   int err = -EINVAL;
>
> ceph_decode_need(, end, sizeof(u32), bad);
> epoch = ceph_decode_32();
> @@ -3809,10 +3806,6 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> dout("handle_fsmap epoch %u\n", epoch);
>
> ceph_decode_need(, end, 2 + sizeof(u32), bad);
> -   struct_v = ceph_decode_8();
> -   struct_cv = ceph_decode_8();
> -   map_len = ceph_decode_32();
> -

p here is a pointer into a buffer which was sent to us over the
network (msg->front.iov_base, shown at top of patch). ceph_decode is
reading data out of that buffer into these variables and advancing p
for the subsequent readers. If you drop these statements, we start
trying to read values at the wrong offsets, and obviously nothing
makes sense.

> ceph_decode_need(, end, sizeof(u32) * 3, bad);
> p += sizeof(u32) * 2; /* skip epoch and legacy_client_fscid */
>
> @@ -3820,12 +3813,9 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
> while (num_fs-- > 0) {
> void *info_p, *info_end;
> u32 info_len;
> -   u8 info_v, info_cv;
> u32 fscid, namelen;
>
> ceph_decode_need(, end, 2 + sizeof(u32), bad);
> -   info_v = ceph_decode_8();
> -   info_cv = ceph_decode_8();
> info_len = ceph_decode_32();
> ceph_decode_need(, end, info_len, bad);
> info_p = p;
> @@ -3852,7 +3842,6 @@ void ceph_mdsc_handle_fsmap(struct ceph_mds_client 
> *mdsc, struct ceph_msg *msg)
>0, true);
> ceph_monc_renew_subs(>client->monc);
> } else {
> -   err = -ENOENT;
> goto err_out;

Amusingly, this bit does seem to be correct, as the err_out block just
unconditionally sets mdsc->mdsmap_err to -ENOENT (and is only
reachable from this goto).

> }
> return;

I assume this patch was generated by some code cleanliness tool, so it
appears to need some work... :)
-Greg


Re: [GIT PULL] Ceph fixes for -rc7

2016-03-30 Thread Gregory Farnum
On Wed, Mar 30, 2016 at 1:04 AM, Ilya Dryomov  wrote:
> On Wed, Mar 30, 2016 at 4:40 AM, NeilBrown  wrote:
>> On Wed, Mar 30 2016, Yan, Zheng wrote:
>>
>>> On Wed, Mar 30, 2016 at 8:24 AM, NeilBrown  wrote:
 On Fri, Mar 25 2016, Ilya Dryomov wrote:

> On Fri, Mar 25, 2016 at 5:02 AM, NeilBrown  wrote:
>> On Sun, Mar 06 2016, Sage Weil wrote:
>>
>>> Hi Linus,
>>>
>>> Please pull the following Ceph patch from
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
>>> for-linus
>>>
>>> This is a final commit we missed to align the protocol compatibility 
>>> with
>>> the feature bits.  It decodes a few extra fields in two different 
>>> messages
>>> and reports EIO when they are used (not yet supported).
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> 
>>> Yan, Zheng (1):
>>>   ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
>>
>> Just wondering, but was CEPH_FEATURE_FS_FILE_LAYOUT_V2 supposed to have
>> exactly the same value as CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING (and
>> CEPH_FEATURE_CRUSH_TUNABLES5)??
>
> Yes, that was the point of getting it merged into -rc7.

 I did wonder if that might be the case.

>
>> Because when I backported this patch (and many others) to some ancient
>> enterprise kernel, it caused mounts to fail.  If it really is meant to
>> be the same value, then I must have some other backported issue to find
>> and fix.
>
> It has to be backported in concert with changes that add support for
> the other two bits.

 I have everything from fs/ceph and net/ceph as of 4.5, with adjustments
 for different core code.

>  How did mount fail?

 "can't read superblock".
 dmesg contains

 [   50.822479] libceph: client144098 fsid 
 2b73bc29-3e78-490a-8fc6-21da1bf901ba
 [   50.823746] libceph: mon0 192.168.1.122:6789 session established
 [   51.635312] ceph: problem parsing mds trace -5
 [   51.635317] ceph: mds parse_reply err -5
 [   51.635318] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

 then a hex dump of header:, front: footer:

 Maybe my MDS is causing the problem?  It is based on v10.0.5 which
 contains

 #define CEPH_FEATURE_CRUSH_TUNABLES5(1ULL<<58) /* chooseleaf stable 
 mode */
 // duplicated since it was introduced at the same time as 
 CEPH_FEATURE_CRUSH_TUN
 #define CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING   (1ULL<<58) /* New, v7 
 encoding */

 in ceph_features.h  i.e. two features using bit 58, but not
 FS_FILE_LAYOUT_V2

 Should I expect Linux 4.5 to work with ceph 10.0.5 ??
>>>
>>> Sorry, cephfs in linux 4.5 does not work with 10.0.5. Please upgrade
>>> to ceph 10.1.0
>>>
>>
>> Ahhh..  I do wonder at the point of feature flags if they don't let you
>> run any client with any server...
>> Is there a compatability matrix published somewhere?
>> If I have to stay with 10.0.5 (I don't know yet), it is safe to use
>> Linux-4.4 code?
>
> 10.0.* are all development cuts, we didn't even built packages for
> some of them.  10.1.0 is the first release candidate.  You can think of
> 10.0.5 as a random pre-rc1 kernel snapshot, aimed at brave testers, so
> you do want to upgrade.
>
> The reason it doesn't work is those three features are all defined to
> the same value, but two of them got added earlier in the 10.0.* cycle.
> CEPH_FEATURE_FS_FILE_LAYOUT_V2 came in last, after 10.0.5.

A little more specifically: these feature bits do let you run any
client with any "real release" of Ceph that we expect not-testers to
be using. They *usually* work on our dev releases as well, but we've
gotten stingier about it as we come close to running out of feature
bits and are trying to pack more of them into the same actual bits
(we're working on freeing them up as well, but got started a little
later than is comfortable), while coordinating code merges between a
few different places. You got unlucky here.
-Greg


Re: [GIT PULL] Ceph fixes for -rc7

2016-03-30 Thread Gregory Farnum
On Wed, Mar 30, 2016 at 1:04 AM, Ilya Dryomov  wrote:
> On Wed, Mar 30, 2016 at 4:40 AM, NeilBrown  wrote:
>> On Wed, Mar 30 2016, Yan, Zheng wrote:
>>
>>> On Wed, Mar 30, 2016 at 8:24 AM, NeilBrown  wrote:
 On Fri, Mar 25 2016, Ilya Dryomov wrote:

> On Fri, Mar 25, 2016 at 5:02 AM, NeilBrown  wrote:
>> On Sun, Mar 06 2016, Sage Weil wrote:
>>
>>> Hi Linus,
>>>
>>> Please pull the following Ceph patch from
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
>>> for-linus
>>>
>>> This is a final commit we missed to align the protocol compatibility 
>>> with
>>> the feature bits.  It decodes a few extra fields in two different 
>>> messages
>>> and reports EIO when they are used (not yet supported).
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> 
>>> Yan, Zheng (1):
>>>   ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
>>
>> Just wondering, but was CEPH_FEATURE_FS_FILE_LAYOUT_V2 supposed to have
>> exactly the same value as CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING (and
>> CEPH_FEATURE_CRUSH_TUNABLES5)??
>
> Yes, that was the point of getting it merged into -rc7.

 I did wonder if that might be the case.

>
>> Because when I backported this patch (and many others) to some ancient
>> enterprise kernel, it caused mounts to fail.  If it really is meant to
>> be the same value, then I must have some other backported issue to find
>> and fix.
>
> It has to be backported in concert with changes that add support for
> the other two bits.

 I have everything from fs/ceph and net/ceph as of 4.5, with adjustments
 for different core code.

>  How did mount fail?

 "can't read superblock".
 dmesg contains

 [   50.822479] libceph: client144098 fsid 
 2b73bc29-3e78-490a-8fc6-21da1bf901ba
 [   50.823746] libceph: mon0 192.168.1.122:6789 session established
 [   51.635312] ceph: problem parsing mds trace -5
 [   51.635317] ceph: mds parse_reply err -5
 [   51.635318] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

 then a hex dump of header:, front: footer:

 Maybe my MDS is causing the problem?  It is based on v10.0.5 which
 contains

 #define CEPH_FEATURE_CRUSH_TUNABLES5(1ULL<<58) /* chooseleaf stable 
 mode */
 // duplicated since it was introduced at the same time as 
 CEPH_FEATURE_CRUSH_TUN
 #define CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING   (1ULL<<58) /* New, v7 
 encoding */

 in ceph_features.h  i.e. two features using bit 58, but not
 FS_FILE_LAYOUT_V2

 Should I expect Linux 4.5 to work with ceph 10.0.5 ??
>>>
>>> Sorry, cephfs in linux 4.5 does not work with 10.0.5. Please upgrade
>>> to ceph 10.1.0
>>>
>>
>> Ahhh..  I do wonder at the point of feature flags if they don't let you
>> run any client with any server...
>> Is there a compatability matrix published somewhere?
>> If I have to stay with 10.0.5 (I don't know yet), it is safe to use
>> Linux-4.4 code?
>
> 10.0.* are all development cuts, we didn't even built packages for
> some of them.  10.1.0 is the first release candidate.  You can think of
> 10.0.5 as a random pre-rc1 kernel snapshot, aimed at brave testers, so
> you do want to upgrade.
>
> The reason it doesn't work is those three features are all defined to
> the same value, but two of them got added earlier in the 10.0.* cycle.
> CEPH_FEATURE_FS_FILE_LAYOUT_V2 came in last, after 10.0.5.

A little more specifically: these feature bits do let you run any
client with any "real release" of Ceph that we expect not-testers to
be using. They *usually* work on our dev releases as well, but we've
gotten stingier about it as we come close to running out of feature
bits and are trying to pack more of them into the same actual bits
(we're working on freeing them up as well, but got started a little
later than is comfortable), while coordinating code merges between a
few different places. You got unlucky here.
-Greg


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-19 Thread Gregory Farnum
On Wed, Mar 16, 2016 at 5:33 PM, Eric Sandeen  wrote:
> I may have lost the thread at this point, with poor Darrick's original
> patch submission devolving into a long thread about a NO_HIDE_STALE patch
> used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
>
> At least I can't find any indication of that.
>
> Am I missing something?  cc'ing Greg on this one in case I am.

Brief background:
Ceph currently has two big local storage subsystems: FileStore and
BlueStore. FileStore is the one that's been around for forever and is
currently stable/production-ready/bla bla bla. This one represents
RADOS objects as actual files and while it's *mostly* just converting
object operations into posix FS ones, it does rely on a few pieces of
the fs namespace and posix ops to do its work.
BlueStore is our new, pure userspace solution (Sage started this about
8 months ago, I think?). It started out using xfs basically as a block
allocator, but at this point it's just doing raw block access 100% in
userspace.

So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream. What *is* a big deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can. So no, Ceph isn't likely
to be a big user of these interfaces as it's too late for us. Adding
them would be an investment for future distributed storage systems
more than current ones. Maybe that's not worth it, or maybe there are
better places to keep them in the kernel. (I think I saw a reference
to some hypothetical shared block allocator? That would be *awesome*.)

=
Separately. In the particular case of the extents and data leaks, a
coworker of mine suggested you could tag any files which *ever* had
unwritten extents with something that prevents them being read by a
user who doesn't have raw block access (and, even better, let us apply
that flag on file create)...that's a weird new security rule for
people to know and requires space for tagging (no idea how bad that
is), but would work in any use cases we have and would not leak
anything the user doesn't already have access to.
-Greg


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-19 Thread Gregory Farnum
On Wed, Mar 16, 2016 at 5:33 PM, Eric Sandeen  wrote:
> I may have lost the thread at this point, with poor Darrick's original
> patch submission devolving into a long thread about a NO_HIDE_STALE patch
> used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
>
> At least I can't find any indication of that.
>
> Am I missing something?  cc'ing Greg on this one in case I am.

Brief background:
Ceph currently has two big local storage subsystems: FileStore and
BlueStore. FileStore is the one that's been around for forever and is
currently stable/production-ready/bla bla bla. This one represents
RADOS objects as actual files and while it's *mostly* just converting
object operations into posix FS ones, it does rely on a few pieces of
the fs namespace and posix ops to do its work.
BlueStore is our new, pure userspace solution (Sage started this about
8 months ago, I think?). It started out using xfs basically as a block
allocator, but at this point it's just doing raw block access 100% in
userspace.

So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream. What *is* a big deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can. So no, Ceph isn't likely
to be a big user of these interfaces as it's too late for us. Adding
them would be an investment for future distributed storage systems
more than current ones. Maybe that's not worth it, or maybe there are
better places to keep them in the kernel. (I think I saw a reference
to some hypothetical shared block allocator? That would be *awesome*.)

=
Separately. In the particular case of the extents and data leaks, a
coworker of mine suggested you could tag any files which *ever* had
unwritten extents with something that prevents them being read by a
user who doesn't have raw block access (and, even better, let us apply
that flag on file create)...that's a weird new security rule for
people to know and requires space for tagging (no idea how bad that
is), but would work in any use cases we have and would not leak
anything the user doesn't already have access to.
-Greg


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Gregory Farnum
On Thu, Mar 17, 2016 at 10:47 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <g...@gregs42.com> wrote:
>>
>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
>> it was one of the problems Sage had using xfs in his BlueStore
>> implementation and was a big part of why it moved to pure userspace.
>> FileStore might use NO_HIDE_STALE in some places but it would be
>> pretty limited. When it came up at Linux FAST we were discussing how
>> it and similar things had been problems for us in the past and it
>> would've been nice if they were upstream.
>
> Hmm.
>
> So to me it really sounds like somebody should cook up a patch, but we
> shouldn't put it in the upstream kernel until we get numbers and
> actual "yes, we'd use this" from outside of google.
>
> I say "outside of google", because inside of google not only do we not
> get numbers, but google can maintain their own patch.
>
> But maybe Ted could at least post the patch google uses, and somebody
> in the Ceph community might want to at least try it out...
>
>>What *is* a 
>> big deal for
>> FileStore (and would be easy to take advantage of) is the thematically
>> similar O_NOMTIME flag, which is also about reducing metadata updates
>> and got blocked on similar stupid-user grounds (although not security
>> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
>
> Hmm. I don't hate that patch, because the NOATIME thing really does
> wonders on many loads. NOMTIME makes sense.
>
> It's not like you can't do this with utimes() anyway.
>
> That said, I do wonder if people wouldn't just prefer to expand on and
> improve on the lazytime.
>
> Is there some reason you guys didn't use that?

I wasn't really involved in this stuff but I gather from looking at
http://www.spinics.net/lists/xfs/msg36869.html that any durability
command other than fdatasync is going to write out the mtime updates
to the inodes on disk. Given our durability requirements and the
guarantees offered about when things actually hit disk, that doesn't
work for us. We run an fsync on the order of every 30 seconds, and we
do various combinations of fsync, fdatasync, flush_file_range, (and,
well, any command we're provided) to try and bound the amount of dirty
data and prevent fs/disk throughput pauses when we do that full sync.
Anybody trying to do anything similar would want a switch that
prevents operations from updating the mtime on disk no matter what.
-Greg


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Gregory Farnum
On Thu, Mar 17, 2016 at 10:47 AM, Linus Torvalds
 wrote:
> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum  wrote:
>>
>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
>> it was one of the problems Sage had using xfs in his BlueStore
>> implementation and was a big part of why it moved to pure userspace.
>> FileStore might use NO_HIDE_STALE in some places but it would be
>> pretty limited. When it came up at Linux FAST we were discussing how
>> it and similar things had been problems for us in the past and it
>> would've been nice if they were upstream.
>
> Hmm.
>
> So to me it really sounds like somebody should cook up a patch, but we
> shouldn't put it in the upstream kernel until we get numbers and
> actual "yes, we'd use this" from outside of google.
>
> I say "outside of google", because inside of google not only do we not
> get numbers, but google can maintain their own patch.
>
> But maybe Ted could at least post the patch google uses, and somebody
> in the Ceph community might want to at least try it out...
>
>>What *is* a 
>> big deal for
>> FileStore (and would be easy to take advantage of) is the thematically
>> similar O_NOMTIME flag, which is also about reducing metadata updates
>> and got blocked on similar stupid-user grounds (although not security
>> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
>
> Hmm. I don't hate that patch, because the NOATIME thing really does
> wonders on many loads. NOMTIME makes sense.
>
> It's not like you can't do this with utimes() anyway.
>
> That said, I do wonder if people wouldn't just prefer to expand on and
> improve on the lazytime.
>
> Is there some reason you guys didn't use that?

I wasn't really involved in this stuff but I gather from looking at
http://www.spinics.net/lists/xfs/msg36869.html that any durability
command other than fdatasync is going to write out the mtime updates
to the inodes on disk. Given our durability requirements and the
guarantees offered about when things actually hit disk, that doesn't
work for us. We run an fsync on the order of every 30 seconds, and we
do various combinations of fsync, fdatasync, flush_file_range, (and,
well, any command we're provided) to try and bound the amount of dirty
data and prevent fs/disk throughput pauses when we do that full sync.
Anybody trying to do anything similar would want a switch that
prevents operations from updating the mtime on disk no matter what.
-Greg


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-09 Thread Gregory Farnum
On Thu, Mar 3, 2016 at 3:10 PM, Dave Chinner  wrote:
> On Thu, Mar 03, 2016 at 05:39:52PM -0500, Theodore Ts'o wrote:
>> On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
>> > > "Christoph" == Christoph Hellwig  writes:
>> >
>> > Christoph>  - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
>> > Christoph> space is deallocated as much as possible -
>> > Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
>> > Christoph> are actually allocated
>> >
>> > That works for me. I think it would be great if we could have consistent
>> > interfaces for fs and block. The more commonality the merrier.
>>
>> So a question I have is do we want to add a "discard-as-a-hint" analog
>> for fallocate?
>
> Well defined, reliable behaviour only, please. If the device can't
> provide the required hardware offload, then it needs to use the
> generic, slow implementation of the functionality or report
> EOPNOTSUPP.
>
>> P.S.  Speaking of things that are powerful and too dangerous for
>> application programmers, after the Linux FAST workshop, I was having
>> dinner with the Ceph developers and Ric Wheeler, and we were talking
>> about things they really needed.  Turns out they also could use an
>> FALLOC_FL_NO_HIDE_STALE functionality.
>
> For better or for worse, Ceph is moving away from using filesystems
> for its back end object store, so the use of such a hack in Ceph
> has a very limited life.

Well, let's be clear: the reason Ceph is moving away from using local
filesystems is because we couldn't get the overheads of using them
down to what we considered an acceptable level. There are always going
to be some inefficiencies from it of course (since you have two
metadata streams) but the more issues get addressed, the fewer
userspace filesystems will feel or run up against the need to do their
own block device management. :) If none of them get fixed the same
scenario will just repeat itself — a userspace filesystem rises, it
tries to get features it needs into the kernel, it eventually gives up
and drops the kernel out of the loop, and then the fact that nobody's
using the kernel in this scenario will be considered a reason not to
make it work better.

I really am sensitive to the security concerns, just know that if it's
a permanent blocker you're essentially blocking out a growing category
of disk users (who run on an awfully large number of disks!).
-Greg

>
>> I told them I had an
>> out-of-tree patch that had that functionality, and even Ric Wheeler
>> started getting tempted  :-)
>
> You can tempt all you want, but it does not change the basic fact
> that it is dangerous and compromises system security. As such, it
> does not belong in upstream kernels. Especially in this day and age
> where ensuring the fundamental integrity of our systems is more
> important than ever.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-09 Thread Gregory Farnum
On Thu, Mar 3, 2016 at 3:10 PM, Dave Chinner  wrote:
> On Thu, Mar 03, 2016 at 05:39:52PM -0500, Theodore Ts'o wrote:
>> On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
>> > > "Christoph" == Christoph Hellwig  writes:
>> >
>> > Christoph>  - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
>> > Christoph> space is deallocated as much as possible -
>> > Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
>> > Christoph> are actually allocated
>> >
>> > That works for me. I think it would be great if we could have consistent
>> > interfaces for fs and block. The more commonality the merrier.
>>
>> So a question I have is do we want to add a "discard-as-a-hint" analog
>> for fallocate?
>
> Well defined, reliable behaviour only, please. If the device can't
> provide the required hardware offload, then it needs to use the
> generic, slow implementation of the functionality or report
> EOPNOTSUPP.
>
>> P.S.  Speaking of things that are powerful and too dangerous for
>> application programmers, after the Linux FAST workshop, I was having
>> dinner with the Ceph developers and Ric Wheeler, and we were talking
>> about things they really needed.  Turns out they also could use an
>> FALLOC_FL_NO_HIDE_STALE functionality.
>
> For better or for worse, Ceph is moving away from using filesystems
> for its back end object store, so the use of such a hack in Ceph
> has a very limited life.

Well, let's be clear: the reason Ceph is moving away from using local
filesystems is because we couldn't get the overheads of using them
down to what we considered an acceptable level. There are always going
to be some inefficiencies from it of course (since you have two
metadata streams) but the more issues get addressed, the fewer
userspace filesystems will feel or run up against the need to do their
own block device management. :) If none of them get fixed the same
scenario will just repeat itself — a userspace filesystem rises, it
tries to get features it needs into the kernel, it eventually gives up
and drops the kernel out of the loop, and then the fact that nobody's
using the kernel in this scenario will be considered a reason not to
make it work better.

I really am sensitive to the security concerns, just know that if it's
a permanent blocker you're essentially blocking out a growing category
of disk users (who run on an awfully large number of disks!).
-Greg

>
>> I told them I had an
>> out-of-tree patch that had that functionality, and even Ric Wheeler
>> started getting tempted  :-)
>
> You can tempt all you want, but it does not change the basic fact
> that it is dangerous and compromises system security. As such, it
> does not belong in upstream kernels. Especially in this day and age
> where ensuring the fundamental integrity of our systems is more
> important than ever.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] fs: ceph: Replace CURRENT_TIME by ktime_get_real_ts()

2016-02-04 Thread Gregory Farnum
On Thu, Feb 4, 2016 at 5:31 AM, Arnd Bergmann  wrote:
> On Thursday 04 February 2016 10:01:31 Ilya Dryomov wrote:
>> On Thu, Feb 4, 2016 at 9:30 AM, Arnd Bergmann  wrote:
>> > On Thursday 04 February 2016 10:00:19 Yan, Zheng wrote:
>> >> > On Feb 4, 2016, at 05:27, Arnd Bergmann  wrote:
>> >
>> > static inline void ceph_decode_timespec(struct timespec *ts,
>> > const struct ceph_timespec *tv)
>> > {
>> > ts->tv_sec = (__kernel_time_t)le32_to_cpu(tv->tv_sec);
>> > ts->tv_nsec = (long)le32_to_cpu(tv->tv_nsec);
>> > }
>> >
>> > Is that intentional and documented? If yes, what is your plan to deal
>> > with y2038 support?
>>
>> tv_sec is used as a time_t, so signed.  The problem is that ceph_timespec is
>> not only passed over the wire, but is also stored on disk, part of quite a 
>> few
>> other data structures.
>
> That is only part of the issue though:
>
> Most file systems that store a timespec on disk define the function
> differently:
>
> static inline void ceph_decode_timespec(struct timespec *ts,
> const struct ceph_timespec *tv)
> {
> ts->tv_sec = (time_t)(u32)le32_to_cpu(tv->tv_sec);
> ts->tv_nsec = (long)le32_to_cpu(tv->tv_nsec);
> }
>
> On systems that have a 64-bit time_t, the 1902..1970 interval
> (0x8000..0x) and the 2038..2106
> interval (0x8000..0x) are written
> as the same 32-bit numbers, so when reading back you have to
> decide which interpretation you want, and your cast to
> __kernel_time_t means that you get the first representation on
> both 32-bit and 64-bit systems.
>
> On systems with a 32-bit time_t, this is the only option you
> have anyway, and some other file systems (ext2/3/4, xfs, ...)
> made the same decision in order to behave in a consistent way
> independent of what kernel (32-bit or 64-bit) you use. This
> is generally a reasonable goal, but it means that you get the
> overflow in 2038 rather than 2106.
>
> Alex Elder changed the cephs behavior in 2013 to be the same
> way, but from the changelog c3f56102f28d ("libceph: validate
> timespec conversions"), I guess this was not intentional, as
> he was also adding a comparison against U32_MAX, which should
> have been S32_MAX.
>
> A lot of other file systems (jfs, jffs2, hpfs, minix) apparently
> prefer the 1970..2106 interpretation of time values.
>
>> The plan is to eventually switch to a 64-bit tv_sec and
>> tv_nsec, bump the version on all the structures that contain it and add
>> a cluster-wide feature bit to deal with older clients.  We've recently had
>> a discussion about this, so it may even happen in a not so distant future, 
>> but
>> no promises
>
> Ok. We have a (rough) plan to deal with file systems that don't support
> extended time stamps in the meantime, so depending on user preferences
> we would either allow them to be used as before with times clamped
> to the 2038 overflow date, or only mounted readonly for users that want
> to ensure their systems can survive without regressions in 2038.

I dug up the email conversation, about it, although I think Adam has
done more work than it indicates:
http://www.spinics.net/lists/ceph-devel/msg27900.html. I can't speak
to any kernel-specific issues but this kind of transition while
maintaining wire compatibility with older code is something we've done
a lot; it shouldn't be a big deal even in the kernel where we're
slightly less prolific with such things. :)
-Greg


Re: [PATCH 09/10] fs: ceph: Replace CURRENT_TIME by ktime_get_real_ts()

2016-02-04 Thread Gregory Farnum
On Thu, Feb 4, 2016 at 5:31 AM, Arnd Bergmann  wrote:
> On Thursday 04 February 2016 10:01:31 Ilya Dryomov wrote:
>> On Thu, Feb 4, 2016 at 9:30 AM, Arnd Bergmann  wrote:
>> > On Thursday 04 February 2016 10:00:19 Yan, Zheng wrote:
>> >> > On Feb 4, 2016, at 05:27, Arnd Bergmann  wrote:
>> >
>> > static inline void ceph_decode_timespec(struct timespec *ts,
>> > const struct ceph_timespec *tv)
>> > {
>> > ts->tv_sec = (__kernel_time_t)le32_to_cpu(tv->tv_sec);
>> > ts->tv_nsec = (long)le32_to_cpu(tv->tv_nsec);
>> > }
>> >
>> > Is that intentional and documented? If yes, what is your plan to deal
>> > with y2038 support?
>>
>> tv_sec is used as a time_t, so signed.  The problem is that ceph_timespec is
>> not only passed over the wire, but is also stored on disk, part of quite a 
>> few
>> other data structures.
>
> That is only part of the issue though:
>
> Most file systems that store a timespec on disk define the function
> differently:
>
> static inline void ceph_decode_timespec(struct timespec *ts,
> const struct ceph_timespec *tv)
> {
> ts->tv_sec = (time_t)(u32)le32_to_cpu(tv->tv_sec);
> ts->tv_nsec = (long)le32_to_cpu(tv->tv_nsec);
> }
>
> On systems that have a 64-bit time_t, the 1902..1970 interval
> (0x8000..0x) and the 2038..2106
> interval (0x8000..0x) are written
> as the same 32-bit numbers, so when reading back you have to
> decide which interpretation you want, and your cast to
> __kernel_time_t means that you get the first representation on
> both 32-bit and 64-bit systems.
>
> On systems with a 32-bit time_t, this is the only option you
> have anyway, and some other file systems (ext2/3/4, xfs, ...)
> made the same decision in order to behave in a consistent way
> independent of what kernel (32-bit or 64-bit) you use. This
> is generally a reasonable goal, but it means that you get the
> overflow in 2038 rather than 2106.
>
> Alex Elder changed the cephs behavior in 2013 to be the same
> way, but from the changelog c3f56102f28d ("libceph: validate
> timespec conversions"), I guess this was not intentional, as
> he was also adding a comparison against U32_MAX, which should
> have been S32_MAX.
>
> A lot of other file systems (jfs, jffs2, hpfs, minix) apparently
> prefer the 1970..2106 interpretation of time values.
>
>> The plan is to eventually switch to a 64-bit tv_sec and
>> tv_nsec, bump the version on all the structures that contain it and add
>> a cluster-wide feature bit to deal with older clients.  We've recently had
>> a discussion about this, so it may even happen in a not so distant future, 
>> but
>> no promises
>
> Ok. We have a (rough) plan to deal with file systems that don't support
> extended time stamps in the meantime, so depending on user preferences
> we would either allow them to be used as before with times clamped
> to the 2038 overflow date, or only mounted readonly for users that want
> to ensure their systems can survive without regressions in 2038.

I dug up the email conversation, about it, although I think Adam has
done more work than it indicates:
http://www.spinics.net/lists/ceph-devel/msg27900.html. I can't speak
to any kernel-specific issues but this kind of transition while
maintaining wire compatibility with older code is something we've done
a lot; it shouldn't be a big deal even in the kernel where we're
slightly less prolific with such things. :)
-Greg


Re: [PATCH] locks: ensure that fl_owner is always initialized properly in flock and lease codepaths

2014-05-06 Thread Gregory Farnum
The Ceph bit is fine.
Acked-by: Greg Farnum 


On Mon, Apr 28, 2014 at 10:50 AM, Jeff Layton  wrote:
> Currently, the fl_owner isn't set for flock locks. Some filesystems use
> byte-range locks to simulate flock locks and there is a common idiom in
> those that does:
>
> fl->fl_owner = (fl_owner_t)filp;
> fl->fl_start = 0;
> fl->fl_end = OFFSET_MAX;
>
> Since flock locks are generally "owned" by the open file description,
> move this into the common flock lock setup code. The fl_start and fl_end
> fields are already set appropriately, so remove the unneeded setting of
> that in flock ops in those filesystems as well.
>
> Finally, the lease code also sets the fl_owner as if they were owned by
> the process and not the open file description. This is incorrect as
> leases have the same ownership semantics as flock locks. Set them the
> same way. The lease code doesn't actually use the fl_owner value for
> anything, so this is more for consistency's sake than a bugfix.
>
> Reported-by: Trond Myklebust 
> Signed-off-by: Jeff Layton 
> ---
>  drivers/staging/lustre/lustre/llite/file.c | 17 ++---
>  fs/9p/vfs_file.c   |  3 ---
>  fs/afs/flock.c |  4 
>  fs/ceph/locks.c| 10 ++
>  fs/fuse/file.c |  1 -
>  fs/locks.c |  4 +++-
>  fs/nfs/file.c  |  4 
>  7 files changed, 11 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/llite/file.c 
> b/drivers/staging/lustre/lustre/llite/file.c
> index 8e844a6371e0..760ccd83f1f7 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -2691,20 +2691,15 @@ int ll_file_flock(struct file *file, int cmd, struct 
> file_lock *file_lock)
>
> ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FLOCK, 1);
>
> -   if (file_lock->fl_flags & FL_FLOCK) {
> +   if (file_lock->fl_flags & FL_FLOCK)
> LASSERT((cmd == F_SETLKW) || (cmd == F_SETLK));
> -   /* flocks are whole-file locks */
> -   flock.l_flock.end = OFFSET_MAX;
> -   /* For flocks owner is determined by the local file 
> descriptor*/
> -   flock.l_flock.owner = (unsigned long)file_lock->fl_file;
> -   } else if (file_lock->fl_flags & FL_POSIX) {
> -   flock.l_flock.owner = (unsigned long)file_lock->fl_owner;
> -   flock.l_flock.start = file_lock->fl_start;
> -   flock.l_flock.end = file_lock->fl_end;
> -   } else {
> +   else if (!(file_lock->fl_flags & FL_POSIX))
> return -EINVAL;
> -   }
> +
> +   flock.l_flock.owner = (unsigned long)file_lock->fl_owner;
> flock.l_flock.pid = file_lock->fl_pid;
> +   flock.l_flock.start = file_lock->fl_start;
> +   flock.l_flock.end = file_lock->fl_end;
>
> /* Somewhat ugly workaround for svc lockd.
>  * lockd installs custom fl_lmops->lm_compare_owner that checks
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index d8223209d4b1..59e3fe3d56c0 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -352,9 +352,6 @@ static int v9fs_file_flock_dotl(struct file *filp, int 
> cmd,
> invalidate_mapping_pages(>i_data, 0, -1);
> }
> /* Convert flock to posix lock */
> -   fl->fl_owner = (fl_owner_t)filp;
> -   fl->fl_start = 0;
> -   fl->fl_end = OFFSET_MAX;
> fl->fl_flags |= FL_POSIX;
> fl->fl_flags ^= FL_FLOCK;
>
> diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> index a8cf2cff836c..4baf1d2b39e4 100644
> --- a/fs/afs/flock.c
> +++ b/fs/afs/flock.c
> @@ -555,10 +555,6 @@ int afs_flock(struct file *file, int cmd, struct 
> file_lock *fl)
> return -ENOLCK;
>
> /* we're simulating flock() locks using posix locks on the server */
> -   fl->fl_owner = (fl_owner_t) file;
> -   fl->fl_start = 0;
> -   fl->fl_end = OFFSET_MAX;
> -
> if (fl->fl_type == F_UNLCK)
> return afs_do_unlk(file, fl);
> return afs_do_setlk(file, fl);
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index d94ba0df9f4d..db8c1ca15d72 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -52,10 +52,7 @@ static int ceph_lock_message(u8 lock_type, u16 operation, 
> struct file *file,
> else
> length = fl->fl_end - fl->fl_start + 1;
>
> -   if (lock_type == CEPH_LOCK_FCNTL)
> -   owner = secure_addr(fl->fl_owner);
> -   else
> -   owner = secure_addr(fl->fl_file);
> +   owner = secure_addr(fl->fl_owner);
>
> dout("ceph_lock_message: rule: %d, op: %d, owner: %llx, pid: %llu, "
>  "start: %llu, length: %llu, wait: %d, type: %d", (int)lock_type,
> @@ -313,10 +310,7 @@ int lock_to_ceph_filelock(struct file_lock *lock,
> cephlock->length = 

Re: [PATCH] locks: ensure that fl_owner is always initialized properly in flock and lease codepaths

2014-05-06 Thread Gregory Farnum
The Ceph bit is fine.
Acked-by: Greg Farnum g...@inktank.com


On Mon, Apr 28, 2014 at 10:50 AM, Jeff Layton jlay...@poochiereds.net wrote:
 Currently, the fl_owner isn't set for flock locks. Some filesystems use
 byte-range locks to simulate flock locks and there is a common idiom in
 those that does:

 fl-fl_owner = (fl_owner_t)filp;
 fl-fl_start = 0;
 fl-fl_end = OFFSET_MAX;

 Since flock locks are generally owned by the open file description,
 move this into the common flock lock setup code. The fl_start and fl_end
 fields are already set appropriately, so remove the unneeded setting of
 that in flock ops in those filesystems as well.

 Finally, the lease code also sets the fl_owner as if they were owned by
 the process and not the open file description. This is incorrect as
 leases have the same ownership semantics as flock locks. Set them the
 same way. The lease code doesn't actually use the fl_owner value for
 anything, so this is more for consistency's sake than a bugfix.

 Reported-by: Trond Myklebust trond.mykleb...@primarydata.com
 Signed-off-by: Jeff Layton jlay...@poochiereds.net
 ---
  drivers/staging/lustre/lustre/llite/file.c | 17 ++---
  fs/9p/vfs_file.c   |  3 ---
  fs/afs/flock.c |  4 
  fs/ceph/locks.c| 10 ++
  fs/fuse/file.c |  1 -
  fs/locks.c |  4 +++-
  fs/nfs/file.c  |  4 
  7 files changed, 11 insertions(+), 32 deletions(-)

 diff --git a/drivers/staging/lustre/lustre/llite/file.c 
 b/drivers/staging/lustre/lustre/llite/file.c
 index 8e844a6371e0..760ccd83f1f7 100644
 --- a/drivers/staging/lustre/lustre/llite/file.c
 +++ b/drivers/staging/lustre/lustre/llite/file.c
 @@ -2691,20 +2691,15 @@ int ll_file_flock(struct file *file, int cmd, struct 
 file_lock *file_lock)

 ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FLOCK, 1);

 -   if (file_lock-fl_flags  FL_FLOCK) {
 +   if (file_lock-fl_flags  FL_FLOCK)
 LASSERT((cmd == F_SETLKW) || (cmd == F_SETLK));
 -   /* flocks are whole-file locks */
 -   flock.l_flock.end = OFFSET_MAX;
 -   /* For flocks owner is determined by the local file 
 descriptor*/
 -   flock.l_flock.owner = (unsigned long)file_lock-fl_file;
 -   } else if (file_lock-fl_flags  FL_POSIX) {
 -   flock.l_flock.owner = (unsigned long)file_lock-fl_owner;
 -   flock.l_flock.start = file_lock-fl_start;
 -   flock.l_flock.end = file_lock-fl_end;
 -   } else {
 +   else if (!(file_lock-fl_flags  FL_POSIX))
 return -EINVAL;
 -   }
 +
 +   flock.l_flock.owner = (unsigned long)file_lock-fl_owner;
 flock.l_flock.pid = file_lock-fl_pid;
 +   flock.l_flock.start = file_lock-fl_start;
 +   flock.l_flock.end = file_lock-fl_end;

 /* Somewhat ugly workaround for svc lockd.
  * lockd installs custom fl_lmops-lm_compare_owner that checks
 diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
 index d8223209d4b1..59e3fe3d56c0 100644
 --- a/fs/9p/vfs_file.c
 +++ b/fs/9p/vfs_file.c
 @@ -352,9 +352,6 @@ static int v9fs_file_flock_dotl(struct file *filp, int 
 cmd,
 invalidate_mapping_pages(inode-i_data, 0, -1);
 }
 /* Convert flock to posix lock */
 -   fl-fl_owner = (fl_owner_t)filp;
 -   fl-fl_start = 0;
 -   fl-fl_end = OFFSET_MAX;
 fl-fl_flags |= FL_POSIX;
 fl-fl_flags ^= FL_FLOCK;

 diff --git a/fs/afs/flock.c b/fs/afs/flock.c
 index a8cf2cff836c..4baf1d2b39e4 100644
 --- a/fs/afs/flock.c
 +++ b/fs/afs/flock.c
 @@ -555,10 +555,6 @@ int afs_flock(struct file *file, int cmd, struct 
 file_lock *fl)
 return -ENOLCK;

 /* we're simulating flock() locks using posix locks on the server */
 -   fl-fl_owner = (fl_owner_t) file;
 -   fl-fl_start = 0;
 -   fl-fl_end = OFFSET_MAX;
 -
 if (fl-fl_type == F_UNLCK)
 return afs_do_unlk(file, fl);
 return afs_do_setlk(file, fl);
 diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
 index d94ba0df9f4d..db8c1ca15d72 100644
 --- a/fs/ceph/locks.c
 +++ b/fs/ceph/locks.c
 @@ -52,10 +52,7 @@ static int ceph_lock_message(u8 lock_type, u16 operation, 
 struct file *file,
 else
 length = fl-fl_end - fl-fl_start + 1;

 -   if (lock_type == CEPH_LOCK_FCNTL)
 -   owner = secure_addr(fl-fl_owner);
 -   else
 -   owner = secure_addr(fl-fl_file);
 +   owner = secure_addr(fl-fl_owner);

 dout(ceph_lock_message: rule: %d, op: %d, owner: %llx, pid: %llu, 
  start: %llu, length: %llu, wait: %d, type: %d, (int)lock_type,
 @@ -313,10 +310,7 @@ int lock_to_ceph_filelock(struct file_lock *lock,
 cephlock-length = cpu_to_le64(lock-fl_end - lock-fl_start + 1);
 cephlock-client =