osd: new pool flags: noscrub, nodeep-scrub

2015-09-11 Thread Mykola Golub
Hi,

I would like to add new pool flags: noscrub and nodeep-scrub, to be
able to control scrubbing on per pool basis. In our case it could be
helpful in order to disable scrubbing on cache pools, which does not
work well right now, but I can imagine other scenarios where it could
be useful too.

Before I created a pull request, I would like to see if other people
consider this useful or may be have some other suggestions?

-- 
Mykola Golub
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph Wiki has moved!

2015-09-11 Thread Patrick McGarry
Hey cephers,

Just a note to let you know that the wiki migration is complete.
http://wiki.ceph.com now 301s to the deep link inside of our Ceph
tracker instance.

All content from the original wiki has been moved over and is ready
for consumption and editing. You should be able to create a tracker
account and get started hacking right away. If you would like to see
the wiki map you can do so via the "Index by Title" link in the
sidebar or using the following URL:

http://tracker.ceph.com/projects/ceph/wiki/index

If anyone has problems, questions, or concerns, feel free to contact
me directly. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph changes for 4.3-rc1

2015-09-11 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are a few fixes for snapshot behavior with CephFS and support for 
the new keepalive protocol from Zheng, a libceph fix that affects both RBD 
and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several 
small fixes and cleanups from Jianpeng and others.

Thanks!
sage



Benoît Canet (1):
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

Brad Hubbard (1):
  ceph: remove redundant test of head->safe and silence static analysis 
warnings

Ilya Dryomov (4):
  libceph: rename con_work() to ceph_con_workfn()
  rbd: fix double free on rbd_dev->header_name
  rbd: plug rbd_dev->header.object_prefix memory leak
  libceph: check data_len in ->alloc_msg()

Jianpeng Ma (3):
  ceph: remove the useless judgement
  ceph: no need to get parent inode in ceph_open
  ceph: cleanup use of ceph_msg_get

Nicholas Krause (1):
  libceph: remove the unused macro AES_KEY_SIZE

Yan, Zheng (7):
  ceph: EIO all operations after forced umount
  ceph: invalidate dirty pages after forced umount
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: set 'exists' flag for newly up osd
  libceph: use keepalive2 to verify the mon session is alive
  ceph: get inode size for each append write
  ceph: improve readahead for file holes

 drivers/block/rbd.c|  6 ++--
 fs/ceph/addr.c |  6 ++--
 fs/ceph/caps.c |  8 +
 fs/ceph/file.c | 14 
 fs/ceph/mds_client.c   | 59 ++
 fs/ceph/mds_client.h   |  1 +
 fs/ceph/snap.c |  7 
 fs/ceph/super.c|  1 +
 include/linux/ceph/libceph.h   |  2 ++
 include/linux/ceph/messenger.h |  4 +++
 include/linux/ceph/msgr.h  |  4 ++-
 net/ceph/ceph_common.c |  1 +
 net/ceph/crypto.c  |  4 ---
 net/ceph/messenger.c   | 82 +++---
 net/ceph/mon_client.c  | 37 ++-
 net/ceph/osd_client.c  | 51 ++
 net/ceph/osdmap.c  |  2 +-
 17 files changed, 191 insertions(+), 98 deletions(-)

Re: osd: new pool flags: noscrub, nodeep-scrub

2015-09-11 Thread Gregory Farnum
On Fri, Sep 11, 2015 at 7:42 AM, Mykola Golub  wrote:
> Hi,
>
> I would like to add new pool flags: noscrub and nodeep-scrub, to be
> able to control scrubbing on per pool basis. In our case it could be
> helpful in order to disable scrubbing on cache pools, which does not
> work well right now, but I can imagine other scenarios where it could
> be useful too.

Can you talk more about this? It sounds to me like maybe you dislike
the performance impact of scrubbing, but it's fairly important in
terms of data integrity. I don't think we want to permanently disable
them. A corruption in the cache pool isn't any less important than in
the backing pool — it will eventually get flushed, and it's where all
the reads will be handled!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: loadable objectstore

2015-09-11 Thread James (Fei) Liu-SSI
Hi Varada,
  Got a chance to go through the code. Great job. It is much cleaner . Several 
questions:
  1. What you think about the performance impact with the new implementation? 
Such  as dynamic library vs static link?
  2. Could any vendor just provide a objectstore interfaces complied dynamic 
binary library for their own storage engine with new factory framework? 
  
  Regards,
  James
  

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, September 11, 2015 3:28 AM
To: Sage Weil; Matt W. Benjamin; Loic Dachary
Cc: ceph-devel
Subject: RE: loadable objectstore

Hi Sage/ Matt,

I have submitted the pull request based on wip-plugin branch for the object 
store factory implementation at https://github.com/ceph/ceph/pull/5884 . 
Haven't rebased to the master yet. Working on rebase and including new store in 
the factory implementation.  Please have a look and let me know your comments. 
Will submit a rebased PR soon with new store integration. 

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, July 03, 2015 7:31 PM
To: Sage Weil ; Adam Crume 
Cc: Loic Dachary ; ceph-devel ; 
Matt W. Benjamin 
Subject: RE: loadable objectstore

Hi All,

Not able to make much progress after making common as a shared object along 
with object store. 
Compilation of the test binaries are failing with 
"./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'".

  CXXLDceph_streamtest
./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'
collect2: error: ld returned 1 exit status
make[3]: *** [ceph_streamtest] Error 1

But libfilestore.so is linked with lttng-ust.

src/.libs$ ldd libceph_filestore.so
libceph_keyvaluestore.so.1 => 
/home/varada/obs-factory/plugin-work/src/.libs/libceph_keyvaluestore.so.1 
(0x7f5e50f5)
libceph_os.so.1 => 
/home/varada/obs-factory/plugin-work/src/.libs/libceph_os.so.1 
(0x7f5e4f93a000)
libcommon.so.1 => /home/varada/ 
obs-factory/plugin-work/src/.libs/libcommon.so.1 (0x7f5e4b5df000)
liblttng-ust.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
(0x7f5e4b179000)
liblttng-ust-tracepoint.so.0 => 
/usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7f5e4a021000)
liburcu-bp.so.1 => /usr/lib/liburcu-bp.so.1 (0x7f5e49e1a000)
liburcu-cds.so.1 => /usr/lib/liburcu-cds.so.1 (0x7f5e49c12000)

Edited the above output just show the dependencies.  
Did anyone face this issue before? 
Any help would be much appreciated. 

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, June 26, 2015 3:34 PM
To: Sage Weil
Cc: Loic Dachary; ceph-devel; Matt W. Benjamin
Subject: RE: loadable objectstore

Hi,

Made some more changes to resolve lttng problems at 
https://github.com/varadakari/ceph/commits/wip-plugin.
But couldn’t by pass the issues. Facing some issues like mentioned below.

./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'

Compiling with -llttng-ust is not resolving the problem. Seen some threads in 
devel list before, mentioning this problem. 
Can anyone take a look and guide me to fix this problem?

Haven't made the changes to change the plugin name etc... will be making them 
as part of cleanup.

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Monday, June 22, 2015 8:57 PM
To: Matt W. Benjamin
Cc: Loic Dachary; ceph-devel; Sage Weil
Subject: RE: loadable objectstore

Hi Matt,

Majority of the changes are segregating the files to corresponding shared 
object and creating a factory object. And the naming is mostly taken from 
Erasure-coding plugins. Want a good naming convention :-), hence a preliminary 
review. Do agree, we have lot of loadable interfaces, and I think we are in the 
way of making them on-demand (if possible) loadable modules.

Varada

-Original Message-
From: Matt W. Benjamin [mailto:m...@cohortfs.com]
Sent: Monday, June 22, 2015 8:37 PM
To: Varada Kari
Cc: Loic Dachary; ceph-devel; Sage Weil
Subject: Re: loadable objectstore

Hi,

It's just aesthetic, but it feels clunky to change the names of well known 
modules to Plugin--esp. if that generalizes forward to new loadable 
modules (and we have a lot of loadable interfaces).

Matt

- "Varada Kari"  wrote:

> Hi Sage,
>
> Please find the initial implementation of objects store factory 
> (initial cut) at 
> https://github.com/varadakari/ceph/commit/9d5fe2fecf38ba106c7c7b7a3ede
> 4f189ec7e1c8
>
> This is still work in 

Re: About Fio backend with ObjectStore API

2015-09-11 Thread Casey Bodley
Hi James,

I just looked back at the results you posted, and saw that you were using 
iodepth=1. Setting this higher should help keep the FileStore busy.

Casey

- Original Message -
> From: "James (Fei) Liu-SSI" 
> To: "Casey Bodley" 
> Cc: "Haomai Wang" , ceph-devel@vger.kernel.org
> Sent: Friday, September 11, 2015 1:18:31 PM
> Subject: RE: About Fio backend with ObjectStore API
> 
> Hi Casey,
>   You are right. I think the bottleneck is in fio side rather than in
>   filestore side in this case. The fio did not issue the io commands faster
>   enough to saturate the filestore.
>   Here is one of possible solution for it: Create a  async engine which are
>   normally way faster than sync engine in fio.
>
>Here is possible framework. This new Objectstore-AIO engine in FIO in
>theory will be way faster than sync engine. Once we have FIO which can
>saturate newstore, memstore and filestore, we can investigate them in
>very details of where the bottleneck in their design.
> 
> .
> struct objectstore_aio_data {
>   struct aio_ctx *q_aio_ctx;
>   struct aio_completion_data *a_data;
>   aio_ses_ctx_t *p_ses_ctx;
>   unsigned int entries;
> };
> ...
> /*
>  * Note that the structure is exported, so that fio can get it via
>  * dlsym(..., "ioengine");
>  */
> struct ioengine_ops us_aio_ioengine = {
>   .name   = "objectstore-aio",
>   .version= FIO_IOOPS_VERSION,
>   .init   = fio_objectstore_aio_init,
>   .prep   = fio_objectstore_aio_prep,
>   .queue  = fio_objectstore_aio_queue,
>   .cancel = fio_objectstore_aio_cancel,
>   .getevents  = fio_objectstore_aio_getevents,
>   .event  = fio_objectstore_aio_event,
>   .cleanup= fio_objectstore_aio_cleanup,
>   .open_file  = fio_objectstore_aio_open,
>   .close_file = fio_objectstore_aio_close,
> };
> 
> 
> Let me know what you think.
> 
> Regards,
> James
> 
> -Original Message-
> From: Casey Bodley [mailto:cbod...@redhat.com]
> Sent: Friday, September 11, 2015 7:28 AM
> To: James (Fei) Liu-SSI
> Cc: Haomai Wang; ceph-devel@vger.kernel.org
> Subject: Re: About Fio backend with ObjectStore API
> 
> Hi James,
> 
> That's great that you were able to get fio-objectstore running! Thanks to you
> and Haomai for all the help with testing.
> 
> In terms of performance, it's possible that we're not handling the
> completions optimally. When profiling with MemStore I remember seeing a
> significant amount of cpu time spent in polling with
> fio_ceph_os_getevents().
> 
> The issue with reads is more of a design issue than a bug. Because the test
> starts with a mkfs(), there are no objects to read from initially. You would
> just have to add a write job to run before the read job, to make sure that
> the objects are initialized. Or perhaps the mkfs() step could be an optional
> part of the configuration.
> 
> Casey
> 
> - Original Message -
> From: "James (Fei) Liu-SSI" 
> To: "Haomai Wang" , "Casey Bodley" 
> Cc: ceph-devel@vger.kernel.org
> Sent: Thursday, September 10, 2015 8:08:04 PM
> Subject: RE: About Fio backend with ObjectStore API
> 
> Hi Casey and Haomai,
> 
>   We finally made the fio-objectstore works in our end . Here is fio data
>   against filestore with Samsung 850 Pro. It is sequential write and the
>   performance is very poor which is expected though.
> 
> Run status group 0 (all jobs):
>   WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s,
>   mint=55378msec, maxt=55378msec
> 
>   But anyway, it works even though still some bugs to fix like read and
>   filesytem issues. thanks a lot for your great work.
> 
>   Regards,
>   James
> 
>   jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio
>   ./test/objectstore.fio
> filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K,
> ioengine=cephobjectstore, iodepth=1 fio-2.2.9-56-g736a Starting 1 process
> test1
> filestore: Laying out IO file(s) (1 file(s) / 512MB)
> 2015-09-10 16:55:40.614494 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)
> mkfs in /home/jamesliu/fio_ceph
> 2015-09-10 16:55:40.614924 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)
> mkfs generated fsid 5508d58e-dbfc-48a5-9f9c-c639af4fe73a
> 2015-09-10 16:55:40.630326 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)
> write_version_stamp 4
> 2015-09-10 16:55:40.673417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph)
> backend xfs (magic 0x58465342)
> 2015-09-10 16:55:40.724097 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)
> leveldb db exists/created
> 2015-09-10 16:55:40.724218 7f19d34d1840 -1 journal FileJournal::_open:
> disabling aio for non-block journal.  Use journal_force_aio to 

Re: [PATCH] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations

2015-09-11 Thread J. Bruce Fields
On Fri, Sep 11, 2015 at 06:20:30AM -0400, Jeff Layton wrote:
> With NFSv3 nfsd will always attempt to send along WCC data to the
> client. This generally involves saving off the in-core inode information
> prior to doing the operation on the given filehandle, and then issuing a
> vfs_getattr to it after the op.
> 
> Some filesystems (particularly clustered or networked ones) have an
> expensive ->getattr inode operation. Atomicitiy is also often difficult
> or impossible to guarantee on such filesystems. For those, we're best
> off not trying to provide WCC information to the client at all, and to
> simply allow it to poll for that information as needed with a GETATTR
> RPC.
> 
> This patch adds a new flags field to struct export_operations, and
> defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> also adds a blurb about the new flags field and flag to the exporting
> documentation.
> 
> The server will also now skip collecting this information for NFSv2 as
> well, since that info is never used there anyway.
> 
> Note that this patch does not add this flag to any filesystem
> export_operations structures. This was originally developed to allow
> reexporting nfs via nfsd. That code is not (and may never be) suitable
> for merging into mainline.
> 
> Other filesystems may want to consider enabling this flag too. It's hard
> to tell however which ones have export operations to enable export via
> knfsd and which ones mostly rely on them for open-by-filehandle support,

Are there any in the latter class?  I'm not sure how or why you'd
support open-by-filehandle without supporting nfs exports.

> so I'm leaving that up to the individual maintainers to decide. I am
> cc'ing the relevant lists for those filesystems that I think may want to
> consider adding this though.

I'd definitely like to see evidence from maintainers of those
filesystems that this would be useful to them.

--b.

> 
> Cc: hpdd-disc...@lists.01.org
> Cc: ceph-devel@vger.kernel.org
> Cc: cluster-de...@redhat.com
> Cc: fuse-de...@lists.sourceforge.net
> Cc: ocfs2-de...@oss.oracle.com
> Signed-off-by: Jeff Layton 
> ---
>  Documentation/filesystems/nfs/Exporting | 27 +++
>  fs/nfsd/nfs3xdr.c   |  5 -
>  fs/nfsd/nfsfh.c | 14 ++
>  fs/nfsd/nfsfh.h |  5 -
>  include/linux/exportfs.h|  2 ++
>  5 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/Exporting 
> b/Documentation/filesystems/nfs/Exporting
> index 520a4becb75c..fa636cde3907 100644
> --- a/Documentation/filesystems/nfs/Exporting
> +++ b/Documentation/filesystems/nfs/Exporting
> @@ -138,6 +138,11 @@ struct which has the following members:
>  to find potential names, and matches inode numbers to find the correct
>  match.
>  
> +  flags
> +Some filesystems may need to be handled differently than others. The
> +export_operations struct also includes a flags field that allows the
> +filesystem to communicate such information to nfsd. See the Export
> +Operations Flags section below for more explanation.
>  
>  A filehandle fragment consists of an array of 1 or more 4byte words,
>  together with a one byte "type".
> @@ -147,3 +152,25 @@ generated by encode_fh, in which case it will have been 
> padded with
>  nuls.  Rather, the encode_fh routine should choose a "type" which
>  indicates the decode_fh how much of the filehandle is valid, and how
>  it should be interpreted.
> +
> +Export Operations Flags
> +---
> +In addition to the operation vector pointers, struct export_operations also
> +contains a "flags" field that allows the filesystem to communicate to nfsd
> +that it may want to do things differently when dealing with it. The
> +following flags are defined:
> +
> +  EXPORT_OP_NOWCC
> +RFC 1813 recommends that servers always send weak cache consistency
> +(WCC) data to the client after each operation. The server should
> +atomically collect attributes about the inode, do an operation on it,
> +and then collect the attributes afterward. This allows the client to
> +skip issuing GETATTRs in some situations but means that the server
> +is calling vfs_getattr for almost all RPCs. On some filesystems
> +(particularly those that are clustered or networked) this is expensive
> +and atomicity is difficult to guarantee. This flag indicates to nfsd
> +that it should skip providing WCC attributes to the client in NFSv3
> +replies when doing operations on this filesystem. Consider enabling
> +this on filesystems that have an expensive ->getattr inode operation,
> +or when atomicity between pre and post operation attribute collection
> +is impossible to guarantee.
> diff --git a/fs/nfsd/nfs3xdr.c 

[rgw] Multi-tenancy support in radosgw

2015-09-11 Thread Radoslaw Zarzynski
Hello,

It's a well-known trait of radosgw that an user cannot create new
bucket with a given name if the name is already occupied by other
user's bucket (request to do that will be rejected with 409 Conflict).
This behaviour is entirely expected in S3. However, when it comes
to Swift API, it turns into huge limitation.

In my opinion the root cause lies in how radosgw actually handles
bucket entry points. They might be seen as symlinks which must
be resolved in order to map bucket names exposed to users into
concrete bucket instances:

  bucket name -> unique bucket instance ID

It's completely clear we need to preserve backward compatibility.
Thus we cannot simply append ID of user who owns a given
bucket to argument list of mapping function. We would have to
introduce an indirection layer - bucket namespaces:

  bucket namespace, bucket name -> unique bucket instance ID

Each already existing user would obtain empty bucket namespace
by default. It will be possible to create user  with his own, unique
namespace.

Generally this looks to me as stupidly simple solution. Of course,
it has limitations. At the moment I see following things:

1. We may need to develop a new mechanism for moving buckets
between namespaces in radosgw-admin. The already existing
one for linking/unlinking might be not enough.

2. We will always need ID of namespace in order to access proper
bucket entry points. This is not a problem for requests authorized
through Keystone/TempAuth because user ID is known and thus
BNS may be easily calculated.
Anonymous access/TempURL for bucket with non-empty namespace
is a different story. In this case BNS must be provided explicitly.
Swift resolves the problem by having account name as a part of URL.
We could go the same way. This would be extended by decoupling
rgw_user used for storage access purposes from the one for
authorizing a given operation (RGWOp::verify_permission() method).


I would like to ask for reviews of the idea and feedback.

Best regards,
Radoslaw Zarzynski
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations

2015-09-11 Thread Jeff Layton
On Fri, 11 Sep 2015 17:29:57 -0400
"J. Bruce Fields"  wrote:

> On Fri, Sep 11, 2015 at 06:20:30AM -0400, Jeff Layton wrote:
> > With NFSv3 nfsd will always attempt to send along WCC data to the
> > client. This generally involves saving off the in-core inode information
> > prior to doing the operation on the given filehandle, and then issuing a
> > vfs_getattr to it after the op.
> > 
> > Some filesystems (particularly clustered or networked ones) have an
> > expensive ->getattr inode operation. Atomicitiy is also often difficult
> > or impossible to guarantee on such filesystems. For those, we're best
> > off not trying to provide WCC information to the client at all, and to
> > simply allow it to poll for that information as needed with a GETATTR
> > RPC.
> > 
> > This patch adds a new flags field to struct export_operations, and
> > defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> > that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> > also adds a blurb about the new flags field and flag to the exporting
> > documentation.
> > 
> > The server will also now skip collecting this information for NFSv2 as
> > well, since that info is never used there anyway.
> > 
> > Note that this patch does not add this flag to any filesystem
> > export_operations structures. This was originally developed to allow
> > reexporting nfs via nfsd. That code is not (and may never be) suitable
> > for merging into mainline.
> > 
> > Other filesystems may want to consider enabling this flag too. It's hard
> > to tell however which ones have export operations to enable export via
> > knfsd and which ones mostly rely on them for open-by-filehandle support,
> 
> Are there any in the latter class?  I'm not sure how or why you'd
> support open-by-filehandle without supporting nfs exports.
> 

I don't know. I'm not sure that there's any difference from a technical
standpoint. If you enable open by fh support, then you sort of get
knfsd exporting "for free".

That said, I imagine at least some of these filesystems are typically
exported by some sort of userland server instead of knfsd (ganesha or
whatnot), and for them knfsd v3 performance is not terribly critical
either way.

> > so I'm leaving that up to the individual maintainers to decide. I am
> > cc'ing the relevant lists for those filesystems that I think may want to
> > consider adding this though.
> 
> I'd definitely like to see evidence from maintainers of those
> filesystems that this would be useful to them.
> 

Agreed. If it turns out that there aren't any, then we can drop this
patch and I'll just plan to carry it privately.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: loadable objectstore

2015-09-11 Thread Varada Kari
Hi Sage/ Matt,

I have submitted the pull request based on wip-plugin branch for the object 
store factory implementation at https://github.com/ceph/ceph/pull/5884 . 
Haven't rebased to the master yet. Working on rebase and including new store in 
the factory implementation.  Please have a look and let me know your comments. 
Will submit a rebased PR soon with new store integration. 

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, July 03, 2015 7:31 PM
To: Sage Weil ; Adam Crume 
Cc: Loic Dachary ; ceph-devel ; 
Matt W. Benjamin 
Subject: RE: loadable objectstore

Hi All,

Not able to make much progress after making common as a shared object along 
with object store. 
Compilation of the test binaries are failing with 
"./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'".

  CXXLDceph_streamtest
./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'
collect2: error: ld returned 1 exit status
make[3]: *** [ceph_streamtest] Error 1

But libfilestore.so is linked with lttng-ust.

src/.libs$ ldd libceph_filestore.so
libceph_keyvaluestore.so.1 => 
/home/varada/obs-factory/plugin-work/src/.libs/libceph_keyvaluestore.so.1 
(0x7f5e50f5)
libceph_os.so.1 => 
/home/varada/obs-factory/plugin-work/src/.libs/libceph_os.so.1 
(0x7f5e4f93a000)
libcommon.so.1 => /home/varada/ 
obs-factory/plugin-work/src/.libs/libcommon.so.1 (0x7f5e4b5df000)
liblttng-ust.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
(0x7f5e4b179000)
liblttng-ust-tracepoint.so.0 => 
/usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7f5e4a021000)
liburcu-bp.so.1 => /usr/lib/liburcu-bp.so.1 (0x7f5e49e1a000)
liburcu-cds.so.1 => /usr/lib/liburcu-cds.so.1 (0x7f5e49c12000)

Edited the above output just show the dependencies.  
Did anyone face this issue before? 
Any help would be much appreciated. 

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, June 26, 2015 3:34 PM
To: Sage Weil
Cc: Loic Dachary; ceph-devel; Matt W. Benjamin
Subject: RE: loadable objectstore

Hi,

Made some more changes to resolve lttng problems at 
https://github.com/varadakari/ceph/commits/wip-plugin.
But couldn’t by pass the issues. Facing some issues like mentioned below.

./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'

Compiling with -llttng-ust is not resolving the problem. Seen some threads in 
devel list before, mentioning this problem. 
Can anyone take a look and guide me to fix this problem?

Haven't made the changes to change the plugin name etc... will be making them 
as part of cleanup.

Thanks,
Varada

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Monday, June 22, 2015 8:57 PM
To: Matt W. Benjamin
Cc: Loic Dachary; ceph-devel; Sage Weil
Subject: RE: loadable objectstore

Hi Matt,

Majority of the changes are segregating the files to corresponding shared 
object and creating a factory object. And the naming is mostly taken from 
Erasure-coding plugins. Want a good naming convention :-), hence a preliminary 
review. Do agree, we have lot of loadable interfaces, and I think we are in the 
way of making them on-demand (if possible) loadable modules.

Varada

-Original Message-
From: Matt W. Benjamin [mailto:m...@cohortfs.com]
Sent: Monday, June 22, 2015 8:37 PM
To: Varada Kari
Cc: Loic Dachary; ceph-devel; Sage Weil
Subject: Re: loadable objectstore

Hi,

It's just aesthetic, but it feels clunky to change the names of well known 
modules to Plugin--esp. if that generalizes forward to new loadable 
modules (and we have a lot of loadable interfaces).

Matt

- "Varada Kari"  wrote:

> Hi Sage,
>
> Please find the initial implementation of objects store factory 
> (initial cut) at 
> https://github.com/varadakari/ceph/commit/9d5fe2fecf38ba106c7c7b7a3ede
> 4f189ec7e1c8
>
> This is still work in progress branch. Right now I am facing Lttng 
> issues,
> LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate 
> registration of tracepoint probes having the same name is not allowed.
>
> Might be an issue with libcommon inclusion. Trying resolving the issue 
> now. Seems I need to make libcommon also as a shared object to avoid 
> the duplicates, static linking is a problem here.
> Any suggestions or comments on this problem?
>
>
> I have commented out test binary (ceph_test_keyvaluedb_atomicity) 
> compilation, due to unresolved symbols for g_ceph_context and g_conf.
> Not able to fix/workaround the problem so far.
>
> Can you please review if this 

Re: osd: new pool flags: noscrub, nodeep-scrub

2015-09-11 Thread Andrey Korolyov
On Fri, Sep 11, 2015 at 4:24 PM, Mykola Golub  wrote:
> On Fri, Sep 11, 2015 at 05:59:56AM -0700, Sage Weil wrote:
>
>> I wonder if, in addition, we should also allow scrub and deep-scrub
>> intervals to be set on a per-pool basis?
>
> ceph osd pool set  [deep-]scrub_interval N ?

BTW it would be absolutely lovely to see a copy-aware scrubs, e.g.
parallel (deep-) scrubs on a non-intersecting set of PGs. Currently as
far as I can see if the scrub starts, the max_scrubs is in effect only
for a primary OSD, allowing situations when two scrubs, primary and
non-primary, can land on a same OSD.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Haomai Wang
Yesterday I have a chat with wangrui and the reason is "infos"(legacy
oid) is missing. I'm not sure why it's missing.

PS: resend again because of plain text

On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil  wrote:
> On Fri, 11 Sep 2015, ?? wrote:
>> Thank Sage Weil:
>>
>> 1. I delete some testing pools in the past, but is was a long time ago (may 
>> be 2 months ago), in recently upgrade, do not delete pools.
>> 2.  ceph osd dump please see the (attachment file ceph.osd.dump.log)
>> 3. debug osd = 20' and 'debug filestore = 20  (attachment file 
>> ceph.osd.5.log.tar.gz)
>
> This one is failing on pool 54, which has been deleted.  In this case you
> can work around it by renaming current/54.* out of the way.
>
>> 4. i install the ceph-test, but output error
>> ceph-kvstore-tool /ceph/data5/current/db list
>> Invalid argument: /ceph/data5/current/db: does not exist (create_if_missing 
>> is false)
>
> Sorry, I should have said current/omap, not current/db.  I'm still curious
> to see the key dump.  I'm not sure why the leveldb key for these pgs is
> missing...
>
> Thanks!
> sage
>
>
>>
>> ls -l /ceph/data5/current/db
>> total 0
>> -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
>> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
>> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
>>
>> Thanks very much!
>> Wang Rui
>>
>> -- Original --
>> From:  "Sage Weil";
>> Date:  Fri, Sep 11, 2015 06:23 AM
>> To:  "??";
>> Cc:  "ceph-devel";
>> Subject:  Re: Failed on starting osd-daemon after upgrade giant-0.87.1 
>> tohammer-0.94.3
>>
>> Hi!
>>
>> On Wed, 9 Sep 2015, ?? wrote:
>> > Hi all:
>> >
>> > I got on error after upgrade my ceph cluster from giant-0.87.2 to 
>> > hammer-0.94.3, my local environment is:
>> > CentOS 6.7 x86_64
>> > Kernel 3.10.86-1.el6.elrepo.x86_64
>> > HDD: XFS, 2TB
>> > Install Package: ceph.com official RPMs x86_64
>> >
>> > step 1:
>> > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
>> >
>> > step 2:
>> > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two servers and 
>> > noticed that some osds can not started!
>> > server-1 have 4 osds, all of them can not started;
>> > server-2 have 3 osds, 2 of them can not started, but 1 of them 
>> > successfully started and work in good.
>> >
>> > Error log 1:
>> > service ceph start osd.4
>> > /var/log/ceph/ceph-osd.24.log
>> > (attachment file: ceph.24.log)
>> >
>> > Error log 2:
>> > /usr/bin/ceph-osd -c /etc/ceph/ceph.conf -i 4 -f
>> >  (attachment file: cli.24.log)
>>
>> This looks a lot like a problem with a stray directory that older versions
>> did not clean up (#11429)... but not quite.  Have you deleted pools in the
>> past? (Can you attach a 'ceph osd dump'?)?  Also, i fyou start the osd
>> with 'debug osd = 20' and 'debug filestore = 20' we can see which PG is
>> problematic.  If you install the 'ceph-test' package which contains
>> ceph-kvstore-tool, the output of
>>
>>  ceph-kvstore-tool /var/lib/ceph/osd/ceph-$id/current/db list
>>
>> would also be helpful.
>>
>> Thanks!
>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: make check bot failures (2 hours today)

2015-09-11 Thread Daniel Gryniewicz
Maybe periodically run git gc on the clone out-of-line?  Git runs it
occasionally when it thinks it's necessary, and that can take a while
on large and/or fragmented repos.

Daniel

On Fri, Sep 11, 2015 at 9:03 AM, Loic Dachary  wrote:
> Hi Ceph,
>
> The make check bot failed a number of pull request verifications today. Each 
> of them was notified as false negative (you should have received a short note 
> if your pull request is concerned). The problem is now fixed[1] and all 
> should be back to normal. If you want to schedule another run, you just need 
> to rebase your pull request and re-push it, the bot will notice.
>
> Sorry for the inconvenience and thanks for your patience :-)
>
> P.S. I'm not sure what it was exactly. Just that git fetch took too long to 
> answer and failed. Reseting the git clone from which the bot works fixed the 
> problem. It happened a few times in the past but did not show up in the past 
> six month or so.
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Sage Weil
On Fri, 11 Sep 2015, Haomai Wang wrote:
> On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil  wrote:
>   On Fri, 11 Sep 2015, ?? wrote:
>   > Thank Sage Weil:
>   >
>   > 1. I delete some testing pools in the past, but is was a long
>   time ago (may be 2 months ago), in recently upgrade, do not
>   delete pools.
>   > 2.  ceph osd dump please see the (attachment file
>   ceph.osd.dump.log)
>   > 3. debug osd = 20' and 'debug filestore = 20  (attachment file
>   ceph.osd.5.log.tar.gz)
> 
>   This one is failing on pool 54, which has been deleted.  In this
>   case you
>   can work around it by renaming current/54.* out of the way.
> 
>   > 4. i install the ceph-test, but output error
>   > ceph-kvstore-tool /ceph/data5/current/db list
>   > Invalid argument: /ceph/data5/current/db: does not exist
>   (create_if_missing is false)
> 
>   Sorry, I should have said current/omap, not current/db.  I'm
>   still curious
>   to see the key dump.  I'm not sure why the leveldb key for these
>   pgs is
>   missing...
> 
> 
> Yesterday I have a chat with wangrui and the reason is "infos"(legacy oid)
> is missing. I'm not sure why it's missing.

Probably

https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908

Oh, I think I see what happened:

 - the pg removal was aborted pre-hammer.  On pre-hammer, thsi means that 
load_pgs skips it here:

 https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L2121

 - we upgrade to hammer.  we skip this pg (same reason), don't upgrade it, 
but delete teh legacy infos object

 https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908

 - now we see this crash...

I think the fix is, in hammer, to bail out of peek_map_epoch if the infos 
object isn't present, here

 https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L2867

Probably we should restructure so we can return a 'fail' value 
instead of a magic epoch_t meaning the same...

This is similar to the bug I'm fixing on master (and I think I just 
realized what I was doing wrong there).

Thanks!
sage



>  
> 
>   Thanks!
>   sage
> 
> 
>   >
>   > ls -l /ceph/data5/current/db
>   > total 0
>   > -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
>   >
>   > Thanks very much!
>   > Wang Rui
>   >
>   > -- Original --
>   > From:  "Sage Weil";
>   > Date:  Fri, Sep 11, 2015 06:23 AM
>   > To:  "??";
>   > Cc:  "ceph-devel";
>   > Subject:  Re: Failed on starting osd-daemon after upgrade
>   giant-0.87.1 tohammer-0.94.3
>   >
>   > Hi!
>   >
>   > On Wed, 9 Sep 2015, ?? wrote:
>   > > Hi all:
>   > >
>   > > I got on error after upgrade my ceph cluster from
>   giant-0.87.2 to hammer-0.94.3, my local environment is:
>   > > CentOS 6.7 x86_64
>   > > Kernel 3.10.86-1.el6.elrepo.x86_64
>   > > HDD: XFS, 2TB
>   > > Install Package: ceph.com official RPMs x86_64
>   > >
>   > > step 1:
>   > > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
>   > >
>   > > step 2:
>   > > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two
>   servers and noticed that some osds can not started!
>   > > server-1 have 4 osds, all of them can not started;
>   > > server-2 have 3 osds, 2 of them can not started, but 1 of
>   them successfully started and work in good.
>   > >
>   > > Error log 1:
>   > > service ceph start osd.4
>   > > /var/log/ceph/ceph-osd.24.log
>   > > (attachment file: ceph.24.log)
>   > >
>   > > Error log 2:
>   > > /usr/bin/ceph-osd -c /etc/ceph/ceph.conf -i 4 -f
>   > >  (attachment file: cli.24.log)
>   >
>   > This looks a lot like a problem with a stray directory that
>   older versions
>   > did not clean up (#11429)... but not quite.  Have you deleted
>   pools in the
>   > past? (Can you attach a 'ceph osd dump'?)?  Also, i fyou start
>   the osd
>   > with 'debug osd = 20' and 'debug filestore = 20' we can see
>   which PG is
>   > problematic.  If you install the 'ceph-test' package which
>   contains
>   > ceph-kvstore-tool, the output of
>   >
>   >  ceph-kvstore-tool /var/lib/ceph/osd/ceph-$id/current/db list
>   >
>   > would also be helpful.
>   >
>   > Thanks!
>   > sage
>   --
>   To unsubscribe from this list: send the line "unsubscribe
>   ceph-devel" in
>   the body of a message to majord...@vger.kernel.org
>   More majordomo info at 
>   http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> 
> Best Regards,
> 
> Wheat
> 
> 
> 

Re: pet project: OSD compatible daemon

2015-09-11 Thread Shinobu Kinjo
What I'm thinking of is to use fluentd to get log with
quite human-readable format.

Is it same of what you are thinking of?

Shinobu

- Original Message -
From: "Shinobu" 
To: ski...@redhat.com
Sent: Friday, September 11, 2015 6:16:18 PM
Subject: Fwd: pet project: OSD compatible daemon

-- Forwarded message --
From: Loic Dachary 
Date: Wed, Sep 9, 2015 at 5:45 PM
Subject: pet project: OSD compatible daemon
To: Ceph Development 


Hi Ceph,

I would like to try to write an OSD compatible daemon, as a pet project, to
learn Go and better understand the message flow. I suspect it may also be
useful for debug purposes but that's not my primary incentive.

Has anyone tried something similar ? If so I'd happily contribute instead
of starting something from scratch.

Cheers

P.S. Since it's a pet project it's likely to take months before there is
any kind of progress ;-)

--
Loïc Dachary, Artisan Logiciel Libre




-- 
Email:
 - shin...@linux.com 
Blog:
 - Life with Distributed Computational System based on OpenSource

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [HPDD-discuss] [PATCH] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations

2015-09-11 Thread Dilger, Andreas
On 2015/09/11, 4:20 AM, "HPDD-discuss on behalf of Jeff Layton"

wrote:

>With NFSv3 nfsd will always attempt to send along WCC data to the
>client. This generally involves saving off the in-core inode information
>prior to doing the operation on the given filehandle, and then issuing a
>vfs_getattr to it after the op.
>
>Some filesystems (particularly clustered or networked ones) have an
>expensive ->getattr inode operation. Atomicitiy is also often difficult
>or impossible to guarantee on such filesystems. For those, we're best
>off not trying to provide WCC information to the client at all, and to
>simply allow it to poll for that information as needed with a GETATTR
>RPC.
>
>This patch adds a new flags field to struct export_operations, and
>defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
>that nfsd should not attempt to provide WCC info in NFSv3 replies. It
>also adds a blurb about the new flags field and flag to the exporting
>documentation.
>
>The server will also now skip collecting this information for NFSv2 as
>well, since that info is never used there anyway.
>
>Note that this patch does not add this flag to any filesystem
>export_operations structures. This was originally developed to allow
>reexporting nfs via nfsd. That code is not (and may never be) suitable
>for merging into mainline.
>
>Other filesystems may want to consider enabling this flag too. It's hard
>to tell however which ones have export operations to enable export via
>knfsd and which ones mostly rely on them for open-by-filehandle support,
>so I'm leaving that up to the individual maintainers to decide. I am
>cc'ing the relevant lists for those filesystems that I think may want to
>consider adding this though.
>
>Cc: hpdd-disc...@lists.01.org
>Cc: ceph-devel@vger.kernel.org
>Cc: cluster-de...@redhat.com
>Cc: fuse-de...@lists.sourceforge.net
>Cc: ocfs2-de...@oss.oracle.com
>Signed-off-by: Jeff Layton 
>---
> Documentation/filesystems/nfs/Exporting | 27 +++
> fs/nfsd/nfs3xdr.c   |  5 -
> fs/nfsd/nfsfh.c | 14 ++
> fs/nfsd/nfsfh.h |  5 -
> include/linux/exportfs.h|  2 ++
> 5 files changed, 51 insertions(+), 2 deletions(-)
>
>diff --git a/Documentation/filesystems/nfs/Exporting
>b/Documentation/filesystems/nfs/Exporting
>index 520a4becb75c..fa636cde3907 100644
>--- a/Documentation/filesystems/nfs/Exporting
>+++ b/Documentation/filesystems/nfs/Exporting
>@@ -138,6 +138,11 @@ struct which has the following members:
> to find potential names, and matches inode numbers to find the
>correct
> match.
> 
>+  flags
>+Some filesystems may need to be handled differently than others. The
>+export_operations struct also includes a flags field that allows the
>+filesystem to communicate such information to nfsd. See the Export
>+Operations Flags section below for more explanation.
> 
> A filehandle fragment consists of an array of 1 or more 4byte words,
> together with a one byte "type".
>@@ -147,3 +152,25 @@ generated by encode_fh, in which case it will have
>been padded with
> nuls.  Rather, the encode_fh routine should choose a "type" which
> indicates the decode_fh how much of the filehandle is valid, and how
> it should be interpreted.
>+
>+Export Operations Flags
>+---
>+In addition to the operation vector pointers, struct export_operations
>also
>+contains a "flags" field that allows the filesystem to communicate to
>nfsd
>+that it may want to do things differently when dealing with it. The
>+following flags are defined:
>+
>+  EXPORT_OP_NOWCC
>+RFC 1813 recommends that servers always send weak cache consistency
>+(WCC) data to the client after each operation. The server should
>+atomically collect attributes about the inode, do an operation on it,
>+and then collect the attributes afterward. This allows the client to
>+skip issuing GETATTRs in some situations but means that the server
>+is calling vfs_getattr for almost all RPCs. On some filesystems
>+(particularly those that are clustered or networked) this is
>expensive
>+and atomicity is difficult to guarantee. This flag indicates to nfsd
>+that it should skip providing WCC attributes to the client in NFSv3
>+replies when doing operations on this filesystem. Consider enabling
>+this on filesystems that have an expensive ->getattr inode operation,
>+or when atomicity between pre and post operation attribute collection
>+is impossible to guarantee.
>diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
>index 01dcd494f781..c30c8c604e2a 100644
>--- a/fs/nfsd/nfs3xdr.c
>+++ b/fs/nfsd/nfs3xdr.c
>@@ -203,7 +203,7 @@ static __be32 *
> encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh
>*fhp)
> {
>   struct dentry *dentry = 

[PATCH] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations

2015-09-11 Thread Jeff Layton
With NFSv3 nfsd will always attempt to send along WCC data to the
client. This generally involves saving off the in-core inode information
prior to doing the operation on the given filehandle, and then issuing a
vfs_getattr to it after the op.

Some filesystems (particularly clustered or networked ones) have an
expensive ->getattr inode operation. Atomicitiy is also often difficult
or impossible to guarantee on such filesystems. For those, we're best
off not trying to provide WCC information to the client at all, and to
simply allow it to poll for that information as needed with a GETATTR
RPC.

This patch adds a new flags field to struct export_operations, and
defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
that nfsd should not attempt to provide WCC info in NFSv3 replies. It
also adds a blurb about the new flags field and flag to the exporting
documentation.

The server will also now skip collecting this information for NFSv2 as
well, since that info is never used there anyway.

Note that this patch does not add this flag to any filesystem
export_operations structures. This was originally developed to allow
reexporting nfs via nfsd. That code is not (and may never be) suitable
for merging into mainline.

Other filesystems may want to consider enabling this flag too. It's hard
to tell however which ones have export operations to enable export via
knfsd and which ones mostly rely on them for open-by-filehandle support,
so I'm leaving that up to the individual maintainers to decide. I am
cc'ing the relevant lists for those filesystems that I think may want to
consider adding this though.

Cc: hpdd-disc...@lists.01.org
Cc: ceph-devel@vger.kernel.org
Cc: cluster-de...@redhat.com
Cc: fuse-de...@lists.sourceforge.net
Cc: ocfs2-de...@oss.oracle.com
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/nfs/Exporting | 27 +++
 fs/nfsd/nfs3xdr.c   |  5 -
 fs/nfsd/nfsfh.c | 14 ++
 fs/nfsd/nfsfh.h |  5 -
 include/linux/exportfs.h|  2 ++
 5 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/nfs/Exporting 
b/Documentation/filesystems/nfs/Exporting
index 520a4becb75c..fa636cde3907 100644
--- a/Documentation/filesystems/nfs/Exporting
+++ b/Documentation/filesystems/nfs/Exporting
@@ -138,6 +138,11 @@ struct which has the following members:
 to find potential names, and matches inode numbers to find the correct
 match.
 
+  flags
+Some filesystems may need to be handled differently than others. The
+export_operations struct also includes a flags field that allows the
+filesystem to communicate such information to nfsd. See the Export
+Operations Flags section below for more explanation.
 
 A filehandle fragment consists of an array of 1 or more 4byte words,
 together with a one byte "type".
@@ -147,3 +152,25 @@ generated by encode_fh, in which case it will have been 
padded with
 nuls.  Rather, the encode_fh routine should choose a "type" which
 indicates the decode_fh how much of the filehandle is valid, and how
 it should be interpreted.
+
+Export Operations Flags
+---
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+  EXPORT_OP_NOWCC
+RFC 1813 recommends that servers always send weak cache consistency
+(WCC) data to the client after each operation. The server should
+atomically collect attributes about the inode, do an operation on it,
+and then collect the attributes afterward. This allows the client to
+skip issuing GETATTRs in some situations but means that the server
+is calling vfs_getattr for almost all RPCs. On some filesystems
+(particularly those that are clustered or networked) this is expensive
+and atomicity is difficult to guarantee. This flag indicates to nfsd
+that it should skip providing WCC attributes to the client in NFSv3
+replies when doing operations on this filesystem. Consider enabling
+this on filesystems that have an expensive ->getattr inode operation,
+or when atomicity between pre and post operation attribute collection
+is impossible to guarantee.
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 01dcd494f781..c30c8c604e2a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -203,7 +203,7 @@ static __be32 *
 encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
 {
struct dentry *dentry = fhp->fh_dentry;
-   if (dentry && d_really_is_positive(dentry)) {
+   if (!fhp->fh_no_wcc && dentry && d_really_is_positive(dentry)) {
__be32 err;
struct kstat stat;
 
@@ -256,6 +256,9 @@ void 

Re: pet project: OSD compatible daemon

2015-09-11 Thread Shinobu Kinjo
Yes, that is what I'm thinking.

Shinobu

- Original Message -
From: "Loic Dachary" 
To: "Shinobu Kinjo" 
Cc: "ceph-devel" 
Sent: Saturday, September 12, 2015 12:10:43 AM
Subject: Re: pet project: OSD compatible daemon



On 11/09/2015 16:08, Shinobu Kinjo wrote:
> What I'm thinking of is to use fluentd to get log with
> quite human-readable format.

If you refer to https://github.com/fluent/fluentd it's different. Or is it 
something else ?

> 
> Is it same of what you are thinking of?
> 
> Shinobu
> 
> - Original Message -
> From: "Shinobu" 
> To: ski...@redhat.com
> Sent: Friday, September 11, 2015 6:16:18 PM
> Subject: Fwd: pet project: OSD compatible daemon
> 
> -- Forwarded message --
> From: Loic Dachary 
> Date: Wed, Sep 9, 2015 at 5:45 PM
> Subject: pet project: OSD compatible daemon
> To: Ceph Development 
> 
> 
> Hi Ceph,
> 
> I would like to try to write an OSD compatible daemon, as a pet project, to
> learn Go and better understand the message flow. I suspect it may also be
> useful for debug purposes but that's not my primary incentive.
> 
> Has anyone tried something similar ? If so I'd happily contribute instead
> of starting something from scratch.
> 
> Cheers
> 
> P.S. Since it's a pet project it's likely to take months before there is
> any kind of progress ;-)
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backfill

2015-09-11 Thread Sage Weil
On Thu, 10 Sep 2015, GuangYang wrote:
> Today I played around recovery and backfill of a Ceph cluster (by 
> manually bringing some OSDs down/out), and got one question regards to 
> the current flow:
> 
> Does backfill push everything to the backfill target regardless what the 
> backfill target already has? The scenario is like - acting set of the PG 
> is [1, 2, 3], and 3 went down (at which point it already had some data) 
> and stayed down for a sustained period (but not marked out), during 
> which time there were sustained WRITE to the PG. At some point 3 went 
> back up, and it is not sufficient to recovery via PG log, so the PG 
> needed to be backfilled and 3 is the target. Does 1 needs to push 
> everything (last_backfill starts with MIN) to 3? It seems so to me as I 
> don't see some round trip to negotiate what each OSD has and do an 
> incremental push (as recovery does), but it would be nice to get confirm 
> :)

No.  Backfill iterates over objects on the source and destination and 
only pushes objects that are missing or out of date (and deletes ones 
that shouldn't be there).  This is all in ReplicatedPG::recover_backfill() 
(though it's not the easiest read).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd: new pool flags: noscrub, nodeep-scrub

2015-09-11 Thread Sage Weil
On Fri, 11 Sep 2015, Mykola Golub wrote:
> On Fri, Sep 11, 2015 at 11:08:29AM +0100, Gregory Farnum wrote:
> > On Fri, Sep 11, 2015 at 7:42 AM, Mykola Golub  wrote:
> > > Hi,
> > >
> > > I would like to add new pool flags: noscrub and nodeep-scrub, to be
> > > able to control scrubbing on per pool basis. In our case it could be
> > > helpful in order to disable scrubbing on cache pools, which does not
> > > work well right now, but I can imagine other scenarios where it could
> > > be useful too.
> > 
> > Can you talk more about this? It sounds to me like maybe you dislike
> > the performance impact of scrubbing, but it's fairly important in
> > terms of data integrity. I don't think we want to permanently disable
> > them. A corruption in the cache pool isn't any less important than in
> > the backing pool ? it will eventually get flushed, and it's where all
> > the reads will be handled!
> 
> I was talking about this:
> 
> http://tracker.ceph.com/issues/8752
> 
> (false-negative on a caching pool). Although the best solution is
> definitely to fix the bug, I am not sure it will be resolved soon (the
> bug is open for a year). Still these false-negatives are annoying, as
> they complicate monitoring for real inconsistent pgs. In this case I
> might want to disable periodic scrub for caching pools, as a
> workaround (I could do scrub for them manually though).
> 
> This might be not the best example where these flags could be helpful
> (I just came to the idea when thinking about a workaround for that
> problem, and this looked useful to me in general). We already have
> 'ceph osd set no[deep-]scrub', and users use it to temporary resolve
> high I/O load. Being able to do this per pool looks useful too.
> 
> You might have pools of different importance for you, and disabling
> scrub for some of them might be ok.

I wonder if, in addition, we should also allow scrub and deep-scrub 
intervals to be set on a per-pool basis?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


make check bot failures (2 hours today)

2015-09-11 Thread Loic Dachary
Hi Ceph,

The make check bot failed a number of pull request verifications today. Each of 
them was notified as false negative (you should have received a short note if 
your pull request is concerned). The problem is now fixed[1] and all should be 
back to normal. If you want to schedule another run, you just need to rebase 
your pull request and re-push it, the bot will notice.

Sorry for the inconvenience and thanks for your patience :-)

P.S. I'm not sure what it was exactly. Just that git fetch took too long to 
answer and failed. Reseting the git clone from which the bot works fixed the 
problem. It happened a few times in the past but did not show up in the past 
six month or so.

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Sage Weil
On Fri, 11 Sep 2015, ?? wrote:
> Thank Sage Weil:
> 
> 1. I delete some testing pools in the past, but is was a long time ago (may 
> be 2 months ago), in recently upgrade, do not delete pools.
> 2.  ceph osd dump please see the (attachment file ceph.osd.dump.log)
> 3. debug osd = 20' and 'debug filestore = 20  (attachment file 
> ceph.osd.5.log.tar.gz)

This one is failing on pool 54, which has been deleted.  In this case you 
can work around it by renaming current/54.* out of the way.

> 4. i install the ceph-test, but output error
> ceph-kvstore-tool /ceph/data5/current/db list 
> Invalid argument: /ceph/data5/current/db: does not exist (create_if_missing 
> is false)

Sorry, I should have said current/omap, not current/db.  I'm still curious 
to see the key dump.  I'm not sure why the leveldb key for these pgs is 
missing...

Thanks!
sage


> 
> ls -l /ceph/data5/current/db
> total 0
> -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
> 
> Thanks very much!
> Wang Rui 
>  
> -- Original --
> From:  "Sage Weil";
> Date:  Fri, Sep 11, 2015 06:23 AM
> To:  "??";
> Cc:  "ceph-devel";
> Subject:  Re: Failed on starting osd-daemon after upgrade giant-0.87.1 
> tohammer-0.94.3
>  
> Hi!
> 
> On Wed, 9 Sep 2015, ?? wrote:
> > Hi all:
> > 
> > I got on error after upgrade my ceph cluster from giant-0.87.2 to 
> > hammer-0.94.3, my local environment is:
> > CentOS 6.7 x86_64
> > Kernel 3.10.86-1.el6.elrepo.x86_64
> > HDD: XFS, 2TB
> > Install Package: ceph.com official RPMs x86_64
> > 
> > step 1: 
> > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
> > 
> > step 2: 
> > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two servers and 
> > noticed that some osds can not started!
> > server-1 have 4 osds, all of them can not started;
> > server-2 have 3 osds, 2 of them can not started, but 1 of them successfully 
> > started and work in good.
> > 
> > Error log 1:
> > service ceph start osd.4
> > /var/log/ceph/ceph-osd.24.log 
> > (attachment file: ceph.24.log)
> > 
> > Error log 2:
> > /usr/bin/ceph-osd -c /etc/ceph/ceph.conf -i 4 -f
> >  (attachment file: cli.24.log)
> 
> This looks a lot like a problem with a stray directory that older versions 
> did not clean up (#11429)... but not quite.  Have you deleted pools in the 
> past? (Can you attach a 'ceph osd dump'?)?  Also, i fyou start the osd 
> with 'debug osd = 20' and 'debug filestore = 20' we can see which PG is 
> problematic.  If you install the 'ceph-test' package which contains 
> ceph-kvstore-tool, the output of 
> 
>  ceph-kvstore-tool /var/lib/ceph/osd/ceph-$id/current/db list
> 
> would also be helpful.
> 
> Thanks!
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About Fio backend with ObjectStore API

2015-09-11 Thread Casey Bodley
Hi James,

That's great that you were able to get fio-objectstore running! Thanks to you 
and Haomai for all the help with testing.

In terms of performance, it's possible that we're not handling the completions 
optimally. When profiling with MemStore I remember seeing a significant amount 
of cpu time spent in polling with fio_ceph_os_getevents().

The issue with reads is more of a design issue than a bug. Because the test 
starts with a mkfs(), there are no objects to read from initially. You would 
just have to add a write job to run before the read job, to make sure that the 
objects are initialized. Or perhaps the mkfs() step could be an optional part 
of the configuration.

Casey

- Original Message -
From: "James (Fei) Liu-SSI" 
To: "Haomai Wang" , "Casey Bodley" 
Cc: ceph-devel@vger.kernel.org
Sent: Thursday, September 10, 2015 8:08:04 PM
Subject: RE: About Fio backend with ObjectStore API

Hi Casey and Haomai,

  We finally made the fio-objectstore works in our end . Here is fio data 
against filestore with Samsung 850 Pro. It is sequential write and the 
performance is very poor which is expected though. 

Run status group 0 (all jobs):
  WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s, 
mint=55378msec, maxt=55378msec

  But anyway, it works even though still some bugs to fix like read and 
filesytem issues. thanks a lot for your great work.

  Regards,
  James

  jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio 
./test/objectstore.fio 
filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, 
ioengine=cephobjectstore, iodepth=1
fio-2.2.9-56-g736a
Starting 1 process
test1
filestore: Laying out IO file(s) (1 file(s) / 512MB)
2015-09-10 16:55:40.614494 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs in /home/jamesliu/fio_ceph
2015-09-10 16:55:40.614924 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs generated fsid 5508d58e-dbfc-48a5-9f9c-c639af4fe73a
2015-09-10 16:55:40.630326 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
write_version_stamp 4
2015-09-10 16:55:40.673417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
backend xfs (magic 0x58465342)
2015-09-10 16:55:40.724097 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
leveldb db exists/created
2015-09-10 16:55:40.724218 7f19d34d1840 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use of aio 
anyway
2015-09-10 16:55:40.724226 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.724468 7f19d34d1840 -1 journal check: ondisk fsid 
7580401a-6863-4863-9873-3adda08c9150 doesn't match expected 
5508d58e-dbfc-48a5-9f9c-c639af4fe73a, invalid (someone else's?) journal
2015-09-10 16:55:40.724481 7f19d34d1840  1 journal close 
/tmp/fio_ceph_filestore1
2015-09-10 16:55:40.724506 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.730417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
mkjournal created journal on /tmp/fio_ceph_filestore1
2015-09-10 16:55:40.730446 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs done in /home/jamesliu/fio_ceph
2015-09-10 16:55:40.730527 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
backend xfs (magic 0x58465342)
2015-09-10 16:55:40.730773 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option
2015-09-10 16:55:40.730779 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-09-10 16:55:40.730793 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: splice is 
supported
2015-09-10 16:55:40.751951 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)
2015-09-10 16:55:40.752102 7f19d34d1840  0 
xfsfilestorebackend(/home/jamesliu/fio_ceph) detect_features: extsize is 
supported and your kernel >= 3.5
2015-09-10 16:55:40.794731 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-09-10 16:55:40.794906 7f19d34d1840 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use of aio 
anyway
2015-09-10 16:55:40.794917 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 11: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.795219 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 11: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.795533 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
upgrade
2015-09-10 

Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Haomai Wang
On Fri, Sep 11, 2015 at 10:09 PM, Sage Weil  wrote:
> On Fri, 11 Sep 2015, Haomai Wang wrote:
>> On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil  wrote:
>>   On Fri, 11 Sep 2015, ?? wrote:
>>   > Thank Sage Weil:
>>   >
>>   > 1. I delete some testing pools in the past, but is was a long
>>   time ago (may be 2 months ago), in recently upgrade, do not
>>   delete pools.
>>   > 2.  ceph osd dump please see the (attachment file
>>   ceph.osd.dump.log)
>>   > 3. debug osd = 20' and 'debug filestore = 20  (attachment file
>>   ceph.osd.5.log.tar.gz)
>>
>>   This one is failing on pool 54, which has been deleted.  In this
>>   case you
>>   can work around it by renaming current/54.* out of the way.
>>
>>   > 4. i install the ceph-test, but output error
>>   > ceph-kvstore-tool /ceph/data5/current/db list
>>   > Invalid argument: /ceph/data5/current/db: does not exist
>>   (create_if_missing is false)
>>
>>   Sorry, I should have said current/omap, not current/db.  I'm
>>   still curious
>>   to see the key dump.  I'm not sure why the leveldb key for these
>>   pgs is
>>   missing...
>>
>>
>> Yesterday I have a chat with wangrui and the reason is "infos"(legacy oid)
>> is missing. I'm not sure why it's missing.
>
> Probably
>
> https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
>
> Oh, I think I see what happened:
>
>  - the pg removal was aborted pre-hammer.  On pre-hammer, thsi means that
> load_pgs skips it here:
>
>  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L2121
>
>  - we upgrade to hammer.  we skip this pg (same reason), don't upgrade it,
> but delete teh legacy infos object
>
>  https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
>
>  - now we see this crash...
>
> I think the fix is, in hammer, to bail out of peek_map_epoch if the infos
> object isn't present, here
>
>  https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L2867
>
> Probably we should restructure so we can return a 'fail' value
> instead of a magic epoch_t meaning the same...
>
> This is similar to the bug I'm fixing on master (and I think I just
> realized what I was doing wrong there).

Hmm, I got it. So we could skip this assert or just like load_pgs to
check pool whether exists?

I think it's urgent bug because I remember several people show me the
alike crash.


>
> Thanks!
> sage
>
>
>
>>
>>
>>   Thanks!
>>   sage
>>
>>
>>   >
>>   > ls -l /ceph/data5/current/db
>>   > total 0
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
>>   >
>>   > Thanks very much!
>>   > Wang Rui
>>   >
>>   > -- Original --
>>   > From:  "Sage Weil";
>>   > Date:  Fri, Sep 11, 2015 06:23 AM
>>   > To:  "??";
>>   > Cc:  "ceph-devel";
>>   > Subject:  Re: Failed on starting osd-daemon after upgrade
>>   giant-0.87.1 tohammer-0.94.3
>>   >
>>   > Hi!
>>   >
>>   > On Wed, 9 Sep 2015, ?? wrote:
>>   > > Hi all:
>>   > >
>>   > > I got on error after upgrade my ceph cluster from
>>   giant-0.87.2 to hammer-0.94.3, my local environment is:
>>   > > CentOS 6.7 x86_64
>>   > > Kernel 3.10.86-1.el6.elrepo.x86_64
>>   > > HDD: XFS, 2TB
>>   > > Install Package: ceph.com official RPMs x86_64
>>   > >
>>   > > step 1:
>>   > > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
>>   > >
>>   > > step 2:
>>   > > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two
>>   servers and noticed that some osds can not started!
>>   > > server-1 have 4 osds, all of them can not started;
>>   > > server-2 have 3 osds, 2 of them can not started, but 1 of
>>   them successfully started and work in good.
>>   > >
>>   > > Error log 1:
>>   > > service ceph start osd.4
>>   > > /var/log/ceph/ceph-osd.24.log
>>   > > (attachment file: ceph.24.log)
>>   > >
>>   > > Error log 2:
>>   > > /usr/bin/ceph-osd -c /etc/ceph/ceph.conf -i 4 -f
>>   > >  (attachment file: cli.24.log)
>>   >
>>   > This looks a lot like a problem with a stray directory that
>>   older versions
>>   > did not clean up (#11429)... but not quite.  Have you deleted
>>   pools in the
>>   > past? (Can you attach a 'ceph osd dump'?)?  Also, i fyou start
>>   the osd
>>   > with 'debug osd = 20' and 'debug filestore = 20' we can see
>>   which PG is
>>   > problematic.  If you install the 'ceph-test' package which
>>   contains
>>   > ceph-kvstore-tool, the output of
>>   >
>>   >  ceph-kvstore-tool /var/lib/ceph/osd/ceph-$id/current/db list
>>   >
>>   > would also be helpful.
>>   

Re: About Fio backend with ObjectStore API

2015-09-11 Thread Casey Bodley
I forgot to mention for the list, you can find the latest version of the 
fio-objectstore branch at 
https://github.com/cbodley/ceph/commits/fio-objectstore.

Casey

- Original Message -
From: "Casey Bodley" 
To: "James (Fei) Liu-SSI" 
Cc: "Haomai Wang" , ceph-devel@vger.kernel.org
Sent: Friday, September 11, 2015 10:28:14 AM
Subject: Re: About Fio backend with ObjectStore API

Hi James,

That's great that you were able to get fio-objectstore running! Thanks to you 
and Haomai for all the help with testing.

In terms of performance, it's possible that we're not handling the completions 
optimally. When profiling with MemStore I remember seeing a significant amount 
of cpu time spent in polling with fio_ceph_os_getevents().

The issue with reads is more of a design issue than a bug. Because the test 
starts with a mkfs(), there are no objects to read from initially. You would 
just have to add a write job to run before the read job, to make sure that the 
objects are initialized. Or perhaps the mkfs() step could be an optional part 
of the configuration.

Casey

- Original Message -
From: "James (Fei) Liu-SSI" 
To: "Haomai Wang" , "Casey Bodley" 
Cc: ceph-devel@vger.kernel.org
Sent: Thursday, September 10, 2015 8:08:04 PM
Subject: RE: About Fio backend with ObjectStore API

Hi Casey and Haomai,

  We finally made the fio-objectstore works in our end . Here is fio data 
against filestore with Samsung 850 Pro. It is sequential write and the 
performance is very poor which is expected though. 

Run status group 0 (all jobs):
  WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s, 
mint=55378msec, maxt=55378msec

  But anyway, it works even though still some bugs to fix like read and 
filesytem issues. thanks a lot for your great work.

  Regards,
  James

  jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio 
./test/objectstore.fio 
filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, 
ioengine=cephobjectstore, iodepth=1
fio-2.2.9-56-g736a
Starting 1 process
test1
filestore: Laying out IO file(s) (1 file(s) / 512MB)
2015-09-10 16:55:40.614494 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs in /home/jamesliu/fio_ceph
2015-09-10 16:55:40.614924 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs generated fsid 5508d58e-dbfc-48a5-9f9c-c639af4fe73a
2015-09-10 16:55:40.630326 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
write_version_stamp 4
2015-09-10 16:55:40.673417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
backend xfs (magic 0x58465342)
2015-09-10 16:55:40.724097 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
leveldb db exists/created
2015-09-10 16:55:40.724218 7f19d34d1840 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use of aio 
anyway
2015-09-10 16:55:40.724226 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.724468 7f19d34d1840 -1 journal check: ondisk fsid 
7580401a-6863-4863-9873-3adda08c9150 doesn't match expected 
5508d58e-dbfc-48a5-9f9c-c639af4fe73a, invalid (someone else's?) journal
2015-09-10 16:55:40.724481 7f19d34d1840  1 journal close 
/tmp/fio_ceph_filestore1
2015-09-10 16:55:40.724506 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.730417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
mkjournal created journal on /tmp/fio_ceph_filestore1
2015-09-10 16:55:40.730446 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs done in /home/jamesliu/fio_ceph
2015-09-10 16:55:40.730527 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
backend xfs (magic 0x58465342)
2015-09-10 16:55:40.730773 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option
2015-09-10 16:55:40.730779 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-09-10 16:55:40.730793 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: splice is 
supported
2015-09-10 16:55:40.751951 7f19d34d1840  0 
genericfilestorebackend(/home/jamesliu/fio_ceph) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)
2015-09-10 16:55:40.752102 7f19d34d1840  0 
xfsfilestorebackend(/home/jamesliu/fio_ceph) detect_features: extsize is 
supported and your kernel >= 3.5
2015-09-10 16:55:40.794731 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-09-10 16:55:40.794906 7f19d34d1840 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use 

Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Sage Weil
On Fri, 11 Sep 2015, Haomai Wang wrote:
> On Fri, Sep 11, 2015 at 10:09 PM, Sage Weil  wrote:
> > On Fri, 11 Sep 2015, Haomai Wang wrote:
> >> On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil  wrote:
> >>   On Fri, 11 Sep 2015, ?? wrote:
> >>   > Thank Sage Weil:
> >>   >
> >>   > 1. I delete some testing pools in the past, but is was a long
> >>   time ago (may be 2 months ago), in recently upgrade, do not
> >>   delete pools.
> >>   > 2.  ceph osd dump please see the (attachment file
> >>   ceph.osd.dump.log)
> >>   > 3. debug osd = 20' and 'debug filestore = 20  (attachment file
> >>   ceph.osd.5.log.tar.gz)
> >>
> >>   This one is failing on pool 54, which has been deleted.  In this
> >>   case you
> >>   can work around it by renaming current/54.* out of the way.
> >>
> >>   > 4. i install the ceph-test, but output error
> >>   > ceph-kvstore-tool /ceph/data5/current/db list
> >>   > Invalid argument: /ceph/data5/current/db: does not exist
> >>   (create_if_missing is false)
> >>
> >>   Sorry, I should have said current/omap, not current/db.  I'm
> >>   still curious
> >>   to see the key dump.  I'm not sure why the leveldb key for these
> >>   pgs is
> >>   missing...
> >>
> >>
> >> Yesterday I have a chat with wangrui and the reason is "infos"(legacy oid)
> >> is missing. I'm not sure why it's missing.
> >
> > Probably
> >
> > https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
> >
> > Oh, I think I see what happened:
> >
> >  - the pg removal was aborted pre-hammer.  On pre-hammer, thsi means that
> > load_pgs skips it here:
> >
> >  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L2121
> >
> >  - we upgrade to hammer.  we skip this pg (same reason), don't upgrade it,
> > but delete teh legacy infos object
> >
> >  https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
> >
> >  - now we see this crash...
> >
> > I think the fix is, in hammer, to bail out of peek_map_epoch if the infos
> > object isn't present, here
> >
> >  https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L2867
> >
> > Probably we should restructure so we can return a 'fail' value
> > instead of a magic epoch_t meaning the same...
> >
> > This is similar to the bug I'm fixing on master (and I think I just
> > realized what I was doing wrong there).
> 
> Hmm, I got it. So we could skip this assert or just like load_pgs to
> check pool whether exists?
> 
> I think it's urgent bug because I remember several people show me the
> alike crash.

Yeah.. take a look at https://github.com/ceph/ceph/pull/5892

Does that look right to you?  Packages are building now...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pet project: OSD compatible daemon

2015-09-11 Thread Loic Dachary


On 11/09/2015 16:08, Shinobu Kinjo wrote:
> What I'm thinking of is to use fluentd to get log with
> quite human-readable format.

If you refer to https://github.com/fluent/fluentd it's different. Or is it 
something else ?

> 
> Is it same of what you are thinking of?
> 
> Shinobu
> 
> - Original Message -
> From: "Shinobu" 
> To: ski...@redhat.com
> Sent: Friday, September 11, 2015 6:16:18 PM
> Subject: Fwd: pet project: OSD compatible daemon
> 
> -- Forwarded message --
> From: Loic Dachary 
> Date: Wed, Sep 9, 2015 at 5:45 PM
> Subject: pet project: OSD compatible daemon
> To: Ceph Development 
> 
> 
> Hi Ceph,
> 
> I would like to try to write an OSD compatible daemon, as a pet project, to
> learn Go and better understand the message flow. I suspect it may also be
> useful for debug purposes but that's not my primary incentive.
> 
> Has anyone tried something similar ? If so I'd happily contribute instead
> of starting something from scratch.
> 
> Cheers
> 
> P.S. Since it's a pet project it's likely to take months before there is
> any kind of progress ;-)
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: make check bot failures (2 hours today)

2015-09-11 Thread Loic Dachary


On 11/09/2015 15:53, Daniel Gryniewicz wrote:
> Maybe periodically run git gc on the clone out-of-line?  Git runs it
> occasionally when it thinks it's necessary, and that can take a while
> on large and/or fragmented repos.

I did git prune + git gc on the clone on both sides (receiving and sending) but 
it did not help. But I did dig deeper than that.

> 
> Daniel
> 
> On Fri, Sep 11, 2015 at 9:03 AM, Loic Dachary  wrote:
>> Hi Ceph,
>>
>> The make check bot failed a number of pull request verifications today. Each 
>> of them was notified as false negative (you should have received a short 
>> note if your pull request is concerned). The problem is now fixed[1] and all 
>> should be back to normal. If you want to schedule another run, you just need 
>> to rebase your pull request and re-push it, the bot will notice.
>>
>> Sorry for the inconvenience and thanks for your patience :-)
>>
>> P.S. I'm not sure what it was exactly. Just that git fetch took too long to 
>> answer and failed. Reseting the git clone from which the bot works fixed the 
>> problem. It happened a few times in the past but did not show up in the past 
>> six month or so.
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RE: About Fio backend with ObjectStore API

2015-09-11 Thread James (Fei) Liu-SSI
Hi Casey,
  You are right. I think the bottleneck is in fio side rather than in filestore 
side in this case. The fio did not issue the io commands faster enough to 
saturate the filestore.
  Here is one of possible solution for it: Create a  async engine which are 
normally way faster than sync engine in fio.
   
   Here is possible framework. This new Objectstore-AIO engine in FIO in theory 
will be way faster than sync engine. Once we have FIO which can saturate 
newstore, memstore and filestore, we can investigate them in very details of 
where the bottleneck in their design.

.
struct objectstore_aio_data {
struct aio_ctx *q_aio_ctx;
struct aio_completion_data *a_data;
aio_ses_ctx_t *p_ses_ctx;
unsigned int entries;
};
...
/*
 * Note that the structure is exported, so that fio can get it via
 * dlsym(..., "ioengine");
 */
struct ioengine_ops us_aio_ioengine = {
.name   = "objectstore-aio",
.version= FIO_IOOPS_VERSION,
.init   = fio_objectstore_aio_init,
.prep   = fio_objectstore_aio_prep,
.queue  = fio_objectstore_aio_queue,
.cancel = fio_objectstore_aio_cancel,
.getevents  = fio_objectstore_aio_getevents,
.event  = fio_objectstore_aio_event,
.cleanup= fio_objectstore_aio_cleanup,
.open_file  = fio_objectstore_aio_open,
.close_file = fio_objectstore_aio_close,
};


Let me know what you think.

Regards,
James

-Original Message-
From: Casey Bodley [mailto:cbod...@redhat.com] 
Sent: Friday, September 11, 2015 7:28 AM
To: James (Fei) Liu-SSI
Cc: Haomai Wang; ceph-devel@vger.kernel.org
Subject: Re: About Fio backend with ObjectStore API

Hi James,

That's great that you were able to get fio-objectstore running! Thanks to you 
and Haomai for all the help with testing.

In terms of performance, it's possible that we're not handling the completions 
optimally. When profiling with MemStore I remember seeing a significant amount 
of cpu time spent in polling with fio_ceph_os_getevents().

The issue with reads is more of a design issue than a bug. Because the test 
starts with a mkfs(), there are no objects to read from initially. You would 
just have to add a write job to run before the read job, to make sure that the 
objects are initialized. Or perhaps the mkfs() step could be an optional part 
of the configuration.

Casey

- Original Message -
From: "James (Fei) Liu-SSI" 
To: "Haomai Wang" , "Casey Bodley" 
Cc: ceph-devel@vger.kernel.org
Sent: Thursday, September 10, 2015 8:08:04 PM
Subject: RE: About Fio backend with ObjectStore API

Hi Casey and Haomai,

  We finally made the fio-objectstore works in our end . Here is fio data 
against filestore with Samsung 850 Pro. It is sequential write and the 
performance is very poor which is expected though. 

Run status group 0 (all jobs):
  WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s, 
mint=55378msec, maxt=55378msec

  But anyway, it works even though still some bugs to fix like read and 
filesytem issues. thanks a lot for your great work.

  Regards,
  James

  jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio 
./test/objectstore.fio
filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, 
ioengine=cephobjectstore, iodepth=1 fio-2.2.9-56-g736a Starting 1 process
test1
filestore: Laying out IO file(s) (1 file(s) / 512MB)
2015-09-10 16:55:40.614494 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs in /home/jamesliu/fio_ceph
2015-09-10 16:55:40.614924 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
mkfs generated fsid 5508d58e-dbfc-48a5-9f9c-c639af4fe73a
2015-09-10 16:55:40.630326 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
write_version_stamp 4
2015-09-10 16:55:40.673417 7f19d34d1840  0 filestore(/home/jamesliu/fio_ceph) 
backend xfs (magic 0x58465342)
2015-09-10 16:55:40.724097 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph) 
leveldb db exists/created
2015-09-10 16:55:40.724218 7f19d34d1840 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use of aio 
anyway
2015-09-10 16:55:40.724226 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.724468 7f19d34d1840 -1 journal check: ondisk fsid 
7580401a-6863-4863-9873-3adda08c9150 doesn't match expected 
5508d58e-dbfc-48a5-9f9c-c639af4fe73a, invalid (someone else's?) journal
2015-09-10 16:55:40.724481 7f19d34d1840  1 journal close 
/tmp/fio_ceph_filestore1
2015-09-10 16:55:40.724506 7f19d34d1840  1 journal _open 
/tmp/fio_ceph_filestore1 fd 5: 5368709120 bytes, block size 4096 bytes, 
directio = 1, aio = 0
2015-09-10 16:55:40.730417