date:20150910

Re: [ceph-users] Ceph.conf

2015-09-10 Thread Gregory Farnum

On Thu, Sep 10, 2015 at 9:44 AM, Shinobu Kinjo  wrote:
> Hello,
>
> I'm seeing 859 parameters in the output of:
>
> $ ./ceph --show-config | wc -l
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> 859
>
> In:
>
> $ ./ceph --version
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> ceph version 9.0.2-1454-g050e1c5 
> (050e1c5c7471f8f237d9fa119af98c1efa9a8479)
>
> Since I'm quite new to Ceph, so my question is:
>
> Where can I know what each parameter exactly mean?
>
> I am probably right. Some parameters are just for tes-
> ting purpose.

Yes. A bunch shouldn't ever be set by users. A lot of the ones that
should be are described as part of various operations in
ceph.com/docs, but I don't know which ones of interest are missing
from there. It's not very discoverable right now, unfortunately.
-Greg

>
> Thank you for your help in advance.
>
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph.conf

2015-09-10 Thread Shinobu Kinjo

Thank you for letting me know your thought, Abhishek!!


> The Ceph Object Gateway will query Keystone periodically
> for a list of revoked tokens. These requests are encoded
> and signed. Also, Keystone may be configured to provide 
> self-signed tokens, which are also encoded and signed.


This is completely absolutely out of scope of my original
question.

But I would like to ask you if above implementation that
**periodically** talks to keystone with tokens is really
secure or not.

I'm just asking you. Because I'm just thinking of keysto-
ne federation.

But you can ignore me anyhow or point out anything to me -;

Shinobu

- Original Message -
From: "Abhishek L" 
To: "Shinobu Kinjo" 
Cc: "Gregory Farnum" , "ceph-users" 
, "ceph-devel" 
Sent: Thursday, September 10, 2015 6:35:31 PM
Subject: Re: [ceph-users] Ceph.conf

On Thu, Sep 10, 2015 at 2:51 PM, Shinobu Kinjo  wrote:
> Thank you for your really really quick reply, Greg.
>
>  > Yes. A bunch shouldn't ever be set by users.
>
>  Anyhow, this is one of my biggest concern right now -;
>
> rgw_keystone_admin_password =
>
>
> MUST not be there.


I know the dangers of this (ie keystone admin password being visible);
but isn't this already visible in ceph/radosgw configuration file as
well if you configure keystone.[1]

[1]: 
http://ceph.com/docs/master/radosgw/keystone/#integrating-with-openstack-keystone

> Shinobu
>
> - Original Message -
> From: "Gregory Farnum" 
> To: "Shinobu Kinjo" 
> Cc: "ceph-users" , "ceph-devel" 
> 
> Sent: Thursday, September 10, 2015 5:57:52 PM
> Subject: Re: [ceph-users] Ceph.conf
>
> On Thu, Sep 10, 2015 at 9:44 AM, Shinobu Kinjo  wrote:
>> Hello,
>>
>> I'm seeing 859 parameters in the output of:
>>
>> $ ./ceph --show-config | wc -l
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> 859
>>
>> In:
>>
>> $ ./ceph --version
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> ceph version 9.0.2-1454-g050e1c5 
>> (050e1c5c7471f8f237d9fa119af98c1efa9a8479)
>>
>> Since I'm quite new to Ceph, so my question is:
>>
>> Where can I know what each parameter exactly mean?
>>
>> I am probably right. Some parameters are just for tes-
>> ting purpose.
>
> Yes. A bunch shouldn't ever be set by users. A lot of the ones that
> should be are described as part of various operations in
> ceph.com/docs, but I don't know which ones of interest are missing
> from there. It's not very discoverable right now, unfortunately.
> -Greg
>
>>
>> Thank you for your help in advance.
>>
>> Shinobu
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD with iSCSI

2015-09-10 Thread Daleep Bais

Hello ,

Can anyone please suggest some path to check and resolve this issue?


Thanks in advance.

Daleep Singh Bais

On Wed, Sep 9, 2015 at 5:43 PM, Daleep Bais  wrote:

> Hi,
>
> I am following steps from URL 
> *http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/
> *
>   to create a RBD pool  and share to another initiator.
>
> I am not able to get rbd in the backstore list. Please suggest.
>
> below is the output of tgtadm command:
>
> tgtadm --lld iscsi --op show --mode system
> System:
> State: ready
> debug: off
> LLDs:
> iscsi: ready
> iser: error
> Backing stores:
> sheepdog
> bsg
> sg
> null
> ssc
> smc (bsoflags sync:direct)
> mmc (bsoflags sync:direct)
> rdwr (bsoflags sync:direct)
> Device types:
> disk
> cd/dvd
> osd
> controller
> changer
> tape
> passthrough
> iSNS:
> iSNS=Off
> iSNSServerIP=
> iSNSServerPort=3205
> iSNSAccessControl=Off
>
>
> I have installed tgt and tgt-rbd packages till now. Working on Debian
> GNU/Linux 8.1 (jessie)
>
> Thanks.
>
> Daleep Singh Bais
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph.conf

2015-09-10 Thread Shinobu Kinjo

Hello,

I'm seeing 859 parameters in the output of:

$ ./ceph --show-config | wc -l
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
859

In:

$ ./ceph --version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
ceph version 9.0.2-1454-g050e1c5 (050e1c5c7471f8f237d9fa119af98c1efa9a8479)

Since I'm quite new to Ceph, so my question is:

Where can I know what each parameter exactly mean?

I am probably right. Some parameters are just for tes-
ting purpose.

Thank you for your help in advance.

Shinobu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph.conf

2015-09-10 Thread Shinobu Kinjo

Thank you for your really really quick reply, Greg.

 > Yes. A bunch shouldn't ever be set by users.

 Anyhow, this is one of my biggest concern right now -;

rgw_keystone_admin_password = 
   

MUST not be there.

Shinobu

- Original Message -
From: "Gregory Farnum" 
To: "Shinobu Kinjo" 
Cc: "ceph-users" , "ceph-devel" 

Sent: Thursday, September 10, 2015 5:57:52 PM
Subject: Re: [ceph-users] Ceph.conf

On Thu, Sep 10, 2015 at 9:44 AM, Shinobu Kinjo  wrote:
> Hello,
>
> I'm seeing 859 parameters in the output of:
>
> $ ./ceph --show-config | wc -l
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> 859
>
> In:
>
> $ ./ceph --version
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> ceph version 9.0.2-1454-g050e1c5 
> (050e1c5c7471f8f237d9fa119af98c1efa9a8479)
>
> Since I'm quite new to Ceph, so my question is:
>
> Where can I know what each parameter exactly mean?
>
> I am probably right. Some parameters are just for tes-
> ting purpose.

Yes. A bunch shouldn't ever be set by users. A lot of the ones that
should be are described as part of various operations in
ceph.com/docs, but I don't know which ones of interest are missing
from there. It's not very discoverable right now, unfortunately.
-Greg

>
> Thank you for your help in advance.
>
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph.conf

2015-09-10 Thread Abhishek L

On Thu, Sep 10, 2015 at 2:51 PM, Shinobu Kinjo  wrote:
> Thank you for your really really quick reply, Greg.
>
>  > Yes. A bunch shouldn't ever be set by users.
>
>  Anyhow, this is one of my biggest concern right now -;
>
> rgw_keystone_admin_password =
>
>
> MUST not be there.


I know the dangers of this (ie keystone admin password being visible);
but isn't this already visible in ceph/radosgw configuration file as
well if you configure keystone.[1]

[1]: 
http://ceph.com/docs/master/radosgw/keystone/#integrating-with-openstack-keystone

> Shinobu
>
> - Original Message -
> From: "Gregory Farnum" 
> To: "Shinobu Kinjo" 
> Cc: "ceph-users" , "ceph-devel" 
> 
> Sent: Thursday, September 10, 2015 5:57:52 PM
> Subject: Re: [ceph-users] Ceph.conf
>
> On Thu, Sep 10, 2015 at 9:44 AM, Shinobu Kinjo  wrote:
>> Hello,
>>
>> I'm seeing 859 parameters in the output of:
>>
>> $ ./ceph --show-config | wc -l
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> 859
>>
>> In:
>>
>> $ ./ceph --version
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> ceph version 9.0.2-1454-g050e1c5 
>> (050e1c5c7471f8f237d9fa119af98c1efa9a8479)
>>
>> Since I'm quite new to Ceph, so my question is:
>>
>> Where can I know what each parameter exactly mean?
>>
>> I am probably right. Some parameters are just for tes-
>> ting purpose.
>
> Yes. A bunch shouldn't ever be set by users. A lot of the ones that
> should be are described as part of various operations in
> ceph.com/docs, but I don't know which ones of interest are missing
> from there. It's not very discoverable right now, unfortunately.
> -Greg
>
>>
>> Thank you for your help in advance.
>>
>> Shinobu
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question on cephfs recovery tools

2015-09-10 Thread John Spray

On Wed, Sep 9, 2015 at 2:31 AM, Goncalo Borges
 wrote:
> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions will appear
> at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to test the
> cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose) lost some
> data and metadata (check annex 1 after the main email).

You're only *maybe* losing metadata here, as your procedure is
targeting OSDs that contain data, and just hoping that those OSDs also
contain some metadata.

>
> 3) I've stopped the mds, and waited to check how the cluster reacts. After
> some time, as expected, the cluster reports a ERROR state, with a lot of PGs
> degraded and stuck
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_ERR
> 174 pgs degraded
> 48 pgs stale
> 174 pgs stuck degraded
> 41 pgs stuck inactive
> 48 pgs stuck stale
> 238 pgs stuck unclean
> 174 pgs stuck undersized
> 174 pgs undersized
> recovery 22366/463263 objects degraded (4.828%)
> recovery 8190/463263 objects misplaced (1.768%)
> too many PGs per OSD (388 > max 300)
> mds rank 0 has failed
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e24: 0/1/1 up, 1 failed
>  osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>   pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
> 1715 GB used, 40027 GB / 41743 GB avail
> 22366/463263 objects degraded (4.828%)
> 8190/463263 objects misplaced (1.768%)
> 1799 active+clean
>  110 active+undersized+degraded
>   60 active+remapped
>   37 stale+undersized+degraded+peered
>   23 active+undersized+degraded+remapped
>   11 stale+active+clean
>4 undersized+degraded+peered
>4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me this time but
> I already had situations where 'umount' would simply hang, and the only
> viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover operations are
> in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was considered
> degraded but after some small amount of time, that message disappeared. The
> WARNING status was just because of "too many PGs per OSD (409 > max 300)"
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>   client io 4151 kB/s rd, 1 op/s
>
> (wait some time)
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>
> 7) I was able to mount the cephfs filesystem in a client. When I tried to
> read a file made of some lost objects, I got holes in part of the file
> (compare with the same operation on annex 1)
>
> # od /cephfs/goncalo/5Gbytes_029.txt | head
> 000 00 00 00 00 00 00 00 00
> *
> 200 176665 053717 015710 124465 047254 102011 065275 123534
> 220 015727 131070 075673 176566 047511 154343 146334 006111
> 240 050506 102172 172362 121464 003532 005427 137554 137111
> 260 071444 052477 123364 127652 043562 144163 170405 026422
> 2000100 050316 117337 042573 171037 150704 071144 066344 116653
> 2000120 076041 041546 030235 055204 016253 136063 046012 066200
> 2000140 171626 123573 065351 032357 171326 132673 012213 016046
> 2000160 022034

Re: [ceph-users] Question on cephfs recovery tools

2015-09-10 Thread Shinobu Kinjo

>> Finally the questions:
>>
>> a./ Under a situation as the one describe above, how can we safely terminate
>> cephfs in the clients? I have had situations where umount simply hangs and
>> there is no real way to unblock the situation unless I reboot the client. If
>> we have hundreds of clients, I would like to avoid that.
>
> In your procedure, the umount problems have nothing to do with
> corruption.  It's (sometimes) hanging because the MDS is offline.  If
> the client has dirty metadata, it may not be able to flush it until
> the MDS is online -- there's no general way to "abort" this without
> breaking userspace semantics.  Similar case:
> http://tracker.ceph.com/issues/9477
> 
> Rebooting the machine is actually correct, 

So far, it might be ok, kind of but we have to improve this kind of un-friendly
situation.
We must give more choices to users using more nice messages, or by implementing
another monitor process.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Gregory Farnum

On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> while we're happy running ceph firefly in production and also reach
> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 2.2.1.
>
> We've now a customer having a single threaded application needing around
> 2000 iop/s but we don't go above 600 iop/s in this case.
>
> Any tuning hints for this case?

If the application really wants 2000 sync IOPS to disk without any
parallelism, I don't think any network storage system is likely to
satisfy him — that's only half a millisecond per IO. 600 IOPS is about
the limit of what the OSD can do right now (in terms of per-op
speeds), and although there is some work being done to improve that
it's not going to be in a released codebase for a while.

Or perhaps I misunderstood the question?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] higher read iop/s for single thread

2015-09-10 Thread Stefan Priebe - Profihost AG

Hi,

while we're happy running ceph firefly in production and also reach
enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 2.2.1.

We've now a customer having a single threaded application needing around
2000 iop/s but we don't go above 600 iop/s in this case.

Any tuning hints for this case?

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS/Fuse : detect package upgrade to remount

2015-09-10 Thread John Spray

On Tue, Sep 8, 2015 at 9:00 AM, Gregory Farnum  wrote:
> On Tue, Sep 8, 2015 at 2:33 PM, Florent B  wrote:
>>
>>
>> On 09/08/2015 03:26 PM, Gregory Farnum wrote:
>>> On Fri, Sep 4, 2015 at 9:15 AM, Florent B  wrote:
 Hi everyone,

 I would like to know if there is a way on Debian to detect an upgrade of
 ceph-fuse package, that "needs" remouting CephFS.

 When I upgrade my systems, I do a "aptitude update && aptitude
 safe-upgrade".

 When ceph-fuse package is upgraded, it would be nice to remount all
 CephFS points,  I suppose.

 Does someone did this ?
>>> I'm not sure how this could work. It'd be nice to smoothly upgrade for
>>> users, but
>>> 1) We don't automatically restart the OSD or monitor daemons on
>>> upgrade, because users want to control how many of their processes are
>>> down at once (and the load spike of a rebooting OSD),
>>> 2) I'm not sure how you could safely/cleanly restart a process that's
>>> serving a filesystem. It's not like we can force users to stop using
>>> the cephfs mountpoint and then reopen all their files after we reboot.
>>> -Greg
>>
>> Hi Greg,
>>
>> I understand.
>> It could be something like this : a command (or temp file) containing
>> *running* version of CephFS (per mount point, or per system).
>> Of course we can then get *installed* version of CephFS.
>> And if different, umount point.
>
> I guess I don't see how that helps compared to just remembering when
> you upgraded the package and comparing that to the running time of the
> ceph-fuse process.

I can see the utility of being able to do this at a moment in time,
rather than having to remember state.  Most orchestration tools will
make it a lot easier to implement a "compare and conditionally
restart" than a "store, upgrade, then compare and conditionally
restart".

Anyway, I think we already provide enough hooks to do this: the client
admin socket has a "version" command (`ceph daemon
/var/run/ceph/.asok version`).  The caller just needs to
know how to compare that to a package version.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question on cephfs recovery tools

2015-09-10 Thread John Spray

On Thu, Sep 10, 2015 at 7:44 AM, Shinobu Kinjo  wrote:
>>> Finally the questions:
>>>
>>> a./ Under a situation as the one describe above, how can we safely terminate
>>> cephfs in the clients? I have had situations where umount simply hangs and
>>> there is no real way to unblock the situation unless I reboot the client. If
>>> we have hundreds of clients, I would like to avoid that.
>>
>> In your procedure, the umount problems have nothing to do with
>> corruption.  It's (sometimes) hanging because the MDS is offline.  If
>> the client has dirty metadata, it may not be able to flush it until
>> the MDS is online -- there's no general way to "abort" this without
>> breaking userspace semantics.  Similar case:
>> http://tracker.ceph.com/issues/9477
>>
>> Rebooting the machine is actually correct,
>
> So far, it might be ok, kind of but we have to improve this kind of 
> un-friendly
> situation.
> We must give more choices to users using more nice messages, or by 
> implementing
> another monitor process.

Yes, agreed.  We may not be able to change the underlying OS rules
about filesystems, but that doesn't mean we can't add other mechanisms
to make this better for users.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Jan Schermer

Get faster CPUs (sorry, nothing else comes to mind).
What type of application is that and what exactly does it do?

Basically you would have to cache it in rbd cache or pagecache in the VM but 
that only works if the reads repeat.

Jan
 
> On 10 Sep 2015, at 15:34, Stefan Priebe - Profihost AG 
>  wrote:
> 
> Hi,
> 
> while we're happy running ceph firefly in production and also reach
> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 2.2.1.
> 
> We've now a customer having a single threaded application needing around
> 2000 iop/s but we don't go above 600 iop/s in this case.
> 
> Any tuning hints for this case?
> 
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Andrija Panic

"enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
2.2.1."

That is very nice number if I'm allowed to comment - may I know what is
your setup (in 2 lines, hardware, number of OSDs) ?

Thanks

On 10 September 2015 at 15:39, Jan Schermer  wrote:

> Get faster CPUs (sorry, nothing else comes to mind).
> What type of application is that and what exactly does it do?
>
> Basically you would have to cache it in rbd cache or pagecache in the VM
> but that only works if the reads repeat.
>
> Jan
>
> > On 10 Sep 2015, at 15:34, Stefan Priebe - Profihost AG <
> s.pri...@profihost.ag> wrote:
> >
> > Hi,
> >
> > while we're happy running ceph firefly in production and also reach
> > enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
> 2.2.1.
> >
> > We've now a customer having a single threaded application needing around
> > 2000 iop/s but we don't go above 600 iop/s in this case.
> >
> > Any tuning hints for this case?
> >
> > Stefan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Mark Nelson

I'm not sure you will be able to get there with firefly.  I've gotten 
close to 1ms after lots of tuning on hammer, but 0.5ms is probably not 
likely to happen without all of the new work that 
Sandisk/Fujitsu/Intel/Others have been doing to improve the data path.


Your best bet is probably going to be a combination of:

1) switch to jemalloc (and make sure you have enough RAM to deal with it)
2) disabled ceph auth
3) disable all logging
4) throw a high clock speed CPU at the OSDs and keep the number of OSDs 
per server lowish (will need to be tested to see where the sweet spot is).
5) potentially implement some kind of scheme to make sure OSD threads 
stay pinned to specific cores.
6) lots of investigation to make sure the kernel/tcp stack/vm/etc isn't 
getting in the way.


Mark

On 09/10/2015 08:34 AM, Stefan Priebe - Profihost AG wrote:

Hi,

while we're happy running ceph firefly in production and also reach
enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 2.2.1.

We've now a customer having a single threaded application needing around
2000 iop/s but we don't go above 600 iop/s in this case.

Any tuning hints for this case?

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Jan Schermer

It's certainly not a problem with DRBD (yeah, it's something completely 
different but it's used for all kinds of workloads including things like 
replicated tablespaces for databases).
It won't be a problem with VSAN (again, a bit different, but most people just 
want something like that)
It surely won't be a problem with e.g. ScaleIO which should be comparable to 
Ceph.

Latency on the network can be very low (0.05ms on my 10GbE). Latency on good 
SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is pretty 
good nowadays at waking up threads and pushing the work. Multiply those numbers 
by whatever factor and it's still just a fraction of the 0.5ms needed.
The problem is quite frankly slow OSD code and the only solution now is to keep 
the data closer to the VM.

Jan

> On 10 Sep 2015, at 15:38, Gregory Farnum  wrote:
> 
> On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>> 
>> while we're happy running ceph firefly in production and also reach
>> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 2.2.1.
>> 
>> We've now a customer having a single threaded application needing around
>> 2000 iop/s but we don't go above 600 iop/s in this case.
>> 
>> Any tuning hints for this case?
> 
> If the application really wants 2000 sync IOPS to disk without any
> parallelism, I don't think any network storage system is likely to
> satisfy him — that's only half a millisecond per IO. 600 IOPS is about
> the limit of what the OSD can do right now (in terms of per-op
> speeds), and although there is some work being done to improve that
> it's not going to be in a released codebase for a while.
> 
> Or perhaps I misunderstood the question?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Jan Schermer


> On 10 Sep 2015, at 16:26, Haomai Wang  wrote:
> 
> Actually we can reach 700us per 4k write IO for single io depth(2 copy, 
> E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a 
> unbridgeable problem.
> 

Flushed to disk?


> CPU is critical for ssd backend, so what's your cpu model?
> 
> On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer  > wrote:
> It's certainly not a problem with DRBD (yeah, it's something completely 
> different but it's used for all kinds of workloads including things like 
> replicated tablespaces for databases).
> It won't be a problem with VSAN (again, a bit different, but most people just 
> want something like that)
> It surely won't be a problem with e.g. ScaleIO which should be comparable to 
> Ceph.
> 
> Latency on the network can be very low (0.05ms on my 10GbE). Latency on good 
> SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is pretty 
> good nowadays at waking up threads and pushing the work. Multiply those 
> numbers by whatever factor and it's still just a fraction of the 0.5ms needed.
> The problem is quite frankly slow OSD code and the only solution now is to 
> keep the data closer to the VM.
> 
> Jan
> 
> > On 10 Sep 2015, at 15:38, Gregory Farnum  > > wrote:
> >
> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
> > > wrote:
> >> Hi,
> >>
> >> while we're happy running ceph firefly in production and also reach
> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 
> >> 2.2.1.
> >>
> >> We've now a customer having a single threaded application needing around
> >> 2000 iop/s but we don't go above 600 iop/s in this case.
> >>
> >> Any tuning hints for this case?
> >
> > If the application really wants 2000 sync IOPS to disk without any
> > parallelism, I don't think any network storage system is likely to
> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
> > the limit of what the OSD can do right now (in terms of per-op
> > speeds), and although there is some work being done to improve that
> > it's not going to be in a released codebase for a while.
> >
> > Or perhaps I misunderstood the question?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Andrija Panic

We also get 2ms for writes, INtel S3500 Journals (5 journals on 1 SSD) and
4TB OSDs...

On 10 September 2015 at 16:41, Jan Schermer  wrote:

> What did you tune? Did you have to make a human sacrifice? :) Which
> release?
> The last proper benchmark numbers I saw were from hammer and the latencies
> were basically still the same, about 2ms for write.
>
> Jan
>
>
> On 10 Sep 2015, at 16:38, Haomai Wang  wrote:
>
>
>
> On Thu, Sep 10, 2015 at 10:36 PM, Jan Schermer  wrote:
>
>>
>> On 10 Sep 2015, at 16:26, Haomai Wang  wrote:
>>
>> Actually we can reach 700us per 4k write IO for single io depth(2 copy,
>> E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a
>> unbridgeable problem.
>>
>>
>> Flushed to disk?
>>
>
> of course
>
>
>>
>>
>> CPU is critical for ssd backend, so what's your cpu model?
>>
>> On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer  wrote:
>>
>>> It's certainly not a problem with DRBD (yeah, it's something completely
>>> different but it's used for all kinds of workloads including things like
>>> replicated tablespaces for databases).
>>> It won't be a problem with VSAN (again, a bit different, but most people
>>> just want something like that)
>>> It surely won't be a problem with e.g. ScaleIO which should be
>>> comparable to Ceph.
>>>
>>> Latency on the network can be very low (0.05ms on my 10GbE). Latency on
>>> good SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is
>>> pretty good nowadays at waking up threads and pushing the work. Multiply
>>> those numbers by whatever factor and it's still just a fraction of the
>>> 0.5ms needed.
>>> The problem is quite frankly slow OSD code and the only solution now is
>>> to keep the data closer to the VM.
>>>
>>> Jan
>>>
>>> > On 10 Sep 2015, at 15:38, Gregory Farnum  wrote:
>>> >
>>> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
>>> >  wrote:
>>> >> Hi,
>>> >>
>>> >> while we're happy running ceph firefly in production and also reach
>>> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
>>> 2.2.1.
>>> >>
>>> >> We've now a customer having a single threaded application needing
>>> around
>>> >> 2000 iop/s but we don't go above 600 iop/s in this case.
>>> >>
>>> >> Any tuning hints for this case?
>>> >
>>> > If the application really wants 2000 sync IOPS to disk without any
>>> > parallelism, I don't think any network storage system is likely to
>>> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
>>> > the limit of what the OSD can do right now (in terms of per-op
>>> > speeds), and although there is some work being done to improve that
>>> > it's not going to be in a released codebase for a while.
>>> >
>>> > Or perhaps I misunderstood the question?
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> --
>>
>> Best Regards,
>>
>> Wheat
>>
>>
>>
>
>
> --
>
> Best Regards,
>
> Wheat
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Jan Schermer

What did you tune? Did you have to make a human sacrifice? :) Which release?
The last proper benchmark numbers I saw were from hammer and the latencies were 
basically still the same, about 2ms for write.

Jan


> On 10 Sep 2015, at 16:38, Haomai Wang  wrote:
> 
> 
> 
> On Thu, Sep 10, 2015 at 10:36 PM, Jan Schermer  > wrote:
> 
>> On 10 Sep 2015, at 16:26, Haomai Wang > > wrote:
>> 
>> Actually we can reach 700us per 4k write IO for single io depth(2 copy, 
>> E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a 
>> unbridgeable problem.
>> 
> 
> Flushed to disk?
> 
> of course
>  
> 
> 
>> CPU is critical for ssd backend, so what's your cpu model?
>> 
>> On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer > > wrote:
>> It's certainly not a problem with DRBD (yeah, it's something completely 
>> different but it's used for all kinds of workloads including things like 
>> replicated tablespaces for databases).
>> It won't be a problem with VSAN (again, a bit different, but most people 
>> just want something like that)
>> It surely won't be a problem with e.g. ScaleIO which should be comparable to 
>> Ceph.
>> 
>> Latency on the network can be very low (0.05ms on my 10GbE). Latency on good 
>> SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is pretty 
>> good nowadays at waking up threads and pushing the work. Multiply those 
>> numbers by whatever factor and it's still just a fraction of the 0.5ms 
>> needed.
>> The problem is quite frankly slow OSD code and the only solution now is to 
>> keep the data closer to the VM.
>> 
>> Jan
>> 
>> > On 10 Sep 2015, at 15:38, Gregory Farnum > > > wrote:
>> >
>> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
>> > > wrote:
>> >> Hi,
>> >>
>> >> while we're happy running ceph firefly in production and also reach
>> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 
>> >> 2.2.1.
>> >>
>> >> We've now a customer having a single threaded application needing around
>> >> 2000 iop/s but we don't go above 600 iop/s in this case.
>> >>
>> >> Any tuning hints for this case?
>> >
>> > If the application really wants 2000 sync IOPS to disk without any
>> > parallelism, I don't think any network storage system is likely to
>> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
>> > the limit of what the OSD can do right now (in terms of per-op
>> > speeds), and although there is some work being done to improve that
>> > it's not going to be in a released codebase for a while.
>> >
>> > Or perhaps I misunderstood the question?
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> > 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> 
>> Wheat
>> 
> 
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD with iSCSI

2015-09-10 Thread Jake Young

On Wed, Sep 9, 2015 at 8:13 AM, Daleep Bais  wrote:

> Hi,
>
> I am following steps from URL 
> *http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/
> *
>   to create a RBD pool  and share to another initiator.
>
> I am not able to get rbd in the backstore list. Please suggest.
>
> below is the output of tgtadm command:
>
> tgtadm --lld iscsi --op show --mode system
> System:
> State: ready
> debug: off
> LLDs:
> iscsi: ready
> iser: error
> Backing stores:
> sheepdog
> bsg
> sg
> null
> ssc
> smc (bsoflags sync:direct)
> mmc (bsoflags sync:direct)
> rdwr (bsoflags sync:direct)
> Device types:
> disk
> cd/dvd
> osd
> controller
> changer
> tape
> passthrough
> iSNS:
> iSNS=Off
> iSNSServerIP=
> iSNSServerPort=3205
> iSNSAccessControl=Off
>
>
> I have installed tgt and tgt-rbd packages till now. Working on Debian
> GNU/Linux 8.1 (jessie)
>
> Thanks.
>
> Daleep Singh Bais
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
Hey Daleep,

The tgt you have installed does not support Ceph rbd.  See the output from
my system using a more recent tgt that supports rbd.

tgtadm --lld iscsi --mode system --op show
System:
State: ready
debug: off
LLDs:
iscsi: ready
iser: error
Backing stores:
*rbd (bsoflags sync:direct)*
sheepdog
bsg
sg
null
ssc
rdwr (bsoflags sync:direct)
Device types:
disk
cd/dvd
osd
controller
changer
tape
passthrough
iSNS:
iSNS=Off
iSNSServerIP=
iSNSServerPort=3205
iSNSAccessControl=Off


You will need a new version of tgt.  I think the earliest version that
supports rbd is 1.0.42

https://github.com/fujita/tgt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Haomai Wang

Actually we can reach 700us per 4k write IO for single io depth(2 copy,
E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a
unbridgeable problem.

CPU is critical for ssd backend, so what's your cpu model?

On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer  wrote:

> It's certainly not a problem with DRBD (yeah, it's something completely
> different but it's used for all kinds of workloads including things like
> replicated tablespaces for databases).
> It won't be a problem with VSAN (again, a bit different, but most people
> just want something like that)
> It surely won't be a problem with e.g. ScaleIO which should be comparable
> to Ceph.
>
> Latency on the network can be very low (0.05ms on my 10GbE). Latency on
> good SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is
> pretty good nowadays at waking up threads and pushing the work. Multiply
> those numbers by whatever factor and it's still just a fraction of the
> 0.5ms needed.
> The problem is quite frankly slow OSD code and the only solution now is to
> keep the data closer to the VM.
>
> Jan
>
> > On 10 Sep 2015, at 15:38, Gregory Farnum  wrote:
> >
> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
> >  wrote:
> >> Hi,
> >>
> >> while we're happy running ceph firefly in production and also reach
> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
> 2.2.1.
> >>
> >> We've now a customer having a single threaded application needing around
> >> 2000 iop/s but we don't go above 600 iop/s in this case.
> >>
> >> Any tuning hints for this case?
> >
> > If the application really wants 2000 sync IOPS to disk without any
> > parallelism, I don't think any network storage system is likely to
> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
> > the limit of what the OSD can do right now (in terms of per-op
> > speeds), and although there is some work being done to improve that
> > it's not going to be in a released codebase for a while.
> >
> > Or perhaps I misunderstood the question?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Haomai Wang

On Thu, Sep 10, 2015 at 10:36 PM, Jan Schermer  wrote:

>
> On 10 Sep 2015, at 16:26, Haomai Wang  wrote:
>
> Actually we can reach 700us per 4k write IO for single io depth(2 copy,
> E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a
> unbridgeable problem.
>
>
> Flushed to disk?
>

of course


>
>
> CPU is critical for ssd backend, so what's your cpu model?
>
> On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer  wrote:
>
>> It's certainly not a problem with DRBD (yeah, it's something completely
>> different but it's used for all kinds of workloads including things like
>> replicated tablespaces for databases).
>> It won't be a problem with VSAN (again, a bit different, but most people
>> just want something like that)
>> It surely won't be a problem with e.g. ScaleIO which should be comparable
>> to Ceph.
>>
>> Latency on the network can be very low (0.05ms on my 10GbE). Latency on
>> good SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is
>> pretty good nowadays at waking up threads and pushing the work. Multiply
>> those numbers by whatever factor and it's still just a fraction of the
>> 0.5ms needed.
>> The problem is quite frankly slow OSD code and the only solution now is
>> to keep the data closer to the VM.
>>
>> Jan
>>
>> > On 10 Sep 2015, at 15:38, Gregory Farnum  wrote:
>> >
>> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
>> >  wrote:
>> >> Hi,
>> >>
>> >> while we're happy running ceph firefly in production and also reach
>> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
>> 2.2.1.
>> >>
>> >> We've now a customer having a single threaded application needing
>> around
>> >> 2000 iop/s but we don't go above 600 iop/s in this case.
>> >>
>> >> Any tuning hints for this case?
>> >
>> > If the application really wants 2000 sync IOPS to disk without any
>> > parallelism, I don't think any network storage system is likely to
>> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
>> > the limit of what the OSD can do right now (in terms of per-op
>> > speeds), and although there is some work being done to improve that
>> > it's not going to be in a released codebase for a while.
>> >
>> > Or perhaps I misunderstood the question?
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
>
> Best Regards,
>
> Wheat
>
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Jan Schermer

Yes, I read that and I investigated all those areas, on my Dumpling the gains 
are pretty low.
If I remember correctly Hammer didn't improve synchronous (journal) writes at 
all. Or at least it didn't when I read that...?
So is it actually that much faster? Did something change in Hammer in recent 
releases?
3x speedup is of course good, but it still needs to get even faster. I'd 
consider 0.5ms in all situations a bottom line goal to be able to run 
moderately loaded databases, but to compete with any SAN or even DAS storage it 
needs to get _much_ faster than that, somehow...
Otherwise we're still just wasting SSDs in here.

Jan

> On 10 Sep 2015, at 16:45, Haomai Wang  wrote:
> 
> 
> 
> On Thu, Sep 10, 2015 at 10:41 PM, Jan Schermer  > wrote:
> What did you tune? Did you have to make a human sacrifice? :) Which release?
> The last proper benchmark numbers I saw were from hammer and the latencies 
> were basically still the same, about 2ms for write.
> 
> No sacrifice, actually I dive into all-ssd ceph since 2013, I can see the 
> improvement from Dumpling to Hammer. 
> 
> You can find my related thread in 2014. Mainly about ensure fd hit, cpu 
> powersave disable, memory management
>  
> 
> Jan
> 
> 
>> On 10 Sep 2015, at 16:38, Haomai Wang > > wrote:
>> 
>> 
>> 
>> On Thu, Sep 10, 2015 at 10:36 PM, Jan Schermer > > wrote:
>> 
>>> On 10 Sep 2015, at 16:26, Haomai Wang >> > wrote:
>>> 
>>> Actually we can reach 700us per 4k write IO for single io depth(2 copy, 
>>> E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a 
>>> unbridgeable problem.
>>> 
>> 
>> Flushed to disk?
>> 
>> of course
>>  
>> 
>> 
>>> CPU is critical for ssd backend, so what's your cpu model?
>>> 
>>> On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer >> > wrote:
>>> It's certainly not a problem with DRBD (yeah, it's something completely 
>>> different but it's used for all kinds of workloads including things like 
>>> replicated tablespaces for databases).
>>> It won't be a problem with VSAN (again, a bit different, but most people 
>>> just want something like that)
>>> It surely won't be a problem with e.g. ScaleIO which should be comparable 
>>> to Ceph.
>>> 
>>> Latency on the network can be very low (0.05ms on my 10GbE). Latency on 
>>> good SSDs is  2 orders of magnitute lower (as low as 0.5 ms). Linux is 
>>> pretty good nowadays at waking up threads and pushing the work. Multiply 
>>> those numbers by whatever factor and it's still just a fraction of the 
>>> 0.5ms needed.
>>> The problem is quite frankly slow OSD code and the only solution now is to 
>>> keep the data closer to the VM.
>>> 
>>> Jan
>>> 
>>> > On 10 Sep 2015, at 15:38, Gregory Farnum >> > > wrote:
>>> >
>>> > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
>>> > > wrote:
>>> >> Hi,
>>> >>
>>> >> while we're happy running ceph firefly in production and also reach
>>> >> enough 4k read iop/s for multithreaded apps (around 23 000) with qemu 
>>> >> 2.2.1.
>>> >>
>>> >> We've now a customer having a single threaded application needing around
>>> >> 2000 iop/s but we don't go above 600 iop/s in this case.
>>> >>
>>> >> Any tuning hints for this case?
>>> >
>>> > If the application really wants 2000 sync IOPS to disk without any
>>> > parallelism, I don't think any network storage system is likely to
>>> > satisfy him — that's only half a millisecond per IO. 600 IOPS is about
>>> > the limit of what the OSD can do right now (in terms of per-op
>>> > speeds), and although there is some work being done to improve that
>>> > it's not going to be in a released codebase for a while.
>>> >
>>> > Or perhaps I misunderstood the question?
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com 
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> > 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> 
>>> Wheat
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> 
>> Wheat
>> 
> 
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-09-10 Thread Jelle de Jong

Hello Jan,

I want to test your pincpus I got from github.

I have a 2x CPU (X5550) with 4 core 16 threads system
I have four OSD (4x WD1003FBYX) with SSD (SHFS37A) journal .
I got three nodes like that.
I am not sure how to configure prz-pincpus.conf

# prz-pincpus.conf
https://paste.debian.net/plainh/70d11f19

I do not have a /cgroup/cpuset/libvirt/cpuset.cpus or /etc/cgconfig.conf
file.

# mount | grep cg
cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,size=12k)
cgmfs on /run/cgmanager/fs type tmpfs (rw,relatime,size=100k,mode=755)
systemd on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/usr/lib/x86_64-linux-gnu/systemd-shim-cgroup-release-agent,name=systemd)

And some unrelated beachnmarks because I know you like them as well:
https://paste.debian.net/plainh/0b5c159f
# noop scheduler on all disks and ssd
# write_cache = off on ssd
# add_random =0 on ssd
# performance governor
# more ideas tips?

Kind regards,

Jelle

On 27/07/15 14:21, Jan Schermer wrote:
> Hi!
> The /cgroup/* mount point is probably a RHEL6 thing, recent distributions 
> seem to use /sys/fs/cgroup like in your case (maybe because of systemd?). On 
> RHEL 6 the mount points are configured in /etc/cgconfig.conf and /cgroup is 
> the default.
> 
> I also saw the pull request from you on github and I don’t think I’ll merge 
> it because creating the directory if the parent does not exist could mask the 
> non-existence of cgroups or a different mountpoint, so I think it’s better to 
> fail and leave it up to the admin to modify the script.
> A more mature solution would probably be some sort of OS-specific integration 
> (automatic cgclassify rules, initscript-ed cgroup creation and such). When 
> this support is already in place maintainers only need to integrate it. In 
> newer distros a new kernel (scheduler) with more NUMA awareness and other 
> autotuning could do a better job than this script by default.
> 
> And if any CEPH devs are listening: I saw an issue on CEPH tracker for cgroup 
> classification http://tracker.ceph.com/issues/12424 and I humbly advise you 
> not to do that - this will either turn into something distro-specific or it 
> will create an Inner Platform Effect on all distros that maintainers 
> downstream will need to replace with their own anyway. Of course since 
> Inktank is somewhat part of RedHat now it makes sense to integrate it into 
> RHOS, RHEV and CEPH packages for RHEL and make a profile for “tuned” or 
> whatever does the tuning magic.
> 
> Btw has anybody else tried it? What are your results? We still use it and it 
> makes a big difference on NUMA systems, even bigger difference when mixed 
> with KVM guests on the same hardware.
>  
> Thanks
> Jan
> 
> 
> 
>> On 27 Jul 2015, at 13:23, Saverio Proto  wrote:
>>
>> Hello Jan,
>>
>> I am testing your scripts, because we want also to test OSDs and VMs
>> on the same server.
>>
>> I am new to cgroups, so this might be a very newbie question.
>> In your script you always reference to the file
>> /cgroup/cpuset/libvirt/cpuset.cpus
>>
>> but I have the file in /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus
>>
>> I am working on Ubuntu 14.04
>>
>> This difference comes from something special in your setup, or maybe
>> because we are working on different Linux distributions ?
>>
>> Thanks for clarification.
>>
>> Saverio
>>
>>
>>
>> 2015-06-30 17:50 GMT+02:00 Jan Schermer :
>>> Hi all,
>>> our script is available on GitHub
>>>
>>> https://github.com/prozeta/pincpus
>>>
>>> I haven’t had much time to do a proper README, but I hope the configuration
>>> is self explanatory enough for now.
>>> What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA
>>> node.
>>>
>>> Let me know how it works for you!
>>>
>>> Jan
>>>
>>>
>>> On 30 Jun 2015, at 10:50, Huang Zhiteng  wrote:
>>>
>>>
>>>
>>> On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer  wrote:

 Not having OSDs and KVMs compete against each other is one thing.
 But there are more reasons to do this

 1) not moving the processes and threads between cores that much (better
 cache utilization)
 2) aligning the processes with memory on NUMA systems (that means all
 modern dual socket systems) - you don’t want your OSD running on CPU1 with
 memory allocated to CPU2
 3) the same goes for other resources like NICs or storage controllers -
 but that’s less important and not always practical to do
 4) you can limit the scheduling domain on linux if you limit the cpuset
 for your OSDs (I’m not sure how important this is, just best practice)
 5) you can easily limit memory or CPU usage, set priority, with much
 greater granularity than without cgroups
 6) if you have HyperThreading enabled you get the most gain when the
 workloads on the threads are dissimiliar - so to have the higher throughput
 you have to pin OSD to thread1

Re: [ceph-users] Straw2 kernel version?

2015-09-10 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

My notes show that it should have landed in 4.1, but I also have
written down that it wasn't merged yet. Just trying to get a
confirmation on the version that it did land in.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 12:45 PM, Lincoln Bryant  wrote:
> Hi Robert,
>
> I believe kernel versions 4.1 and beyond support straw2.
>
> —Lincoln
>
>> On Sep 10, 2015, at 1:43 PM, Robert LeBlanc  wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> Has straw2 landed in the kernel and if so which version?
>>
>> Thanks,
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.0.2
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJV8c9fCRDmVDuy+mK58QAAKOoP/ibMriwPzqlY0ow1N36V
>> OX1wg+6r3nQRyGglvKVi9cmpPrgnlTZxPVv0KRr8xocRBrPYI//hob6qEVWH
>> hvaUVg5PDbgQRGi4GNWP8oY0VR7rYxjQAys3c+Mo9LSs1ZmgygIxmuNSGR1w
>> g3BCHJjBnSvrQ+NzDuIsaSnxAWCQKIJgMSmlOa0Pieqq4lXJDTNAdRILDOMn
>> eAuJcXZqq2Ll8axQnl8ymIRvq9aZ/TQi+q0lqJ/wgAkO/coZm/18HmMa/VI0
>> 1/8rZTG0Jy4lgxny5VB1OjAZLMGnKfPyKs8bvQeksNBhMhIZVeFrZ5JHQC3f
>> 4VsmAnTtDxD7RSEhlVy66kBMmdOlU6PhlSWZQ0OmLgHotX8HC9TJAq2I18yJ
>> ggk4mNkpcZwTz4PagjeEtST8/s1OIEjX4e9lh5u9einFv6mCxUMWT7bQwzFd
>> SImx589rjXLyZjdDtXsPZxN1G2Qi4HnlgKnkC44mx4soypo2sDFFmtv6YeWJ
>> e0Nr8RvFmKhPPgc71R1po9ZTOMIh3aBfMehvsAueVE8AhBZl8lvQyatAqYES
>> S7dcuhVATS4gfkEv4XWR1MVhvLDYP3l/I1H32cp5mh43BCT/DpSHvyfr0lhb
>> dxBlfSY/GYLFMGxbG73DFZO3S9o85nz2vma90rsS6AGx/oJOsJYUnXKcvUzL
>> Qgep
>> =e4Wy
>> -END PGP SIGNATURE-
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8dG2CRDmVDuy+mK58QAALXYP/0QYtBxc17tJmxVgUzwu
nnUpiM2iePAGkyG9FNVYOBW9TENWkLz56zFbJM3/aDYgTW8L0pqthVA4whCu
+pV0WxrQ7cDEZ0UWdZ+I8Ag12g5KWmpZyt000Uuxjx9PJQSKD5KxHEFw4dcn
A4BqQbKExtaz3KY8MVt3WpO+xsXBM64ImPkcYGUyOQ5tWzVvNbIOiLe0RIGW
JX00KnqdaVm9xXz+lBqkYKBGNto5z/Xu2wWK28FCFDfxe5Uw8Pd8JRq2/rp4
MhlTYOJazW9LbLW8mkzCaxscDMjeCTkGKPAlGXU6QkOp1ounxEpsT19nxPJw
IafYjGqYbASCBcCKjHWSZziEfA1PjlZhgs2DyXYFo9PaktW1vUtOJGkz+SMa
LkkXa8L3Y920v9iNe9syOC5/CKd2DTnfIsZjCw3Np9HL3REiulINEk4R2d1E
MLUmApTE60bBtzXKy2MmuzvX2IcE3TV0Oh+f9Nijr60Cd43dhBH7h0wKJmgh
CbJmZ23vxnJIUlhCd/+y+PSfahx0z4pSL7CVNLvJ1eQLBfLsrsTWLiRyF4iQ
k8KDaNdfBvOVmg6wU7Hzvxn1Z8sn0dEqxrEz6F6gcIpFyBq28jK1kpaaJ+/3
GgIEakW0CwOpkzT8viqJMsw8DUG+30meXWja8JOD0CPt3CyqrhhbORQPUgXQ
5g3M
=psm2
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Straw2 kernel version?

2015-09-10 Thread Lincoln Bryant

Hi Robert,

I believe kernel versions 4.1 and beyond support straw2.

—Lincoln

> On Sep 10, 2015, at 1:43 PM, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Has straw2 landed in the kernel and if so which version?
> 
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJV8c9fCRDmVDuy+mK58QAAKOoP/ibMriwPzqlY0ow1N36V
> OX1wg+6r3nQRyGglvKVi9cmpPrgnlTZxPVv0KRr8xocRBrPYI//hob6qEVWH
> hvaUVg5PDbgQRGi4GNWP8oY0VR7rYxjQAys3c+Mo9LSs1ZmgygIxmuNSGR1w
> g3BCHJjBnSvrQ+NzDuIsaSnxAWCQKIJgMSmlOa0Pieqq4lXJDTNAdRILDOMn
> eAuJcXZqq2Ll8axQnl8ymIRvq9aZ/TQi+q0lqJ/wgAkO/coZm/18HmMa/VI0
> 1/8rZTG0Jy4lgxny5VB1OjAZLMGnKfPyKs8bvQeksNBhMhIZVeFrZ5JHQC3f
> 4VsmAnTtDxD7RSEhlVy66kBMmdOlU6PhlSWZQ0OmLgHotX8HC9TJAq2I18yJ
> ggk4mNkpcZwTz4PagjeEtST8/s1OIEjX4e9lh5u9einFv6mCxUMWT7bQwzFd
> SImx589rjXLyZjdDtXsPZxN1G2Qi4HnlgKnkC44mx4soypo2sDFFmtv6YeWJ
> e0Nr8RvFmKhPPgc71R1po9ZTOMIh3aBfMehvsAueVE8AhBZl8lvQyatAqYES
> S7dcuhVATS4gfkEv4XWR1MVhvLDYP3l/I1H32cp5mh43BCT/DpSHvyfr0lhb
> dxBlfSY/GYLFMGxbG73DFZO3S9o85nz2vma90rsS6AGx/oJOsJYUnXKcvUzL
> Qgep
> =e4Wy
> -END PGP SIGNATURE-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Straw2 kernel version?

2015-09-10 Thread Josh Durgin


On 09/10/2015 11:53 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

My notes show that it should have landed in 4.1, but I also have
written down that it wasn't merged yet. Just trying to get a
confirmation on the version that it did land in.


Yes, it landed in 4.1.

Josh


- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 12:45 PM, Lincoln Bryant  wrote:

Hi Robert,

I believe kernel versions 4.1 and beyond support straw2.

—Lincoln


On Sep 10, 2015, at 1:43 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Has straw2 landed in the kernel and if so which version?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8c9fCRDmVDuy+mK58QAAKOoP/ibMriwPzqlY0ow1N36V
OX1wg+6r3nQRyGglvKVi9cmpPrgnlTZxPVv0KRr8xocRBrPYI//hob6qEVWH
hvaUVg5PDbgQRGi4GNWP8oY0VR7rYxjQAys3c+Mo9LSs1ZmgygIxmuNSGR1w
g3BCHJjBnSvrQ+NzDuIsaSnxAWCQKIJgMSmlOa0Pieqq4lXJDTNAdRILDOMn
eAuJcXZqq2Ll8axQnl8ymIRvq9aZ/TQi+q0lqJ/wgAkO/coZm/18HmMa/VI0
1/8rZTG0Jy4lgxny5VB1OjAZLMGnKfPyKs8bvQeksNBhMhIZVeFrZ5JHQC3f
4VsmAnTtDxD7RSEhlVy66kBMmdOlU6PhlSWZQ0OmLgHotX8HC9TJAq2I18yJ
ggk4mNkpcZwTz4PagjeEtST8/s1OIEjX4e9lh5u9einFv6mCxUMWT7bQwzFd
SImx589rjXLyZjdDtXsPZxN1G2Qi4HnlgKnkC44mx4soypo2sDFFmtv6YeWJ
e0Nr8RvFmKhPPgc71R1po9ZTOMIh3aBfMehvsAueVE8AhBZl8lvQyatAqYES
S7dcuhVATS4gfkEv4XWR1MVhvLDYP3l/I1H32cp5mh43BCT/DpSHvyfr0lhb
dxBlfSY/GYLFMGxbG73DFZO3S9o85nz2vma90rsS6AGx/oJOsJYUnXKcvUzL
Qgep
=e4Wy
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8dG2CRDmVDuy+mK58QAALXYP/0QYtBxc17tJmxVgUzwu
nnUpiM2iePAGkyG9FNVYOBW9TENWkLz56zFbJM3/aDYgTW8L0pqthVA4whCu
+pV0WxrQ7cDEZ0UWdZ+I8Ag12g5KWmpZyt000Uuxjx9PJQSKD5KxHEFw4dcn
A4BqQbKExtaz3KY8MVt3WpO+xsXBM64ImPkcYGUyOQ5tWzVvNbIOiLe0RIGW
JX00KnqdaVm9xXz+lBqkYKBGNto5z/Xu2wWK28FCFDfxe5Uw8Pd8JRq2/rp4
MhlTYOJazW9LbLW8mkzCaxscDMjeCTkGKPAlGXU6QkOp1ounxEpsT19nxPJw
IafYjGqYbASCBcCKjHWSZziEfA1PjlZhgs2DyXYFo9PaktW1vUtOJGkz+SMa
LkkXa8L3Y920v9iNe9syOC5/CKd2DTnfIsZjCw3Np9HL3REiulINEk4R2d1E
MLUmApTE60bBtzXKy2MmuzvX2IcE3TV0Oh+f9Nijr60Cd43dhBH7h0wKJmgh
CbJmZ23vxnJIUlhCd/+y+PSfahx0z4pSL7CVNLvJ1eQLBfLsrsTWLiRyF4iQ
k8KDaNdfBvOVmg6wU7Hzvxn1Z8sn0dEqxrEz6F6gcIpFyBq28jK1kpaaJ+/3
GgIEakW0CwOpkzT8viqJMsw8DUG+30meXWja8JOD0CPt3CyqrhhbORQPUgXQ
5g3M
=psm2
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Straw2 kernel version?

2015-09-10 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Has straw2 landed in the kernel and if so which version?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8c9fCRDmVDuy+mK58QAAKOoP/ibMriwPzqlY0ow1N36V
OX1wg+6r3nQRyGglvKVi9cmpPrgnlTZxPVv0KRr8xocRBrPYI//hob6qEVWH
hvaUVg5PDbgQRGi4GNWP8oY0VR7rYxjQAys3c+Mo9LSs1ZmgygIxmuNSGR1w
g3BCHJjBnSvrQ+NzDuIsaSnxAWCQKIJgMSmlOa0Pieqq4lXJDTNAdRILDOMn
eAuJcXZqq2Ll8axQnl8ymIRvq9aZ/TQi+q0lqJ/wgAkO/coZm/18HmMa/VI0
1/8rZTG0Jy4lgxny5VB1OjAZLMGnKfPyKs8bvQeksNBhMhIZVeFrZ5JHQC3f
4VsmAnTtDxD7RSEhlVy66kBMmdOlU6PhlSWZQ0OmLgHotX8HC9TJAq2I18yJ
ggk4mNkpcZwTz4PagjeEtST8/s1OIEjX4e9lh5u9einFv6mCxUMWT7bQwzFd
SImx589rjXLyZjdDtXsPZxN1G2Qi4HnlgKnkC44mx4soypo2sDFFmtv6YeWJ
e0Nr8RvFmKhPPgc71R1po9ZTOMIh3aBfMehvsAueVE8AhBZl8lvQyatAqYES
S7dcuhVATS4gfkEv4XWR1MVhvLDYP3l/I1H32cp5mh43BCT/DpSHvyfr0lhb
dxBlfSY/GYLFMGxbG73DFZO3S9o85nz2vma90rsS6AGx/oJOsJYUnXKcvUzL
Qgep
=e4Wy
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Christian Balzer


Hello,

On Thu, 10 Sep 2015 16:16:10 -0600 Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Do the recovery options kick in when there is only backfill going on?
>
Aside from having these set just in case as your cluster (and one of mine)
is clearly at the limits of its abilities, that's a good question.

Recovery and backfill are a bit blurry and clearly can happen at the same
time when locking at my logs from yesterday, when testing ways on how to
ease in new OSDs on my test cluster.

It would be nice if somebody in the know aka Devs would pipe up here.

What happens in the following scenarios?

1. OSD fails, is set out, etc. PGs get moved around. -> Recovery
2. Same OSD is brought back in. PGs move to their original OSDs. Recovery
or backfill?
3. New bucket (host or OSD) is added to the crush map, causing minor PG
reshuffles. Recovery or backfill?
4. The same OSD added in 3 is set "in", started. Backfill, one would
assume.

But this is a log entry from a situation like 4:
---
2015-09-10 15:53:30.084063 mon.0 203.216.0.33:6789/0 6254 : [INF] pgmap 
v791755: 896 pgs: 45 active+remapped+wait_backfill, 2 
active+remapped+backfilling, 
10 active+recovery_wait, 839 active+clean; 69546 MB data, 303 GB used, 5323 GB 
/ 5665 GB avail; 2925/54958 objects degraded (5.322%); 15638 kB/s, 3 objects/s 
recover
ing
---

I read that as both backfilling and recovery going on at the same time.

Christian
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> > Try all these..
> >
> > osd recovery max active = 1
> > osd max backfills = 1
> > osd recovery threads = 1
> > osd recovery op priority = 1
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Robert LeBlanc Sent: Thursday, September 10, 2015 1:56 PM
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Hammer reduce recovery impact
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > We are trying to add some additional OSDs to our cluster, but the
> > impact of the backfilling has been very disruptive to client I/O and
> > we have been trying to figure out how to reduce the impact. We have
> > seen some client I/O blocked for more than 60 seconds. There has been
> > CPU and RAM head room on the OSD nodes, network has been fine, disks
> > have been busy, but not terrible.
> >
> > 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> > (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> > S51G-1UL.
> >
> > Clients are QEMU VMs.
> >
> > [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2
> > (5fb85614ca8f354284c713a2f9c610860720bbf3)
> >
> > Some nodes are 0.94.3
> >
> > [ulhglive-root@ceph5 current]# ceph status
> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> >  health HEALTH_WARN
> > 3 pgs backfill
> > 1 pgs backfilling
> > 4 pgs stuck unclean
> > recovery 2382/33044847 objects degraded (0.007%)
> > recovery 50872/33044847 objects misplaced (0.154%)
> > noscrub,nodeep-scrub flag(s) set
> >  monmap e2: 3 mons at
> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> > election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> >  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> > flags noscrub,nodeep-scrub
> >   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> > 128 TB used, 322 TB / 450 TB avail
> > 2382/33044847 objects degraded (0.007%)
> > 50872/33044847 objects misplaced (0.154%)
> > 2300 active+clean
> >3 active+remapped+wait_backfill
> >1 active+remapped+backfilling recovery io 70401
> > kB/s, 16 objects/s client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
> >
> > Each pool is size 4 with min_size 2.
> >
> > One problem we have is that the requirements of the cluster changed
> > after setting up our pools, so our PGs are really out of wack. Our
> > most active pool has only 256 PGs and each PG is about 120 GB is size.
> > We are trying to clear out a pool that has way too many PGs so that we
> > can split the PGs in that pool. I think these large PGs is part of our
> > issues.
> >
> > Things I've tried:
> >
> > * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> > the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> > it has also reduced the huge swings in  latency, but has also reduced
> > throughput somewhat.
> > * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> > OSD process gives the recovery threads a different disk priority or if
> > changing the scheduler without restarting the OSD allows the OSD to
> > use disk priorities.
> > *

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy

I am not an expert on that, but, probably these settings will help backfill to 
go slow and thus less degradation on client IO. You may want to try..

Thanks & Regards
Somnath

-Original Message-
From: Robert LeBlanc [mailto:rob...@leblancnet.us] 
Sent: Thursday, September 10, 2015 3:16 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
>3 active+remapped+wait_backfill
>1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually within 10 seconds so I/O should be serviced "pretty quick". My next 
> guess is that the journals are getting full and blocking while waiting for 
> flushes, but I'm not exactly sure how to identify that. We are using the 
> defaults for the journal except for size (10G). We'd like to have

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lincoln Bryant


On 9/10/2015 5:39 PM, Lionel Bouton wrote:
For example deep-scrubs were a problem on our installation when at 
times there were several going on. We implemented a scheduler that 
enforces limits on simultaneous deep-scrubs and these problems are gone.


Hi Lionel,

Out of curiosity, how many was "several" in your case?

Cheers,
Lincoln
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 10/09/2015 22:56, Robert LeBlanc a écrit :
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.

It seems you've already exhausted most of the ways I know. When
confronted to this situation, I used a simple script to throttle
backfills (freezing them, then re-enabling them), this helped our VMs at
the time but you must be prepared for very long migrations and some
experimentations with different schedulings. You simply pass it the
number of seconds backfills are allowed to proceed then the number of
seconds during them they pause.

Here's the script, which should be self-explanatory:
http://pastebin.com/sy7h1VEy

something like :

./throttler 10 120

limited the impact on our VMs (the idea being that during the 10s the
backfill won't be able to trigger filestore syncs and the 120s pause
will allow the filestore syncs to remove "dirty" data from the journals
without interfering too much with concurrent writes).
I believe you must have a high filestore sync value to hope to benefit
from this (we use 30s).
At the very least the long pause will eventually allow VMs to move data
to disk regularly instead of being nearly frozen.

Note that your pgs are more than 10G each, if the OSDs can't stop a
backfill before finishing transferring the current pg this won't help (I
assume backfills go through journals and they probably won't be able to
act as write-back caches anymore as one PG will be enough to fill them up).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
>3 active+remapped+wait_backfill
>1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually within 10 seconds so I/O should be serviced "pretty quick". My next 
> guess is that the journals are getting full and blocking while waiting for 
> flushes, but I'm not exactly sure how to identify that. We are using the 
> defaults for the journal except for size (10G). We'd like to have journals 
> large to handle bursts, but if they are getting filled with backfill traffic, 
> it may be counter productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -BEGIN 
> PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
>

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I don't think the script will help our situation as it is just setting
osd_max_backfill from 1 to 0. It looks like that change doesn't go
into effect until after it finishes the PG. It would be nice if
backfill/recovery would skip the journal, but there would have to be
some logic if the obect was changed as it was being replicated. Maybe
just a log in the journal that the objects are starting restore and
finished restore, then the journal flush knows if it needs to commit
the write?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:33 PM, Lionel Bouton  wrote:
> Le 10/09/2015 22:56, Robert LeBlanc a écrit :
>> We are trying to add some additional OSDs to our cluster, but the
>> impact of the backfilling has been very disruptive to client I/O and
>> we have been trying to figure out how to reduce the impact. We have
>> seen some client I/O blocked for more than 60 seconds. There has been
>> CPU and RAM head room on the OSD nodes, network has been fine, disks
>> have been busy, but not terrible.
>
> It seems you've already exhausted most of the ways I know. When
> confronted to this situation, I used a simple script to throttle
> backfills (freezing them, then re-enabling them), this helped our VMs at
> the time but you must be prepared for very long migrations and some
> experimentations with different schedulings. You simply pass it the
> number of seconds backfills are allowed to proceed then the number of
> seconds during them they pause.
>
> Here's the script, which should be self-explanatory:
> http://pastebin.com/sy7h1VEy
>
> something like :
>
> ./throttler 10 120
>
> limited the impact on our VMs (the idea being that during the 10s the
> backfill won't be able to trigger filestore syncs and the 120s pause
> will allow the filestore syncs to remove "dirty" data from the journals
> without interfering too much with concurrent writes).
> I believe you must have a high filestore sync value to hope to benefit
> from this (we use 30s).
> At the very least the long pause will eventually allow VMs to move data
> to disk regularly instead of being nearly frozen.
>
> Note that your pgs are more than 10G each, if the OSDs can't stop a
> backfill before finishing transferring the current pg this won't help (I
> assume backfills go through journals and they probably won't be able to
> act as write-back caches anymore as one PG will be enough to fill them up).
>
> Best regards,
>
> Lionel

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8gIbCRDmVDuy+mK58QAAGOgQAMLGgbrgsHF2n9ZVGxol
4X1jezsXAjrPc19U38u8JLv1kVsSal6MBh+uSt1O6RnHWT+fMYOh1knPSYgl
aWvjYP9yJ+yVnWtuz5YxRI45WJ8XvJ8V7FPUYLRxSId7IX4EToupUf30AjdD
KZfjfLgpNKz98UMmFBRporTsvIX1cHGVtN7tiqhAtRPQYMhgXCA2pyqUFkhJ
H86287DZnnXrlDOsT7e+0Gel+eYKjUF7QsUYKCUMVx1Mj5oAm9gC0ZIm+icS
YIeUOzIO8LGV3YXHWmUQClzV9w0uQ7CBvvLoCBbFjvQOgQizsOUpgXv818Fr
Fp6ihpoNKDGaQ7lylLmT8Yu4Rf+JFQn3xfLBE0lPg41CkI8/MQIQsyYLlr5D
Pdd1msxy14Y1lvRbwsNnn+ICzvz/YhbuwtTSVFT+EnRSwc+fkRhKi1ipB1Zx
5zyvVI0ge8SRIelXYfueBmC/LCxjYp9ntfSSQujxlVejgUCxmG3HTd3TvBcn
SdyA7F5sQOpOSK+Hc/eRGwxYgWq4r/jd3TJQt6F2qRHi/nx2K4oFFv6r6SgT
zkDdZewlE+kVx8GkKnB4h1xI3DhGsIyPaS7rCSqy1DrMmxUSFFGgYto7umok
s5cpOeq35owbiv9Da8t3MCzoZvYfhuXCitWn+Jl69v5vfGHm6ha4A59mcigz
S9DN
=6xla
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 11/09/2015 00:20, Robert LeBlanc a écrit :
> I don't think the script will help our situation as it is just setting
> osd_max_backfill from 1 to 0. It looks like that change doesn't go
> into effect until after it finishes the PG.

That was what I was afraid of. Note that it should help a little anyway
(if not that's worrying, setting backfills to 0 completely should solve
your clients IO problems in a matter of minutes).
You may have better results by allowing backfills on only a few of your
OSD at a time. For example deep-scrubs were a problem on our
installation when at times there were several going on. We implemented a
scheduler that enforces limits on simultaneous deep-scrubs and these
problems are gone.
That's a last resort and rough around the edges but if every other means
of reducing the impact on your clients has failed, that's the best you
can hope for.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 11/09/2015 01:24, Lincoln Bryant a écrit :
> On 9/10/2015 5:39 PM, Lionel Bouton wrote:
>> For example deep-scrubs were a problem on our installation when at
>> times there were several going on. We implemented a scheduler that
>> enforces limits on simultaneous deep-scrubs and these problems are gone.
>
> Hi Lionel,
>
> Out of curiosity, how many was "several" in your case?

I had to issue ceph osd set nodeep-scrub several times with 3 or 4
concurrent deep-scrubs to avoid processes blocked in D state on VMs and
I could see the VM loads start rising with only 2. At the time I had
only 3 or 4 servers with 18 or 24 OSDs on Firefly. Obviously the more
servers and OSDs you have the more simultaneous deep scrubs you can handle.

One PG is ~5GB on our installation and it was probably ~4GB at the time.
As deep scrubs must read data on all replicas, with size=3 having 3 or 4
concurrent on only 3 or 4 servers means reading anywhere between 10 to
20G from disks on each server (and I don't think the OSDs are trying to
bypass the kernel cache).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question on cephfs recovery tools

2015-09-10 Thread Shinobu Kinjo

>> c./ After recovering the cluster, I though I was in a cephfs situation where
>> I had
>> c.1 files with holes (because of lost PGs and objects in the data pool)
>> c.2 files without metadata (because of lost PGs and objects in the
>> metadata pool)
>
> What does "files without metadata" mean?  Do you mean their objects
> were in the data pool but they didn't appear in your filesystem mount?
>
>> c.3 metadata without associated files (because of lost PGs and objects
>> in the data pool)
>
> So you mean you had files with the expected size but zero data, right?
>
>> I've tried to run the recovery tools, but I have several doubts which I did
>> not found described in the documentation
>> - Is there a specific order / a way to run the tools for the c.1, c.2
>> and c.3 cases I mentioned?

I'm still trying to understand what you try to say in your
original message but I have not been able to get you yet.

Can you summarize like:

 1. What current status is.
  e.g: working but not as expected.

 2. What your thought (, guess or whatever) is about your cluster.
  e.g: broken metadata, data or whatever you're thinking now.
 
 3. What you exactly did shortly not bla bla bla...

 4. What you really want to do (shortly)?

Otherwise there would be a bunch of back-end-force messages.

Shinobu

- Original Message -
From: "John Spray" 
To: "Goncalo Borges" 
Cc: ceph-users@lists.ceph.com
Sent: Thursday, September 10, 2015 8:49:46 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

On Wed, Sep 9, 2015 at 2:31 AM, Goncalo Borges
 wrote:
> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions will appear
> at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to test the
> cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose) lost some
> data and metadata (check annex 1 after the main email).

You're only *maybe* losing metadata here, as your procedure is
targeting OSDs that contain data, and just hoping that those OSDs also
contain some metadata.

>
> 3) I've stopped the mds, and waited to check how the cluster reacts. After
> some time, as expected, the cluster reports a ERROR state, with a lot of PGs
> degraded and stuck
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_ERR
> 174 pgs degraded
> 48 pgs stale
> 174 pgs stuck degraded
> 41 pgs stuck inactive
> 48 pgs stuck stale
> 238 pgs stuck unclean
> 174 pgs stuck undersized
> 174 pgs undersized
> recovery 22366/463263 objects degraded (4.828%)
> recovery 8190/463263 objects misplaced (1.768%)
> too many PGs per OSD (388 > max 300)
> mds rank 0 has failed
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e24: 0/1/1 up, 1 failed
>  osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>   pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
> 1715 GB used, 40027 GB / 41743 GB avail
> 22366/463263 objects degraded (4.828%)
> 8190/463263 objects misplaced (1.768%)
> 1799 active+clean
>  110 active+undersized+degraded
>   60 active+remapped
>   37 stale+undersized+degraded+peered
>   23 active+undersized+degraded+remapped
>   11 stale+active+clean
>4 undersized+degraded+peered
>4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me this time but
> I already had situations where 'umount' would simply hang, and the only
> viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover operations are
> in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was considered
> degraded but after some small amount of time, that message disappeared. The
> WARNING status was just because of "too many PGs per OSD (409 > max 300)"
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>  osdmap e614: 15

[ceph-users] ceph shows health_ok but cluster completely jacked up

2015-09-10 Thread Xu (Simon) Chen

Hi all,

I am using ceph 0.94.1. Recently, I ran into a somewhat serious issue.
"ceph -s" reports everything ok, all PGs active+clean, no blocked
requests, etc. However, everything on top (VM's rbd disks) is
completely jacked up. VM dmesg reporting blocked io requests, and
reboot would just stuck. I had to reboot my entire ceph cluster before
things start to clear up.

That said, now I remember seeing non-zero blocked requests in "ceph
report" before - I wonder why that didn't get reported by "ceph -s".

Any idea why this happened?

Thanks!
-Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to use query string of s3 Restful API to use RADOSGW

2015-09-10 Thread Fulin Sun

Hi, ceph experts

Newbie here. Just want to try ceph object gateway and use s3 restful api for 
some performance test. 

We had configured and started radosgw according to this : 
http://ceph.com/docs/master/radosgw/config/ 

And we had successfully ran the python test for s3 access. 

Question is : How could we use an URL and pass the authentication message for 
accessing s3 connection ? We had refered to

the following doc but had no idea how to access. 

http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html 

Can any experts shed a little light on this ? 

Best,
Sun.





CertusNet

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Rafael Lopez

Hi all,

I am seeing a big discrepancy between librbd and kRBD/ext4 performance
using FIO with single RBD image. RBD images are coming from same RBD pool,
same size and settings for both. The librbd results are quite bad by
comparison, and in addition if I scale up the kRBD FIO job with more
jobs/threads it increases up to 3-4x results below, but librbd doesn't seem
to scale much at all. I figured that it should be close to the kRBD result
for a single job/thread before parallelism comes into play though. RBD
cache settings are all default.

I can see some obvious differences in FIO output, but not being well versed
with FIO I'm not sure what to make of it or where to start diagnosing the
discrepancy. Hunted around but haven't found anything useful, any
suggestions/insights would be appreciated.

RBD cache settings:
[root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon
/var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
[root@rcmktdc1r72-09-ac rafaell]#

This is the FIO job file for the kRBD job:

[root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
; -- start job file --
[global]
rw=rw
size=100g
filename=/mnt/rbd/fio_test_file_ext4
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
[job1]
; -- end job file --

[root@rcprsdc1r72-01-ac rafaell]#

This is the FIO job file for the librbd job:

[root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
; -- start job file --
[global]
rw=rw
size=100g
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
ioengine=rbd
rbdname=nas1-rds-stg31
pool=rbd
[job1]
; -- end job file --


Here are the results:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 thread
job1: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops]
[eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015
  write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
clat percentiles (usec):
 |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
 | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
 | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
 | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
 | 99.99th=[464896]
bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27,
stdev=375092.66
lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
  cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s,
mint=262314msec, maxt=262314msec

Disk stats (read/write):
  rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277,
util=96.97%
[root@rcprsdc1r72-01-ac rafaell]#

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops] [eta
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015
  write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec
slat (usec): min=355, max=7300, avg=908.97, stdev=361.02
clat (msec): min=11, max=1468, avg=129.59, stdev=130.68
 lat (msec): min=12, max=1468, avg=130.50, stdev=130.69
clat percentiles (msec):
 |  1.00th=[   21],  5.00th=[   26], 10.00th=[   29], 20.00th=[   34],
 | 30.00th=[   37], 40.00th=[   40], 50.00th=[   44], 60.00th=[   63],
 | 70.00th=[  233], 80.00th=[  241], 90.00th=[  269], 95.00th=[  367],
 | 99.00th=[  553], 99.50th=[  652], 99.90th=[  832], 99.95th=[  848],
 | 99.99th=[ 1369]
bw (KB  /s): min=20363, max=248543, per=100.00%, avg=124381.19,
stdev=42313.29
lat (msec) : 20=0.95%, 50=55.27%, 100=5.55%, 250=24.83%, 500=12.28%
lat (msec) : 750=0.89%, 1000=0.21%, 2000=0.01%
  cpu

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Rafael Lopez

Ok I ran the two tests again with direct=1, smaller block size (4k) and
smaller total io (100m), disabled cache at ceph.conf side on client by
adding:

[client]
rbd cache = false
rbd cache max dirty = 0
rbd cache size = 0
rbd cache target dirty = 0


The result seems to have swapped around, now the librbd job is running ~50%
faster than the krbd job!

### krbd job:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
 lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
clat percentiles (usec):
 |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
 | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
 | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
 | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
 | 99.99th=[19328]
bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
mint=162033msec, maxt=162033msec

Disk stats (read/write):
  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
util=99.11%
[root@rcprsdc1r72-01-ac rafaell]#

## librb job:

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
slat (usec): min=70, max=992, avg=115.05, stdev=30.07
clat (msec): min=13, max=117, avg=67.91, stdev=24.93
 lat (msec): min=13, max=117, avg=68.03, stdev=24.93
clat percentiles (msec):
 |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[   40],
 | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[   85],
 | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[   99],
 | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[  117],
 | 99.99th=[  118]
bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74, stdev=407.67
lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29%
  cpu  : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750
  IO depths: 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s,
mint=110360msec, maxt=110360msec

Disk stats (read/write):
dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, util=0.38%,
aggrios=240/379, aggrmerge=0/19, aggrticks=742/41, aggrin_queue=783,
aggrutil=0.39%
  sda: ios=240/379, merge=0/19, ticks=742/41, in_queue=783, util=0.39%
[root@rcprsdc1r72-01-ac rafaell]#



Confirmed speed (at least for krbd) using dd:
[root@rcprsdc1r72-01-ac rafaell]# dd if=/mnt/ssd/random100g
of=/mnt/rbd/dd_io_test bs=4k count=1 oflag=direct
1+0 records in
1+0 records out
4096 bytes (41 MB) copied, 64.9799 s, 630 kB/s
[root@rcprsdc1r72-01-ac rafaell]#


Back to FIO, it's worse for 1M block size (librbd is about ~100% better
perf).
1M librbd:
Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=112641KB/s, minb=112641KB/s, maxb=112641KB/s,
mint=9309msec, maxt=9309msec

1M krbd:
Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=49939KB/s, minb=49939KB/s, maxb=49939KB/s,
mint=20997msec, maxt=20997msec

Raf

On 11 September 2015 at 14:33, Somnath Roy  wrote:

> Only changing client side ceph.conf and rerunning the tests is sufficient.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* Rafael Lopez [mailto:rafael.lo...@monash.edu]
> *Sent:* Thursday, September 10, 2015 8:58 PM
> *To:* Somnath Roy
> *Cc:*

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy

That’s probably because the krbd version you are using doesn’t have the 
TCP_NODELAY patch. We have submitted it (and you can build it from latest rbd 
source) , but, I am not sure when it will be in linux mainline.

Thanks & Regards
Somnath

From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
Sent: Thursday, September 10, 2015 10:12 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Ok I ran the two tests again with direct=1, smaller block size (4k) and smaller 
total io (100m), disabled cache at ceph.conf side on client by adding:

[client]
rbd cache = false
rbd cache max dirty = 0
rbd cache size = 0
rbd cache target dirty = 0


The result seems to have swapped around, now the librbd job is running ~50% 
faster than the krbd job!

### krbd job:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
 lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
clat percentiles (usec):
 |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
 | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
 | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
 | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
 | 99.99th=[19328]
bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s, 
mint=162033msec, maxt=162033msec

Disk stats (read/write):
  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745, util=99.11%
[root@rcprsdc1r72-01-ac rafaell]#

## librb job:

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
slat (usec): min=70, max=992, avg=115.05, stdev=30.07
clat (msec): min=13, max=117, avg=67.91, stdev=24.93
 lat (msec): min=13, max=117, avg=68.03, stdev=24.93
clat percentiles (msec):
 |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[   40],
 | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[   85],
 | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[   99],
 | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[  117],
 | 99.99th=[  118]
bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74, stdev=407.67
lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29%
  cpu  : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750
  IO depths: 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s, 
mint=110360msec, maxt=110360msec

Disk stats (read/write):
dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, util=0.38%, 
aggrios=240/379, aggrmerge=0/19, aggrticks=742/41, aggrin_queue=783, 
aggrutil=0.39%
  sda: ios=240/379, merge=0/19, ticks=742/41, in_queue=783, util=0.39%
[root@rcprsdc1r72-01-ac rafaell]#



Confirmed speed (at least for krbd) using dd:
[root@rcprsdc1r72-01-ac rafaell]# dd if=/mnt/ssd/random100g 
of=/mnt/rbd/dd_io_test bs=4k count=1 oflag=direct
1+0 records in
1+0 records out
4096 bytes (41 MB) copied, 64.9799 s, 630 kB/s
[root@rcprsdc1r72-01-ac rafaell]#


Back to FIO, it's worse for 1M block size (librbd is about ~100% better perf).
1M librbd:
Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=112641KB/s, minb=112641KB/s, maxb=112641KB/s, 
mint=9309msec, maxt=9309msec

1M krbd:
Run status group 0 (all

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Christian Balzer


Hello,

On Fri, 11 Sep 2015 13:24:24 +1000 Rafael Lopez wrote:

> Hi all,
> 
> I am seeing a big discrepancy between librbd and kRBD/ext4 performance
> using FIO with single RBD image. RBD images are coming from same RBD
> pool, same size and settings for both. The librbd results are quite bad
> by comparison, and in addition if I scale up the kRBD FIO job with more
> jobs/threads it increases up to 3-4x results below, but librbd doesn't
> seem to scale much at all. I figured that it should be close to the kRBD
> result for a single job/thread before parallelism comes into play
> though. RBD cache settings are all default.
>
librbd as in FUSE or to KVM client VM?

RBD cache settings only influence librbd, the kernel will use all of the
available memory for page cache.

And this what you're probably seeing, with the kernel RBD being so much 
faster.

Anyway, a good comparison and idea of what your cluster can do would be
firstly with a blocksize of 4KB (smaller total size of course) and
direct=1.

Christian
 
> I can see some obvious differences in FIO output, but not being well
> versed with FIO I'm not sure what to make of it or where to start
> diagnosing the discrepancy. Hunted around but haven't found anything
> useful, any suggestions/insights would be appreciated.
> 
> RBD cache settings:
> [root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon
> /var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
> "rbd_cache": "true",
> "rbd_cache_writethrough_until_flush": "true",
> "rbd_cache_size": "33554432",
> "rbd_cache_max_dirty": "25165824",
> "rbd_cache_target_dirty": "16777216",
> "rbd_cache_max_dirty_age": "1",
> "rbd_cache_max_dirty_object": "0",
> "rbd_cache_block_writes_upfront": "false",
> [root@rcmktdc1r72-09-ac rafaell]#
> 
> This is the FIO job file for the kRBD job:
> 
> [root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
> ; -- start job file --
> [global]
> rw=rw
> size=100g
> filename=/mnt/rbd/fio_test_file_ext4
> rwmixread=0
> rwmixwrite=100
> percentage_random=0
> bs=1024k
> direct=0
> iodepth=16
> thread=1
> numjobs=1
> [job1]
> ; -- end job file --
> 
> [root@rcprsdc1r72-01-ac rafaell]#
> 
> This is the FIO job file for the librbd job:
> 
> [root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
> ; -- start job file --
> [global]
> rw=rw
> size=100g
> rwmixread=0
> rwmixwrite=100
> percentage_random=0
> bs=1024k
> direct=0
> iodepth=16
> thread=1
> numjobs=1
> ioengine=rbd
> rbdname=nas1-rds-stg31
> pool=rbd
> [job1]
> ; -- end job file --
> 
> 
> Here are the results:
> 
> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
> job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
> fio-2.2.8
> Starting 1 thread
> job1: Laying out IO file(s) (1 file(s) / 102400MB)
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops]
> [eta 00m:00s]
> job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015
>   write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
> clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
>  lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
> clat percentiles (usec):
>  |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474],
> 20.00th=[  510], | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160],
> 60.00th=[ 1320], | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712],
> 95.00th=[ 7904], | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120],
> 99.95th=[73216], | 99.99th=[464896]
> bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27,
> stdev=375092.66
> lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
> lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
> lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
>   cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0 latency   : target=0, window=0, percentile=100.00%,
> depth=16
> 
> Run status group 0 (all jobs):
>   WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s,
> mint=262314msec, maxt=262314msec
> 
> Disk stats (read/write):
>   rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277,
> util=96.97%
> [root@rcprsdc1r72-01-ac rafaell]#
> 
> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
> job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16
> fio-2.2.8
> Starting 1 thread
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops]
> [eta 00m:00s]
> job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015
>   write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec
> slat (usec): min=355, max=7300, avg=908.97, stdev=361.02
>

[ceph-users] 答复: ceph shows health_ok but cluster completely jacked up

2015-09-10 Thread Duanweijun


you can use s3cmd  
http://www.cnblogs.com/zhyg6516/archive/2011/09/02/2163933.html

or

use s3curl.pl   baidu yi xia

example
// get bucket index  with  --debug
./s3curl.pl --debug --id=personal  -- 
http://node110/admin/bucket?index\=bkt-test

s3curl: Found the url: host=node110; port=; uri=/admin/bucket; 
query=index=bkt-test;
s3curl: StringToSign='GET\n\n\nFri, 11 Sep 2015 02:33:49 +\n/admin/bucket'
s3curl: exec curl -H Date: Fri, 11 Sep 2015 02:33:49 + -H Authorization: 
AWS keyuser:3OyxtJYnUmYjtC94W4yYF1udgAs= -L -H content-type:  
http://node110/admin/bucket?index=bkt-test
["_multipart_ceph_0.94.2-1.tar.gz"]

use debug print Composition curl cmd, like this
// 时间与Authorization关联，过一段时间后失效,
curl -H "Date: Mon, 31 Aug 2015 08:55:21 +" -H "Authorization: AWS 
keyuser:wG0VkR5QlNdh7j5N2kLAhQb1x5s=" -L -H "content-type:"  
http://node110/admin/bucket?index\=bkt-test



-邮件原件-
发件人: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] 代表 Xu (Simon) Chen
发送时间: 2015年9月11日 10:12
收件人: ceph-users@lists.ceph.com
主题: [ceph-users] ceph shows health_ok but cluster completely jacked up

Hi all,

I am using ceph 0.94.1. Recently, I ran into a somewhat serious issue.
"ceph -s" reports everything ok, all PGs active+clean, no blocked requests, 
etc. However, everything on top (VM's rbd disks) is completely jacked up. VM 
dmesg reporting blocked io requests, and reboot would just stuck. I had to 
reboot my entire ceph cluster before things start to clear up.

That said, now I remember seeing non-zero blocked requests in "ceph report" 
before - I wonder why that didn't get reported by "ceph -s".

Any idea why this happened?

Thanks!
-Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
邮件！
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Rafael Lopez

Thanks for the quick reply Somnath, will give this a try.

In order to set the rbd cache settings, is it a matter of updating the
ceph.conf file on the client only prior to running the test, or do I need
to inject args to all OSDs ?

Raf


On 11 September 2015 at 13:39, Somnath Roy  wrote:

> It may be due to rbd cache effect..
>
> Try the following..
>
>
>
> Run your test with direct = 1 both the cases and rbd_cache = false
> (disable all other rbd cache option as well). This should give you similar
> result like krbd.
>
>
>
> In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true.
>
> But, direct = 0 case, it could be more as you are seeing..
>
>
>
> I think there is a delta (or need to tune properly) if you want to use rbd
> cache.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Rafael Lopez
> *Sent:* Thursday, September 10, 2015 8:24 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] bad perf for librbd vs krbd using FIO
>
>
>
> Hi all,
>
>
>
> I am seeing a big discrepancy between librbd and kRBD/ext4 performance
> using FIO with single RBD image. RBD images are coming from same RBD pool,
> same size and settings for both. The librbd results are quite bad by
> comparison, and in addition if I scale up the kRBD FIO job with more
> jobs/threads it increases up to 3-4x results below, but librbd doesn't seem
> to scale much at all. I figured that it should be close to the kRBD result
> for a single job/thread before parallelism comes into play though. RBD
> cache settings are all default.
>
>
>
> I can see some obvious differences in FIO output, but not being well
> versed with FIO I'm not sure what to make of it or where to start
> diagnosing the discrepancy. Hunted around but haven't found anything
> useful, any suggestions/insights would be appreciated.
>
>
>
> RBD cache settings:
>
> [root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon
> /var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
>
> "rbd_cache": "true",
>
> "rbd_cache_writethrough_until_flush": "true",
>
> "rbd_cache_size": "33554432",
>
> "rbd_cache_max_dirty": "25165824",
>
> "rbd_cache_target_dirty": "16777216",
>
> "rbd_cache_max_dirty_age": "1",
>
> "rbd_cache_max_dirty_object": "0",
>
> "rbd_cache_block_writes_upfront": "false",
>
> [root@rcmktdc1r72-09-ac rafaell]#
>
>
>
> This is the FIO job file for the kRBD job:
>
>
>
> [root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
>
> ; -- start job file --
>
> [global]
>
> rw=rw
>
> size=100g
>
> filename=/mnt/rbd/fio_test_file_ext4
>
> rwmixread=0
>
> rwmixwrite=100
>
> percentage_random=0
>
> bs=1024k
>
> direct=0
>
> iodepth=16
>
> thread=1
>
> numjobs=1
>
> [job1]
>
> ; -- end job file --
>
>
>
> [root@rcprsdc1r72-01-ac rafaell]#
>
>
>
> This is the FIO job file for the librbd job:
>
>
>
> [root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
>
> ; -- start job file --
>
> [global]
>
> rw=rw
>
> size=100g
>
> rwmixread=0
>
> rwmixwrite=100
>
> percentage_random=0
>
> bs=1024k
>
> direct=0
>
> iodepth=16
>
> thread=1
>
> numjobs=1
>
> ioengine=rbd
>
> rbdname=nas1-rds-stg31
>
> pool=rbd
>
> [job1]
>
> ; -- end job file --
>
>
>
>
>
> Here are the results:
>
>
>
> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
>
> job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
>
> fio-2.2.8
>
> Starting 1 thread
>
> job1: Laying out IO file(s) (1 file(s) / 102400MB)
>
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops]
> [eta 00m:00s]
>
> job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015
>
>   write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
>
> clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
>
>  lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
>
> clat percentiles (usec):
>
>  |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
>
>  | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
>
>  | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
>
>  | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
>
>  | 99.99th=[464896]
>
> bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27,
> stdev=375092.66
>
> lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
>
> lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
>
> lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
>
>   cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
>
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>
>  issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>
>  latency   : target=0,

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy

It may be due to rbd cache effect..
Try the following..

Run your test with direct = 1 both the cases and rbd_cache = false  (disable 
all other rbd cache option as well). This should give you similar result like 
krbd.

In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true.
But, direct = 0 case, it could be more as you are seeing..

I think there is a delta (or need to tune properly) if you want to use rbd 
cache.

Thanks & Regards
Somnath



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rafael 
Lopez
Sent: Thursday, September 10, 2015 8:24 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] bad perf for librbd vs krbd using FIO

Hi all,

I am seeing a big discrepancy between librbd and kRBD/ext4 performance using 
FIO with single RBD image. RBD images are coming from same RBD pool, same size 
and settings for both. The librbd results are quite bad by comparison, and in 
addition if I scale up the kRBD FIO job with more jobs/threads it increases up 
to 3-4x results below, but librbd doesn't seem to scale much at all. I figured 
that it should be close to the kRBD result for a single job/thread before 
parallelism comes into play though. RBD cache settings are all default.

I can see some obvious differences in FIO output, but not being well versed 
with FIO I'm not sure what to make of it or where to start diagnosing the 
discrepancy. Hunted around but haven't found anything useful, any 
suggestions/insights would be appreciated.

RBD cache settings:
[root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon 
/var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
[root@rcmktdc1r72-09-ac rafaell]#

This is the FIO job file for the kRBD job:

[root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
; -- start job file --
[global]
rw=rw
size=100g
filename=/mnt/rbd/fio_test_file_ext4
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
[job1]
; -- end job file --

[root@rcprsdc1r72-01-ac rafaell]#

This is the FIO job file for the librbd job:

[root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
; -- start job file --
[global]
rw=rw
size=100g
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
ioengine=rbd
rbdname=nas1-rds-stg31
pool=rbd
[job1]
; -- end job file --


Here are the results:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 thread
job1: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015
  write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
clat percentiles (usec):
 |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
 | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
 | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
 | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
 | 99.99th=[464896]
bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27, 
stdev=375092.66
lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
  cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s, 
mint=262314msec, maxt=262314msec

Disk stats (read/write):
  rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277, 
util=96.97%
[root@rcprsdc1r72-01-ac rafaell]#

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015
  write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec
slat (usec): min=355, max=7300,

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Stefan Priebe

Am 10.09.2015 um 16:26 schrieb Haomai Wang:

Actually we can reach 700us per 4k write IO for single io depth(2 copy,
E52650, 10Gib, intel s3700). So I think 400 read iops shouldn't be a
unbridgeable problem.

How did you meassure it?

CPU is critical for ssd backend, so what's your cpu model?

On Thu, Sep 10, 2015 at 9:48 PM, Jan Schermer > wrote:

It's certainly not a problem with DRBD (yeah, it's something
completely different but it's used for all kinds of workloads
including things like replicated tablespaces for databases).
It won't be a problem with VSAN (again, a bit different, but most
people just want something like that)
It surely won't be a problem with e.g. ScaleIO which should be
comparable to Ceph.

Latency on the network can be very low (0.05ms on my 10GbE). Latency
on good SSDs is  2 orders of magnitute lower (as low as 0.5 ms).
Linux is pretty good nowadays at waking up threads and pushing the
work. Multiply those numbers by whatever factor and it's still just
a fraction of the 0.5ms needed.
The problem is quite frankly slow OSD code and the only solution now
is to keep the data closer to the VM.

Jan

 > On 10 Sep 2015, at 15:38, Gregory Farnum > wrote:
 >
 > On Thu, Sep 10, 2015 at 2:34 PM, Stefan Priebe - Profihost AG
 > > wrote:
 >> Hi,
 >>
 >> while we're happy running ceph firefly in production and also reach
 >> enough 4k read iop/s for multithreaded apps (around 23 000) with
qemu 2.2.1.
 >>
 >> We've now a customer having a single threaded application
needing around
 >> 2000 iop/s but we don't go above 600 iop/s in this case.
 >>
 >> Any tuning hints for this case?
 >
 > If the application really wants 2000 sync IOPS to disk without any
 > parallelism, I don't think any network storage system is likely to
 > satisfy him — that's only half a millisecond per IO. 600 IOPS is
about
 > the limit of what the OSD can do right now (in terms of per-op
 > speeds), and although there is some work being done to improve that
 > it's not going to be in a released codebase for a while.
 >
 > Or perhaps I misunderstood the question?
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com 
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Best Regards,

Wheat

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy

Only changing client side ceph.conf and rerunning the tests is sufficient.

Thanks & Regards
Somnath

From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
Sent: Thursday, September 10, 2015 8:58 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Thanks for the quick reply Somnath, will give this a try.

In order to set the rbd cache settings, is it a matter of updating the 
ceph.conf file on the client only prior to running the test, or do I need to 
inject args to all OSDs ?

Raf

On 11 September 2015 at 13:39, Somnath Roy 
> wrote:
It may be due to rbd cache effect..
Try the following..

Run your test with direct = 1 both the cases and rbd_cache = false  (disable 
all other rbd cache option as well). This should give you similar result like 
krbd.

In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true.
But, direct = 0 case, it could be more as you are seeing..

I think there is a delta (or need to tune properly) if you want to use rbd 
cache.

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Rafael Lopez
Sent: Thursday, September 10, 2015 8:24 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] bad perf for librbd vs krbd using FIO

Hi all,

I am seeing a big discrepancy between librbd and kRBD/ext4 performance using 
FIO with single RBD image. RBD images are coming from same RBD pool, same size 
and settings for both. The librbd results are quite bad by comparison, and in 
addition if I scale up the kRBD FIO job with more jobs/threads it increases up 
to 3-4x results below, but librbd doesn't seem to scale much at all. I figured 
that it should be close to the kRBD result for a single job/thread before 
parallelism comes into play though. RBD cache settings are all default.

I can see some obvious differences in FIO output, but not being well versed 
with FIO I'm not sure what to make of it or where to start diagnosing the 
discrepancy. Hunted around but haven't found anything useful, any 
suggestions/insights would be appreciated.

RBD cache settings:
[root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon 
/var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
[root@rcmktdc1r72-09-ac rafaell]#

This is the FIO job file for the kRBD job:

[root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
; -- start job file --
[global]
rw=rw
size=100g
filename=/mnt/rbd/fio_test_file_ext4
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
[job1]
; -- end job file --

[root@rcprsdc1r72-01-ac rafaell]#

This is the FIO job file for the librbd job:

[root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
; -- start job file --
[global]
rw=rw
size=100g
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
ioengine=rbd
rbdname=nas1-rds-stg31
pool=rbd
[job1]
; -- end job file --

Here are the results:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 thread
job1: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 
2015
  write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
clat percentiles (usec):
 |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
 | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
 | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
 | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
 | 99.99th=[464896]
bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27, 
stdev=375092.66
lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
  cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0,

Re: [ceph-users] higher read iop/s for single thread

2015-09-10 Thread Stefan Priebe



Am 10.09.2015 um 17:20 schrieb Mark Nelson:

I'm not sure you will be able to get there with firefly.  I've gotten
close to 1ms after lots of tuning on hammer, but 0.5ms is probably not
likely to happen without all of the new work that
Sandisk/Fujitsu/Intel/Others have been doing to improve the data path.

Your best bet is probably going to be a combination of:

1) switch to jemalloc (and make sure you have enough RAM to deal with it)
2) disabled ceph auth
3) disable all logging
4) throw a high clock speed CPU at the OSDs and keep the number of OSDs
per server lowish (will need to be tested to see where the sweet spot is).
5) potentially implement some kind of scheme to make sure OSD threads
stay pinned to specific cores.
6) lots of investigation to make sure the kernel/tcp stack/vm/etc isn't
getting in the way.


Thanks will do so. The strange thing currently is that an iotop shows 
more threads involved (4-6). ANd fio can easily reach 5000 iop/s reading 
with 4 threads doing 16k randread. So currently i don't understand the 
difference in workload.


Stefan



Mark

On 09/10/2015 08:34 AM, Stefan Priebe - Profihost AG wrote:

Hi,

while we're happy running ceph firefly in production and also reach
enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
2.2.1.

We've now a customer having a single threaded application needing around
2000 iop/s but we don't go above 600 iop/s in this case.

Any tuning hints for this case?

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rados bench seq throttling

2015-09-10 Thread Deneau, Tom

Running 9.0.3 rados bench on a 9.0.3 cluster...
In the following experiments this cluster is only 2 osd nodes, 6 osds each
and a separate mon node (and a separate client running rados bench).

I have two pools populated with 4M objects.  The pools are replicated x2
with identical parameters.  The objects appear to be spread evenly across the 
12 osds.

In all cases I drop caches on all nodes before doing a rados bench seq test.
In all cases I run rados bench seq for identical times (30 seconds) and in that 
time
we do not run out of objects to read from the pool.

I am seeing significant bandwidth differences between the following:

   * running a single instance of rados bench reading from one pool with 32 
threads
 (bandwidth approx 300)

   * running two instances rados bench each reading from one of the two pools
 with 16 threads per instance (combined bandwidth approx. 450)

I have already increased the following:
  objecter_inflight_op_bytes = 10485760
  objecter_inflight_ops = 8192
  ms_dispatch_throttle_bytes = 1048576000  #didn't seem to have any effect

The disks and network are not reaching anywhere near 100% utilization

What is the best way to diagnose what is throttling things in the one-instance 
case?

-- Tom Deneau, AMD
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS and caching

2015-09-10 Thread Kyle Hutson

A 'rados -p cachepool ls' takes about 3 hours - not exactly useful.

I'm intrigued that you say a single read may not promote it into the cache.
My understanding is that if you have an EC-backed pool the clients can't
talk to them directly, which means they would necessarily be promoted to
the cache pool so the client could read it. Is my understanding wrong?

I'm also wondering if it's possible to use RAM as a read-cache layer.
Obviously, we don't want this for write-cache because of power outages,
motherboard failures, etc., but it seems to make sense for a read-cache. Is
that something that's being done, can be done, is going to be done, or has
even been considered?

On Wed, Sep 9, 2015 at 10:33 AM, Gregory Farnum  wrote:

> On Wed, Sep 9, 2015 at 4:26 PM, Kyle Hutson  wrote:
> >
> >
> > On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum 
> wrote:
> >>
> >> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
> >> > We are using Hammer - latest released version. How do I check if it's
> >> > getting promoted into the cache?
> >>
> >> Umm...that's a good question. You can run rados ls on the cache pool,
> >> but that's not exactly scalable; you can turn up logging and dig into
> >> them to see if redirects are happening, or watch the OSD operations
> >> happening via the admin socket. But I don't know if there's a good
> >> interface for users to just query the cache state of a single object.
> >> :/
> >
> >
> > even using 'rados ls', I (naturally) get cephfs object names - is there a
> > way to see a filename -> objectname conversion ... or objectname ->
> filename
> > ?
>
> The object name is .. So you can
> look at the file inode and then see which of its objects are actually
> in the pool.
> -Greg
>
> >
> >>
> >> > We're using the latest ceph kernel client. Where do I poke at
> readahead
> >> > settings there?
> >>
> >> Just the standard kernel readahead settings; I'm not actually familiar
> >> with how to configure those but I don't believe Ceph's are in any way
> >> special. What do you mean by "latest ceph kernel client"; are you
> >> running one of the developer testing kernels or something?
> >
> >
> > No, just what comes with the latest stock kernel. Sorry for any
> confusion.
> >
> >>
> >> I think
> >> Ilya might have mentioned some issues with readahead being
> >> artificially blocked, but that might have only been with RBD.
> >>
> >> Oh, are the files you're using sparse? There was a bug with sparse
> >> files not filling in pages that just got patched yesterday or
> >> something.
> >
> >
> > No, these are not sparse files. Just really big.
> >
> >>
> >> >
> >> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum 
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson 
> >> >> wrote:
> >> >> > I was wondering if anybody could give me some insight as to how
> >> >> > CephFS
> >> >> > does
> >> >> > its caching - read-caching in particular.
> >> >> >
> >> >> > We are using CephFS with an EC pool on the backend with a
> replicated
> >> >> > cache
> >> >> > pool in front of it. We're seeing some very slow read times. Trying
> >> >> > to
> >> >> > compute an md5sum on a 15GB file twice in a row (so it should be in
> >> >> > cache)
> >> >> > takes the time from 23 minutes down to 17 minutes, but this is
> over a
> >> >> > 10Gbps
> >> >> > network and with a crap-ton of OSDs (over 300), so I would expect
> it
> >> >> > to
> >> >> > be
> >> >> > down in the 2-3 minute range.
> >> >>
> >> >> A single sequential read won't necessarily promote an object into the
> >> >> cache pool (although if you're using Hammer I think it will), so you
> >> >> want to check if it's actually getting promoted into the cache before
> >> >> assuming that's happened.
> >> >>
> >> >> >
> >> >> > I'm just trying to figure out what we can do to increase the
> >> >> > performance. I
> >> >> > have over 300 TB of live data that I have to be careful with,
> though,
> >> >> > so
> >> >> > I
> >> >> > have to have some level of caution.
> >> >> >
> >> >> > Is there some other caching we can do (client-side or server-side)
> >> >> > that
> >> >> > might give us a decent performance boost?
> >> >>
> >> >> Which client are you using for this testing? Have you looked at the
> >> >> readahead settings? That's usually the big one; if you're only asking
> >> >> for 4KB at once then stuff is going to be slow no matter what (a
> >> >> single IO takes at minimum about 2 milliseconds right now, although
> >> >> the RADOS team is working to improve that).
> >> >> -Greg
> >> >>
> >> >> >
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are trying to add some additional OSDs to our cluster, but the
impact of the backfilling has been very disruptive to client I/O and
we have been trying to figure out how to reduce the impact. We have
seen some client I/O blocked for more than 60 seconds. There has been
CPU and RAM head room on the OSD nodes, network has been fine, disks
have been busy, but not terrible.

11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
(10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
S51G-1UL.

Clients are QEMU VMs.

[ulhglive-root@ceph5 current]# ceph --version
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

Some nodes are 0.94.3

[ulhglive-root@ceph5 current]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN
3 pgs backfill
1 pgs backfilling
4 pgs stuck unclean
recovery 2382/33044847 objects degraded (0.007%)
recovery 50872/33044847 objects misplaced (0.154%)
noscrub,nodeep-scrub flag(s) set
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
election epoch 180, quorum 0,1,2 mon1,mon2,mon3
 osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
128 TB used, 322 TB / 450 TB avail
2382/33044847 objects degraded (0.007%)
50872/33044847 objects misplaced (0.154%)
2300 active+clean
   3 active+remapped+wait_backfill
   1 active+remapped+backfilling
recovery io 70401 kB/s, 16 objects/s
  client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s

Each pool is size 4 with min_size 2.

One problem we have is that the requirements of the cluster changed
after setting up our pools, so our PGs are really out of wack. Our
most active pool has only 256 PGs and each PG is about 120 GB is size.
We are trying to clear out a pool that has way too many PGs so that we
can split the PGs in that pool. I think these large PGs is part of our
issues.

Things I've tried:

* Lowered nr_requests on the spindles from 1000 to 100. This reduced
the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
it has also reduced the huge swings in  latency, but has also reduced
throughput somewhat.
* Changed the scheduler from deadline to CFQ. I'm not sure if the the
OSD process gives the recovery threads a different disk priority or if
changing the scheduler without restarting the OSD allows the OSD to
use disk priorities.
* Reduced the number of osd_max_backfills from 2 to 1.
* Tried setting noin to give the new OSDs time to get the PG map and
peer before starting the backfill. This caused more problems than
solved as we had blocked I/O (over 200 seconds) until we set the new
OSDs to in.

Even adding one OSD disk into the cluster is causing these slow I/O
messages. We still have 5 more disks to add from this server and four
more servers to add.

In addition to trying to minimize these impacts, would it be better to
split the PGs then add the rest of the servers, or add the servers
then do the PG split. I'm thinking splitting first would be better,
but I'd like to get other opinions.

No spindle stays at high utilization for long and the await drops
below 20 ms usually within 10 seconds so I/O should be serviced
"pretty quick". My next guess is that the journals are getting full
and blocking while waiting for flushes, but I'm not exactly sure how
to identify that. We are using the defaults for the journal except for
size (10G). We'd like to have journals large to handle bursts, but if
they are getting filled with backfill traffic, it may be counter
productive. Can/does backfill/recovery bypass the journal?

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA
HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv
VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9
aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD
Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl
RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf
Rzgu
=vlrz
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy

Try all these..

osd recovery max active = 1
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert 
LeBlanc
Sent: Thursday, September 10, 2015 1:56 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are trying to add some additional OSDs to our cluster, but the impact of the 
backfilling has been very disruptive to client I/O and we have been trying to 
figure out how to reduce the impact. We have seen some client I/O blocked for 
more than 60 seconds. There has been CPU and RAM head room on the OSD nodes, 
network has been fine, disks have been busy, but not terrible.

11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.

Clients are QEMU VMs.

[ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3)

Some nodes are 0.94.3

[ulhglive-root@ceph5 current]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN
3 pgs backfill
1 pgs backfilling
4 pgs stuck unclean
recovery 2382/33044847 objects degraded (0.007%)
recovery 50872/33044847 objects misplaced (0.154%)
noscrub,nodeep-scrub flag(s) set
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
election epoch 180, quorum 0,1,2 mon1,mon2,mon3
 osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
128 TB used, 322 TB / 450 TB avail
2382/33044847 objects degraded (0.007%)
50872/33044847 objects misplaced (0.154%)
2300 active+clean
   3 active+remapped+wait_backfill
   1 active+remapped+backfilling recovery io 70401 kB/s, 16 
objects/s
  client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s

Each pool is size 4 with min_size 2.

One problem we have is that the requirements of the cluster changed after 
setting up our pools, so our PGs are really out of wack. Our most active pool 
has only 256 PGs and each PG is about 120 GB is size.
We are trying to clear out a pool that has way too many PGs so that we can 
split the PGs in that pool. I think these large PGs is part of our issues.

Things I've tried:

* Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
latency sometimes up to 3000 ms down to a max of 500-700 ms.
it has also reduced the huge swings in  latency, but has also reduced 
throughput somewhat.
* Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
process gives the recovery threads a different disk priority or if changing the 
scheduler without restarting the OSD allows the OSD to use disk priorities.
* Reduced the number of osd_max_backfills from 2 to 1.
* Tried setting noin to give the new OSDs time to get the PG map and peer 
before starting the backfill. This caused more problems than solved as we had 
blocked I/O (over 200 seconds) until we set the new OSDs to in.

Even adding one OSD disk into the cluster is causing these slow I/O messages. 
We still have 5 more disks to add from this server and four more servers to add.

In addition to trying to minimize these impacts, would it be better to split 
the PGs then add the rest of the servers, or add the servers then do the PG 
split. I'm thinking splitting first would be better, but I'd like to get other 
opinions.

No spindle stays at high utilization for long and the await drops below 20 ms 
usually within 10 seconds so I/O should be serviced "pretty quick". My next 
guess is that the journals are getting full and blocking while waiting for 
flushes, but I'm not exactly sure how to identify that. We are using the 
defaults for the journal except for size (10G). We'd like to have journals 
large to handle bursts, but if they are getting filled with backfill traffic, 
it may be counter productive. Can/does backfill/recovery bypass the journal?

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -BEGIN 
PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA

[ceph-users] 9 PGs stay incomplete

2015-09-10 Thread Wido den Hollander

Hi,

I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
test 9 PGs stay incomplete:

osdmap e78770: 2294 osds: 2294 up, 2294 in
pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
   755 TB used, 14468 TB / 15224 TB avail
  51831 active+clean
  9 incomplete

As you can see, all 2294 OSDs are online and about all PGs became
active+clean again, except for 9.

I found out that these PGs are the problem:

10.3762
7.309e
7.29a2
10.2289
7.17dd
10.165a
7.1050
7.c65
10.abf

Digging further, all the PGs map back to a OSD which is running on the
same host. 'ceph-stg-01' in this case.

$ ceph pg 10.3762 query

Looking at the recovery state, this is shown:

{
"first": 65286,
"last": 67355,
"maybe_went_rw": 0,
"up": [
1420,
854,
1105
],
"acting": [
1420
],
"primary": 1420,
"up_primary": 1420
},

osd.1420 is online. I tried restarting it, but nothing happens, these 9
PGs stay incomplete.

Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
the PG with identical numbers.

I restarted both 854 and 1105, without result.

The output of PG query can be found here: http://pastebin.com/qQL699zC

The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
3.13 kernel. XFS is being used as the backing filesystem.

Any suggestions to fix this issue? There is no valuable data in these
pools, so I can remove them, but I'd rather fix the root-cause.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to observed civetweb.

2015-09-10 Thread Kobi Laredo

We haven't had the need to explore civetweb's SSL termination ability, so I
don't know the answer to your question.
Either way, haproxy is a safer bet.

*Kobi Laredo*
*Cloud Systems Engineer* | (*408) 409-KOBI*

On Tue, Sep 8, 2015 at 8:50 PM, Vickie ch  wrote:

> Thanks a lot!!
> One more question.  I can understand use haproxy is a better way for
> loadbalance.
> And github say civetweb already support https.
> But I found some documents mention that civetweb need haproxy for https.
> Which one is true?
>
>
>
> Best wishes,
> Mika
>
>
> 2015-09-09 2:21 GMT+08:00 Kobi Laredo :
>
>> Vickie,
>>
>> You can add:
>> *access_log_file=/var/log/civetweb/access.log
>> error_log_file=/var/log/civetweb/error.log*
>>
>> to *rgw frontends* in ceph.conf though these logs are thin on info
>> (Source IP, date, and request)
>>
>> Check out
>> https://github.com/civetweb/civetweb/blob/master/docs/UserManual.md for
>> more civetweb configs you can inject through  *rgw frontends* config
>> attribute in ceph.conf
>>
>> We are currently testing tuning civetweb's num_threads
>> and request_timeout_ms to improve radosgw performance
>>
>> *Kobi Laredo*
>> *Cloud Systems Engineer* | (*408) 409-KOBI*
>>
>> On Tue, Sep 8, 2015 at 8:20 AM, Yehuda Sadeh-Weinraub 
>> wrote:
>>
>>> You can increase the civetweb logs by adding 'debug civetweb = 10' in
>>> your ceph.conf. The output will go into the rgw logs.
>>>
>>> Yehuda
>>>
>>> On Tue, Sep 8, 2015 at 2:24 AM, Vickie ch 
>>> wrote:
>>> > Dear cephers,
>>> >Just upgrade radosgw from apache to civetweb.
>>> > It's really simple to installed and used. But I can't find any
>>> parameters or
>>> > logs to adjust(or observe) civetweb. (Like apache log).  I'm really
>>> confuse.
>>> > Any ideas?
>>> >
>>> >
>>> > Best wishes,
>>> > Mika
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

55 matches

Mail list logo