Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Kirkwood

On 24/06/14 17:37, Alexandre DERUMIER wrote:

Hi Greg,


So the only way to improve performance would be to not use O_DIRECT (as this 
should bypass rbd cache as well, right?).


yes, indeed O_DIRECT bypass cache.



BTW, Do you need to use mysql with O_DIRECT ? default innodb_flush_method is 
fdatasync, so it should work with cache.
(but you can lose some write is case of a crash failure)



While this suggestion is good, I don't believe that the you could lose 
data statement is correct with respect to fdatasync (or fsync) [1]. 
With all modern kernels I think you will find that fdatasync will 
actually flush modified buffers to the device (i.e write through file 
buffer cache).


All of which means that Mysql performance (looking at you binlog) may 
still suffer due to lots of small block size sync writes.


regards

Mark

[1] See kernel archives concerning REQ_FLUSH and friends.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub versus osd scrub load threshold

2014-06-24 Thread Christian Balzer


Hello,

On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote:

 
 By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week
 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400
 seconds (60*60*24).  As long as osd_scrub_max_interval =
 osd_deep_scrub_interval then the load won’t impact when deep scrub
 occurs.   I suggest that osd_scrub_min_interval =
 osd_scrub_max_interval = osd_deep_scrub_interval.
 
 I’d like to know how you have those 3 values set, so I can confirm that
 this explains the issue.
 
They are and were unsurprisingly set to the default values.

Now to provide some more information, shortly after the inception of this
cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday
morning (the things we do for Ceph, a scheduler with a variety of rules
would be nice, but I digress). 
This took until 05:30 despite the cluster being idle and with close to no
data in it. In retrospect it seems clear to me that this already was
influenced by the load threshold (a scrub I initiated with the new
threshold value of 1.5 finished in just 30 minutes last night).
Consequently all the normal scrubs happened in the same time frame until
this weekend on the 21st (normal scrub).
The deep scrub on the 22nd clearly ran into the load threshold.

So if I understand you correctly setting osd_scrub_max_interval to 6 days
should have deep scrubs ignore the load threshold as per the documentation?

Regards,

Christian

 
 David Zafman
 Senior Developer
 http://www.inktank.com
 http://www.redhat.com
 
 On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote:
 
  
  Hello,
  
  On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote:
  
  Looks like it's a doc error (at least on master), but it might have
  changed over time. If you're running Dumpling we should change the
  docs.
  
  Nope, I'm running 0.80.1 currently.
  
  Christian
  
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
  
  
  On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com
  wrote:
  
  Hello,
  
  This weekend I noticed that the deep scrubbing took a lot longer than
  usual (long periods without a scrub running/finishing), even though
  the cluster wasn't all that busy.
  It was however busier than in the past and the load average was above
  0.5 frequently.
  
  Now according to the documentation osd scrub load threshold is
  ignored when it comes to deep scrubs.
  
  However after setting it to 1.5 and restarting the OSDs the
  floodgates opened and all those deep scrubs are now running at full
  speed.
  
  Documentation error or did I unstuck something by the OSD restart?
  
  Regards,
  
  Christian
  --
  Christian BalzerNetwork/Systems Engineer
  ch...@gol.com   Global OnLine Japan/Fusion Communications
  http://www.gol.com/
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
  
  
  -- 
  Christian BalzerNetwork/Systems Engineer
  ch...@gol.com   Global OnLine Japan/Fusion Communications
  http://www.gol.com/
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Robert van Leeuwen
 All of which means that Mysql performance (looking at you binlog) may
 still suffer due to lots of small block size sync writes.

Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough. 
We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning 
disks each.
Using ceph trough kvm rbd we were seeing performance equal to about 1-2 
spinning disks.

Reading this thread it now looks a bit if there are inherent architecture + 
latency issues that would prevent it from performing great as a Mysql database 
store.
I'd be interested in example setups where people are running busy databases on 
Ceph backed volumes.

Cheers,
Robert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Kirkwood

On 24/06/14 18:15, Robert van Leeuwen wrote:

All of which means that Mysql performance (looking at you binlog) may
still suffer due to lots of small block size sync writes.


Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough.
We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning 
disks each.
Using ceph trough kvm rbd we were seeing performance equal to about 1-2 
spinning disks.

Reading this thread it now looks a bit if there are inherent architecture + 
latency issues that would prevent it from performing great as a Mysql database 
store.
I'd be interested in example setups where people are running busy databases on 
Ceph backed volumes.


Yes indeed,

We have looked extensively at Postgres performance on rbd - and while it 
is not Mysql, the underlying mechanism for durable writes (i.e commit) 
is essentially very similar (fsync, fdatasync and friends). We achieved 
quite reasonable performance (by that I mean sufficiently encouraging to 
be happy to host real datastores for our moderately busy systems - and 
we are continuing to investigate using it for our really busy ones).


I have not experimented exptensively with the various choices of flush 
method (called sync method in Postgres but the same idea), as we found 
quite good performance with the default (fdatasync). However this is 
clearly an area that is worth investigation.



Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly OSDs : set_extsize: FSSETXATTR: (22) Invalid argument

2014-06-24 Thread Ilya Dryomov
On Tue, Jun 24, 2014 at 12:02 PM, Florent B flor...@coppint.com wrote:
 Hi all,

 On 2 Firefly cluster, I have a lot of errors like this on my OSDs :

 2014-06-24 09:54:39.088469 7fb5b8628700  0
 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) set_extsize: FSSETXATTR:
 (22) Invalid argument

 Both are using XFS, *without* filestore_xattr_use_omap = true. I read
 that was not necessary for XFS...

 What could be the problem ?

 Both clusters are using a RedHat 3.10 kernel on Debian Wheezy.

Have you done a ceph upgrade recently?  This is most probably an
artifact of the upgrade, caused by a bug (omission) that has been
fixed.  Nothing serious: set_extsize simply tries to set an allocation
size hint, it doesn't affect anything else.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly OSDs : set_extsize: FSSETXATTR: (22) Invalid argument

2014-06-24 Thread Ilya Dryomov
On Tue, Jun 24, 2014 at 1:15 PM, Florent B flor...@coppint.com wrote:
 On 06/24/2014 11:13 AM, Ilya Dryomov wrote:
 On Tue, Jun 24, 2014 at 12:02 PM, Florent B flor...@coppint.com wrote:
 Hi all,

 On 2 Firefly cluster, I have a lot of errors like this on my OSDs :

 2014-06-24 09:54:39.088469 7fb5b8628700  0
 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) set_extsize: FSSETXATTR:
 (22) Invalid argument

 Both are using XFS, *without* filestore_xattr_use_omap = true. I read
 that was not necessary for XFS...

 What could be the problem ?

 Both clusters are using a RedHat 3.10 kernel on Debian Wheezy.
 Have you done a ceph upgrade recently?  This is most probably an
 artifact of the upgrade, caused by a bug (omission) that has been
 fixed.  Nothing serious: set_extsize simply tries to set an allocation
 size hint, it doesn't affect anything else.

 Thanks,

 Ilya

 Yes of course I upgraded from Emperor when Firefly was released.

 What did I miss ?

You missed nothing, set_extsize code just didn't take upgrades into
account.  The fix for that should be in the next firefly release.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Kirkwood

On 23/06/14 19:16, Mark Kirkwood wrote:

For database types (and yes I'm one of those)...you want to know that
your writes (particularly your commit writes) are actually making it to
persistent storage (that ACID thing you know). Now I see RBD cache very
like battery backed RAID cards - your commits (i.e fsync or O_DIRECT
writes) are not actually written, but are cached - so you are depending
on the reliability of a) your RAID controller battery etc in that case
or more interestingly b) your Ceph topology - to withstand node
failures. Given we usually design a Ceph cluster with these things in
mind it is probably ok!



Thinking about this a bit more (and noting Mark N's comment too), this 
is a bit more subtle that what I indicated above:


The rbd cache lives at the *client* level so (thinking in Openstack 
terms): if your VM fails - no problem, the compute node has the write 
cache in memory...ok, but how about if the compute node itself fails? 
This is analogous to: how about if your battery backed raid card self 
destructs? The answer would appear to be data loss, so rbd cache 
reliability looks to be dependent on the resilience of the 
client/compute design.


Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'osd pool set-quota' behaviour with CephFS

2014-06-24 Thread george.ryall
Last week I decided to take a look at the 'osd pool set-quota' option.

I have a directory in cephFS that uses a pool called pool-2 (configured by 
following this: 
http://www.sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs/).
   I have a directory in that filled with cat pictures. I ran 'rados df'. I 
then copied a couple more cat pictures into my directory using 'cp file 
destination  sync'. I then ran 'rados df' again, this showed an increase in 
the object count for the pool equal to the number of additional cat pictures 
and an increase in the pool size equal to the size of the cat pictures, as 
expected.

I then used the command 'ceph osd pool set-quota {pool-name} [max_objects 
{obj-count}] [max_bytes {bytes}]', as per 
http://ceph.com/docs/master/rados/operations/pools/, and set an object limit a 
couple of objects bigger than the current pool size. I then ran a loop copying 
more cat pictures one at a time (again with ' sync') each time. Whilst doing 
this I ran 'rados df', the number of objects in the pool increased up to the 
limit and stopped. However on the machine copying the cat pictures, the copying 
appeared to work fine and running ls showed more pictures than the 'rados df' 
command would suggest should be there. If I accessed the same directory from a 
different machine, then I saw only the pictures that were copied up to the 
limit. If I then removed the limit, the images would appear in the directory 
and 'rados df' would report a larger number of objects. Similar behaviour was 
observed when setting a size limit.  What's going on? Is this expected 
behaviour?


George Ryall

Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell Oxford | 
Didcot | OX11 0QX
(01235 44) 5021


-- 
Scanned by iCritical.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Nelson

On 06/24/2014 03:45 AM, Mark Kirkwood wrote:

On 24/06/14 18:15, Robert van Leeuwen wrote:

All of which means that Mysql performance (looking at you binlog) may
still suffer due to lots of small block size sync writes.


Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough.
We have a small ceph cluster: 3 machines with 2 SSD journals and 10
spinning disks each.
Using ceph trough kvm rbd we were seeing performance equal to about
1-2 spinning disks.

Reading this thread it now looks a bit if there are inherent
architecture + latency issues that would prevent it from performing
great as a Mysql database store.
I'd be interested in example setups where people are running busy
databases on Ceph backed volumes.


Yes indeed,

We have looked extensively at Postgres performance on rbd - and while it
is not Mysql, the underlying mechanism for durable writes (i.e commit)
is essentially very similar (fsync, fdatasync and friends). We achieved
quite reasonable performance (by that I mean sufficiently encouraging to
be happy to host real datastores for our moderately busy systems - and
we are continuing to investigate using it for our really busy ones).

I have not experimented exptensively with the various choices of flush
method (called sync method in Postgres but the same idea), as we found
quite good performance with the default (fdatasync). However this is
clearly an area that is worth investigation.


FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of 
qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication. 
 I kept buffer sizes small to try to force disk IO and benchmarked 
against a local disk passed through to the VM.  We typically did about 
3-4x faster on queries than the local disk, but there were a couple of 
queries were we were slower.  I didn't look at how multiple databases 
scaled though.  That may have it's own benefits and challenges.


I'm encouraged overall though.  It looks like from your comments and 
from my own testing it's possible to have at least passable performance 
with a single database and potentially as we reduce latency in Ceph make 
it even better.  With multiple databases, it's entirely possible that we 
can do pretty good even now.





Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Nelson

On 06/24/2014 04:46 AM, Mark Kirkwood wrote:

On 23/06/14 19:16, Mark Kirkwood wrote:

For database types (and yes I'm one of those)...you want to know that
your writes (particularly your commit writes) are actually making it to
persistent storage (that ACID thing you know). Now I see RBD cache very
like battery backed RAID cards - your commits (i.e fsync or O_DIRECT
writes) are not actually written, but are cached - so you are depending
on the reliability of a) your RAID controller battery etc in that case
or more interestingly b) your Ceph topology - to withstand node
failures. Given we usually design a Ceph cluster with these things in
mind it is probably ok!



Thinking about this a bit more (and noting Mark N's comment too), this
is a bit more subtle that what I indicated above:

The rbd cache lives at the *client* level so (thinking in Openstack
terms): if your VM fails - no problem, the compute node has the write
cache in memory...ok, but how about if the compute node itself fails?
This is analogous to: how about if your battery backed raid card self
destructs? The answer would appear to be data loss, so rbd cache
reliability looks to be dependent on the resilience of the
client/compute design.


Well, it's the same problem you have with cache on most spinning disks. 
 You just have to assume that anything that wasn't flushed might not 
have made it.  Depending on the use case that might or might not be an 
ok assumption.


In terms of data loss, the way I like to look at this is that there is 
always a spectrum.  Even with battery backed RAID cards you don't have 
any guarantee that any given write is going to make it out of RAM and to 
the controller before a system crash.  What's more important imho is 
making sure you know exactly what the granularity is and what kind of 
guaranties you do get.




Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Jake Young
On Mon, Jun 23, 2014 at 3:03 PM, Mark Nelson mark.nel...@inktank.com
wrote:

 Well, for random IO you often can't do much coalescing.  You have to bite
 the bullet and either parallelize things or reduce per-op latency.  Ceph
 already handles parallelism very well.  You just throw more disks at the
 problem and so long as there are enough client requests it more or less
 just scales (limited by things like network bisection bandwidth or other
 complications).  On the latency side, spinning disks aren't fast enough for
 Ceph's extra latency overhead to matter much, but with SSDs the story is
 different.  That's why we are very interested in reducing latency.

 Regarding journals:  Journal writes are always sequential (even for random
 IO!), but are O_DIRECT so they'll skip linux buffer cache.  If you have
 hardware that is fast at writing sequential small IO (say a controller with
 WB cache or an SSD), you can do journal writes very quickly.  For bursts of
 small random IO, performance can be quite good.  The downsides is that you
 can hit journal limits very quickly, meaning you have to flush and wait for
 the underlying filestore to catch up. This results in performance that
 starts out super fast, then stalls once the journal limits are hit, back to
 super fast again for a bit, then another stall, etc.  This is less than
 ideal given the way crush distributes data across OSDs.  The alternative is
 setting a soft limit on how much data is in the journal and flushing
 smaller amounts of data more quickly to limit the spikey behaviour.  On the
 whole, that can be good but limits the burst potential and also limits the
 amount of data that could potentially be coalesced in the journal.


Mark,

What settings are you suggesting for setting a soft limit on journal size
and flushing smaller amounts of data?

Something like this?
filestore_queue_max_bytes: 10485760
filestore_queue_committing_max_bytes: 10485760
journal_max_write_bytes: 10485760
journal_queue_max_bytes: 10485760
ms_dispatch_throttle_bytes: 10485760
objecter_infilght_op_bytes: 10485760

(see Small bytes in
http://ceph.com/community/ceph-bobtail-jbod-performance-tuning)



 Luckily with RBD you can (when applicable) coalesce on the client with RBD
 cache instead, which is arguably better anyway since you can send bigger
 IOs to the OSDs earlier in the write path.  So long as you are ok with what
 RBD cache does and does not guarantee, it's definitely worth enabling imho.


Thanks,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'osd pool set-quota' behaviour with CephFS

2014-06-24 Thread Travis Rhoden
Hi George,

I actually asked Sage about a similar scenario at the OpenStack summit in
Atlanta this year -- namely if I could use the new pool quota functionality
to enforce quotas on CephFS.  The answer was no, that the pool quota
functionality is mostly intended for radosgw and that the existing cephfs
clients have no support for it.  He said the quota should work, actually,
but that you were likely to see some very strange behavior in cephfs.  That
sounds like what you've seen.  It won't be a graceful failure at all.

Quotas in cephfs is a different task, and one that I'm following as well.
See here: https://github.com/ceph/ceph/pull/1122

The pull request is old, but Sage did mention he was in contact with the
team working on the code and was hopeful to see it finished.

 - Travis


On Tue, Jun 24, 2014 at 7:06 AM, george.ry...@stfc.ac.uk wrote:

  Last week I decided to take a look at the ‘osd pool set-quota’ option.



 I have a directory in cephFS that uses a pool called pool-2 (configured by
 following this:
 http://www.sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs/).
 I have a directory in that filled with cat pictures. I ran ‘rados df’. I
 then copied a couple more cat pictures into my directory using ‘cp file
 destination  sync’. I then ran ‘rados df’ again, this showed an increase
 in the object count for the pool equal to the number of additional cat
 pictures and an increase in the pool size equal to the size of the cat
 pictures, as expected.



 I then used the command ‘ceph osd pool set-quota {pool-name} [max_objects
 {obj-count}] [max_bytes {bytes}]’, as per
 http://ceph.com/docs/master/rados/operations/pools/, and set an object
 limit a couple of objects bigger than the current pool size. I then ran a
 loop copying more cat pictures one at a time (again with ‘ sync’) each
 time. Whilst doing this I ran ‘rados df’, the number of objects in the pool
 increased up to the limit and stopped. However on the machine copying the
 cat pictures, the copying appeared to work fine and running ls showed more
 pictures than the ‘rados df’ command would suggest should be there. If I
 accessed the same directory from a different machine, then I saw only the
 pictures that were copied up to the limit. If I then removed the limit, the
 images would appear in the directory and ‘rados df’ would report a larger
 number of objects. Similar behaviour was observed when setting a size
 limit.  What’s going on? Is this expected behaviour?





 George Ryall


 Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell
 Oxford | Didcot | OX11 0QX

 (01235 44) 5021



 --
 Scanned by iCritical.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] limitations of erasure coded pools

2014-06-24 Thread Chad Seys
Hi All,
  Could someone point me to a document (possibly a FAQ :) ) describing the 
limitations of erasure coded pools?  Hopefully it would contain the when and 
how to use them as well.
   E.g. I read about people using replicated pools as a front end to erasure 
coded pools, but I don't know why they're deciding to do this, or how they are 
setting this up.

THanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple hierarchies and custom placement

2014-06-24 Thread Gregory Farnum
There's not really a simple way to do this. There are functions in the
OSDMap structure to calculate the location of a particular PG, but there
are a lot of independent places that map objects into PGs.

On Monday, June 23, 2014, Shayan Saeed shayansaee...@gmail.com wrote:

 Thanks for getting back with a helpful reply. Assuming that I change the
 source code to do custom placement, what are the places I need to look in
 the code to do that? I am currently trying to change the CRUSH code, but is
 there any place else I need to be concerned about?

 Regards,
 Shayan Saeed


 On Mon, Jun 23, 2014 at 2:14 PM, Gregory Farnum g...@inktank.com
 javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote:

 On Fri, Jun 20, 2014 at 4:23 PM, Shayan Saeed shayansaee...@gmail.com
 javascript:_e(%7B%7D,'cvml','shayansaee...@gmail.com'); wrote:
  Is it allowed for crush maps to have multiple hierarchies for different
  pools. So for example, I want one pool to treat my cluster as flat with
  every host being equal but the other pool to have a more hierarchical
 idea
  as hosts-racks-root?

 Yes. It can get complicated, so make sure you know exactly what you're
 doing, but you can create different root buckets and link the OSDs
 in to each root in different ways.

 
  Also, is it currently possible in ceph to have a custom placement of
 erasure
  coded chunks. So for example within a pool, I want objects to reside
 exactly
  on the OSDs I choose instead of doing placement for load balancing. Can
 I
  specify something like: For object 1, I want systematic chunks on
 rack1 and
  non systematic distributed between rack2 and rack3 and then for object
 2, I
  want systematic ones on rack2 and non systematic distributed between
 rack1
  and rack3?

 Not generally, no — you need to let the CRUSH algorithm place them.
 You can do things like specify specific buckets within a CRUSH rule,
 but that applies on a pool level.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 
  I would greatly appreciate any suggestions I get.
 
  Regards,
  Shayan Saeed
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Continuing placement group problems

2014-06-24 Thread Peter Howell
We are using two instances of version 8.0 of ceph both on ZFS. We are 
frequently getting placement group inconsistent on both Ceph clusters


We are suspecting that there is a problem with the network that is 
randomly corrupting the update of placement groups. Does anyone have any 
suggestions as to where and how to look for the problem. The network 
does not seem to have any problems. ZFS is not reporting any problems 
with the disks and the OSD's are fine.


Thanks

Peter.

Log as follows

 health HEALTH_ERR 50 pgs inconsistent; 121 scrub errors
 monmap e8: 6 mons at 
{broll=10.5.8.9:6789/0,gelbin=10.5.8.10:6789/0,magni=10.5.8.12:6789/0,sicco=10.5.8.11:6789/0,tyrande=10.5.8.8:6789/0,varian=10.5.8.14:6789/0},
 election epoch 272, quorum 0,1,2,3,4,5 tyrande,broll,gelbin,sicco,magni,varian
 mdsmap e430: 1/1/1 up {0=broll=up:active}, 5 up:standby
 osdmap e18928: 7 osds: 7 up, 7 in
  pgmap v4910054: 512 pgs, 4 pools, 13043 MB data, 3681 objects
40800 MB used, 856 GB / 895 GB avail
 462 active+clean
  50 active+clean+inconsistent
  client io 12769 B/s rd, 5 op/s


---

Follow us on:
www.twitter.com/teamenergyeaa
www.youtube.com/user/teamenergyeaa

Date for your diary
TEAM User Group Conference
5th November 2014

Subscribe to our newsletter http://www.teamenergy.com/newsletter-subscription/ 


www.teamenergy.com +44 (0)1908 690018 enquir...@teamenergy.com
Registered Office: TEAM (Energy Auditing Agency Ltd), 34 The Forum, Rockingham 
Drive, Linford Wood, Milton Keynes, MK14 6LY
Registered in England No. 1916768
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RGW + S3 Client (s3cmd)

2014-06-24 Thread Francois Deppierraz
Hi Vickey,

This really looks like a DNS issue. Are you sure that the host from
which s3cmd is running is able to resolve the host 'bmi-pocfe2.scc.fi'?

Does a regular ping works?

$ ping bmi-pocfe2.scc.fi

François

On 23. 06. 14 16:24, Vickey Singh wrote:
 # s3cmd ls
 
 WARNING: Retrying failed request: / ([Errno -2] Name or service not known)
 
 WARNING: Waiting 3 sec...
 
 WARNING: Retrying failed request: / ([Errno -2] Name or service not known)
 
 WARNING: Waiting 6 sec...
 
 WARNING: Retrying failed request: / ([Errno -2] Name or service not known)
 
 WARNING: Waiting 9 sec...
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub versus osd scrub load threshold

2014-06-24 Thread David Zafman

Unfortunately, decreasing the osd_scrub_max_interval to 6 days isn’t going to 
fix it.

There is sort of quirk in the way the deep scrub is initiated.  It doesn’t 
trigger a deep scrub until a regular scrub is about to start.  So with 
osd_scrub_max_interval set to 1 week and a high load the next possible scrub or 
deep-scrub is 1 week from the last REGULAR scrub, even if the last deep scrub 
was more than 7 days ago.  

The longest wait for a deep scrub is osd_scrub_max_interval + 
osd_deep_scrub_interval between deep scrubs.

For example, a deep scrub happens on Jan 1.  Each day after that for six days a 
regular scrub happens with low load.  After 6 regular scrubs ending on Jan 7 
the load goes high.  Now with the load high no scrub can start until Jan 14 
because you must get past osd_scrub_max_interval since the last regular scrub 
on Jan 7.  At that time it will be a deep scrub because it is more than 7 days 
since the last deep scrub on Jan 1.

See also http://tracker.ceph.com/issues/6735

There may be a need for more documentation clarification in this area or a 
change to the behavior.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 23, 2014, at 11:10 PM, Christian Balzer ch...@gol.com wrote:

 
 
 Hello,
 
 On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote:
 
 
 By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week
 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400
 seconds (60*60*24).  As long as osd_scrub_max_interval =
 osd_deep_scrub_interval then the load won’t impact when deep scrub
 occurs.   I suggest that osd_scrub_min_interval =
 osd_scrub_max_interval = osd_deep_scrub_interval.
 
 I’d like to know how you have those 3 values set, so I can confirm that
 this explains the issue.
 
 They are and were unsurprisingly set to the default values.
 
 Now to provide some more information, shortly after the inception of this
 cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday
 morning (the things we do for Ceph, a scheduler with a variety of rules
 would be nice, but I digress). 
 This took until 05:30 despite the cluster being idle and with close to no
 data in it. In retrospect it seems clear to me that this already was
 influenced by the load threshold (a scrub I initiated with the new
 threshold value of 1.5 finished in just 30 minutes last night).
 Consequently all the normal scrubs happened in the same time frame until
 this weekend on the 21st (normal scrub).
 The deep scrub on the 22nd clearly ran into the load threshold.
 
 So if I understand you correctly setting osd_scrub_max_interval to 6 days
 should have deep scrubs ignore the load threshold as per the documentation?
 
 Regards,
 
 Christian
 
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 http://www.redhat.com
 
 On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote:
 
 
 Hello,
 
 On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote:
 
 Looks like it's a doc error (at least on master), but it might have
 changed over time. If you're running Dumpling we should change the
 docs.
 
 Nope, I'm running 0.80.1 currently.
 
 Christian
 
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com
 wrote:
 
 Hello,
 
 This weekend I noticed that the deep scrubbing took a lot longer than
 usual (long periods without a scrub running/finishing), even though
 the cluster wasn't all that busy.
 It was however busier than in the past and the load average was above
 0.5 frequently.
 
 Now according to the documentation osd scrub load threshold is
 ignored when it comes to deep scrubs.
 
 However after setting it to 1.5 and restarting the OSDs the
 floodgates opened and all those deep scrubs are now running at full
 speed.
 
 Documentation error or did I unstuck something by the OSD restart?
 
 Regards,
 
 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com Global OnLine Japan/Fusion Communications
 http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RGW + S3 Client (s3cmd)

2014-06-24 Thread Stephan Fabel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/23/2014 04:24 AM, Vickey Singh wrote:
 host_bucket = %(bucket)s.bmi-pocfe2.scc.fi
 http://s.bmi-pocfe2.scc.fi

Should there be a '.' (period) between %(bucket) and
s.bmi-pocfe2.scc.fi?

- -Stephan
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTqe0OAAoJEJs+ZpPmKxIxKCAH/3ief0gpJYtECDraFJoYuIlD
DC8JCBoTMkhMWhbxgq3kSWnPNuVaMMrQOBAOpZF1tebAqYDNjvvAgI4Yc/45PQTg
o4H5O7OvJHM+VC2tmLdF8jgjuhi5P+xErM+7LB7V8PDrLMUsdT6HsvqdDchCIjLk
kUBo7whIehNPrvcP964iFG5fB5V44mA0TDyGqPFp5wbXVrHw8fVG9pR4hhi0eETg
I0xNXjNwvjR+4WZavpiUkEK+/91nGJatUNsu7jl1DvshizH4L3Ujpifr3RyDkVJT
EKJ6UVbpBNsRwvfx30SSIUDfzWCCBgifv2DYNy4wPadLeNU3kwaUyvj2bvqZSlw=
=dGQO
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limitations of erasure coded pools

2014-06-24 Thread Blair Bethwaite
 Message: 24
 Date: Tue, 24 Jun 2014 09:39:50 -0500
 From: Chad Seys cws...@physics.wisc.edu
 To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: [ceph-users] limitations of erasure coded pools
 Message-ID: 201406240939.50550.cws...@physics.wisc.edu
 Content-Type: Text/Plain;  charset=us-ascii

 Hi All,
   Could someone point me to a document (possibly a FAQ :) ) describing the
 limitations of erasure coded pools?  Hopefully it would contain the when and
 how to use them as well.

Hi Chad, this Ceph Enterprise 1.2 FAQ provides a good overview:
https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf

E.g. I read about people using replicated pools as a front end to erasure
 coded pools, but I don't know why they're deciding to do this, or how they are
 setting this up.

Unless you have a very specific use-case then you don't want to
interact directly with an EC pool, for a number of reasons - but
here's one really good one: objects cannot be modified in-place, so
to speak, the OSDs have to read (at least) k chunks, compute the data,
make the write, recompute the updated object's EC. This extra overhead
probably makes EC unsuitable for block storage, whereas it might be OK
for particularly read-dominated object storage. The replicated cache
front-end/tier helps to service random IO bursts (in writeback mode)
and keeps hot objects available to serve to clients without
recomputing.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Mark Kirkwood

On 24/06/14 23:39, Mark Nelson wrote:

On 06/24/2014 03:45 AM, Mark Kirkwood wrote:

On 24/06/14 18:15, Robert van Leeuwen wrote:

All of which means that Mysql performance (looking at you binlog) may
still suffer due to lots of small block size sync writes.


Which begs the question:
Anyone running a reasonable busy Mysql server on Ceph backed storage?

We tried and it did not perform good enough.
We have a small ceph cluster: 3 machines with 2 SSD journals and 10
spinning disks each.
Using ceph trough kvm rbd we were seeing performance equal to about
1-2 spinning disks.

Reading this thread it now looks a bit if there are inherent
architecture + latency issues that would prevent it from performing
great as a Mysql database store.
I'd be interested in example setups where people are running busy
databases on Ceph backed volumes.


Yes indeed,

We have looked extensively at Postgres performance on rbd - and while it
is not Mysql, the underlying mechanism for durable writes (i.e commit)
is essentially very similar (fsync, fdatasync and friends). We achieved
quite reasonable performance (by that I mean sufficiently encouraging to
be happy to host real datastores for our moderately busy systems - and
we are continuing to investigate using it for our really busy ones).

I have not experimented exptensively with the various choices of flush
method (called sync method in Postgres but the same idea), as we found
quite good performance with the default (fdatasync). However this is
clearly an area that is worth investigation.


FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of
qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication.
  I kept buffer sizes small to try to force disk IO and benchmarked
against a local disk passed through to the VM.  We typically did about
3-4x faster on queries than the local disk, but there were a couple of
queries were we were slower.  I didn't look at how multiple databases
scaled though.  That may have it's own benefits and challenges.

I'm encouraged overall though.  It looks like from your comments and
from my own testing it's possible to have at least passable performance
with a single database and potentially as we reduce latency in Ceph make
it even better.  With multiple databases, it's entirely possible that we
can do pretty good even now.



Yes - same kind of findings, specifically:

- random read and write (e.g index access) faster than local disk
- sequential write (e.g batch inserts) similar or faster than local disk
- sequential read (e.g table scan) slower than local disk

Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com