Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Jacek Jarosiewicz

Hi,

On 08/19/2015 11:01 AM, Christian Balzer wrote:


Hello,

That's a pretty small cluster all things considered, so your rather
intensive test setup is likely to run into any or all of the following
issues:

1) The amount of data you're moving around is going cause a lot of
promotions from and to the cache tier. This is expensive and slow.
2) EC coded pools are slow. You may have actually better results with a
Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined
with EC may look nice to you from a cost/density prospect, but more HDDs
means more IOPS and thus speed.
3) scrubbing (unless configured very aggressively down) will
impact your performance on top of the items above.
4) You already noted the kernel versus userland bit.
5) Having all your storage in a single JBOD chassis strikes me as ill
advised, though I don't think it's an actual bottleneck at 4x12Gb/s.



We use two of these (I forgot to mention that)
Each chasis has two internal controllers - both exposing all the disks 
to the connected hosts. There are two osd nodes connected to each chasis.



When you ran the fio tests I assume nothing else was going on and the
dataset size would have fit easily into the cache pool, right?

Look at your nodes with atop or iostat, I venture all your HDDs are at
100%.

Christian



Yes, the problem was a full cache pool. I'm currently wondering on how 
to tune the cache pool parameters so that the whole cluster doesn't slow 
down that much when the cache is full...
I'm thinking of doing some tests on a pool w/o the cache tier so I can 
compare the results. Any suggestions would be greatly appreciated..


J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA -   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Jacek Jarosiewicz

On 08/19/2015 10:58 AM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Jacek Jarosiewicz
Sent: 19 August 2015 09:29
To: ceph-us...@ceph.com
Subject: [ceph-users] requests are blocked - problem


I would suggest running the fio tests again, just to make sure that there isn't 
a problem with your newer config, but I suspect you will see equally bad 
performance with the fio tests now that the cache tier has begun to be more 
populated.



Ok, I did the tests and You're right - the full cache was the problem. 
After flushing cache and running fio results are again good (fast).


Is there a way to tune the cache parameters so that the whole cluster 
doesn't slow down that much and doesn't block requests?


We use defaults for the cache pool from the documentation:
hit_set_period 3600
cache_min_flush_age 600
cache_min_evict_age 1800

J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA -   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Latency impact on RBD performance

2015-08-19 Thread Logan Barfield
Hi,

We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes
for KVM hypervisors.  This 'cluster' is currently set up in 'Location A'.

We are looking to move our hypervisors/VMs over to a new location, and will
have a 1Gbit link between the two datacenters.  We can run Layer 2 over the
link, and it should have ~10ms of latency.  Call the new datacenter
'Location B'.

One proposed solution for the migration is to set up new RBD hosts in the
new location, set up a new pool, and move the VM volumes to it.

The potential issue with this solution is that we can end up in a scenario
where the VM is running on a hypervisor in 'Location A', but
writing/reading to a volume in 'Location B'.

My question is: what kind of performance impact should we expect when
reading/writing over a link with ~10ms of latency?  Will it bring I/O
intensive operations (like databases) to a halt, or will it be 'tolerable'
for a short period (a few days).  Most of the VMs are running database
backed e-commerce sites.

My expectation is that 10ms for every I/O operation will cause a
significant impact, but we wanted to verify that before ruling it out as a
solution.  We will also be doing some internal testing of course.


I appreciate any feedback the community has.

- Logan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster_network with linklocal ipv6

2015-08-19 Thread Björn Lässig
On 08/18/2015 03:39 PM, Björn Lässig wrote:
 For not having any dependencies in my cluster network, i want to use
 only ipv6 link-local addresses on interface 'cephnet'.
 cluster_network = fe80::%cephnet/64

RFC4007 11.7

   The IPv6 addressing architecture [1] also defines the syntax of IPv6
   prefixes.  If the address portion of a prefix is non-global and its
   scope zone should be disambiguated, the address portion SHOULD be in
   the format.  For example, a link-local prefix fe80::/64 on the second
   link can be represented as follows:

fe80::%2/64

   In this combination, it is important to place the zone index portion
   before the prefix length when we consider parsing the format by a
   name-to-address library function [11].  That is, we can first
   separate the address with the zone index from the prefix length, and
   just pass the former to the library function.

so my original format would be the correct one.

just looked into the code ... and this is exactly what happens in:

src/common/ipaddr.cc: bool parse_network(const char *s, […]) {

it looks where a '/' is and try to cast everything after to a longint.
Everything before the '/' is thrown into inet_pton

ok = inet_pton(AF_INET6, addr, ((struct sockaddr_in6*)network)-sin6_addr);

and inet_pton6 from glibc does not look at '%'.
No chance, this code could work with an interfacename after or before
the '/'

I'm using addresses with global scope now.

Björn



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Nick Fisk




 -Original Message-
 From: Jacek Jarosiewicz [mailto:jjarosiew...@supermedia.pl]
 Sent: 19 August 2015 14:28
 To: Nick Fisk n...@fisk.me.uk; ceph-us...@ceph.com
 Subject: Re: [ceph-users] requests are blocked - problem
 
 On 08/19/2015 10:58 AM, Nick Fisk wrote:
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Jacek Jarosiewicz
  Sent: 19 August 2015 09:29
  To: ceph-us...@ceph.com
  Subject: [ceph-users] requests are blocked - problem
 
  I would suggest running the fio tests again, just to make sure that there
 isn't a problem with your newer config, but I suspect you will see equally bad
 performance with the fio tests now that the cache tier has begun to be more
 populated.
 
 
 Ok, I did the tests and You're right - the full cache was the problem.
 After flushing cache and running fio results are again good (fast).
 
 Is there a way to tune the cache parameters so that the whole cluster
 doesn't slow down that much and doesn't block requests?
 
 We use defaults for the cache pool from the documentation:
 hit_set_period 3600
 cache_min_flush_age 600
 cache_min_evict_age 1800
 

Although you may get some benefit from tweaking parameters, I suspect you are 
nearer the performance ceiling for the current implementation of the tiering 
code. Could you post all the variables you set for the tiering including 
target_max_bytes and the dirty/full ratios.

Since you are doing maildirs, which will have lots of small files, you might 
also want to try making the object size of the RBD smaller. This will mean less 
data is needed to be shifted on each promotion/flush.

 J
 
 --
 Jacek Jarosiewicz
 Administrator Systemów Informatycznych
 
 
 SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075
 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy
 Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy
 42.756.000 zł
 NIP: 957-05-49-503
 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa
 
 
 SUPERMEDIA -   http://www.supermedia.pl
 dostep do internetu - hosting - kolokacja - lacza - telefonia




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latency impact on RBD performance

2015-08-19 Thread Nick Fisk
I would suspect that you will notice a significant slow down. Don't forget 
that’s an extra 10ms on however long it already takes for each IO. Also when 
the cluster does any sort of recovery it will likely get much worse.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Logan Barfield
 Sent: 19 August 2015 15:20
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Latency impact on RBD performance
 
 Hi,
 
 We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes
 for KVM hypervisors.  This 'cluster' is currently set up in 'Location A'.
 
 We are looking to move our hypervisors/VMs over to a new location, and will
 have a 1Gbit link between the two datacenters.  We can run Layer 2 over the
 link, and it should have ~10ms of latency.  Call the new datacenter 'Location
 B'.
 
 One proposed solution for the migration is to set up new RBD hosts in the
 new location, set up a new pool, and move the VM volumes to it.
 
 The potential issue with this solution is that we can end up in a scenario
 where the VM is running on a hypervisor in 'Location A', but writing/reading
 to a volume in 'Location B'.
 
 My question is: what kind of performance impact should we expect when
 reading/writing over a link with ~10ms of latency?  Will it bring I/O 
 intensive
 operations (like databases) to a halt, or will it be 'tolerable' for a short 
 period
 (a few days).  Most of the VMs are running database backed e-commerce
 sites.
 
 My expectation is that 10ms for every I/O operation will cause a significant
 impact, but we wanted to verify that before ruling it out as a solution.  We
 will also be doing some internal testing of course.
 
 
 I appreciate any feedback the community has.
 
 
 - Logan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latency impact on RBD performance

2015-08-19 Thread Jan Schermer
This simply depends on what your workload is. I know this is a non-anwer for 
you but that's how it is.

Databases are the worst, because they tend to hit the disks with every 
transaction, and the transaction throughput is in direct proportion to the 
number of IOPS you can get. And the number of IOPS you can get (in this 
scenario) is basically iops=1000/latency_in_ms. There's not much parallelism 
when commiting.

So if your storage has a 2ms latency now it can achieve at most 500 IOPS 
(single thread/queue depth synchronous) - that can in theory equal to 500 
durable transactions per second in a database

But in practice:
a) mysql/innodb has an option not to flush every transaction but every Xth 
transaction. With X=10 it's like having a 5000 IOPS disk more or less.
b) there is filesystem overhead - if you are appending to the transaction log 
you not only have to flush the data, but also filesystem metadata, and that 
could be many more IOPS before you even get to the transaction - I've seen a 
factor of 10(!) with no preallocation and xfs. That's terrible. But not every 
database does this, for example mysql creates and preallocates iblogfileX files 
so that should not be an issue if you run mysql.
c) there can be other IOs blocking the submission, you have a limited queue 
depth of in-flight iops so it can clog up even if you turn the synchronous IO 
into asynchronous, and you often have to flush even the asynchronous writes
d) we must not forget about reads - if there's a webserver connecting to the 
database then there it will need more memory because all requests will take 
longer (and they also often consume CPU even when waiting), and if there are 
multiple requests or subrequests it cascades and goes up fast.
---

Take a look in iostat at the drive utilization and latency of your database's 
disk. Then calculate:

resulting_disk_utilization=(latency+10)/latency*utilization

Taking the figures from previous example, let's say you have a 2ms latency and 
the drive is 20% utilized: (2+10)/2*20 = 120%

My guess is that it's going to be horrible unless you turn everything into 
async IO, enable cache=unsafe in qemu and pray really hard :/

Jan


 On 19 Aug 2015, at 16:20, Logan Barfield lbarfi...@tqhosting.com wrote:
 
 Hi,
 
 We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes 
 for KVM hypervisors.  This 'cluster' is currently set up in 'Location A'.
 
 We are looking to move our hypervisors/VMs over to a new location, and will 
 have a 1Gbit link between the two datacenters.  We can run Layer 2 over the 
 link, and it should have ~10ms of latency.  Call the new datacenter 'Location 
 B'.
 
 One proposed solution for the migration is to set up new RBD hosts in the new 
 location, set up a new pool, and move the VM volumes to it.
 
 The potential issue with this solution is that we can end up in a scenario 
 where the VM is running on a hypervisor in 'Location A', but writing/reading 
 to a volume in 'Location B'.
 
 My question is: what kind of performance impact should we expect when 
 reading/writing over a link with ~10ms of latency?  Will it bring I/O 
 intensive operations (like databases) to a halt, or will it be 'tolerable' 
 for a short period (a few days).  Most of the VMs are running database backed 
 e-commerce sites.
 
 My expectation is that 10ms for every I/O operation will cause a significant 
 impact, but we wanted to verify that before ruling it out as a solution.  We 
 will also be doing some internal testing of course.
 
 
 I appreciate any feedback the community has.
 
 - Logan 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph distributed osd

2015-08-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

By default, all pools will use all OSDs. Each RBD, for instance, is
broken up into 4 MB objects and those objects are somewhat uniformly
distributed between the OSDs. When you add another OSD, the CRUSH map
is recalculated and the OSDs shuffle the objects to their new
locations somewhat uniformly distributing them across all available
OSDs.

I say uniformly distributed because it is based on the hashing
algorithm of the name and size is not taken into account. So you may
have more larger objects on some OSDs than others. The number of PGs
affect the ability to more uniformly distribute the data (more hash
buckets for data to land in).

You can create CRUSH rules that limit selection of OSDs to a subset
and then configure a pool to use those rules. This is a pretty
advanced configuration option.

I hope that helps with your question.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV1M7SCRDmVDuy+mK58QAASbYQAMG0oPEu56Uz0/9cb4LY
E7QTeX2hUGRX5c65Zurr9p+/Sc4WCvDEZm/aPPcB9UtO0O5dvWXULWjXRgr0
Z13/28OozLxWQihRc80OhY2MskNfgPA0zYwaANgUR0xJV4YFQ1ORa13rj0L8
SL4z/IDK9tK/NDLxnjq/iMPXCTTcg3ufiB+0Njl3zLRbGEOAix6H5hzi0239
qHb7UniTtailICcSI0byQE2vKPWQbJ7GueECbcAn/MkqU0uZqzyh5HotiBFq
9ut/ui3ec0Sg/3puD6TOhipQlP998sMnAa5hFi+hoNbVbljGZ9dGZ+inVlJy
kSQTbNDs0Xo2QijGH11LrQ4yL47Trr2WkIriHONtvbncgZg3qK7uR39k6kZ9
dfGUdtstkn8sh5gt98jFNvjWL8UTH9puAJv5C9TzPuq+cq3kr3dwhy4WxrN+
MNISYwJOvncY/2kl03FLL/Z0HxDx1mjjJMQdzM+q9+D0m/EYfUpe/DxMqqMI
4t8hD5UPBhkv1sgLYSWyJ5vxLnNOZP7roe2Jp0KwwlSADM9DJb4MEx/1nNcb
6emts8KUhhtb1jsH8gu9Z0tzHcaqNE8N1z9JiveaNCjs6wTp8xbtmDB7p9k4
uZzzoIXTJWrIN/Qqukza+/+8D+WAJ618uwXCCpWi/k83RKt7iy2iv5w4EDTx
25cQ
=a+24
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Aug 18, 2015 at 8:26 AM, gjprabu gjpr...@zohocorp.com wrote:
 Hi Luis,

 What i mean , we have three OSD with Harddisk size each 1TB and two
 pool (poolA and poolB) with replica 2. Here writing behavior is the
 confusion for us. Our assumptions is below.

 PoolA   -- may write with OSD1 and OSD2  (is this correct)

 PoolB  --  may write with OSD3 and OSD1 (is this correct)

 suppose the hard disk size got full , then how many OSD's need to be added
 and How will be the writing behavior to new OSD's

 After added few osd's

 PoolA --  may write with OSD4 and OSD5 (is this correct)
 PoolB --  may write with OSD5 and OSD6 (is this correct)


 Regards
 Prabu

  On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito periqu...@gmail.com
 wrote 

 I don't understand your question? You created a 1G RBD/disk and it's full.
 You are able to grow it though - but that's a Linux management issue, not
 ceph.

 As everything is thin-provisioned you can create a RBD with an arbitrary
 size - I've create one with 1PB when the cluster only had 600G/Raw
 available.

 On Mon, Aug 17, 2015 at 1:18 PM, gjprabu gjpr...@zohocorp.com wrote:

 Hi All,

Anybody can help on this issue.

 Regards
 Prabu

  On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu gjpr...@zohocorp.com wrote
 

 Hi All,

Also please find osd information.

 ceph osd dump | grep 'replicated size'
 pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash
 rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool
 stripe_width 0

 Regards
 Prabu




  On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu gjpr...@zohocorp.com wrote
 



 Hi All,

We need to test three OSD and one image with replica 2(size 1GB). While
 testing data is not writing above 1GB. Is there any option to write on third
 OSD.

 ceph osd pool get  repo  pg_num
 pg_num: 126

 # rbd showmapped
 id pool image  snap device
 0  rbd  integdownloads -/dev/rbd0 -- Already one
 2  repo integrepotest  -/dev/rbd2  -- newly created


 [root@hm2 repository]# df -Th
 Filesystem   Type  Size  Used Avail Use% Mounted on
 /dev/sda5ext4  289G   18G  257G   7% /
 devtmpfs devtmpfs  252G 0  252G   0% /dev
 tmpfstmpfs 252G 0  252G   0% /dev/shm
 tmpfstmpfs 252G  538M  252G   1% /run
 tmpfstmpfs 252G 0  252G   0% /sys/fs/cgroup
 /dev/sda2ext4  488M  212M  241M  47% /boot
 /dev/sda4ext4  1.9T   20G  1.8T   2% /var
 /dev/mapper/vg0-zoho ext4  8.6T  1.7T  6.5T  21% /zoho
 /dev/rbd0ocfs2 977G  101G  877G  11% /zoho/build/downloads
 /dev/rbd2ocfs21000M 1000M 0 100% /zoho/build/repository

 @:~$ scp -r sample.txt root@integ-hm2:/zoho/build/repository/
 root@integ-hm2's password:
 sample.txt
 100% 1024MB   4.5MB/s   03:48
 scp: /zoho/build/repository//sample.txt: No space left on device

 Regards
 Prabu




  On Thu, 13 Aug 2015 19:42:11 +0530 gjprabu gjpr...@zohocorp.com wrote
 



 Dear Team,

  We are using two 

Re: [ceph-users] ceph distributed osd

2015-08-19 Thread gjprabu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-19 Thread J-P Methot
Hi,

Thank you for the quick reply. However, we do have those exact settings
for recovery and it still strongly affects client io. I have looked at
various ceph logs and osd logs and nothing is out of the ordinary.
Here's an idea though, please tell me if I am wrong.

We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was
explained several times on this mailing list, Samsung SSDs suck in ceph.
They have horrible O_dsync speed and die easily, when used as journal.
That's why we're using Intel ssds for journaling, so that we didn't end
up putting 96 samsung SSDs in the trash.

In recovery though, what is the ceph behaviour? What kind of write does
it do on the OSD SSDs? Does it write directly to the SSDs or through the
journal?

Additionally, something else we notice: the ceph cluster is MUCH slower
after recovery than before. Clearly there is a bottleneck somewhere and
that bottleneck does not get cleared up after the recovery is done.


On 2015-08-19 3:32 PM, Somnath Roy wrote:
 If you are concerned about *client io performance* during recovery, use these 
 settings..
 
 osd recovery max active = 1
 osd max backfills = 1
 osd recovery threads = 1
 osd recovery op priority = 1
 
 If you are concerned about *recovery performance*, you may want to bump this 
 up, but I doubt it will help much from default settings..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P 
 Methot
 Sent: Wednesday, August 19, 2015 12:17 PM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Bad performances in recovery
 
 Hi,
 
 Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total 
 of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph 
 version is hammer v0.94.1 . There is a performance overhead because we're 
 using SSDs (I've heard it gets better in infernalis, but we're not upgrading 
 just yet) but we can reach numbers that I would consider alright.
 
 Now, the issue is, when the cluster goes into recovery it's very fast at 
 first, but then slows down to ridiculous levels as it moves forward. You can 
 go from 7% to 2% to recover in ten minutes, but it may take 2 hours to 
 recover the last 2%. While this happens, the attached openstack setup becomes 
 incredibly slow, even though there is only a small fraction of objects still 
 recovering (less than 1%). The settings that may affect recovery speed are 
 very low, as they are by default, yet they still affect client io speed way 
 more than it should.
 
 Why would ceph recovery become so slow as it progress and affect client io 
 even though it's recovering at a snail's pace? And by a snail's pace, I mean 
 a few kb/second on 10gbps uplinks.
 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 


-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-19 Thread Somnath Roy
All the writes will go through the journal.
It may happen your SSDs are not preconditioned well and after a lot of writes 
during recovery IOs are stabilized to lower number. This is quite common for 
SSDs if that is the case.

Thanks  Regards
Somnath

-Original Message-
From: J-P Methot [mailto:jpmet...@gtcomm.net] 
Sent: Wednesday, August 19, 2015 1:03 PM
To: Somnath Roy; ceph-us...@ceph.com
Subject: Re: [ceph-users] Bad performances in recovery

Hi,

Thank you for the quick reply. However, we do have those exact settings for 
recovery and it still strongly affects client io. I have looked at various ceph 
logs and osd logs and nothing is out of the ordinary.
Here's an idea though, please tell me if I am wrong.

We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was 
explained several times on this mailing list, Samsung SSDs suck in ceph.
They have horrible O_dsync speed and die easily, when used as journal.
That's why we're using Intel ssds for journaling, so that we didn't end up 
putting 96 samsung SSDs in the trash.

In recovery though, what is the ceph behaviour? What kind of write does it do 
on the OSD SSDs? Does it write directly to the SSDs or through the journal?

Additionally, something else we notice: the ceph cluster is MUCH slower after 
recovery than before. Clearly there is a bottleneck somewhere and that 
bottleneck does not get cleared up after the recovery is done.


On 2015-08-19 3:32 PM, Somnath Roy wrote:
 If you are concerned about *client io performance* during recovery, use these 
 settings..
 
 osd recovery max active = 1
 osd max backfills = 1
 osd recovery threads = 1
 osd recovery op priority = 1
 
 If you are concerned about *recovery performance*, you may want to bump this 
 up, but I doubt it will help much from default settings..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of J-P Methot
 Sent: Wednesday, August 19, 2015 12:17 PM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Bad performances in recovery
 
 Hi,
 
 Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total 
 of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph 
 version is hammer v0.94.1 . There is a performance overhead because we're 
 using SSDs (I've heard it gets better in infernalis, but we're not upgrading 
 just yet) but we can reach numbers that I would consider alright.
 
 Now, the issue is, when the cluster goes into recovery it's very fast at 
 first, but then slows down to ridiculous levels as it moves forward. You can 
 go from 7% to 2% to recover in ten minutes, but it may take 2 hours to 
 recover the last 2%. While this happens, the attached openstack setup becomes 
 incredibly slow, even though there is only a small fraction of objects still 
 recovering (less than 1%). The settings that may affect recovery speed are 
 very low, as they are by default, yet they still affect client io speed way 
 more than it should.
 
 Why would ceph recovery become so slow as it progress and affect client io 
 even though it's recovering at a snail's pace? And by a snail's pace, I mean 
 a few kb/second on 10gbps uplinks.
 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 


--
==
Jean-Philippe Méthot
Administrateur système / System administrator GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latency impact on RBD performance

2015-08-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You didn't specify your database, but if you are using mysql you can use:

[mysqld]
# Change disk flush to every second instead of after each transaction.
innodb_flush_log_at_trx_commit=2

to specify flushing the logs every X seconds instead of after every
transaction. It really depends if you can afford to lose any
transactions. This helped on a machine that had some high disk waits
and were I felt comfortable enough losing two seconds of transactions.
In my situation it took an almost 30 minute query to ~30 seconds. Now
that the disk issue is resolved, I don't use that option any more.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV1OtLCRDmVDuy+mK58QAA420QAJmjJvK4fvoYJb5eNKDg
z1cxO4+v3ue+cFX6cj8HnO5ByYKWCroxfTHU159b3L2LvGQVkJso7ogR6e82
rkcDeaLYs82TupBcszCnMpCcI6NB7PrWMp8LPkhiqnitd+tVntZQ+bHSPPGQ
vGSp1b8WC9AwjnkiifX6jNhS9KxbKPrzqlfQfGcxEaC082lWp2E+MInDWYct
K1tYiCfVrvNWVa5JWwU5xXDasHJNeA9Xam4h1Bvy3m3coueF7hVL2k3sV9d0
OkcfQbyNVeOxViQ4FbfT1RdgtWO9UdkNoj7ujtGaK6nPN2icLNnuVQqw+gq0
F2FTBDDAz9C/pSiYeljyKE5QoU0iDJNtadbnlgg8zqZC8o1F5TVrUIv4a6Hz
+Vr+KQ/MY2QK6qXSRxDrR6MBQqMkC7jDC05/bx1CI5dgPEtetN6LXIO6fk30
8Y6PRCMIr0/dvseBFkO8rdASFf3aLDvVPuQMnyWwrMhFzzsxDfP/mWrOby3K
WXD2hvRuXltLe3XwuAocZVXSSIgTQnBCfe9HeKnbR6p+4LZjxUfq/EaEg56F
9swbl33e4NByk/te63KuwDs7o+CWzzD3LOBkSXyolWKWsH6uSiklwGOSiYO8
xqgIbN9Lwzo//T14DyZfZRExfRhLgC99eGqpLdROs2P/i+1E9O8pRMr7i/Ls
Huga
=Ef88
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Aug 19, 2015 at 8:20 AM, Logan Barfield lbarfi...@tqhosting.com wrote:
 Hi,

 We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes
 for KVM hypervisors.  This 'cluster' is currently set up in 'Location A'.

 We are looking to move our hypervisors/VMs over to a new location, and will
 have a 1Gbit link between the two datacenters.  We can run Layer 2 over the
 link, and it should have ~10ms of latency.  Call the new datacenter
 'Location B'.

 One proposed solution for the migration is to set up new RBD hosts in the
 new location, set up a new pool, and move the VM volumes to it.

 The potential issue with this solution is that we can end up in a scenario
 where the VM is running on a hypervisor in 'Location A', but writing/reading
 to a volume in 'Location B'.

 My question is: what kind of performance impact should we expect when
 reading/writing over a link with ~10ms of latency?  Will it bring I/O
 intensive operations (like databases) to a halt, or will it be 'tolerable'
 for a short period (a few days).  Most of the VMs are running database
 backed e-commerce sites.

 My expectation is that 10ms for every I/O operation will cause a significant
 impact, but we wanted to verify that before ruling it out as a solution.  We
 will also be doing some internal testing of course.


 I appreciate any feedback the community has.

 - Logan

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bad performances in recovery

2015-08-19 Thread J-P Methot
Hi,

Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a
total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The
ceph version is hammer v0.94.1 . There is a performance overhead because
we're using SSDs (I've heard it gets better in infernalis, but we're not
upgrading just yet) but we can reach numbers that I would consider
alright.

Now, the issue is, when the cluster goes into recovery it's very fast at
first, but then slows down to ridiculous levels as it moves forward. You
can go from 7% to 2% to recover in ten minutes, but it may take 2 hours
to recover the last 2%. While this happens, the attached openstack setup
becomes incredibly slow, even though there is only a small fraction of
objects still recovering (less than 1%). The settings that may affect
recovery speed are very low, as they are by default, yet they still
affect client io speed way more than it should.

Why would ceph recovery become so slow as it progress and affect client
io even though it's recovering at a snail's pace? And by a snail's pace,
I mean a few kb/second on 10gbps uplinks.
-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-19 Thread Somnath Roy
If you are concerned about *client io performance* during recovery, use these 
settings..

osd recovery max active = 1
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1

If you are concerned about *recovery performance*, you may want to bump this 
up, but I doubt it will help much from default settings..

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P 
Methot
Sent: Wednesday, August 19, 2015 12:17 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Bad performances in recovery

Hi,

Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total 
of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version 
is hammer v0.94.1 . There is a performance overhead because we're using SSDs 
(I've heard it gets better in infernalis, but we're not upgrading just yet) but 
we can reach numbers that I would consider alright.

Now, the issue is, when the cluster goes into recovery it's very fast at first, 
but then slows down to ridiculous levels as it moves forward. You can go from 
7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 
2%. While this happens, the attached openstack setup becomes incredibly slow, 
even though there is only a small fraction of objects still recovering (less 
than 1%). The settings that may affect recovery speed are very low, as they are 
by default, yet they still affect client io speed way more than it should.

Why would ceph recovery become so slow as it progress and affect client io even 
though it's recovering at a snail's pace? And by a snail's pace, I mean a few 
kb/second on 10gbps uplinks.
--
==
Jean-Philippe Méthot
Administrateur système / System administrator GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

 Probably the big question is what are the pain points?  The most common
 answer we get when asking folks what applications they run on top of Ceph is
 everything!.  This is wonderful, but not helpful when trying to figure out
 what performance issues matter most! :)

 IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid journal
 overhead for large writes?  Are there specific use cases where we should
 specifically be focusing attention? general iscsi?  S3? databases directly
 on RBD? etc.  There's tons of different areas that we can work on (general
 OSD threading improvements, different messenger implementations, newstore,
 client side bottlenecks, etc) but all of those things tackle different kinds
 of problems.

We run everything or it sure seems like it. I did some computation
of a large sampling of our servers and found that the average request
size was ~12K/~18K (read/write) and ~30%/70% (it looks like I didn't
save that spreadsheet to get exact numbers).

So, any optimization in smaller I/O sizes would really benefit us

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV1ONDCRDmVDuy+mK58QAAGZMP/2jZk2ScXzJP0kpvVpCZ
pHn74r2tR2gwpzSDzFp2az5L36DutrJImDx/wDZnxsuA6v1sUWiW3/VBwm8r
LRHr0jHvVm5EnpyJuT7zPRg/aJhXcK9hlIuud3tQ357BtMb0Hv6rbVclqtoP
RCYFoWv6vHVg0nmbKyZODj9W4PWfb6AjXazwHlgZw10q1GcYSs5LS2n9Yx8B
Q8hpn/8mf49IopYuyBOH5VTIxOUGlv1XAUD4kSRSYvhFLMQg0lt7L1VZrbiY
qqFtMUlvSoasb7eFahYZskkjUPB9c9kplWjKkKo8K7nV40pUfg8yClZmZ0kl
a4gok35Cn8x58rWBrSpMEqAvr5ObE27LNnwGhy9KfzdkMpdWS2/lr7oZsb+O
Gwk/4/u4hWbjuYSeGsqXefFINgRjl8TPOTQ7ZawlOcxsJhnyiGwWa5jXypHZ
Smju6lEKMd9XvoBHtRBk+wX08E/T/U6pZplOpRwG8jV5xQtrF9B9AEiQoMTj
x4HWdD17O9DcW5veiPkDzf9onkrbWZdcYjTXwdKqm6q0vEvk4stcLTfgjCVu
+zqcgcbCyvw/URNmVjAHH3dSkfmrFBHuLW062hYhSnPlqgSBJ6xTwFzQSIIZ
ZcDOfQIR72l5mLLg/YAvOMljhUnwZjdURRJytYG5KsdzZRWJZ9bWd+n2ZwfZ
Yf4H
=bSAx
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-19 Thread Somnath Roy
Also, check if scrubbing started in the cluster or not. That may considerably 
slow down the cluster.

-Original Message-
From: Somnath Roy 
Sent: Wednesday, August 19, 2015 1:35 PM
To: 'J-P Methot'; ceph-us...@ceph.com
Subject: RE: [ceph-users] Bad performances in recovery

All the writes will go through the journal.
It may happen your SSDs are not preconditioned well and after a lot of writes 
during recovery IOs are stabilized to lower number. This is quite common for 
SSDs if that is the case.

Thanks  Regards
Somnath

-Original Message-
From: J-P Methot [mailto:jpmet...@gtcomm.net]
Sent: Wednesday, August 19, 2015 1:03 PM
To: Somnath Roy; ceph-us...@ceph.com
Subject: Re: [ceph-users] Bad performances in recovery

Hi,

Thank you for the quick reply. However, we do have those exact settings for 
recovery and it still strongly affects client io. I have looked at various ceph 
logs and osd logs and nothing is out of the ordinary.
Here's an idea though, please tell me if I am wrong.

We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was 
explained several times on this mailing list, Samsung SSDs suck in ceph.
They have horrible O_dsync speed and die easily, when used as journal.
That's why we're using Intel ssds for journaling, so that we didn't end up 
putting 96 samsung SSDs in the trash.

In recovery though, what is the ceph behaviour? What kind of write does it do 
on the OSD SSDs? Does it write directly to the SSDs or through the journal?

Additionally, something else we notice: the ceph cluster is MUCH slower after 
recovery than before. Clearly there is a bottleneck somewhere and that 
bottleneck does not get cleared up after the recovery is done.


On 2015-08-19 3:32 PM, Somnath Roy wrote:
 If you are concerned about *client io performance* during recovery, use these 
 settings..
 
 osd recovery max active = 1
 osd max backfills = 1
 osd recovery threads = 1
 osd recovery op priority = 1
 
 If you are concerned about *recovery performance*, you may want to bump this 
 up, but I doubt it will help much from default settings..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of J-P Methot
 Sent: Wednesday, August 19, 2015 12:17 PM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Bad performances in recovery
 
 Hi,
 
 Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total 
 of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph 
 version is hammer v0.94.1 . There is a performance overhead because we're 
 using SSDs (I've heard it gets better in infernalis, but we're not upgrading 
 just yet) but we can reach numbers that I would consider alright.
 
 Now, the issue is, when the cluster goes into recovery it's very fast at 
 first, but then slows down to ridiculous levels as it moves forward. You can 
 go from 7% to 2% to recover in ten minutes, but it may take 2 hours to 
 recover the last 2%. While this happens, the attached openstack setup becomes 
 incredibly slow, even though there is only a small fraction of objects still 
 recovering (less than 1%). The settings that may affect recovery speed are 
 very low, as they are by default, yet they still affect client io speed way 
 more than it should.
 
 Why would ceph recovery become so slow as it progress and affect client io 
 even though it's recovering at a snail's pace? And by a snail's pace, I mean 
 a few kb/second on 10gbps uplinks.
 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 


--
==
Jean-Philippe Méthot
Administrateur système / System administrator GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph OSD nodes in XenServer VMs

2015-08-19 Thread Jiri Kanicky

Hi all,

We are experimenting with an idea to run OSD nodes in XenServer VMs. We 
believe this could provide better flexibility, backups for the nodes etc.


For example:
Xenserver with 4 HDDs dedicated for Ceph.
We would introduce 1 VM (OSD node) with raw/direct access to 4 HDDs or 2 
VMs (2 OSD nodes) with 2 HDDs each.


Do you have any experience with this? Any thoughts on this? Good or bad 
idea?


Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-19 Thread Christian Balzer
On Wed, 19 Aug 2015 10:02:25 +0100 Nick Fisk wrote:

 
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Christian Balzer
  Sent: 19 August 2015 03:32
  To: ceph-users@lists.ceph.com
  Cc: Nick Fisk n...@fisk.me.uk
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
  
  On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:
  
  [mega snip]
   4. Disk based OSD with SSD Journal performance As I touched on above
   earlier, I would expect a disk based OSD with SSD journal to have
   similar performance to a pure SSD OSD when dealing with sequential
   small IO's. Currently the levelDB sync and potentially other things
   slow this down.
  
  
  Has anybody tried symlinking the omap directory to a SSD and tested if
  hat makes a (significant) difference?
 
 I thought I remember reading somewhere that all these items need to
 remain on the OSD itself so that when the OSD calls fsync it can be sure
 they are all in sync at the same time.

Would be nice to have this confirmed by the devs.
It being leveldb, you'd think it would be in sync by default.

But even if it were potentially unsafe (not crash safe) in the current
incarnation, the results of such a test might make any needed changes
attractive.
Unfortunately I don't have anything resembling a SSD in my test cluster. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Christian Balzer

Hello,

On Wed, 19 Aug 2015 15:27:29 +0200 Jacek Jarosiewicz wrote:

 Hi,
 
 On 08/19/2015 11:01 AM, Christian Balzer wrote:
 
  Hello,
 
  That's a pretty small cluster all things considered, so your rather
  intensive test setup is likely to run into any or all of the following
  issues:
 
  1) The amount of data you're moving around is going cause a lot of
  promotions from and to the cache tier. This is expensive and slow.
  2) EC coded pools are slow. You may have actually better results with a
  Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs
  combined with EC may look nice to you from a cost/density prospect,
  but more HDDs means more IOPS and thus speed.
  3) scrubbing (unless configured very aggressively down) will
  impact your performance on top of the items above.
  4) You already noted the kernel versus userland bit.
  5) Having all your storage in a single JBOD chassis strikes me as ill
  advised, though I don't think it's an actual bottleneck at 4x12Gb/s.
 
 
 We use two of these (I forgot to mention that)
 Each chasis has two internal controllers - both exposing all the disks 
 to the connected hosts. There are two osd nodes connected to each chasis.

Ah, so you have the dual controller version.
 
  When you ran the fio tests I assume nothing else was going on and the
  dataset size would have fit easily into the cache pool, right?
 
  Look at your nodes with atop or iostat, I venture all your HDDs are at
  100%.
 
  Christian
 
 
 Yes, the problem was a full cache pool. I'm currently wondering on how 
 to tune the cache pool parameters so that the whole cluster doesn't slow 
 down that much when the cache is full...

Nick already gave you some advice on this, however with the current
versions of Ceph cache tiering is simply expensive and slow.

 I'm thinking of doing some tests on a pool w/o the cache tier so I can 
 compare the results. Any suggestions would be greatly appreciated..
 
For a realistic comparison with your current setup, a total rebuild would
be in order. Provided your cluster is testing only at this point.

Given your current HW, that means the same 2-3 HDDs per storage node and 1
SSD as journal.

What exact maker/model are your SSDs?

Again, more HDDs means more (sustainable) IOPS, so unless your space
requirements (data and physical) are very demanding, double the amount of
3TB HDDs would be noticeably better. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd suddenly dies and no longer can be started

2015-08-19 Thread Евгений Д .
I kind of fixed it by creating a new journal in file instead of separate
partition, which probably caused some data loss, but at least allowed OSD
to start and join cluster. Backfilling is now in progress.
Old journal is still there on separate device, if it can help in
investigation. But this malloc - ENOMEM/OOM killer - corrupted journal
- trying to recover - ENOMEM/OOM killer ... looks like a bug.

2015-08-19 0:13 GMT+03:00 Евгений Д. ineu.m...@gmail.com:

 Hello.

 I have a small Ceph cluster running 9 OSDs, using XFS on separate disks
 and dedicated partitions on system disk for journals.
 After creation it worked fine for a while. Then suddenly one of OSDs
 stopped and didn't start. I had to recreate it. Recovery started.
 After few days of recovery OSD on another machine also stopped. I try to
 start it, it runs for few minutes and dies, looks like it is not able to
 recover journal.
 According to strace, it tries to allocate too much memory and stops with
 ENOMEM. Sometimes it is being killed by kernel's OOM killer.

 I tried flushing journal manually with `ceph-osd -i 3 --flush-journal`,
 but it didn't work either. Error log is as follows:

 [root@assets-2 ~]# ceph-osd -i 3 --flush-journal
 SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00
 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 2015-08-18 23:00:37.956714 7ff102040880 -1
 filestore(/var/lib/ceph/osd/ceph-3) could not find
 225eff8c/default.4323.18_22783306dc51892b40b49e3e26f79baf_55c38b33172600566c46_s.jpeg/head//8
 in index: (2) No such file or directory
 2015-08-18 23:00:37.956741 7ff102040880 -1
 filestore(/var/lib/ceph/osd/ceph-3) could not find
 235eff8c/default.4323.16_3018ff7c6066bddc0c867b293724d7b1_dolar7_106_m.jpg/head//8
 in index: (2) No such file or directory
 skipped
 2015-08-18 23:00:37.958424 7ff102040880 -1
 filestore(/var/lib/ceph/osd/ceph-3) could not find c//head//8 in index: (2)
 No such file or directory
 tcmalloc: large alloc 1073741824 bytes == 0x66b1 @  0x7ff10115ae6a
 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
 tcmalloc: large alloc 2147483648 bytes == 0xbf49 @  0x7ff10115ae6a
 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
 tcmalloc: large alloc 4294967296 bytes == 0x16e32 @  0x7ff10115ae6a
 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
 tcmalloc: large alloc 8589934592 bytes == (nil) @  0x7ff10115ae6a
 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
 terminate called after throwing an instance of 'std::bad_alloc'
   what():  std::bad_alloc
 *** Caught signal (Aborted) **
  in thread 7ff102040880
  ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
  1: ceph-osd() [0xac5642]
  2: (()+0xf130) [0x7ff1009d4130]
  3: (gsignal()+0x37) [0x7ff0ff3ee5d7]
  4: (abort()+0x148) [0x7ff0ff3efcc8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ff0ffcf29b5]
  6: (()+0x5e926) [0x7ff0ffcf0926]
  7: (()+0x5e953) [0x7ff0ffcf0953]
  8: (()+0x5eb73) [0x7ff0ffcf0b73]
  9: (()+0x15d3e) [0x7ff10115ad3e]
  10: (tc_new()+0x1e0) [0x7ff10117ade0]
  11: (std::string::_Rep::_S_create(unsigned long, unsigned long,
 std::allocatorchar const)+0x59) [0x7ff0ffd4fc29]
  12: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
 long)+0x1b) [0x7ff0ffd5086b]
  13: (std::string::reserve(unsigned long)+0x44) [0x7ff0ffd50914]
  14: (std::string::append(char const*, unsigned long)+0x4f)
 [0x7ff0ffd50b7f]
  15: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string
 const)+0xdf) [0x968a0f]
  16:
 (DBObjectMap::clear_header(std::tr1::shared_ptrDBObjectMap::_Header,
 std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xd3) [0xa572b3]
  17: (DBObjectMap::_clear(std::tr1::shared_ptrDBObjectMap::_Header,
 std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xa1) [0xa5c6b1]
  18: (DBObjectMap::clear(ghobject_t const, SequencerPosition
 const*)+0x202) [0xa5f762]
  19: (FileStore::lfn_unlink(coll_t, ghobject_t const, SequencerPosition
 const, bool)+0x16a) [0x9018ba]
  20: (FileStore::_remove(coll_t, ghobject_t const, SequencerPosition
 const)+0x9e) [0x90238e]
  21: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
 int, ThreadPool::TPHandle*)+0x252c) [0x911b2c]
  22: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) 

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-19 Thread Stefan Priebe - Profihost AG
Am 18.08.2015 um 15:43 schrieb Campbell, Bill:
 Hey Stefan,
 Are you using your Ceph cluster for virtualization storage?
Yes

  Is dm-writeboost configured on the OSD nodes themselves?
Yes

Stefan

 
 
 *From: *Stefan Priebe - Profihost AG s.pri...@profihost.ag
 *To: *Mark Nelson mnel...@redhat.com, ceph-users@lists.ceph.com
 *Sent: *Tuesday, August 18, 2015 7:36:10 AM
 *Subject: *Re: [ceph-users] any recommendation of using EnhanceIO?
 
 We're using an extra caching layer for ceph since the beginning for our
 older ceph deployments. All new deployments go with full SSDs.
 
 I've tested so far:
 - EnhanceIO
 - Flashcache
 - Bcache
 - dm-cache
 - dm-writeboost
 
 The best working solution was and is bcache except for it's buggy code.
 The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e.
 discards result in crashed FS after reboots and so on. But it's still
 the fastest for ceph.
 
 The 2nd best solution which we already use in production is
 dm-writeboost (https://github.com/akiradeveloper/dm-writeboost).
 
 Everything else is too slow.
 
 Stefan
 Am 18.08.2015 um 13:33 schrieb Mark Nelson:
 Hi Jan,

 Out of curiosity did you ever try dm-cache?  I've been meaning to give
 it a spin but haven't had the spare cycles.

 Mark

 On 08/18/2015 04:00 AM, Jan Schermer wrote:
 I already evaluated EnhanceIO in combination with CentOS 6 (and
 backported 3.10 and 4.0 kernel-lt if I remember correctly).
 It worked fine during benchmarks and stress tests, but once we run DB2
 on it it panicked within minutes and took all the data with it (almost
 literally - files that werent touched, like OS binaries were b0rked
 and the filesystem was unsalvageable).
 If you disregard this warning - the performance gains weren't that
 great either, at least in a VM. It had problems when flushing to disk
 after reaching dirty watermark and the block size has some
 not-well-documented implications (not sure now, but I think it only
 cached IO _larger_than the block size, so if your database keeps
 incrementing an XX-byte counter it will go straight to disk).

 Flashcache doesn't respect barriers (or does it now?) - if that's ok
 for you than go for it, it should be stable and I used it in the past
 in production without problems.

 bcache seemed to work fine, but I needed to
 a) use it for root
 b) disable and enable it on the fly (doh)
 c) make it non-persisent (flush it) before reboot - not sure if that
 was possible either.
 d) all that in a customer's VM, and that customer didn't have a strong
 technical background to be able to fiddle with it...
 So I haven't tested it heavily.

 Bcache should be the obvious choice if you are in control of the
 environment. At least you can cry on LKML's shoulder when you lose
 data :-)

 Jan


 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:

 What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
 months ago, but no external contributors :(

 The nice thing about EnhanceIO is there is no need to change device
 name, unlike bcache, flashcache etc.

 Best regards,
 Alex

 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com
 wrote:
 I did some (non-ceph) work on these, and concluded that bcache was
 the best
 supported, most stable, and fastest.  This was ~1 year ago, to take
 it with
 a grain of salt, but that's what I would recommend.

 Daniel


 
 From: Dominik Zalewski dzalew...@optlink.net
 To: German Anders gand...@despegar.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, July 1, 2015 5:28:10 PM
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?


 Hi,

 I’ve asked same question last weeks or so (just search the mailing list
 archives for EnhanceIO :) and got some interesting answers.

 Looks like the project is pretty much dead since it was bought out
 by HGST.
 Even their website has some broken links in regards to EnhanceIO

 I’m keen to try flashcache or bcache (its been in the mainline
 kernel for
 some time)

 Dominik

 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:

 Hi cephers,

Is anyone out there that implement enhanceIO in a production
 environment?
 any recommendation? any perf output to share with the diff between
 using it
 and not?

 Thanks in advance,

 German
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 

[ceph-users] НА: Rename Ceph cluster

2015-08-19 Thread Межов Игорь Александрович
Hi!

I think, that renaming cluster - is not only mv config file. We try to change 
name of test Hammer
cluster, created with ceph-deploy and got some issues.

In default install, naming of many parts are derived from cluster name. For 
example, cephx keys are
stored not in ceph.client.admin.keyring, but 
$CLUSTERNAME.client.admin.keyring, so we
have to rename keyrings also.

The same thing is for OSD/MON mounting points: instead 
/var/lib/ceph/osd/ceph-$OSDNUM,
after renaming cluster, daemons try to run OSD from 
/var/lib/ceph/osd/$CLUSTERNAME-$OSDNUM.
Of course, there are no such mountpoints and we manually create them, mount fs 
and re-run OSDs.

There is one unresolved issue with udev rules: after node reboot, filesystems 
are mounted by udev
into the old mountpoints. As the cluster is for testing - this is not a big 
thing.

So, be carefull while renaming production or loaded cluster.

PS: All above is my IMHO and I may be wrong. ;)

Megov Igor
CIO, Yuterra



От: ceph-users ceph-users-boun...@lists.ceph.com от имени Jan Schermer 
j...@schermer.cz
Отправлено: 18 августа 2015 г. 15:18
Кому: Erik McCormick
Копия: ceph-users@lists.ceph.com
Тема: Re: [ceph-users] Rename Ceph cluster

I think it's pretty clear:

http://ceph.com/docs/master/install/manual-deployment/

For example, when you run multiple clusters in a federated architecture, the 
cluster name (e.g., us-west, us-east) identifies the cluster for the current 
CLI session. Note: To identify the cluster name on the command line interface, 
specify the a Ceph configuration file with the cluster name (e.g., ceph.conf, 
us-west.conf, us-east.conf, etc.). Also see CLI usage (ceph --cluster 
{cluster-name}).

But it could be tricky on the OSDs that are running, depending on the 
distribution initscripts - you could find out that you can't service ceph stop 
osd... anymore after the change, since it can't find it's pidfile anymore. 
Looking at Centos initscript it looks like it accepts -c conffile argument 
though.
(So you should be managins OSDs with -c ceph-prod.conf now?)

Jan


On 18 Aug 2015, at 14:13, Erik McCormick 
emccorm...@cirrusseven.commailto:emccorm...@cirrusseven.com wrote:


I've got a custom named cluster integrated with Openstack (Juno) and didn't run 
into any hard-coded name issues that I can recall. Where are you seeing that?

As to the name change itself, I think it's really just a label applying to a 
configuration set. The name doesn't actually appear *in* the configuration 
files. It stands to reason you should be able to rename the configuration files 
on the client side and leave the cluster alone. It'd be with trying in a test 
environment anyway.

-Erik

On Aug 18, 2015 7:59 AM, Jan Schermer 
j...@schermer.czmailto:j...@schermer.cz wrote:
This should be simple enough

mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf

No? :-)

Or you could set this in nova.conf:
images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf

Obviously since different parts of openstack have their own configs, you'd have 
to do something similiar for cinder/glance... so not worth the hassle.

Jan

 On 18 Aug 2015, at 13:50, Vasiliy Angapov 
 anga...@gmail.commailto:anga...@gmail.com wrote:

 Hi,

 Does anyone know what steps should be taken to rename a Ceph cluster?
 Btw, is it ever possbile without data loss?

 Background: I have a cluster named ceph-prod integrated with
 OpenStack, however I found out that the default cluster name ceph is
 very much hardcoded into OpenStack so I decided to change it to the
 default value.

 Regards, Vasily.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: Rename Ceph cluster

2015-08-19 Thread Vasiliy Angapov
Thanks to all!

Everything worked like a charm:
1) Stopped the cluster (I guess it's faster than moving OSDs one by one)
2) Unmounted OSDs and fixed fstab entries for them
3) Renamed the MON and OSD folders
4) Renamed config file and keyrings, fixed paths to keyrings in config
5) Mounted OSDs back (mount -a)
6) Started everything
7) Fixed path to config in nova.conf, cinder.conf and glance-api.conf
all over the OpenStack

Everything works as expected. It took about half an hour to do all the job.
Again, thanks to all!

Regards, Vasily.

2015-08-19 15:10 GMT+08:00 Межов Игорь Александрович me...@yuterra.ru:
 Hi!

 I think, that renaming cluster - is not only mv config file. We try to
 change name of test Hammer
 cluster, created with ceph-deploy and got some issues.

 In default install, naming of many parts are derived from cluster name. For
 example, cephx keys are
 stored not in ceph.client.admin.keyring, but
 $CLUSTERNAME.client.admin.keyring, so we
 have to rename keyrings also.

 The same thing is for OSD/MON mounting points: instead
 /var/lib/ceph/osd/ceph-$OSDNUM,
 after renaming cluster, daemons try to run OSD from
 /var/lib/ceph/osd/$CLUSTERNAME-$OSDNUM.
 Of course, there are no such mountpoints and we manually create them, mount
 fs and re-run OSDs.

 There is one unresolved issue with udev rules: after node reboot,
 filesystems are mounted by udev
 into the old mountpoints. As the cluster is for testing - this is not a big
 thing.

 So, be carefull while renaming production or loaded cluster.

 PS: All above is my IMHO and I may be wrong. ;)

 Megov Igor
 CIO, Yuterra


 
 От: ceph-users ceph-users-boun...@lists.ceph.com от имени Jan Schermer
 j...@schermer.cz
 Отправлено: 18 августа 2015 г. 15:18
 Кому: Erik McCormick
 Копия: ceph-users@lists.ceph.com
 Тема: Re: [ceph-users] Rename Ceph cluster

 I think it's pretty clear:

 http://ceph.com/docs/master/install/manual-deployment/

 For example, when you run multiple clusters in a federated architecture,
 the cluster name (e.g., us-west, us-east) identifies the cluster for the
 current CLI session. Note: To identify the cluster name on the command line
 interface, specify the a Ceph configuration file with the cluster name
 (e.g., ceph.conf, us-west.conf, us-east.conf, etc.). Also see CLI usage
 (ceph --cluster {cluster-name}).

 But it could be tricky on the OSDs that are running, depending on the
 distribution initscripts - you could find out that you can't service ceph
 stop osd... anymore after the change, since it can't find it's pidfile
 anymore. Looking at Centos initscript it looks like it accepts -c conffile
 argument though.
 (So you should be managins OSDs with -c ceph-prod.conf now?)

 Jan


 On 18 Aug 2015, at 14:13, Erik McCormick emccorm...@cirrusseven.com wrote:

 I've got a custom named cluster integrated with Openstack (Juno) and didn't
 run into any hard-coded name issues that I can recall. Where are you seeing
 that?

 As to the name change itself, I think it's really just a label applying to a
 configuration set. The name doesn't actually appear *in* the configuration
 files. It stands to reason you should be able to rename the configuration
 files on the client side and leave the cluster alone. It'd be with trying in
 a test environment anyway.

 -Erik

 On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz wrote:

 This should be simple enough

 mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf

 No? :-)

 Or you could set this in nova.conf:
 images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf

 Obviously since different parts of openstack have their own configs, you'd
 have to do something similiar for cinder/glance... so not worth the hassle.

 Jan

  On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com wrote:
 
  Hi,
 
  Does anyone know what steps should be taken to rename a Ceph cluster?
  Btw, is it ever possbile without data loss?
 
  Background: I have a cluster named ceph-prod integrated with
  OpenStack, however I found out that the default cluster name ceph is
  very much hardcoded into OpenStack so I decided to change it to the
  default value.
 
  Regards, Vasily.
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Nick Fisk
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jacek Jarosiewicz
 Sent: 19 August 2015 09:29
 To: ceph-us...@ceph.com
 Subject: [ceph-users] requests are blocked - problem
 
 Hi,
 
 Our setup is this:
 
 4 x OSD nodes:
 E5-1630 CPU
 32 GB RAM
 Mellanox MT27520 56Gbps network cards
 SATA controller LSI Logic SAS3008
 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Each
 node has 2-3 spinning OSDs (6TB drives) and 2 ssd OSDs (240GB drives)
 3 monitors running on OSD nodes
 ceph hammer 0.94.2
 Ubuntu 14.04
 cache tier with ecpool (3+1)
 
 We've ran some tests on the cluster and results were promising - speeds,
 iops etc as expected, but now we tried to use more than one client for
 testing and ran into some problems:

When you ran this test did you fill the RBD enough that it would have been 
flushing the dirty contents down to the base tier?

 
 We've created a couple rbd images and mapped them on clients (kernel
 rbd) running two rsync processes and one dd on a large number of files (~
 250 GB of maildirs rsync'ed from one rbd image to the other and a dd process
 writing one big 2TB file on another rbd image)
 
 And the speeds now are less than OK, plus we have a lot requests blocked
 warnings. We've left the processes to run over night but this morning I came
 and none of the processes finished - they are able to write data, but at a
 very, very slow rate.

Once you have written to a block once, if the underlying object exists on the 
EC pool, it will have to be promoted to the Cache pool before it can be written 
to. This can have a severe impact on performance, especially if you are hitting 
lots of different blocks and the tiering agent can't keep up with the promotion 
requests.

 
 Please help me diagnose this problem, everything seems to work, just very,
 very slow... when we ran the tests with fio (librbd engine) everything seemd
 fine.. I know that kernel implementation is slower, but is this normal? I 
 can't
 understand why are so many requests blocked.

I would suggest running the fio tests again, just to make sure that there isn't 
a problem with your newer config, but I suspect you will see equally bad 
performance with the fio tests now that the cache tier has begun to be more 
populated.

 
 Some diagnostic data:
 
 root@cf01:/var/log/ceph# ceph -s
  cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5
   health HEALTH_WARN
  31 requests are blocked  32 sec
   monmap e1: 3 mons at
 {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0}
  election epoch 202, quorum 0,1,2 cf01,cf02,cf03
   osdmap e1319: 18 osds: 18 up, 18 in
pgmap v933010: 2112 pgs, 19 pools, 10552 GB data, 2664 kobjects
  14379 GB used, 42812 GB / 57192 GB avail
  2111 active+clean
 1 active+clean+scrubbing
client io 0 B/s rd, 12896 kB/s wr, 35 op/s
 
 
 root@cf01:/var/log/ceph# ceph health detail
 HEALTH_WARN 23 requests are blocked  32 sec; 6 osds have slow requests
 1 ops are blocked  131.072 sec
 22 ops are blocked  65.536 sec
 1 ops are blocked  65.536 sec on osd.2
 1 ops are blocked  65.536 sec on osd.3
 1 ops are blocked  65.536 sec on osd.4
 1 ops are blocked  131.072 sec on osd.7
 18 ops are blocked  65.536 sec on osd.10
 1 ops are blocked  65.536 sec on osd.12
 6 osds have slow requests
 
 
 root@cf01:/var/log/ceph# grep WRN ceph.log | tail -50
 2015-08-19 10:23:34.505669 osd.14 10.4.10.213:6810/21207 17942 : cluster
 [WRN] 2 slow requests, 2 included below; oldest blocked for  30.575870 secs
 2015-08-19 10:23:34.505796 osd.14 10.4.10.213:6810/21207 17943 : cluster
 [WRN] slow request 30.575870 seconds old, received at 2015-08-19
 10:23:03.929722: osd_op(client.9203.1:22822591
 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size
 4194304 write_size 4194304,write 0~462848] 5.1c2aff5f ondisk+write
 e1319) currently waiting for blocked object
 2015-08-19 10:23:34.505803 osd.14 10.4.10.213:6810/21207 17944 : cluster
 [WRN] slow request 30.560009 seconds old, received at 2015-08-19
 10:23:03.945583: osd_op(client.9203.1:22822592
 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size
 4194304 write_size 4194304,write 462848~524288] 5.1c2aff5f ondisk+write
 e1319) currently waiting for blocked object
 2015-08-19 10:23:35.489927 osd.1 10.4.10.211:6812/9198 7921 : cluster
 [WRN] 1 slow requests, 1 included below; oldest blocked for  30.112783 secs
 2015-08-19 10:23:35.490326 osd.1 10.4.10.211:6812/9198 7922 : cluster
 [WRN] slow request 30.112783 seconds old, received at 2015-08-19
 10:23:05.376339: osd_op(osd.14.1299:731492
 rbd_data.37e02ae8944a.00017a90 [copy-from ver 61293] 4.5aa9fb69
 ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319)
 currently commit_sent
 2015-08-19 10:23:36.799861 osd.6 10.4.10.213:6806/22569 22663 : cluster
 [WRN] 2 slow requests, 1 included below; oldest 

Re: [ceph-users] requests are blocked - problem

2015-08-19 Thread Christian Balzer

Hello,

That's a pretty small cluster all things considered, so your rather
intensive test setup is likely to run into any or all of the following
issues:

1) The amount of data you're moving around is going cause a lot of
promotions from and to the cache tier. This is expensive and slow.
2) EC coded pools are slow. You may have actually better results with a
Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined
with EC may look nice to you from a cost/density prospect, but more HDDs
means more IOPS and thus speed.
3) scrubbing (unless configured very aggressively down) will
impact your performance on top of the items above.
4) You already noted the kernel versus userland bit.
5) Having all your storage in a single JBOD chassis strikes me as ill
advised, though I don't think it's an actual bottleneck at 4x12Gb/s.

When you ran the fio tests I assume nothing else was going on and the
dataset size would have fit easily into the cache pool, right?

Look at your nodes with atop or iostat, I venture all your HDDs are at
100%.

Christian

On Wed, 19 Aug 2015 10:29:28 +0200 Jacek Jarosiewicz wrote:

 Hi,
 
 Our setup is this:
 
 4 x OSD nodes:
 E5-1630 CPU
 32 GB RAM
 Mellanox MT27520 56Gbps network cards
 SATA controller LSI Logic SAS3008
 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD
 Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd OSDs (240GB
 drives) 3 monitors running on OSD nodes
 ceph hammer 0.94.2
 Ubuntu 14.04
 cache tier with ecpool (3+1)
 
 We've ran some tests on the cluster and results were promising - speeds, 
 iops etc as expected, but now we tried to use more than one client for 
 testing and ran into some problems:
 
 We've created a couple rbd images and mapped them on clients (kernel 
 rbd) running two rsync processes and one dd on a large number of files 
 (~ 250 GB of maildirs rsync'ed from one rbd image to the other and a dd 
 process writing one big 2TB file on another rbd image)
 
 And the speeds now are less than OK, plus we have a lot requests blocked 
 warnings. We've left the processes to run over night but this morning I 
 came and none of the processes finished - they are able to write data, 
 but at a very, very slow rate.
 
 Please help me diagnose this problem, everything seems to work, just 
 very, very slow... when we ran the tests with fio (librbd engine) 
 everything seemd fine.. I know that kernel implementation is slower, but 
 is this normal? I can't understand why are so many requests blocked.
 
 Some diagnostic data:
 
 root@cf01:/var/log/ceph# ceph -s
  cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5
   health HEALTH_WARN
  31 requests are blocked  32 sec
   monmap e1: 3 mons at 
 {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0}
  election epoch 202, quorum 0,1,2 cf01,cf02,cf03
   osdmap e1319: 18 osds: 18 up, 18 in
pgmap v933010: 2112 pgs, 19 pools, 10552 GB data, 2664 kobjects
  14379 GB used, 42812 GB / 57192 GB avail
  2111 active+clean
 1 active+clean+scrubbing
client io 0 B/s rd, 12896 kB/s wr, 35 op/s
 
 
 root@cf01:/var/log/ceph# ceph health detail
 HEALTH_WARN 23 requests are blocked  32 sec; 6 osds have slow requests
 1 ops are blocked  131.072 sec
 22 ops are blocked  65.536 sec
 1 ops are blocked  65.536 sec on osd.2
 1 ops are blocked  65.536 sec on osd.3
 1 ops are blocked  65.536 sec on osd.4
 1 ops are blocked  131.072 sec on osd.7
 18 ops are blocked  65.536 sec on osd.10
 1 ops are blocked  65.536 sec on osd.12
 6 osds have slow requests
 
 
 root@cf01:/var/log/ceph# grep WRN ceph.log | tail -50
 2015-08-19 10:23:34.505669 osd.14 10.4.10.213:6810/21207 17942 : cluster 
 [WRN] 2 slow requests, 2 included below; oldest blocked for  30.575870
 secs 2015-08-19 10:23:34.505796 osd.14 10.4.10.213:6810/21207 17943 :
 cluster [WRN] slow request 30.575870 seconds old, received at 2015-08-19 
 10:23:03.929722: osd_op(client.9203.1:22822591 
 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 
 4194304 write_size 4194304,write 0~462848] 5.1c2aff5f ondisk+write 
 e1319) currently waiting for blocked object
 2015-08-19 10:23:34.505803 osd.14 10.4.10.213:6810/21207 17944 : cluster 
 [WRN] slow request 30.560009 seconds old, received at 2015-08-19 
 10:23:03.945583: osd_op(client.9203.1:22822592 
 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 
 4194304 write_size 4194304,write 462848~524288] 5.1c2aff5f ondisk+write 
 e1319) currently waiting for blocked object
 2015-08-19 10:23:35.489927 osd.1 10.4.10.211:6812/9198 7921 : cluster 
 [WRN] 1 slow requests, 1 included below; oldest blocked for  30.112783
 secs 2015-08-19 10:23:35.490326 osd.1 10.4.10.211:6812/9198 7922 :
 cluster [WRN] slow request 30.112783 seconds old, received at 2015-08-19 
 10:23:05.376339: osd_op(osd.14.1299:731492 
 rbd_data.37e02ae8944a.00017a90 [copy-from ver 61293] 4.5aa9fb69 
 

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-19 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Christian Balzer
 Sent: 19 August 2015 03:32
 To: ceph-users@lists.ceph.com
 Cc: Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
 
 On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:
 
 [mega snip]
  4. Disk based OSD with SSD Journal performance As I touched on above
  earlier, I would expect a disk based OSD with SSD journal to have
  similar performance to a pure SSD OSD when dealing with sequential
  small IO's. Currently the levelDB sync and potentially other things
  slow this down.
 
 
 Has anybody tried symlinking the omap directory to a SSD and tested if hat
 makes a (significant) difference?

I thought I remember reading somewhere that all these items need to remain
on the OSD itself so that when the OSD calls fsync it can be sure they are
all in sync at the same time.

 
 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com