[ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be 
Have you tried turning it off and on again? ^o^ err, 
Are all your interrupts on one CPU? 
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.

Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0. 
Which certainly is the case with all my recent hardware and Debian
kernels.

I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future. 
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^

So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results. 

With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.

So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache. 
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.

Now with Intel some PCIe lanes are handled by a specific CPU (that's why
you often see the need for adding a 2nd CPU to use all slots) and in that
case pinning the IRQ handling for those slots on a specific CPU might
actually make a lot of sense. Especially if not all the traffic generated
by that card will have to transferred to the other CPU anyway.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

hi christian,

we once were debugging some performance isssues, and IRQ balancing was 
one of the issues we looked in, but no real benefit there for us.
all interrupts on one cpu is only an issue if the hardware itself is not 
the bottleneck. we were running some default SAS HBA (Dell H200), and 
those simply can't generated enough load to cause any IRQ issue even on 
older AMD cpus (we did tests on R515 boxes). (there was a ceph 
persentation somewhere that highlights the impact of using the proper 
the disk controller, we'll have to fix that first in our case. i'll be 
happy if IRQ balancing actually becomes an issue ;)


but another issue is the OSD processes: do you pin those as well? and 
how much data do they actually handle. to checksum, the OSD process 
needs all data, so that can also cause a lot of NUMA traffic, esp if 
they are not pinned.


i sort of hope that current CPUs have enough pcie lanes and cores so we 
can use single socket nodes, to avoid at least the NUMA traffic.


stijn


not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be
Have you tried turning it off and on again? ^o^ err,
Are all your interrupts on one CPU?
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.

Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0.
Which certainly is the case with all my recent hardware and Debian
kernels.

I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future.
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^

So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results.

With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.

So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache.
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.

Now with Intel some PCIe lanes are handled by a specific CPU (that's why
you often see the need for adding a 2nd CPU to use all slots) and in that
case pinning the IRQ handling for those slots on a specific CPU might
actually make a lot of sense. Especially if not all the traffic generated
by that card will have to transferred to the other CPU anyway.


Christian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

On Mon, 22 Sep 2014 09:35:10 +0200 Stijn De Weirdt wrote:

 hi christian,
 
 we once were debugging some performance isssues, and IRQ balancing was 
 one of the issues we looked in, but no real benefit there for us.
 all interrupts on one cpu is only an issue if the hardware itself is not 
 the bottleneck. 
In particular the spinning rust. ^o^
But this crept up in recent discussions about all SSD OSD storage servers,
so there is some (remote) possibility for this to happen.

we were running some default SAS HBA (Dell H200), and 
 those simply can't generated enough load to cause any IRQ issue even on 
 older AMD cpus (we did tests on R515 boxes). (there was a ceph 
 persentation somewhere that highlights the impact of using the proper 
 the disk controller, we'll have to fix that first in our case. i'll be 
 happy if IRQ balancing actually becomes an issue ;)
 
Yeah, this pretty much matches what I'm seeing and experienced over the
years.

 but another issue is the OSD processes: do you pin those as well? and 
 how much data do they actually handle. to checksum, the OSD process 
 needs all data, so that can also cause a lot of NUMA traffic, esp if 
 they are not pinned.
 
That's why all my (production) storage nodes have only a single 6 or 8
core CPU. Unfortunately that also limits the amount of RAM in there, 16GB
modules have just recently become an economically viable alternative to
8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs
and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not
IOwait!) resources with the right (or is that wrong) tests, namely 4K
FIOs. 

The linux scheduler usually is quite decent in keeping processes where the
action is, thus you see for example a clear preference of DRBD or KVM vnet
processes to be near or on the CPU(s) where the IRQs are.

 i sort of hope that current CPUs have enough pcie lanes and cores so we 
 can use single socket nodes, to avoid at least the NUMA traffic.
 
Even the lackluster Opterons with just PCIe v2 and less lanes than current
Intel CPUs are plenty fast enough (sufficient bandwidth) when it comes to
the storage node density I'm deploying.

Christian
 stijn
 
  not really specific to Ceph, but since one of the default questions by
  the Ceph team when people are facing performance problems seems to be
  Have you tried turning it off and on again? ^o^ err,
  Are all your interrupts on one CPU?
  I'm going to wax on about this for a bit and hope for some feedback
  from others with different experiences and architectures than me.
 
  Now firstly that question if all your IRQ handling is happening on the
  same CPU is a valid one, as depending on a bewildering range of factors
  ranging from kernel parameters to actual hardware one often does indeed
  wind up with that scenario, usually with all on CPU0.
  Which certainly is the case with all my recent hardware and Debian
  kernels.
 
  I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
  thus feedback from Intel users is very much sought after, as I'm
  considering Intel based storage nodes in the future.
  It's vaguely amusing that Ceph storage nodes seem to have more CPU
  (individual core performance, not necessarily # of cores) and similar
  RAM requirements than my VM hosts. ^o^
 
  So the common wisdom is that all IRQs on one CPU is a bad thing, lest
  it gets overloaded and for example drop network packets because of
  this. And while that is true, I'm hard pressed to generate any load on
  my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
 
  Thus it should come as no surprise that spreading out IRQs with
  irqbalance or more accurately by manually setting
  the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
  differences when it comes to benchmark results.
 
  With irqbalance spreading things out willy-nilly w/o any regards or
  knowledge about the hardware and what IRQ does what it's definitely
  something I won't be using out of the box. This goes especially for
  systems with different NUMA regions without proper policyscripts for
  irqbalance.
 
  So for my current hardware I'm going to keep IRQs on CPU0 and CPU1
  which are the same Bulldozer module and thus sharing L2 and L3 cache.
  In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs
  on CPU0 and the network (Infiniband) on CPU1.
  That should give me sufficient reserves in processing power and keep
  intra core (module) and NUMA (additional physical CPUs) traffic to a
  minimum. This also will (within a certain load range) allow these 2
  CPUs (module) to be ramped up to full speed while other cores can
  remain at a lower frequency.
 
  Now with Intel some PCIe lanes are handled by a specific CPU (that's
  why you often see the need for adding a 2nd CPU to use all slots) and
  in that case pinning the IRQ handling for those slots on a specific
  CPU might actually make a lot of sense. Especially if not all the
 

Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Florian Haas
On Mon, Sep 22, 2014 at 10:21 AM, Christian Balzer ch...@gol.com wrote:
 The linux scheduler usually is quite decent in keeping processes where the
 action is, thus you see for example a clear preference of DRBD or KVM vnet
 processes to be near or on the CPU(s) where the IRQs are.

Since you're just mentioning it: DRBD, for one, needs to *tell* the
kernel that its sender, receiver and worker threads should be on the
same CPU. It has done that for some time now, but you shouldn't assume
that this is some kernel magic that DRBD can just use. Not suggesting
that you're unaware of this, but the casual reader might be. :)

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

but another issue is the OSD processes: do you pin those as well? and
how much data do they actually handle. to checksum, the OSD process
needs all data, so that can also cause a lot of NUMA traffic, esp if
they are not pinned.


That's why all my (production) storage nodes have only a single 6 or 8
core CPU. Unfortunately that also limits the amount of RAM in there, 16GB
modules have just recently become an economically viable alternative to
8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs
and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not
IOwait!) resources with the right (or is that wrong) tests, namely 4K
FIOs.

The linux scheduler usually is quite decent in keeping processes where the
action is, thus you see for example a clear preference of DRBD or KVM vnet
processes to be near or on the CPU(s) where the IRQs are.
the scheduler has improved recently, but i don't know since what version 
(certainly not backported to RHEL6 kernel).


pinning the OSDs might actually be a bad idea, unless the page cache is 
flushed before each osd restart. kernel VM has this nice feature where 
allocating memory in a NUMA domain does not trigger freeing of cache 
memory in the domain, but it will first try to allocate memory on 
another NUMA domain. although typically the VM cache will be maxed out 
on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so 
who knows where the memory is located when it's allocated.



stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Anand Bhat
Page reclamation in Linux is NUMA aware.  So page reclamation is not an issue.

You can see performance improvements only if all the components of a given IO 
completes  on a single core. This is hard to achieve in Ceph as a single IO 
goes through multiple thread switches and the threads are not bound to any 
core.  Starting an OSD with numactl  and binding it to one core might aggravate 
the problem as all the threads spawned by that OSD will compete for the CPU on 
a single core.  OSD with default configuration has 20+ threads .  Binding the 
OSD process to one core using taskset does not help as some memory (especially 
heap) may be already allocated on the other NUMA node.

Looks the design principle followed is to fan out by spawning multiple threads 
at each of the pipelining stage to utilize the available cores in the system.  
Because the IOs won't complete on the same core as issued, lots of cycles are 
lost for cache coherency.

Regards,
Anand



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn 
De Weirdt
Sent: Monday, September 22, 2014 2:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] IRQ balancing, distribution

 but another issue is the OSD processes: do you pin those as well? and
 how much data do they actually handle. to checksum, the OSD process
 needs all data, so that can also cause a lot of NUMA traffic, esp if
 they are not pinned.

 That's why all my (production) storage nodes have only a single 6 or 8
 core CPU. Unfortunately that also limits the amount of RAM in there,
 16GB modules have just recently become an economically viable
 alternative to 8GB ones.

 Thus I don't pin OSD processes, given that on my 8 core nodes with 8
 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU
 (not
 IOwait!) resources with the right (or is that wrong) tests, namely 4K
 FIOs.

 The linux scheduler usually is quite decent in keeping processes where
 the action is, thus you see for example a clear preference of DRBD or
 KVM vnet processes to be near or on the CPU(s) where the IRQs are.
the scheduler has improved recently, but i don't know since what version 
(certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache is flushed 
before each osd restart. kernel VM has this nice feature where allocating 
memory in a NUMA domain does not trigger freeing of cache memory in the domain, 
but it will first try to allocate memory on another NUMA domain. although 
typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache 
clearing itself is NUMA aware, so who knows where the memory is located when 
it's allocated.


stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Newbie Ceph Design Questions

2014-09-22 Thread Udo Lembke
Hi Christian,

On 22.09.2014 05:36, Christian Balzer wrote:
 Hello,

 On Sun, 21 Sep 2014 21:00:48 +0200 Udo Lembke wrote:

 Hi Christian,

 On 21.09.2014 07:18, Christian Balzer wrote:
 ...
 Personally I found ext4 to be faster than XFS in nearly all use cases
 and the lack of full, real kernel integration of ZFS is something that
 doesn't appeal to me either.
 a little bit OT... what kind of ext4-mount options do you use?
 I have an 5-node cluster with xfs (60 osds), and perhaps the performance
 with ext4 would be better?!
 Hard to tell w/o testing your particular load, I/O patterns.

 When benchmarking directly with single disks or RAIDs it is fairly
 straightforward to see:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028540.html

 Also note that the actual question has never been answered by the Ceph
 team, which is a shame as I venture that it would make things faster.
do you run your cluster without filestore_xattr_use_omap = true or
with due missing answer (to be on the safe side)??

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

hi,

Page reclamation in Linux is NUMA aware.  So page reclamation is not
an issue.

except for the first min_free_kbytes? those can come from anywhere, no? 
or is the reclamation such that it tries to free equal portion for each 
NUMA domain. if the OSD allocates memory in chunks smaller then that 
value, you might be lucky.



You can see performance improvements only if all the components of a
given IO completes  on a single core. This is hard to achieve in Ceph
as a single IO goes through multiple thread switches and the threads
are not bound to any core.  Starting an OSD with numactl  and binding
it to one core might aggravate the problem as all the threads spawned
by that OSD will compete for the CPU on a single core.  OSD with
default configuration has 20+ threads .  Binding the OSD process to
one core using taskset does not help as some memory (especially heap)
may be already allocated on the other NUMA node.

this is not true if you start the process under numactl, is it?

but binding an OSD to a NUMA domain makes sense.



Looks the design principle followed is to fan out by spawning
multiple threads at each of the pipelining stage to utilize the
available cores in the system.  Because the IOs won't complete on the
same core as issued, lots of cycles are lost for cache coherency.
is intel HT a solution/help for this? turn on HT and start the OSD on 
the L2 (e.g. with hwloc-bind)


as a more general question, the recommendation for ceph to have one cpu 
core for each OSD; can these be HT cores or actual physical cores?




stijn



Regards, Anand



-Original Message- From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De
Weirdt Sent: Monday, September 22, 2014 2:36 PM To:
ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing,
distribution


but another issue is the OSD processes: do you pin those as well?
and how much data do they actually handle. to checksum, the OSD
process needs all data, so that can also cause a lot of NUMA
traffic, esp if they are not pinned.


That's why all my (production) storage nodes have only a single 6
or 8 core CPU. Unfortunately that also limits the amount of RAM in
there, 16GB modules have just recently become an economically
viable alternative to 8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with
8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all
CPU (not IOwait!) resources with the right (or is that wrong)
tests, namely 4K FIOs.

The linux scheduler usually is quite decent in keeping processes
where the action is, thus you see for example a clear preference of
DRBD or KVM vnet processes to be near or on the CPU(s) where the
IRQs are.

the scheduler has improved recently, but i don't know since what
version (certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache
is flushed before each osd restart. kernel VM has this nice feature
where allocating memory in a NUMA domain does not trigger freeing of
cache memory in the domain, but it will first try to allocate memory
on another NUMA domain. although typically the VM cache will be maxed
out on OSD boxes, i'm not sure the cache clearing itself is NUMA
aware, so who knows where the memory is located when it's allocated.


stijn ___ ceph-users
mailing list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution, or
copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender by telephone or
e-mail (as shown above) immediately and destroy any and all copies of
this message in your possession (whether hard copies or
electronically stored copies).




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pgs are in stale+down+peering state

2014-09-22 Thread Sahana Lokeshappa
Hi all,

I used command  'ceph osd thrash ' command and after all osds are up and in, 3  
pgs are in  stale+down+peering state

sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs 
stuck stale; 3 pgs stuck unclean
pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, 
last acting [12,25,23]
pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, 
last acting [12,25,23]
pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, 
last acting [12,25,23]


Please, can anyone explain why pgs are in this state.
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Timeout on ceph-disk activate

2014-09-22 Thread Alfredo Deza
I would run that one command (sudo ceph-disk -v activate --mark-init sysvinit
--mount /data/osd ) on the hp10 box and see what is going on when you do so.



On Thu, Sep 18, 2014 at 12:09 PM, BG bglac...@nyx.com wrote:
 I've hit a timeout issue on calls to ceph-disk activate.

 Initially, I followed the 'Storage Cluster Quick Start' on the CEPH website to
 get a cluster up and running. I wanted to tweak the configuration however and
 decided to blow away the initial setup using the purge / purgedata / 
 forgetkeys
 commands with ceph-deploy.

 Next time around I'm getting a timeout error when attempting to activate an 
 OSD
 on two out of the three boxes I'm using:
 [ceph_deploy.cli][INFO  ] Invoked (1.5.15): /usr/bin/ceph-deploy osd activate
 hp10:/data/osd
 [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks hp10:/data/osd:
 [hp10][DEBUG ] connected to host: hp10
 [hp10][DEBUG ] detect platform information from remote host
 [hp10][DEBUG ] detect machine type
 [ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.0.1406 Core
 [ceph_deploy.osd][DEBUG ] activating host hp10 disk /data/osd
 [ceph_deploy.osd][DEBUG ] will use init type: sysvinit
 [hp10][INFO  ] Running command: sudo ceph-disk -v activate --mark-init 
 sysvinit
 --mount /data/osd
 [hp10][WARNIN] No data was received after 300 seconds, disconnecting...
 [hp10][INFO  ] checking OSD status...
 [hp10][INFO  ] Running command: sudo ceph --cluster=ceph osd stat 
 --format=json

 This is on CentOS 7, ceph-deploy version is 1.5.15. The firewalld service is
 disabled, network connectivity should be good as the cluster previously worked
 on these boxes.

 Any suggestions where I should start looking to track down the root cause
 of the timeout?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health related message

2014-09-22 Thread Sean Sullivan
I had this happen to me as well. Turned out to be a connlimit thing for me.
I would check dmesg/kernel log and see if you see any conntrack limit
reached connection dropped messages then increase connlimit. Odd as I
connected over ssh for this but I can't deny syslog.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Mark Nelson

On 09/22/2014 01:55 AM, Christian Balzer wrote:


Hello,

not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be
Have you tried turning it off and on again? ^o^ err,
Are all your interrupts on one CPU?
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.


This may be a result of me harping about this after a customer's 
clusters had mysterious performance issues and where irqbalance didn't 
appear to be working properly. :)




Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0.
Which certainly is the case with all my recent hardware and Debian
kernels.


Yes, there are certainly a lot of scenarios where this can happen.  I 
think the hope has been that with MSI-X, interrupts will get evenly 
distributed by default and that is typically better than throwing them 
all at core 0, but things are still quite complicated.




I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future.
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^


It might be reasonable to say that Ceph is a pretty intensive piece of 
software.  With lots of OSDs on a system there are hundreds if not 
thousands of threads.  Under heavy load conditions the CPUs, network 
cards, HBAs, memory, socket interconnects, possibly SAS expanders are 
all getting worked pretty hard and possibly in unusual ways where both 
throughput and latency are important.  At the cluster scale things like 
switch bisection bandwidth and network topology become issues too.  High 
performance clustered storage is imho one of the most complicated 
performance subjects in computing.


The good news is that much of this can be avoided by sticking to simple 
designs with fewer OSDs per node.  The more OSDs you try to stick in 1 
system, the more you need to worry about all of this if you care about 
high performance.




So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results.


Ok, that's fine, but this is pretty subjective.  Without knowing the 
load and the hardware setup I don't think we can really draw any 
conclusions other than that in your test on your hardware this wasn't 
the bottleneck.




With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.


I believe irqbalance takes PCI topology into account when making mapping 
decisions.  See:


http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html



So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache.
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.


So it's been a while since I looked at AMD CPU interconnect topology, 
but back in the magnycours era I drew up some diagrams:


2 socket:

https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing

4 socket:

https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing

I think Interlagos looks somewhat similar from a hypertransport 
perspective.  My gut instinct  is that you really want to keep 
everything you can local to the socket on these kinds of systems.  So if 
your HBA is on the first socket, you want your processing and interrupt 
handling there too.  In the 4-socket configuration this is especially 
true.  It's entirely possible that you may have to go through both an 
on-die and a inter-socket HT link before you get to a neighbour 

Re: [ceph-users] Newbie Ceph Design Questions

2014-09-22 Thread Christian Balzer
On Mon, 22 Sep 2014 13:35:26 +0200 Udo Lembke wrote:

 Hi Christian,
 
 On 22.09.2014 05:36, Christian Balzer wrote:
  Hello,
 
  On Sun, 21 Sep 2014 21:00:48 +0200 Udo Lembke wrote:
 
  Hi Christian,
 
  On 21.09.2014 07:18, Christian Balzer wrote:
  ...
  Personally I found ext4 to be faster than XFS in nearly all use cases
  and the lack of full, real kernel integration of ZFS is something
  that doesn't appeal to me either.
  a little bit OT... what kind of ext4-mount options do you use?
  I have an 5-node cluster with xfs (60 osds), and perhaps the
  performance with ext4 would be better?!
  Hard to tell w/o testing your particular load, I/O patterns.
 
  When benchmarking directly with single disks or RAIDs it is fairly
  straightforward to see:
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028540.html
 
  Also note that the actual question has never been answered by the Ceph
  team, which is a shame as I venture that it would make things faster.
 do you run your cluster without filestore_xattr_use_omap = true or
 with due missing answer (to be on the safe side)??
 
For the time being at the default, aka filestore_xattr_use_omap = true.

Christian 
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding another radosgw node

2014-09-22 Thread Jon Kåre Hellan

Hi

We've got a three node ceph cluster, and radosgw on a fourth machine. We 
would like to add another radosgw machine for high availability. Here 
are a few questions I have:


- We aren't expecting to deploy to multiple regions and zones anywhere
  soon. So presumably, we do not have to worry about federated
  deployment. Would it be hard to move to a federated deployment later?
- What is a radosgw instance? I was guessing that it was a machine
  running radosgw. If not, is it a separate gateway with a separate
  set of user and pools, possibly running on the same machine?
- Can I simply deploy another radosgw machine with the same
  configuration as the first one? If the second interpretation is true,
  I guess I could.
- Am I right that all gateway users go in the same keyring, which is
  copied to all the gateway nodes and all the monitor nodes?
- The gateway nodes obviously need a
  [client.radosgw.{instance-name}] stanza in /etc/ceph.conf. Do the
  monitor nodes also need a copy of the stanza?
- Do the gateway nodes need all of the monitors' [global] stanza in
  their /etc/ceph.conf? Presumably, they at least need mon_host to know
  who to talk to. What else?

Regards

Jon

Jon Kåre Hellan, UNINETT AS, Trondheim, Norway
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-22 Thread Sage Weil
Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
sahana.lokesha...@sandisk.com wrote:
Hi all,

I used command  'ceph osd thrash ' command and after all osds are up
and in, 3  pgs are in  stale+down+peering state

sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
monmap e1: 3 mons at
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
pg 0.4d is stuck inactive for 341048.948643, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck inactive for 341048.948667, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck inactive for 341048.949362, current state
stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck unclean for 341048.948665, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck unclean for 341048.948687, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck unclean for 341048.949382, current state
stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck stale for 339823.956929, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck stale for 339823.956930, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck stale for 339823.956925, current state
stale+down+peering, last acting [12,25,23]


Please, can anyone explain why pgs are in this state.
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com




PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named
above. If the reader of this message is not the intended recipient, you
are hereby notified that you have received this message in error and
that any review, dissemination, distribution, or copying of this
message is strictly prohibited. If you have received this communication
in error, please notify the sender by telephone or e-mail (as shown
above) immediately and destroy any and all copies of this message in
your possession (whether hard copies or electronically stored copies).





___
Ceph-community mailing list
ceph-commun...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

-- 
Sent from Kaiten Mail. Please excuse my brevity.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-22 Thread Varada Kari
Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked  32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf 
Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; 
ceph-commun...@lists.ceph.com
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state


Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote:
Hi all,


I used command  ‘ceph osd thrash ‘ command and after all osds are up and in, 3  
pgs are in  stale+down+peering state


sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering


sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs 

[ceph-users] XenServer and Ceph - any updates?

2014-09-22 Thread Andrei Mikhailovsky
Hello guys, 

I was wondering if there has been any updates on getting XenServer ready for 
ceph? I've seen a howto that was written well over a year ago (I think) for a 
PoC integration of XenServer and Ceph. However, I've not seen any developments 
lately.It would be cool to see other hypervisors adapting Ceph )) 

Cheers 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reassigning admin server

2014-09-22 Thread LaBarre, James (CTR) A6IT
If I have a machine/VM I am using as an Admin node for a ceph cluster, can I 
relocate that admin to another machine/VM after I've built a cluster?  I would 
expect as the Admin isn't an actual operating part of the cluster itself (other 
than Calamari, if it happens to be running) the rest of the cluster should be 
adequately served with a -update-conf.

--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown. 
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2014 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-22 Thread Robert LeBlanc
We are still in the middle of testing things, but so far we have had more
improvement with SSD journals than the OSD cached with bcache (five OSDs
fronted by one SSD). We still have yet to test if adding a bcache layer in
addition to the SSD journals provides any additional improvements.

Robert LeBlanc

On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson mark.nel...@inktank.com
wrote:

 On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:

 Hello guys,

 Was wondering if anyone uses or done some testing with using bcache or
 enhanceio caching in front of ceph osds?

 I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds
 for journals. I've recently purchased four additional ssds to be used
 for ceph cache pool, but i've found performance of guest vms to be
 slower with the cache pool for many benchmarks. The write performance
 has slightly improved, but the read performance has suffered a lot (as
 much as 60% in some tests).

 Therefore, I am planning to scrap the cache pool (at least until it
 matures) and use either bcache or enhanceio instead.


 We're actually looking at dm-cache a bit right now. (and talking some of
 the developers about the challenges they are facing to help improve our own
 cache tiering)  No meaningful benchmarks of dm-cache yet though. Bcache,
 enhanceio, and flashcache all look interesting too.  Regarding the cache
 pool: we've got a couple of ideas that should help improve performance,
 especially for reads.  There are definitely advantages to keeping cache
 local to the node though.  I think some form of local node caching could be
 pretty useful going forward.


 Thanks

 Andrei


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-22 Thread Mark Nelson
Likely it won't since the OSD is already coalescing journal writes. 
FWIW, I ran through a bunch of tests using seekwatcher and blktrace at 
4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get a 
feel for what the IO patterns are like for the dm-cache developers.  I 
included both the raw blktrace data and seekwatcher graphs here:


http://nhm.ceph.com/firefly_blktrace/

there are some interesting patterns but they aren't too easy to spot (I 
don't know why the Chris decided to use blue and green by default!)


Mark

On 09/22/2014 04:32 PM, Robert LeBlanc wrote:

We are still in the middle of testing things, but so far we have had
more improvement with SSD journals than the OSD cached with bcache (five
OSDs fronted by one SSD). We still have yet to test if adding a bcache
layer in addition to the SSD journals provides any additional improvements.

Robert LeBlanc

On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson mark.nel...@inktank.com
mailto:mark.nel...@inktank.com wrote:

On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:

Hello guys,

Was wondering if anyone uses or done some testing with using
bcache or
enhanceio caching in front of ceph osds?

I've got a small cluster of 2 osd servers, 16 osds in total and
4 ssds
for journals. I've recently purchased four additional ssds to be
used
for ceph cache pool, but i've found performance of guest vms to be
slower with the cache pool for many benchmarks. The write
performance
has slightly improved, but the read performance has suffered a
lot (as
much as 60% in some tests).

Therefore, I am planning to scrap the cache pool (at least until it
matures) and use either bcache or enhanceio instead.


We're actually looking at dm-cache a bit right now. (and talking
some of the developers about the challenges they are facing to help
improve our own cache tiering)  No meaningful benchmarks of dm-cache
yet though. Bcache, enhanceio, and flashcache all look interesting
too.  Regarding the cache pool: we've got a couple of ideas that
should help improve performance, especially for reads.  There are
definitely advantages to keeping cache local to the node though.  I
think some form of local node caching could be pretty useful going
forward.


Thanks

Andrei


_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Day Speaking Slots

2014-09-22 Thread Patrick McGarry
Hey cephers,

As we finalize the next couple schedules for Ceph Days in NYC and
London it looks like there are still a couple of speaking slots open.
If you are available in NYC on 08 OCT or in London on 22 OCT and would
be interested in speaking about your Ceph experiences (of any kind)
please contact me as soon as possible.  Thanks.

http://ceph.com/cephdays/


Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-22 Thread Andrei Mikhailovsky
I've done a bit of testing with Enhanceio on my cluster and I can see a 
definate improvement in read performance for cached data. The performance 
increase is around 3-4 times the cluster speed prior to using enhanceio based 
on large block size IO (1M and 4M). 

I've done a concurrent test of running a single dd if=/dev/vda of=/dev/null 
bs=1M/4M iflag=direct instance over 20 vms which were running on 4 host 
servers. Prior to enchanceio i was getting around 30-35MB/s per guest vm 
regardless of how many times i run the test. With enhanceio (from the second 
run) I was hitting over 130MB/s per vm. I've not seen any lag in performance of 
other vms while using enchanceio, unlike a considerable lag without the 
enchanceio. The ssd disk utilisation was not hitting much over 60%. 

The small block size (4K) performance hasn't changed with enhanceio, which made 
me think that the performance of osds themselves is limited when using small 
block sizes. I wasn't getting much over 2-3MB/s per guest vm. 

On a contrary, when I tried to use the firefly cache pool on the same hardware, 
my cluster has performed significantly slower with the cache pool. The whole 
cluster seemed under a lot more load and the performance has dropped to around 
12-15MB/s and other guest vms were very very slow. The ssd disks were utilised 
100% all the time during the test with majority of write IO. 

I admit that these tests shouldn't be considered as a definate and fully 
performance tests of ceph cluster as this is a live cluster with disk io 
actiivity outside outside of the test vms. The average load is not much 
(300-500 IO/s), mainly reads. However, it still indicates that there is a room 
for improvement in the ceph's cache pool implementation. Looking at my results, 
I think ceph is missing a lot of hits on the read cache, which causes osds to 
write a lot of data. With enchanceio I was getting well over 50% read hit ratio 
and the main activity on the ssds was read io unlike ceph. 

Outside of the tests, i've left enchanceio running on the osd servers. It has 
been a few days now and the hit ratio on the osds is around 8-11%, which seems 
a bit low. I was wondering if I should change the default block size of 
enchance io to 2K instead of the default 4K. Taking into account's ceph object 
size of 4M I am not sure if this will help the hit ratio. Does anyone have an 
idea? 

Andrei 
- Original Message -

 From: Mark Nelson mark.nel...@inktank.com
 To: Robert LeBlanc rob...@leblancnet.us, Mark Nelson
 mark.nel...@inktank.com
 Cc: ceph-users@lists.ceph.com
 Sent: Monday, 22 September, 2014 10:49:42 PM
 Subject: Re: [ceph-users] Bcache / Enhanceio with osds

 Likely it won't since the OSD is already coalescing journal writes.
 FWIW, I ran through a bunch of tests using seekwatcher and blktrace
 at
 4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get
 a
 feel for what the IO patterns are like for the dm-cache developers. I
 included both the raw blktrace data and seekwatcher graphs here:

 http://nhm.ceph.com/firefly_blktrace/

 there are some interesting patterns but they aren't too easy to spot
 (I
 don't know why the Chris decided to use blue and green by default!)

 Mark

 On 09/22/2014 04:32 PM, Robert LeBlanc wrote:
  We are still in the middle of testing things, but so far we have
  had
  more improvement with SSD journals than the OSD cached with bcache
  (five
  OSDs fronted by one SSD). We still have yet to test if adding a
  bcache
  layer in addition to the SSD journals provides any additional
  improvements.
 
  Robert LeBlanc
 
  On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson
  mark.nel...@inktank.com
  mailto:mark.nel...@inktank.com wrote:
 
  On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:
 
  Hello guys,
 
  Was wondering if anyone uses or done some testing with using
  bcache or
  enhanceio caching in front of ceph osds?
 
  I've got a small cluster of 2 osd servers, 16 osds in total and
  4 ssds
  for journals. I've recently purchased four additional ssds to be
  used
  for ceph cache pool, but i've found performance of guest vms to be
  slower with the cache pool for many benchmarks. The write
  performance
  has slightly improved, but the read performance has suffered a
  lot (as
  much as 60% in some tests).
 
  Therefore, I am planning to scrap the cache pool (at least until it
  matures) and use either bcache or enhanceio instead.
 
 
  We're actually looking at dm-cache a bit right now. (and talking
  some of the developers about the challenges they are facing to help
  improve our own cache tiering) No meaningful benchmarks of dm-cache
  yet though. Bcache, enhanceio, and flashcache all look interesting
  too. Regarding the cache pool: we've got a couple of ideas that
  should help improve performance, especially for reads. There are
  definitely advantages to keeping cache local to the node though. I
  think some form of local node caching could be pretty useful going
  

Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left

2014-09-22 Thread Nathan O'Sullivan

Hi Christian,

Your problem is probably that your kernel.pid_max (the maximum 
threads+processes across the entire system) needs to be increased - the 
default is 32768, which is too low for even a medium density 
deployment.  You can test this easily enough with


$ ps axms | wc -l

If you get a number around the 30,000 mark then you are going to be 
affected.


There's an issue here http://tracker.ceph.com/issues/6142 , although it 
doesn't seem to have gotten much traction in terms of informing users.


Regards
Nathan

On 15/09/2014 7:13 PM, Christian Eichelmann wrote:

Hi all,

I have no idea why running out of filehandles should produce a out of
memory error, but well. I've increased the ulimit as you told me, and
nothing changed. I've noticed that the osd init script sets the max open
file handles explicitly, so I was setting the corresponding option in my
ceph conf. Now the limits of an OSD process look like this:

Limit Soft Limit   Hard Limit
Units
Max cpu time  unlimitedunlimited
seconds
Max file size unlimitedunlimited
bytes
Max data size unlimitedunlimited
bytes
Max stack size8388608  unlimited
bytes
Max core file sizeunlimitedunlimited
bytes
Max resident set  unlimitedunlimited
bytes
Max processes 2067478  2067478
processes
Max open files6553665536
files
Max locked memory 6553665536
bytes
Max address space unlimitedunlimited
bytes
Max file locksunlimitedunlimited
locks
Max pending signals   2067478  2067478
signals
Max msgqueue size 819200   819200
bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus

Anyways, the exact same behavior as before. I was also finding a mailing
on this list from someone who had the exact same problem:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html

Unfortunately, there was also no real solution for this problem.

So again: this is *NOT* a ulimit issue. We were running emperor and
dumpling on the same hardware without any issues. They first started
after our upgrade to firefly.

Regards,
Christian


Am 12.09.2014 18:26, schrieb Christian Balzer:

On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote:


That's not how ulimit works.  Check the `ulimit -a` output.


Indeed.

And to forestall the next questions, see man initscript, mine looks like
this:
---
ulimit -Hn 131072
ulimit -Sn 65536

# Execute the program.
eval exec $4
---

And also a /etc/security/limits.d/tuning.conf (debian) like this:
---
rootsoftnofile  65536
roothardnofile  131072
*   softnofile  16384
*   hardnofile  65536
---

Adjusted to your actual needs. There might be other limits you're hitting,
but that is the most likely one

Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy.
I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals)
with that kind of case and enjoy the fact that my OSDs never fail. ^o^

Christian (another one)



On 9/12/2014 10:15 AM, Christian Eichelmann wrote:

Hi,

I am running all commands as root, so there are no limits for the
processes.

Regards,
Christian
___
Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com]
Gesendet: Freitag, 12. September 2014 15:33
An: Christian Eichelmann
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] OSDs are crashing with Cannot fork or
cannot create thread but plenty of memory is left

do cat /proc/pid/limits

probably you hit max processes limit or max FD limit


Hi Ceph-Users,

I have absolutely no idea what is going on on my systems...

Hardware:
45 x 4TB Harddisks
2 x 6 Core CPUs
256GB Memory

When initializing all disks and join them to the cluster, after
approximately 30 OSDs, other osds are crashing. When I try to start
them again I see different kinds of errors. For example:


Starting Ceph osd.316 on ceph-osd-bs04...already running
=== osd.317 ===
Traceback (most recent call last):
File /usr/bin/ceph, line 830, in module
  sys.exit(main())
File /usr/bin/ceph, line 773, in main
  sigdict, inbuf, verbose)
File /usr/bin/ceph, line 420, in new_style_command
  inbuf=inbuf)
File /usr/lib/python2.7/dist-packages/ceph_argparse.py, line
1112, in json_command
  raise RuntimeError('{0}: exception {1}'.format(cmd, e))
NameError: global name 'cmd' is not defined
Exception thread.error: error(can't start new thread,) in bound
method Rados.__del__ of rados.Rados object
at 0x29ee410 ignored


or:
/etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork
/etc/init.d/ceph: 191: 

[ceph-users] get amount of space used by snapshots

2014-09-22 Thread Steve Anthony
Hello,

If I have an rbd image and a series of snapshots of that image, is there
a fast way to determine how much space the objects composing the
original image and all the snapshots are using in the cluster, or even
just the space used by the snaps?

The only way I've been able to find so far is to get the
block_name_prefix for the image with rbd info and then grep for that
prefix in the output of rados ls, eg. rados ls|grep
rb.0.396de.238e1f29|wc -l. This is relatively slow, printing ~250
objects/s, which means hours to count through 10s of TB of objects.

Basically, if I'm keeping daily snapshots for a set of images, I'd like
to be able to tell how much space those snapshots are using so I can
determine how frequently I need to prune old snaps. Thanks!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

On Mon, 22 Sep 2014 08:55:48 -0500 Mark Nelson wrote:

 On 09/22/2014 01:55 AM, Christian Balzer wrote:
 
  Hello,
 
  not really specific to Ceph, but since one of the default questions by
  the Ceph team when people are facing performance problems seems to be
  Have you tried turning it off and on again? ^o^ err,
  Are all your interrupts on one CPU?
  I'm going to wax on about this for a bit and hope for some feedback
  from others with different experiences and architectures than me.
 
 This may be a result of me harping about this after a customer's 
 clusters had mysterious performance issues and where irqbalance didn't 
 appear to be working properly. :)
 
 
  Now firstly that question if all your IRQ handling is happening on the
  same CPU is a valid one, as depending on a bewildering range of factors
  ranging from kernel parameters to actual hardware one often does indeed
  wind up with that scenario, usually with all on CPU0.
  Which certainly is the case with all my recent hardware and Debian
  kernels.
 
 Yes, there are certainly a lot of scenarios where this can happen.  I 
 think the hope has been that with MSI-X, interrupts will get evenly 
 distributed by default and that is typically better than throwing them 
 all at core 0, but things are still quite complicated.
 
 
  I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
  thus feedback from Intel users is very much sought after, as I'm
  considering Intel based storage nodes in the future.
  It's vaguely amusing that Ceph storage nodes seem to have more CPU
  (individual core performance, not necessarily # of cores) and similar
  RAM requirements than my VM hosts. ^o^
 
 It might be reasonable to say that Ceph is a pretty intensive piece of 
 software.  With lots of OSDs on a system there are hundreds if not 
 thousands of threads.  Under heavy load conditions the CPUs, network 
 cards, HBAs, memory, socket interconnects, possibly SAS expanders are 
 all getting worked pretty hard and possibly in unusual ways where both 
 throughput and latency are important.  At the cluster scale things like 
 switch bisection bandwidth and network topology become issues too.  High 
 performance clustered storage is imho one of the most complicated 
 performance subjects in computing.
 
Nobody will argue that. ^.^

 The good news is that much of this can be avoided by sticking to simple 
 designs with fewer OSDs per node.  The more OSDs you try to stick in 1 
 system, the more you need to worry about all of this if you care about 
 high performance.
 
I'd say that 8 OSDs isn't exactly dense (my case), but the advantages
of less densely populated nodes come with the significant price tag of
rack space and hardware costs.

 
  So the common wisdom is that all IRQs on one CPU is a bad thing, lest
  it gets overloaded and for example drop network packets because of
  this. And while that is true, I'm hard pressed to generate any load on
  my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
 
  Thus it should come as no surprise that spreading out IRQs with
  irqbalance or more accurately by manually setting
  the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
  differences when it comes to benchmark results.
 
 Ok, that's fine, but this is pretty subjective.  Without knowing the 
 load and the hardware setup I don't think we can really draw any 
 conclusions other than that in your test on your hardware this wasn't 
 the bottleneck.
 
Of course, I can only realistically talk about what I have tested and thus
invited feedback from others. 
I can certainly see situations where this could be an issue with Ceph and
do have experience with VM hosts that benefited from spreading IRQ
handling over more than one CPU. 

What I'm trying to get across is for people to not fall into a cargo cult
trap and think/examine things for themselves, as blindly turning on
indiscriminate IRQ balancing might do more harm than good in certain
scenarios.  

 
  With irqbalance spreading things out willy-nilly w/o any regards or
  knowledge about the hardware and what IRQ does what it's definitely
  something I won't be using out of the box. This goes especially for
  systems with different NUMA regions without proper policyscripts for
  irqbalance.
 
 I believe irqbalance takes PCI topology into account when making mapping 
 decisions.  See:
 
 http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html
 

I'm sure it tries to do the right thing and it gets at least some things
right, like what my system (single Opteron 4386) looks like:
---
Package 0:  numa_node is 0 cpu mask is 00ff (load 0)
Cache domain 0:  numa_node is 0 cpu mask is 0003  (load 0) 
CPU number 0  numa_node is 0 (load 0)
CPU number 1  numa_node is 0 (load 0)
Cache domain 1:  numa_node is 0 cpu mask is 000c  (load 0) 
CPU number 2  numa_node is 0 (load 0)