Re: Ceph performance improvement

2012-08-22 Thread Alexandre DERUMIER
Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

Hi, glibc from wheezy don't have syncfs support.

- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Denis Fondras c...@ledeuns.net 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 22 Août 2012 14:35:28 
Objet: Re: Ceph performance improvement 

On 08/22/2012 03:54 AM, Denis Fondras wrote: 
 Hello all, 

Hello! 

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts... 

 
 I'm currently testing Ceph. So far it seems that HA and recovering are 
 very good. 
 The only point that prevents my from using it at datacenter-scale is 
 performance. 
 
 First of all, here is my setup : 
 - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 

Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive 
 for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal 
 and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot 
 partition is BTRFS-formated and 4K-aligned. 
 - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
 Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). 
 Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
 960Mb/s). 
 
 Here is my ceph.conf : 
 --cut-here-- 
 [global] 
 auth supported = cephx 
 keyring = /etc/ceph/keyring 
 journal dio = true 
 osd op threads = 24 
 osd disk threads = 24 
 filestore op threads = 6 
 filestore queue max ops = 24 
 osd client message size cap = 1400 
 ms dispatch throttle bytes = 1750 
 

default values are quite a bit lower for most of these. You may want to 
play with them and see if it has an effect. 

 [mon] 
 mon data = /home/mon.$id 
 keyring = /etc/ceph/keyring.$name 
 
 [mon.a] 
 host = ceph-osd-0 
 mon addr = 192.168.0.132:6789 
 
 [mds] 
 keyring = /etc/ceph/keyring.$name 
 
 [mds.a] 
 host = ceph-osd-0 
 
 [osd] 
 osd data = /home/osd.$id 
 osd journal = /home/osd.$id.journal 
 osd journal size = 1000 
 keyring = /etc/ceph/keyring.$name 
 
 [osd.0] 
 host = ceph-osd-0 
 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 
 btrfs options = rw,noatime 

Just fyi, we are trying to get away from btrfs devs. 

 --cut-here-- 
 
 Here are some figures : 
 * Test with dd on the OSD server (on drive 
 /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s 

Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though. 

 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 0,00 0,00 0,52 41,99 0,00 57,48 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sdf 247,00 0,00 125520,00 0 125520 
 
 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
 server (on drive 
 /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
 # time tar xzf src.tar.gz 
 real 0m9.669s 
 user 0m8.405s 
 sys 0m4.736s 
 
 # time rm -rf * 
 real 0m3.647s 
 user 0m0.036s 
 sys 0m3.552s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 10,83 0,00 28,72 16,62 0,00 43,83 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sdf 1369,00 0,00 9300,00 0 9300 
 
 * Test with dd from the client using RBD : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s 

RBD caching should definitely be enabled for a test like this. I'd be 
surprised if you got 42MB/s without it though... 

 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 4,57 0,00 30,46 27,66 0,00 37,31 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sda 317,00 0,00 57400,00 0 57400 
 sdf 237,00 0,00 88336,00 0 88336 
 
 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
 client using RBD : 
 # time tar xzf src.tar.gz 
 real 0m26.955s 
 user 0m9.233s 
 sys 0m11.425s 
 
 # time rm -rf * 
 real 0m8.545s 
 user 0m0.128s 
 sys 0m8.297s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 4,59 0,00 24,74 30,61 0,00 40,05 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sda 239,00 0,00 54772,00 0 54772 
 sdf 441,00 0,00 50836,00 0 50836 
 
 * Test with dd from the client using CephFS : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait 

Ideal hardware spec?

2012-08-22 Thread Jonathan Proulx
Hi All,

Yes I'm asking the impossible question, what is the best hardware
confing.

I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.  

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a private cloud the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).

So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.

First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?

Have SSD been shown to speed performance with this architecture?

If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?

Thanks,
-Jon
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideal hardware spec?

2012-08-22 Thread Wido den Hollander

Hi,

On 08/22/2012 03:55 PM, Jonathan Proulx wrote:

Hi All,

Yes I'm asking the impossible question, what is the best hardware
confing.

I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a private cloud the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).



I prefer 3x replication as well. I've seen the wrong OSDs die on me 
too often.



So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.

First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?



I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the 
OSD machines, the more the kernel can buffer, which will always be a 
performance gain.


You should however ask yourself the question if you want a lot of OSDs 
per server and not go for smaller machines with less disks.


For example

- 1U
- 4 cores
- 8GB RAM
- 4 disks
- 1 SSD

Or

- 2U
- 8 cores
- 16GB RAM
- 8 disks
- 1|2 SSDs

Both will give you the same amount of storage, but the impact of loosing 
one physicial machine will be larger with the 2U machine.


If you take 1TB disks you'd loose 8TB of storage, that is a lot of 
recovery to be done.


Since btrfs (Assuming you are going to use that) is still in development 
it's not excluded that your machine goes down due to a kernel panic or 
other problems.


My personal favor is having multiple small(er) machines than having a 
couple of large machines.



Have SSD been shown to speed performance with this architecture?



I've seen a improvement in performance indeed. Make sure however you 
have a recent version of glibc with syncfs support.



If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?



You should make sure your SSD is capable of doing line-speed of your 
network.


If you are connecting the machines with 4G trunks, make sure the SSD is 
capable of doing around 400MB/sec of sustained writes.


I'd recommended the Intel 520 SSDs and change their available capacity 
with hdparm to about 20% of their original capacity. This way the SSD 
always has a lot of free cells available for writing. Reprogramming 
cells is expensive on an SSD.


You can run the OS on the same SSD since that won't do that much I/O. 
I'd recommend not logging locally though, since that will also write to 
the same SSD. Try using remote syslog.


You can also use the USB sticks[0] from Stec, they have servergrade 
onboard USB sticks for these kind of applications.


A couple of questions still need to be answered though:
* Which OS are you planning on using? Ubuntu 12.04 is recommended
* Which filesystem do you want to use underneath the OSDs?

Wido

[0]: http://www.stec-inc.com/product/ufm.php


Thanks,
-Jon
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Florian Haas
On 08/22/2012 03:10 AM, Sage Weil wrote:
 I pushed a branch that changes some of the crush terminology.  Instead of 
 having a crush type called pool that requires you to say things like 
 pool=default in the ceph osd crush set ... command, it uses root 
 instead.  That hopefully reinforces that it is a tree/hierarchy.
 
 There is also a patch that changes bucket to node throughout, since 
 bucket is a term also used by radosgw.
 
 Thoughts?  I think the main pain in making this transition is that old 
 clusters have maps that have a type 'pool' and new ones won't, and the 
 docs will need to walk people through both...

pool in a crushmap being completely unrelated to a RADOS pool is
something that I've heard customers/users report as confusing, as well.
So changing that is probably a good thing. Naming it root is probably
a good choice as well, as it happens to match
http://ceph.com/wiki/Custom_data_placement_with_CRUSH.

As for changing bucket to node... a node is normally simply a
physical server (at least in HA terminology, which many potential Ceph
users will be familiar with), and CRUSH uses host for that. So that's
another recipe for confusion. How about using something super-generic,
like element or item?

Cheers,
Florian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras c...@ledeuns.net wrote:
 Are you sure your osd data and journal are on the disks you think? The
 /home paths look suspicious -- especially for journal, which often
 should be a block device.
 I am :)
...
 -rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Your journal is a file on a btrfs partition. That is probably a bad
idea for performance. I'd recommend partitioning the drive and using
partitions as journals directly.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph fixes for 3.6-rc3

2012-08-22 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes for -rc3 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

Jim's fix closes a narrow race introduced with the msgr changes.  One fix 
resolves problems with debugfs initialization that Yan found when multiple 
client instances are created (e.g., two clusters mounted, or rbd + 
cephfs), another one fixes problems with mounting a nonexistent server 
subdirectory, and the last one fixes a divide by zero error from 
unsanitized ioctl input that Dan Carpenter found.

Thanks!
sage



Jim Schutt (1):
  libceph: avoid truncation due to racing banners

Sage Weil (3):
  libceph: delay debugfs initialization until we learn global_id
  ceph: tolerate (and warn on) extraneous dentry from mds
  ceph: avoid divide by zero in __validate_layout()

 fs/ceph/debugfs.c  |1 +
 fs/ceph/inode.c|   15 +
 fs/ceph/ioctl.c|3 +-
 net/ceph/ceph_common.c |1 -
 net/ceph/debugfs.c |4 +++
 net/ceph/messenger.c   |   11 -
 net/ceph/mon_client.c  |   51 +++
 7 files changed, 72 insertions(+), 14 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil s...@inktank.com wrote:
 On Wed, 22 Aug 2012, Atchley, Scott wrote:
 On Aug 22, 2012, at 10:46 AM, Florian Haas wrote:

  On 08/22/2012 03:10 AM, Sage Weil wrote:
  I pushed a branch that changes some of the crush terminology.  Instead of
  having a crush type called pool that requires you to say things like
  pool=default in the ceph osd crush set ... command, it uses root
  instead.  That hopefully reinforces that it is a tree/hierarchy.
 
  There is also a patch that changes bucket to node throughout, since
  bucket is a term also used by radosgw.
 
  Thoughts?  I think the main pain in making this transition is that old
  clusters have maps that have a type 'pool' and new ones won't, and the
  docs will need to walk people through both...
 
  pool in a crushmap being completely unrelated to a RADOS pool is
  something that I've heard customers/users report as confusing, as well.
  So changing that is probably a good thing. Naming it root is probably
  a good choice as well, as it happens to match
  http://ceph.com/wiki/Custom_data_placement_with_CRUSH.
 
  As for changing bucket to node... a node is normally simply a
  physical server (at least in HA terminology, which many potential Ceph
  users will be familiar with), and CRUSH uses host for that. So that's
  another recipe for confusion. How about using something super-generic,
  like element or item?
 
  Cheers,
  Florian

 My guess is that he is trying to use data structure tree nomenclature
 (root, node, leaf). I agree that node is an overloaded term (as is
 pool).

 Yeah...

 As for an alternative to bucket which indicates the item is a
 collection, what about subtree or branch?

 I think fixing the overloading of 'pool' in the default crush map is the
 biggest pain point.  I can live with crush 'buckets' staying the same (esp
 since that's what the papers and code use pervasively) if we can't come up
 with a better option.

I'm definitely most interested in replacing pool, and root works
for that in my mind. RGW buckets live at a sufficiently different
level that I think people are unlikely to be confused — and bucket
is actually a good name for what they are (I'm open to better ones,
but I don't think that node qualifies).


 On the pool part, though, the challenge is how to transition.  Existing
 clusters have maps that use 'pool', and new clusters will use 'root' (or
 whatever).  Some options:

  - document both.  this kills much of the benefit of switching, but is
probably inevitable since people will be running different versions.
  - make the upgrade process transparently rename the type.  this lets
all the tools use the new names.
  - make the tools silently translate old names to new names.  this is
kludgey in that it makes the code make assumptions about the names of
the data it is working with, but would cover everyone except those who
created their own crush maps from scratch.
  - ?
I would go with option two, and only document the new options — I
wouldn't be surprised if the number of people who had changed those
was zero. Anybody who has done so can certainly be counted on to pay
enough attention that a line note changed CRUSH names (see here if
you customized your map) would be sufficient, right?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SimpleMessenger dispatching: cause of performance problems?

2012-08-22 Thread Samuel Just
What rbd block size were you using?
-Sam

On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle
andreas.blue...@itxperts.de wrote:
 Hi,


 Samuel Just wrote:

 Was the cluster complete healthy at the time that those traces were taken?
 If there were osds going in/out/up/down, it would trigger osdmap updates
 which
 would tend to hold the osd_lock for an extended period of time.



 The cluster was completely healthy.

 v0.50 included some changes that drastically reduce the purview of
 osd_lock.
 In particular, pg op handling no longer grabs the osd_lock and
 handle_osd_map
 can proceed independently of the pg worker threads.  Trying that might be
 interesting.



 I'll grab v0.50 and take a look.


 -Sam

 On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil s...@inktank.com wrote:


 On Tue, 21 Aug 2012, Sage Weil wrote:


 On Tue, 21 Aug 2012, Andreas Bluemle wrote:


 Hi Sage,

 as mentioned, the workload is a single sequential write on
 the client. The write is not O_DIRECT; and consequently
 the messages arrive at the OSD with 124 KByte per write request.

 The attached pdf shows a timing diagram of two concurrent
 write operations (one primary and one replication / secondary).

 The time spent on the primary write to get the OSD.:osd_lock
 releates nicely with the time when this lock is released by the
 secondary write.


 Looking again at this diagram, I'm a bit confused.  Is the Y access the
 thread id or something?  And the X axis is time in seconds?



 X-Axis is time, Y Axis is absolute offset of the write request on the rados
 block device.

 The big question for me is what on earth the secondary write (or primary,
 for that matter) is doing with osd_lock for a full 3 ms...  If my reading
 of the units is correct, *that* is the real problem.  It shouldn't be
 doing anything that takes that long.  The exception is osdmap handling,
 which can do more work, but request processing should be very fast.

 Thanks-
 sage




 Ah, I see.

 There isn't a trivial way to pull osd_lock out of the picture; there are
 several data structures it's protecting (pg_map, osdmaps, peer epoch
 map,
 etc.).  Before we try going down that road, I suspect it might be more
 fruitful to see where cpu time is being spent while osd_lock is held.

 How much of an issue does it look like this specific contention is for
 you?  Does it go away with larger writes?

 sage




 Hope this helps

 Andreas



 Sage Weil wrote:


 On Mon, 20 Aug 2012, Andreas Bluemle wrote:



 Hi Sage,

 Sage Weil wrote:



 Hi Andreas,

 On Thu, 16 Aug 2012, Andreas Bluemle wrote:



 Hi,

 I have been trying to migrate a ceph cluster (ceph-0.48argonaut)
 to a high speed cluster network and encounter scalability problems:
 the overall performance of the ceph cluster does not scale well
 with an increase in the underlying networking speed.

 In short:

 I believe that the dispatching from SimpleMessenger to
 OSD worker queues causes that scalability issue.

 Question: is it possible that this dispatching is causing
 performance
 problems?



 There is a single 'dispatch' thread that's processing this queue,
 and
 conveniently perf lets you break down its profiling data on a
 per-thread
 basis.  Once you've ruled out the throttler as the culprit, you
 might
 try
 running the daemon with 'perf record -g -- ceph-osd ...' and then
 look
 specifically at where that thread is spending its time.  We
 shouldn't be
 burning that much CPU just doing the sanity checks and then handing
 requests
 off to PGs...

 sage





   The effect, which I am seeing, may be related to some locking
 issue.
 As I read the code, there are multiple dispatchers running: one per
 SimpleMessenger.

 On a typical OSD node, there is

 - the instance of the SimpleMessenger processing input from the
 client
 (primary writes)
 - other instances of SimpleMessenger, which process input from
 neighbor
 OSD
 nodes

 the latter generate replication writes to the OSD I am looking at.

 On the other hand, there is a single instance of the OSD object
 within the
 ceph-osd daemon.
 When dispatching messages to the OSD, then the OSD::osd_lock is held
 for
 the
 complete
 process of dispatching; see code below.

 When the write load increases, then multiple SimpleMessenger
 instances
 start
 to
 congest on the OSD::osd_lock.
 And this may cause delays in the individual dispatch process.



 This is definitely possible, yes, although it would surprise me if
 it's
 happening here (unless your workload is all small writes).  Just to
 confirm,
 are you actually observing osd_lock contention, or speculating about
 what is
 causing the delays you're seeing?

 I'm not sure what the best tool is to measure lock contention.  Mark
 was
 playing with a 'poor man's wall clock profiler' using stack trace
 sampling
 from gdb.  That would tell us whether threads were really blocking
 while
 obtaining the osd_lock...

 Can you tell us a bit more about what your workload is?

 sage





 bool OSD::ms_dispatch(Message *m)
 

Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Dieter Kasper (KD)
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote:
(...)
 
 Your journal is a file on a btrfs partition. That is probably a bad
 idea for performance. I'd recommend partitioning the drive and using
 partitions as journals directly.

Hi Tommi,

can you please teach me how to use the right parameter(s) to realize 'journal 
on block-dev' ?

It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
--mkbtrfs'
(see below)

Regards,
-Dieter


e.g.
---snip---
modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK

/etc/ceph/ceph.conf
--
[global]
auth supported = none

# set log file
log file = /ceph/log/$name.log
log_to_syslog = true# uncomment this line to log to syslog

# set up pid files
pid file = /var/run/ceph/$name.pid

[mon]  
mon data = /ceph/$name
debug optracker = 0

[mon.alpha]
host = 127.0.0.1
mon addr = 127.0.0.1:6789

[mds]
debug optracker = 0

[mds.0]
host = 127.0.0.1

[osd]
osd data = /data/$name

[osd.0]
host = 127.0.0.1
btrfs devs  = /dev/ram0
osd journal = /dev/ram3

[osd.1]
host = 127.0.0.1
btrfs devs  = /dev/ram1
osd journal = /dev/ram4

[osd.2]
host = 127.0.0.1
btrfs devs  = /dev/ram2
osd journal = /dev/ram5
--

root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs
temp dir is /tmp/mkcephfs.wzARGSpFB6
preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print 
/tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de
epoch 0
fsid 40b997ea-387a-4deb-9a30-805cd076a0de
last_changed 2012-08-22 21:04:00.553972
created 2012-08-22 21:04:00.553972
0: 127.0.0.1:6789/0 mon.alpha
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 
monitors)
=== osd.0 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.0: not mounted
umount: /dev/ram0: not mounted

Btrfs v0.19.1+

ATTENTION:

mkfs.btrfs is not intended to be used directly. Please use the
YaST partitioner to create and manage btrfs filesystems to be
in a supported state on SUSE Linux Enterprise systems.

fs created label (null) on /dev/ram0
nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB
Scanning for Btrfs filesystems
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not 
find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 
journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de
creating private key for osd.0 keyring /data/osd.0/keyring
creating /data/osd.0/keyring
collecting osd.0 key
=== osd.1 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.1: not mounted
(...)


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Sage Weil
On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,
 
 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.
 
 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

This means it got an unexpected error when talking to the file system.  If 
you look in the osd log, it may tell you what that was.  (It may 
not--there isn't usually the other tcmalloc stuff triggered from the 
assert handler.)

What happens if you restart that ceph-osd daemon?

sage


 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()
 
 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Andrey Korolyov
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

 This means it got an unexpected error when talking to the file system.  If
 you look in the osd log, it may tell you what that was.  (It may
 not--there isn't usually the other tcmalloc stuff triggered from the
 assert handler.)

 What happens if you restart that ceph-osd daemon?

 sage



Unfortunately I have completely disabled logs during test, so there
are no suggestion of assert_fail. The main problem was revealed -
created VMs was pointed to one monitor instead set of three, so there
may be some unusual things(btw, crashed mon isn`t one from above, but
a neighbor of crashed osds on first node). After IPMI reset node
returns back well and cluster behavior seems to be okay - stuck kvm
I/O somehow prevented even other module load|unload on this node, so I
finally decided to do hard reset. Despite I`m using almost generic
wheezy, glibc was updated to 2.15, may be because of this my trace
appears first time ever. I`m almost sure that fs does not triggered
this crash and mainly suspecting stuck kvm processes. I`ll rerun test
with same conditions tomorrow(~500 vms pointed to one mon and very
high I/O, but with osd logging).

 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()

 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info 

Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD)
d.kas...@kabelmail.de wrote:
 Your journal is a file on a btrfs partition. That is probably a bad
 idea for performance. I'd recommend partitioning the drive and using
 partitions as journals directly.
 can you please teach me how to use the right parameter(s) to realize 'journal 
 on block-dev' ?

Replacing the example paths, use sudo parted /dev/sdg or gksu
gparted /dev/sdg, create partitions, set osd journal to point to a
block device for a partition.

[osd.42]
osd journal = /dev/sdg4

 It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
 --mkbtrfs'
 (see below)

Try running it with -x for any chance of extracting debuggable
information from the monster.

 Scanning for Btrfs filesystems
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
 ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal

Based on that, my best guess would be that you're seeing a journal
from an old run -- perhaps you need to explicitly clear out the block
device contents..

Frankly, you should not use btrfs devs. Any convenience you may gain
is more than doubly offset by pains exactly like these.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Gregory Farnum
The tcmalloc backtrace on the OSD suggests this may be unrelated, but
what's the fd limit on your monitor process? You may be approaching
that limit if you've got 500 OSDs and a similar number of clients.

On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from 
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

 This means it got an unexpected error when talking to the file system.  If
 you look in the osd log, it may tell you what that was.  (It may
 not--there isn't usually the other tcmalloc stuff triggered from the
 assert handler.)

 What happens if you restart that ceph-osd daemon?

 sage



 Unfortunately I have completely disabled logs during test, so there
 are no suggestion of assert_fail. The main problem was revealed -
 created VMs was pointed to one monitor instead set of three, so there
 may be some unusual things(btw, crashed mon isn`t one from above, but
 a neighbor of crashed osds on first node). After IPMI reset node
 returns back well and cluster behavior seems to be okay - stuck kvm
 I/O somehow prevented even other module load|unload on this node, so I
 finally decided to do hard reset. Despite I`m using almost generic
 wheezy, glibc was updated to 2.15, may be because of this my trace
 appears first time ever. I`m almost sure that fs does not triggered
 this crash and mainly suspecting stuck kvm processes. I`ll rerun test
 with same conditions tomorrow(~500 vms pointed to one mon and very
 high I/O, but with osd logging).

 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()

 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line 

Re: Ceph performance improvement

2012-08-22 Thread Mark Kirkwood

On 22/08/12 22:24, David McBride wrote:

On 22/08/12 09:54, Denis Fondras wrote:


* Test with dd from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s


Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.


(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)





Appending conv=fsync to the dd will make the comparison fair enough. 
Looking at the ceph code, it does



sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE);

which is very fast - way faster than fdatasync() and friends (I have 
tested this ... see prev posting on random write performance with file 
writetest.c attached).


I am not convinced the these sort of tests are in any way 'unfair' - for 
instance I would like to use rbd for postgres or mysql data volumes... 
and many database actions involve a stream of block writes similar 
enough to doing dd (e.g bulk row loads, appends to transaction log 
journals).


Cheers

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html