Re: Ceph on btrfs 3.4rc
Hi, the ceph cluster is running under heavy load for the last 13 hours without a problem, dmesg is empty and the performance is good. -martin Am 23.05.2012 21:12, schrieb Martin Mailand: this patch is running for 3 hours without a Bug and without the Warning. I will let it run overnight and report tomorrow. It looks very good ;-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to free space from rados bench comman?
Hi, every rados bench write uses disk space and my space fills up. How to free this space again? Used command? rados -p data bench 60 write -t 16 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
Hi, On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote: Hi, every rados bench write uses disk space and my space fills up. How to free this space again? Used command? rados -p data bench 60 write -t 16 What does this show: $ rados -p data ls|wc -l If that shows something greater than 0 it means you still have objects in that pool which are using up space. Try removing those objects manually. Be cautious not to remove any other objects! To be safe I'd recommend running benchmark commands in a separate pool. Also note that when you remove objects it will take some time before the OSD's have removed them and you see the usage go down with ceph -s. Wido Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
Am 24.05.2012 09:28, schrieb Wido den Hollander: Hi, On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote: Hi, every rados bench write uses disk space and my space fills up. How to free this space again? Used command? rados -p data bench 60 write -t 16 What does this show: $ rados -p data ls|wc -l ~# rados -p data ls|wc -l 46631 I do not use the data pool so it is seperate ;-) i only use the rbd pool for block devices. So i will free the space with: for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done Thanks! Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple named clusters on same nodes
On Wednesday 23 May 2012 wrote Tommi Virtanen: On Wed, May 23, 2012 at 2:00 AM, Amon Ott a@m-privacy.de wrote: So I started experimenting with the new cluster variable, but it does not seem to be well supported so far. mkcephfs does not even know about it and always uses ceph as cluster name. Setting a value for cluster in global section of ceph.conf (homeuser.conf, backup.conf, ...) does not work, it is not even used in the same config file, instead it has the fixed value ceph. [...] I don't think anyone is likely to fix mkcephfs to work with it -- I'm personally trying to get mkcephfs declared obsolete. It's fundamentally the wrong tool; for example, it cannot expand or reconfigure an existing cluster. Attached is a patch based on current git stable that makes mkcephfs work fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon data (default ceph instead of supplied cluster name), so I put in a workaround. Please have a look and consider inclusion as well as fixing mon data path. Thanks. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 commit fc394c63b9fd4f5fea4bc3a430f57164a96dc543 Author: Amon Ott a...@rsbac.org Date: Thu May 24 09:48:29 2012 +0200 mkcephfs: Support --cluster name for cluster naming Current mkcephs can only create clusters with name ceph. This patch allows to specify the cluster name and fixes some default paths to the new $cluster based locations. Parameter --conf is now optional and defaults to /etc/ceph/$cluster.conf. Signed-off-by: Amon Ott a@m-privacy.de diff --git a/src/mkcephfs.in b/src/mkcephfs.in index 17b6014..e1c061e 100644 --- a/src/mkcephfs.in +++ b/src/mkcephfs.in @@ -60,7 +60,7 @@ else fi usage_exit() { -echo usage: $0 -a -c ceph.conf [-k adminkeyring] [--mkbtrfs] +echo usage: $0 [--cluster name] -a [-c ceph.conf] [-k adminkeyring] [--mkbtrfs] echoto generate a new ceph cluster on all nodes; for advanced usage see man page echo** be careful, this WILL clobber old data; check your ceph.conf carefully ** exit @@ -89,6 +89,7 @@ moreargs= auto_action=0 manual_action=0 nocopyconf=0 +cluster=ceph while [ $# -ge 1 ]; do case $1 in @@ -141,6 +142,11 @@ case $1 in shift conf=$1 ;; +--cluster | -C) + [ -z $2 ] usage_exit + shift + cluster=$1 + ;; --numosd) [ -z $2 ] usage_exit shift @@ -181,6 +187,8 @@ done [ -z $conf ] [ -n $dir ] conf=$dir/conf +[ -z $conf ] conf=/etc/ceph/$cluster.conf + if [ $manual_action -eq 0 ]; then if [ $auto_action -eq 0 ]; then echo You must specify an action. See man page. @@ -245,19 +253,19 @@ if [ -n $initdaemon ]; then name=$type.$id # create /var/run/ceph (or wherever pid file and/or admin socket live) -get_conf pid_file /var/run/ceph/$name.pid pid file +get_conf pid_file /var/run/ceph/$type/$cluster-$id.pid pid file rundir=`dirname $pid_file` if [ $rundir != . ] [ ! -d $rundir ]; then mkdir -p $rundir fi -get_conf asok_file /var/run/ceph/$name.asok admin socket +get_conf asok_file /var/run/ceph/$type/$cluster-$id.asok admin socket rundir=`dirname $asok_file` if [ $rundir != . ] [ ! -d $rundir ]; then mkdir -p $rundir fi if [ $type = osd ]; then - $BINDIR/ceph-osd -c $conf --monmap $dir/monmap -i $id --mkfs + $BINDIR/ceph-osd --cluster $cluster -c $conf --monmap $dir/monmap -i $id --mkfs create_private_key fi @@ -266,7 +274,9 @@ if [ -n $initdaemon ]; then fi if [ $type = mon ]; then - $BINDIR/ceph-mon -c $conf --mkfs -i $id --monmap $dir/monmap --osdmap $dir/osdmap -k $dir/keyring.mon +get_conf mondata mon data +test -z $mondata mondata=/var/lib/ceph/mon/$cluster-$id + $BINDIR/ceph-mon --cluster $cluster -c $conf --mon-data=$mondata --mkfs -i $id --monmap $dir/monmap --osdmap $dir/osdmap -k $dir/keyring.mon fi exit 0 @@ -442,14 +452,14 @@ if [ $allhosts -eq 1 ]; then if [ $nocopyconf -eq 0 ]; then # also put conf at /etc/ceph/ceph.conf - scp -q $dir/conf $host:/etc/ceph/ceph.conf + scp -q $dir/conf $host:/etc/ceph/$cluster.conf fi else rdir=$dir if [ $nocopyconf -eq 0 ]; then # also put conf at /etc/ceph/ceph.conf - cp $dir/conf /etc/ceph/ceph.conf + cp $dir/conf /etc/ceph/$cluster.conf fi fi @@ -486,15 +496,15 @@ if [ $allhosts -eq 1 ]; then scp -q $dir/* $host:$rdir if [ $nocopyconf -eq 0 ]; then - # also put conf at /etc/ceph/ceph.conf - scp -q $dir/conf $host:/etc/ceph/ceph.conf + # also put conf at /etc/ceph/$cluster.conf + scp -q $dir/conf $host:/etc/ceph/$cluster.conf fi else
Re: how to free space from rados bench comman?
On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote: Am 24.05.2012 09:28, schrieb Wido den Hollander: Hi, On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote: Hi, every rados bench write uses disk space and my space fills up. How to free this space again? Used command? rados -p data bench 60 write -t 16 What does this show: $ rados -p data ls|wc -l ~# rados -p data ls|wc -l 46631 That is weird, I thought the bench tool cleaned up it's mess. Imho it should cleanup after it's done, but there might be a reason why it's not. Did you abort the benchmark or did you let it do the whole run? I do not use the data pool so it is seperate ;-) i only use the rbd pool for block devices. So i will free the space with: for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done rados -p data ls|xargs -n 1 rados -p data rm I love shorter commands ;) Wido Thanks! Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW, future directions
On Thu, May 24, 2012 at 7:15 AM, Wido den Hollander w...@widodh.nl wrote: On 22-05-12 20:07, Yehuda Sadeh wrote: RGW is maturing. Beside looking at performance, which highly ties into RADOS performance, we'd like to hear whether there are certain pain points or future directions that you (you as in the ceph community) would like to see us taking. There are a few directions that we were thinking about: 1. Extend Object Storage API Swift and S3 has some features that we don't currently support. We can certainly extend our functionality, however, is there any demand for more features? E.g., self destructing objects, web site, user logs, etc. 2. Better OpenStack interoperability Keystone support? Other? 3. New features Some examples: - multitenancy: api for domains and user management - snapshots - computation front end: upload object, then do some data transformation/calculation. - simple key-value api 4. CDMI Sage brought up the CDMI support question to ceph-devel, and I don't remember him getting any response. Is there any intereset in CDMI? 5. Native apache/nginx module or embedded web server We still need to prove that the web server is a bottleneck, or poses scaling issues. Writing a correct native nginx module will require turning rgw process model into event driven, which is not going to be easy. I'd not go for a native nginx or Apache module, that would bring extra C code into the story which would mean extra dependencies. My vote would still go to a embedded webserver written in something like Python. You could then use Apache/nginx/Varnish as a reverse proxy in front and do all kinds of cool stuff. You could even doing caching in nginx or Varnish and let the RGW notify those proxy's when an object has changed so they can purge their cache. This would dramatically improve the performance of the gateway. It would also simplify the code, why try to do caching on your own when some great HTTP caches are out there? 100% +1 6. Improve garbage collection Currently rgw generates intent logs for garabage removal that require running an external tool later, which is an administrative pain. We can implement other solutions (OSD side garbage collection, integrating cleanup process into the gateway, etc.) but we need to understand the priority. 7. libradosgw We have had this in mind for some time now. Creating a programming api for rgw, not too different from librados and librbd. It'll hopefully make code much cleaner. It will allow users to write different front ends for the rgw backend, and it will make it easier for users to write applications that interact with the backend, e.g., do processing on objects that users uploaded, FUSE for rgw without S3 as an intermediate, etc. Yes, I would really like this. Combine this with the Python stand-alone/embedded webserver I proposed and you get a really nice RGW I think. 8. Administration tools improvement We can always do better there. When we have libradosgw it wouldn't be that hard to make a nice web front-end where you can manage the whole thing. 9. Other ideas? Any comments are welcome! Thanks, Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- - Pozdrawiam Sławek sZiBis Skowron -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Wed, 23 May 2012, Gregory Farnum wrote: On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote: * Scratch file system for HPC. (kernel client) * Scratch file system for research groups. (SMB, NFS, SSH) * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC) * Metropolitan cluster. * VDI backend. KVM with RBD. Hmm. Sounds to me like scratch filesystems would get a lot out of not having to hit disk on the commit, but not much out of having separate caching locations versus just letting the OSD page cache handle it. :) The others, I don't really see collaborative caching helping much either. Oh, sorry, those were my use cases for ceph in general. Yes, scratch is mosty of interest. But also fast backup. Currently IOPS is limiting our backup speed on a small cluster with many files but not much data. I have problems scanning through and backing all changed files every night. Currently I am backing to ZFS but Ceph might help with scaling up performance and size. Another option is going for SSD instead of mechanical drives. Anyway, make a bug for it in the tracker (I don't think one exists yet, though I could be wrong) and someday when we start work on the filesystem again we should be able to get to it. :) Thank you for your thoughts on this. I hope to be able to do that soon. Regards, Jerker Nyberg, Uppsala, Sweden.
Problems while doing rpmbuild on CENTOS
Hi All, I am trying to build the CEPH stuff from bzip source archive by using the command below: rpmbuild -tb /home/src/ceph.related/ceph-0.47.2.tar.bz2 I get the following error: error: File /home/src/ceph.related/libs3-trunk.tar.gz: No such file or directory What is this and where do I get this file from ? Can anyone pl give me some pointers on this ? Regards. --ajit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple named clusters on same nodes
On Thursday 24 May 2012 wrote Amon Ott: Attached is a patch based on current git stable that makes mkcephfs work fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon data (default ceph instead of supplied cluster name), so I put in a workaround. Please have a look and consider inclusion as well as fixing mon data path. Thanks. And another patch for the init script to handle multiple clusters. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 commit d446077dc93894784348f7560ee29eaf6e3ce272 Author: Amon Ott a...@rsbac.org Date: Thu May 24 10:55:27 2012 +0200 Make init script init-ceph.in cluster name aware. Add --cluster clustername parameter to start/stop/etc. specific cluster with default config file /etc/ceph/cluster.conf. If no clustername is given, walk through /etc/ceph/*.conf and try to start/stop/etc. them all with clustername taken from conf basename. Signed-off-by: Amon Ott a@m-privacy.de diff --git a/src/init-ceph.in b/src/init-ceph.in index f2702e3..6efe7f0 100644 --- a/src/init-ceph.in +++ b/src/init-ceph.in @@ -28,6 +28,7 @@ fi usage_exit() { echo usage: $0 [options] {start|stop|restart} [mon|osd|mds]... +printf \t--cluster clustername\n printf \t-c ceph.conf\n printf \t--valgrind\trun via valgrind\n printf \t--hostname [hostname]\toverride hostname lookup\n @@ -36,6 +37,8 @@ usage_exit() { . $LIBDIR/ceph_common.sh +conf= + EXIT_STATUS=0 signal_daemon() { @@ -45,7 +48,7 @@ signal_daemon() { signal=$4 action=$5 [ -z $action ] action=Stopping -echo -n $action Ceph $name on $host... +echo -n $action Ceph $cluster $name on $host... do_cmd if [ -e $pidfile ]; then pid=`cat $pidfile` if [ -e /proc/\$pid ] grep -q $daemon /proc/\$pid/cmdline ; then @@ -75,7 +78,7 @@ stop_daemon() { signal=$4 action=$5 [ -z $action ] action=Stopping -echo -n $action Ceph $name on $host... +echo -n $action Ceph $cluster $name on $host... do_cmd while [ 1 ]; do [ -e $pidfile ] || break pid=\`cat $pidfile\` @@ -103,6 +106,7 @@ monaddr= dobtrfs=1 dobtrfsumount=0 verbose=0 +cluster= while echo $1 | grep -q '^-'; do # FIXME: why not '^-'? case $1 in @@ -151,6 +155,12 @@ case $1 in shift hostname=$1 ;; +--cluster ) + [ -z $2 ] usage_exit + options=$options $1 + shift + cluster=$1 +;; *) echo unrecognized option \'$1\' usage_exit @@ -160,11 +170,25 @@ options=$options $1 shift done -verify_conf - command=$1 [ -n $* ] shift +if test -z $cluster +then +for c in /etc/ceph/*.conf +do +test -f $c $0 --cluster $(basename $c .conf) $command $@ +done +exit 0 +fi + +if test -z $conf +then +conf=/etc/ceph/$cluster.conf +fi + +verify_conf + get_name_list $@ for name in $what; do @@ -176,9 +200,9 @@ for name in $what; do check_host || continue binary=$BINDIR/ceph-$type -cmd=$binary -i $id +cmd=$binary --cluster $cluster -i $id -get_conf pid_file $RUN_DIR/$type.$id.pid pid file +get_conf pid_file $RUN_DIR/$type/$cluster-$id.pid pid file if [ -n $pid_file ]; then do_cmd mkdir -p `dirname $pid_file` cmd=$cmd --pid-file $pid_file @@ -191,13 +215,13 @@ for name in $what; do get_conf auto_start auto start if [ $auto_start = no ] || [ $auto_start = false ] || [ $auto_start = 0 ]; then if [ -z $@ ]; then -echo Skipping Ceph $name on $host... auto start is disabled +echo Skipping Ceph $cluster $name on $host... auto start is disabled continue fi fi if daemon_is_running $name ceph-$type $id $pid_file; then - echo Starting Ceph $name on $host...already running + echo Starting Ceph $cluster $name on $host...already running continue fi @@ -228,7 +252,7 @@ for name in $what; do fi # do lockfile, if RH -get_conf lockfile /var/lock/subsys/ceph lock file +get_conf lockfile /var/lock/subsys/ceph/$cluster lock file lockdir=`dirname $lockfile` if [ ! -d $lockdir ]; then lockfile= @@ -270,7 +294,7 @@ for name in $what; do echo Mounting Btrfs on $host:$btrfs_path do_root_cmd modprobe btrfs ; btrfs device scan || btrfsctl -a ; egrep -q '^[^ ]+ $btrfs_path' /proc/mounts || mount -t btrfs $btrfs_opt $first_dev $btrfs_path fi - echo Starting Ceph $name on $host... + echo Starting Ceph $cluster $name on $host... mkdir -p $RUN_DIR get_conf pre_start_eval pre start eval [ -n $pre_start_eval ] $pre_start_eval @@ -297,14 +321,14 @@ for name in $what; do status) if
Re: Ceph on btrfs 3.4rc
Same thing here. I've tried really hard, but even after 12 hours I wasn't able to get a single warning from btrfs. I think you cracked it! Thanks, Christian 2012/5/24 Martin Mailand mar...@tuxadero.com: Hi, the ceph cluster is running under heavy load for the last 13 hours without a problem, dmesg is empty and the performance is good. -martin Am 23.05.2012 21:12, schrieb Martin Mailand: this patch is running for 3 hours without a Bug and without the Warning. I will let it run overnight and report tomorrow. It looks very good ;-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph rbd crashes/stalls while random write 4k blocks
Hi list, i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block device within a network booted kvm. Sequential write/reads and random reads are fine. No problems so far. But when i trigger lots of 4k random writes all of them stall after short time and i get 0 iops and 0 transfer. used command: fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G --numjobs=50 --runtime=30 --group_reporting --name=file1 Then some time later i see this call trace: INFO: task ceph-osd:3065 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. ceph-osdD 8803b0e61d88 0 3065 1 0x0004 88032f3ab7f8 0086 8803bffdac08 8803 8803b0e61820 00010800 88032f3abfd8 88032f3aa010 88032f3abfd8 00010800 81a0b020 8803b0e61820 Call Trace: [815e0e1a] schedule+0x3a/0x60 [815e127d] schedule_timeout+0x1fd/0x2e0 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [81074db1] ? down_trylock+0x31/0x50 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [815e20b9] __down+0x69/0xb0 [8128c4a6] ? _xfs_buf_find+0xf6/0x280 [81074e6b] down+0x3b/0x50 [8128b7b0] xfs_buf_lock+0x40/0xe0 [8128c4a6] _xfs_buf_find+0xf6/0x280 [8128c689] xfs_buf_get+0x59/0x190 [8128ccf7] xfs_buf_read+0x27/0x100 [81282f97] xfs_trans_read_buf+0x1e7/0x420 [81239371] xfs_read_agf+0x61/0x1a0 [812394e4] xfs_alloc_read_agf+0x34/0xd0 [8123c877] xfs_alloc_fix_freelist+0x3f7/0x470 [81288005] ? kmem_free+0x35/0x40 [8127ff6e] ? xfs_trans_free_item_desc+0x2e/0x30 [812800a7] ? xfs_trans_free_items+0x87/0xb0 [8127cc73] ? xfs_perag_get+0x33/0xb0 [8123c97f] ? xfs_free_extent+0x8f/0x120 [8123c990] xfs_free_extent+0xa0/0x120 [81287f07] ? kmem_zone_alloc+0x77/0xf0 [81245ead] xfs_bmap_finish+0x15d/0x1a0 [8126d15e] xfs_itruncate_finish+0x15e/0x340 [81285495] xfs_setattr+0x365/0x980 [812926e6] xfs_vn_setattr+0x16/0x20 [8111e0ad] notify_change+0x11d/0x300 [81103ccc] do_truncate+0x5c/0x90 [8110ea35] ? get_write_access+0x15/0x50 [81103ef7] sys_truncate+0x127/0x130 [815e367b] system_call_fastpath+0x16/0x1b INFO: task flush-8:16:3089 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. flush-8:16 D 8803af0d9d88 0 3089 2 0x 88032e835940 0046 00010fe0 8803 8803af0d9820 00010800 88032e835fd8 88032e834010 88032e835fd8 00010800 8803b0f7e080 8803af0d9820 Call Trace: [810be570] ? __lock_page+0x70/0x70 [815e0e1a] schedule+0x3a/0x60 [815e0ec7] io_schedule+0x87/0xd0 [810be579] sleep_on_page+0x9/0x10 [815e1412] __wait_on_bit_lock+0x52/0xb0 [810be562] __lock_page+0x62/0x70 [8106fb80] ? autoremove_wake_function+0x40/0x40 [810c8fd0] ? pagevec_lookup_tag+0x20/0x30 [810c7f66] write_cache_pages+0x386/0x4d0 [810c6c10] ? set_page_dirty+0x70/0x70 [810fd7ab] ? kmem_cache_free+0x1b/0xe0 [810c80fc] generic_writepages+0x4c/0x70 [81288bcf] xfs_vm_writepages+0x4f/0x60 [810c813c] do_writepages+0x1c/0x40 [81128854] writeback_single_inode+0xf4/0x260 [81128c45] writeback_sb_inodes+0xe5/0x1b0 [811290a8] writeback_inodes_wb+0x98/0x160 [81129ac3] wb_writeback+0x2f3/0x460 [815e089e] ? __schedule+0x3ae/0x850 [8105df47] ? lock_timer_base+0x37/0x70 [81129e4f] wb_do_writeback+0x21f/0x270 [81129f3a] bdi_writeback_thread+0x9a/0x230 [81129ea0] ? wb_do_writeback+0x270/0x270 [81129ea0] ? wb_do_writeback+0x270/0x270 [8106f646] kthread+0x96/0xa0 [815e46d4] kernel_thread_helper+0x4/0x10 [8106f5b0] ? kthread_worker_fn+0x130/0x130 [815e46d0] ? gs_change+0xb/0xb Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph rbd crashes/stalls while random write 4k blocks
Stefan, On 05/24/12 13:07, Stefan Priebe - Profihost AG wrote: Hi list, i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block device within a network booted kvm. Sequential write/reads and random reads are fine. No problems so far. But when i trigger lots of 4k random writes all of them stall after short time and i get 0 iops and 0 transfer. used command: fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G --numjobs=50 --runtime=30 --group_reporting --name=file1 Then some time later i see this call trace: INFO: task ceph-osd:3065 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. ceph-osdD 8803b0e61d88 0 3065 1 0x0004 88032f3ab7f8 0086 8803bffdac08 8803 8803b0e61820 00010800 88032f3abfd8 88032f3aa010 88032f3abfd8 00010800 81a0b020 8803b0e61820 Call Trace: [815e0e1a] schedule+0x3a/0x60 [815e127d] schedule_timeout+0x1fd/0x2e0 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [81074db1] ? down_trylock+0x31/0x50 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [815e20b9] __down+0x69/0xb0 [8128c4a6] ? _xfs_buf_find+0xf6/0x280 [81074e6b] down+0x3b/0x50 sorry I'm coming a bit late to the various threads you've posted recently, but on this particular issue: what kernel are your OSDs running on, and do these hung tasks occur if you're using a local filesystem other than XFS? As of late XFS has occasionally been producing seemingly random kernel hangs. Your call trace doesn't have the signature entries from xfssyncd that identify a particular problem that I've been struggling with lately, but you just might be affected by some other effect of the same root issue. Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html Not sure if this helps at all; just thought I might pitch that in. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crash, wont startup again
Hi, i was using the Debian Packages, but i tried now from source. I used the same version from GIT (cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash report. Then i applied your patch but again the same crash, i think the backtrace is also the same: (gdb) thread 1 [Switching to thread 1 (Thread 9564)]#0 0x7f33a3e58ebb in raise (sig=value optimized out) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41 41 in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c (gdb) backtrace #0 0x7f33a3e58ebb in raise (sig=value optimized out) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41 #1 0x0081423e in reraise_fatal (signum=11) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 #3 signal handler called #4 SnapRealm::have_past_parents_open (this=0x0, first=..., last=...) at mds/snap.cc:112 #5 0x0055d58b in MDCache::check_realm_past_parents (this=0x27a7200, realm=0x0) at mds/MDCache.cc:4495 #6 0x00572eec in MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200) at mds/MDCache.cc:4533 #7 0x005931a0 in MDCache::rejoin_gather_finish (this=0x27a7200) at mds/MDCache.cc: #8 0x0059b9d5 in MDCache::rejoin_send_rejoins (this=0x27a7200) at mds/MDCache.cc:3388 #9 0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at mds/MDS.cc:1404 #10 0x004c253a in MDS::handle_mds_map (this=0x27bc000, m=value optimized out) at mds/MDS.cc:968 #11 0x004c4513 in MDS::handle_core_message (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1651 #12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1790 #13 0x004c628b in MDS::ms_dispatch (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1602 #14 0x00732609 in Messenger::ms_deliver_dispatch (this=0x279f680) at msg/Messenger.h:178 #15 SimpleMessenger::dispatch_entry (this=0x279f680) at msg/SimpleMessenger.cc:363 #16 0x007207ad in SimpleMessenger::DispatchThread::entry() () #17 0x7f33a3e508ca in start_thread (arg=value optimized out) at pthread_create.c:300 #18 0x7f33a26d892d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #19 0x in ?? () Any more ideas? :) Or can i get you more debugging output? 2012/5/23 Gregory Farnum g...@inktank.com: On Wed, May 23, 2012 at 5:28 AM, Felix Feinhals f...@turtle-entertainment.de wrote: Hey, ok i installed libc-dbg and run your commands now this comes up: gdb /usr/bin/ceph-mds core snip GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-linux-gnu. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/bin/ceph-mds...Reading symbols from /usr/lib/debug/usr/bin/ceph-mds...done. (no debugging symbols found)...done. [New Thread 22980] [New Thread 22984] [New Thread 22986] [New Thread 22979] [New Thread 22970] [New Thread 22981] [New Thread 22971] [New Thread 22976] [New Thread 22973] [New Thread 22975] [New Thread 22974] [New Thread 22972] [New Thread 22978] [New Thread 22982] warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libpthread.so.0...Reading symbols from /usr/lib/debug/lib/libpthread-2.11.3.so...done. (no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libcrypto++.so.8 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libuuid.so.1 Reading symbols from /lib/librt.so.1...Reading symbols from /usr/lib/debug/lib/librt-2.11.3.so...done. (no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libtcmalloc.so.0 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libstdc++.so.6 Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.11.3.so...done. (no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.11.3.so...done. (no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.11.3.so...done. (no debugging symbols
Re: ceph rbd crashes/stalls while random write 4k blocks
Am 24.05.2012 14:12, schrieb Florian Haas: Stefan, sorry I'm coming a bit late to the various threads you've posted recently, but on this particular issue: what kernel are your OSDs running on, and do these hung tasks occur if you're using a local filesystem other than XFS? OSDs run 3.0.30 but i tried 3.3.7 too - no difference (regarding XFS crash and random writes). Just tried btrfs with 3.4 kernel and the posted patch from yesterday. But with kernel 3.4 the performance is in general pretty low doesn't matter if i use xfs or btrfs: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163519 75.982476 0.294869 0.376607 2 165135 69.984464 0.103118 0.345375 3 16725674.65284 0.1139090.5364 4 168872 71.986664 0.641818 0.786378 5 169579 63.188728 0.131084 0.737699 6 16 11397 64.655372 0.232688 0.851319 7 16 129 113 64.560464 0.35199 0.822971 8 16 148 132 65.988876 0.09892 0.739852 9 16 149 133 59.1007 4 0.833541 0.740556 10 16 157 141 56.389932 0.101306 0.715187 11 16 157 141 51.2634 0 - 0.715187 12 16 157 141 46.9914 0 - 0.715187 13 16 157 141 43.3766 0 - 0.715187 14 16 157 141 40.2782 0 - 0.715187 15 16 157 14137.593 0 - 0.715187 16 16 157 141 35.2434 0 - 0.715187 Total time run:16.471636 Total writes made: 158 Write size:4194304 Bandwidth (MB/sec):38.369 Average Latency: 1.66534 Max latency: 13.554 Min latency: 0.095194 As of late XFS has occasionally been producing seemingly random kernel hangs. Your call trace doesn't have the signature entries from xfssyncd that identify a particular problem that I've been struggling with lately, but you just might be affected by some other effect of the same root issue. Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html These are solved by using 3.0.20. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
poor OSD performance using kernel 3.4
Hi list, today while testing btrfs i discovered a very poor osd performance using kernel 3.4. Underlying FS is XFS but it is the same with btrfs. 3.0.30: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164125 99.9767 100 0.586984 0.447293 2 167155 109.979 120 0.934388 0.488375 3 169983 110.647 112 1.15982 0.503111 4 16 130 114 113.981 124 1.05952 0.516925 5 16 159 143 114.382 116 0.149313 0.510734 6 16 188 172 114.649 116 0.287166 0.52203 7 16 215 199 113.697 108 0.151784 0.531461 8 16 242 226 112.984 108 0.623478 0.539896 9 16 265 249 110.65192 0.50354 0.538504 10 16 296 280 111.984 124 0.155048 0.542846 Total time run:10.776153 Total writes made: 297 Write size:4194304 Bandwidth (MB/sec):110.243 Average Latency: 0.577534 Max latency: 1.85499 Min latency: 0.091473 3.4: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164024 95.979496 0.393196 0.455936 2 166852 103.983 112 0.835652 0.517297 3 168569 91.984968 1.00535 0.493058 4 169680 79.986944 0.096564 0.577948 5 16 10387 69.587928 0.092722 0.589147 6 16 117 101 67.321656 0.222175 0.675334 7 16 130 114 65.132152 0.15677 0.623806 8 16 144 128 63.989656 0.089157 0.56746 9 16 144 128 56.8794 0 - 0.56746 10 16 144 128 51.1912 0 - 0.56746 11 16 144 128 46.5373 0 - 0.56746 12 16 144 128 42.6591 0 - 0.56746 13 16 144 128 39.3776 0 - 0.56746 14 16 144 128 36.5649 0 - 0.56746 15 16 144 128 34.1272 0 - 0.56746 16 16 145 129 32.2443 0.5 11.3422 0.650985 Total time run:16.193871 Total writes made: 145 Write size:4194304 Bandwidth (MB/sec):35.816 Average Latency: 1.78467 Max latency: 14.4744 Min latency: 0.088753 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph rbd crashes/stalls while random write 4k blocks
On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html These are solved by using 3.0.20. ... or so Christoph says, but comment #4 in bug 922 seems to indicate otherwise. Florian -- Need help with High Availability? http://www.hastexo.com/now -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Hi Stefan, Were these both tested on fresh filesystems? If you still have any 3.0.30 available, could you try a couple of longer running tests (say 5 minutes) and see how they compare? Thanks, Mark On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote: Hi list, today while testing btrfs i discovered a very poor osd performance using kernel 3.4. Underlying FS is XFS but it is the same with btrfs. 3.0.30: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164125 99.9767 100 0.586984 0.447293 2 167155 109.979 120 0.934388 0.488375 3 169983 110.647 112 1.15982 0.503111 4 16 130 114 113.981 124 1.05952 0.516925 5 16 159 143 114.382 116 0.149313 0.510734 6 16 188 172 114.649 116 0.287166 0.52203 7 16 215 199 113.697 108 0.151784 0.531461 8 16 242 226 112.984 108 0.623478 0.539896 9 16 265 249 110.65192 0.50354 0.538504 10 16 296 280 111.984 124 0.155048 0.542846 Total time run:10.776153 Total writes made: 297 Write size:4194304 Bandwidth (MB/sec):110.243 Average Latency: 0.577534 Max latency: 1.85499 Min latency: 0.091473 3.4: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164024 95.979496 0.393196 0.455936 2 166852 103.983 112 0.835652 0.517297 3 168569 91.984968 1.00535 0.493058 4 169680 79.986944 0.096564 0.577948 5 16 10387 69.587928 0.092722 0.589147 6 16 117 101 67.321656 0.222175 0.675334 7 16 130 114 65.132152 0.15677 0.623806 8 16 144 128 63.989656 0.089157 0.56746 9 16 144 128 56.8794 0 - 0.56746 10 16 144 128 51.1912 0 - 0.56746 11 16 144 128 46.5373 0 - 0.56746 12 16 144 128 42.6591 0 - 0.56746 13 16 144 128 39.3776 0 - 0.56746 14 16 144 128 36.5649 0 - 0.56746 15 16 144 128 34.1272 0 - 0.56746 16 16 145 129 32.2443 0.5 11.3422 0.650985 Total time run:16.193871 Total writes made: 145 Write size:4194304 Bandwidth (MB/sec):35.816 Average Latency: 1.78467 Max latency: 14.4744 Min latency: 0.088753 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote: Am 24.05.2012 10:22, schrieb Wido den Hollander: On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote: ~# rados -p data ls|wc -l 46631 That is weird, I thought the bench tool cleaned up it's mess. Imho it should cleanup after it's done, but there might be a reason why it's not. Did you abort the benchmark or did you let it do the whole run? No it doesn't BUG? It doesn't because you might want to leave around the data for read benchmarking (or so that your cluster is full of data). There should probably be an option to clean up bench data, though! I've created a bug: http://tracker.newdream.net/issues/2477 ~# rados -p data ls ~# ~# rados -p data bench 20 write -t 16 ... ~# rados -p data ls| wc -l 589 I do not use the data pool so it is seperate ;-) i only use the rbd pool for block devices. So i will free the space with: for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done rados -p data ls|xargs -n 1 rados -p data rm I love shorter commands ;) me too i just tried it without -n and hoped that this works but rados didn't support more than 1 file per command and i didn't remembered -n1 ;) Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I have some problem to mount ceph file system
That's not an option any more, since malicious clients can fake it so easily. :( On Wednesday, May 23, 2012 at 10:35 PM, FrankWOO Su wrote: So in this version, can i do some settings about mount command limited by IP ? any example ?? Thanks -Frank 2012/5/24 Sage Weil s...@inktank.com (mailto:s...@inktank.com) On Wed, 23 May 2012, Gregory Farnum wrote: On Wed, May 23, 2012 at 1:51 AM, Frank frankwoo@gmail.com (mailto:frankwoo@gmail.com) wrote: Hello I have a question about ceph. When I mount ceph, I do the command as follow : # mount -t ceph -o name=admin,secret=XX 10.1.0.1:6789/ (http://10.1.0.1:6789/) /mnt/ceph -vv now I create an user foo and make a secretkey by ceph-authtool like that : # ceph-authtool /etc/ceph/keyring.bin -n client.foo --gen-key then I add the key into ceph : # ceph auth add client.foo osd 'allow *' mon 'allow *' mds 'allow' -i /etc/ceph/keyring.bin so i can mount ceph by foo : # mount -t ceph -o name=foo,secret=XOXOXO 10.1.0.1:6789/ (http://10.1.0.1:6789/) /mnt/ceph -vv my question is if i don't want foo that has permission to mount 10.1.0.1:6789/ (http://10.1.0.1:6789/) HOW TO DO ITÿÿ if there is a directory foo I want he can mount 10.1.0.1:6789:/foo/ but have no access to mount 10.1.0.1:6789:/ I'm afraid that's not an option with Ceph right now, that I'm aware of. It was built and designed for a trusted set of servers and clients, and while we're slowly carving out areas of security, this isn't one we've done yet. If it's an important feature for you, you should create a feature request in the tracker (tracker.newdream.net (http://tracker.newdream.net)) for it, which we will prioritize and work on once we've moved to focus on the full filesystem. :) http://tracker.newdream.net/issues/1237 (tho the final config will probably not look like that; suggestions welcome.) sag -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS re-exporting CEPH cluster
On Wednesday, May 23, 2012 at 10:14 PM, Madhusudhana U wrote: Hi all, Can anyone tried re-exporting CEPH cluster via NFS with success (I mean to say, mount the CEPH cluster in one of the machine and then export that via NFS to clients)? I need to do this bcz of my client kernel version and some EDA tools compatibility.Can someone suggest me how I can successfully re-export CEPH over NFS ? Have you tried something and it failed? Or are you looking for suggestions? If the former, please report the failure. :) If the latter: http://ceph.com/wiki/Re-exporting_NFS -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
On 05/24/2012 10:55 AM, Greg Farnum wrote: On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote: Am 24.05.2012 10:22, schrieb Wido den Hollander: On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote: ~# rados -p data ls|wc -l 46631 That is weird, I thought the bench tool cleaned up it's mess. Imho it should cleanup after it's done, but there might be a reason why it's not. Did you abort the benchmark or did you let it do the whole run? No it doesn't BUG? It doesn't because you might want to leave around the data for read benchmarking (or so that your cluster is full of data). There should probably be an option to clean up bench data, though! I've created a bug: http://tracker.newdream.net/issues/2477 Why not have the read benchmark write data itself, and then benchmark reading? Then both read and write benchmarks can clean up after themselves. It's a bit odd to have the read benchmark depend on you running a write benchmark first. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSDMap::apply_incremental not updating crush map
On Thursday, May 24, 2012 at 10:58 AM, Adam Crume wrote: I'm trying to simulate adding an OSD to a cluster. I set up an OSDMap::Incremental and apply it, but nothing ever gets mapped to the new OSD. Apparently, the crush map never gets updated. Do I have to do that manually? Yes. If you need help, check out the OSDMonitor::prepare_command code crush section. :) It seems like apply_incremental should do it automatically. apply_incremental has no idea where the new ID is located in terms of failure domains. My test case is below. It shows that the OSDMap is updated to have 11 OSDs, but the crush map still shows only 10. Thanks, Adam Crume #include assert.h #include osd/OSDMap.h #include common/code_environment.h int main() { OSDMap *osdmap = new OSDMap(); CephContext *cct = new CephContext(CODE_ENVIRONMENT_UTILITY); uuid_d fsid; int num_osds = 10; osdmap-build_simple(cct, 1, fsid, num_osds, 7, 8); for(int i = 0; i num_osds; i++) { osdmap-set_state(i, osdmap-get_state(i) | CEPH_OSD_UP | CEPH_OSD_EXISTS); osdmap-set_weight(i, CEPH_OSD_IN); } int osd_num = 10; OSDMap::Incremental inc(osdmap-get_epoch() + 1); inc.new_max_osd = osdmap-get_max_osd() + 1; inc.new_weight[osd_num] = CEPH_OSD_IN; inc.new_state[osd_num] = CEPH_OSD_UP | CEPH_OSD_EXISTS; inc.new_up_client[osd_num] = entity_addr_t(); inc.new_up_internal[osd_num] = entity_addr_t(); inc.new_hb_up[osd_num] = entity_addr_t(); inc.new_up_thru[osd_num] = inc.epoch; uuid_d new_uuid; new_uuid.generate_random(); inc.new_uuid[osd_num] = new_uuid; int e = osdmap-apply_incremental(inc); assert(e == 0); printf(State for 10: %d, State for 0: %d\n, osdmap-get_state(10), osdmap-get_state(0)); printf(10 exists: %s\n, osdmap-exists(10) ? yes : no); printf(10 is in: %s\n, osdmap-is_in(10) ? yes : no); printf(10 is up: %s\n, osdmap-is_up(10) ? yes : no); printf(OSDMap max OSD: %d\n, osdmap-get_max_osd()); printf(CRUSH max devices: %d\n, osdmap-crush-get_max_devices()); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
On Thursday, May 24, 2012 at 11:05 AM, Josh Durgin wrote: Why not have the read benchmark write data itself, and then benchmark reading? Then both read and write benchmarks can clean up after themselves. It's a bit odd to have the read benchmark depend on you running a write benchmark first. Josh We've talked about that and decided we didn't like it. I think it was about being able to repeat large read benchmarks without having to wait for all the data to get written out first, and also (although this was never implemented) being able to implement random read benchmarks and things in ways that allowed you to make the cache cold first. Which is not to say that changing it is a bad idea; I could be talked into that or somebody else could do it. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Hi Stefan, Thanks for the info! I've been testing on 3.4 for the last couple of days but haven't run into that problem here. It looks like your journal has writes going to it quickly and then things stall as it tries to write out to your data disk. I wonder if any of the data actually makes it to the disk... Can you run iostat or collectl or something and see what kind of write throughput you get to the OSD data disks? Thanks, Mark On 05/24/2012 01:15 PM, Stefan Priebe wrote: Am 24.05.2012 16:55, schrieb Mark Nelson: Hi Stefan, Were these both tested on fresh filesystems? If you still have any 3.0.30 available, could you try a couple of longer running tests (say 5 minutes) and see how they compare? Yes with 3.4 it totally stalls. Tested with XFS and btrfs. Client always had the same Kernel. So i just changed the kernel on osd side. Kernel 3.4 http://pastebin.com/raw.php?i=CApKbSNj Kernel 3.0.30 http://pastebin.com/raw.php?i=kZ7rnwcM Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Am 24.05.2012 20:53, schrieb Mark Nelson: Hi Stefan, Thanks for the info! I've been testing on 3.4 for the last couple of days but haven't run into that problem here. It looks like your journal has writes going to it quickly and then things stall as it tries to write out to your data disk. That's a good point. Right now while testing i'm using a tmpfs ramdisk for the journal and have set journal dio = false in ceph.conf? Might this be the difference / problem? 3.2.18 works fine too. I wonder if any of the data actually makes it to the disk... Can you run iostat or collectl or something and see what kind of write throughput you get to the OSD data disks? none... so it seems get's never transferred from journal to disk. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mkcephfs regression in current master branch
Hi, In my testing I make repeated use of the manual mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo master# mkcephfs --prepare-mon -d /tmp/foo monnode# mkcephfs --init-local-daemons mon -d /tmp/foo Using current master branch (commit ca79f45a33f9), the mkcephfs --init-local-daemons osd phase breaks like this: 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) async snap create 'snap_1' transid 0 got (17) File exists os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) ceph version 0.47.1-157-gb003815 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e) 1: (FileStore::sync_entry()+0x34f0) [0x68e190] 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd] 3: (Thread::_entry_func(void*)+0x1b9) [0x803499] 4: (()+0x77f1) [0x7fdcd58637f1] 5: (clone()+0x6d) [0x7fdcd4cb4ccd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) FWIW, with commit 598dea12411 (filestore: mkfs: only create snap_0 if we created current_op_seq) reverted, I am able to create a new filesystem using the above sequence, and a typical mkcephfs --init-local-daemons osd looks like this: 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0 creating /mnt/ceph/misc.osd.0/keyring.osd.0 Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkcephfs regression in current master branch
Hi Jim, On Thu, 24 May 2012, Jim Schutt wrote: Hi, In my testing I make repeated use of the manual mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo master# mkcephfs --prepare-mon -d /tmp/foo monnode# mkcephfs --init-local-daemons mon -d /tmp/foo Using current master branch (commit ca79f45a33f9), the mkcephfs --init-local-daemons osd phase breaks like this: 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) async snap create 'snap_1' transid 0 got (17) File exists os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) ceph version 0.47.1-157-gb003815 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e) 1: (FileStore::sync_entry()+0x34f0) [0x68e190] 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd] 3: (Thread::_entry_func(void*)+0x1b9) [0x803499] 4: (()+0x77f1) [0x7fdcd58637f1] 5: (clone()+0x6d) [0x7fdcd4cb4ccd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) I just pushed a fix for this to master. BTW, the real change happening with these patches is that --mkfs no longer clobbers existing data. If you want to wipe out an osd and start anew, you need to rm -r (and btrfs sub delete snap_* and current), or re-run mkfs.btrfs. Thanks for the report! sage FWIW, with commit 598dea12411 (filestore: mkfs: only create snap_0 if we created current_op_seq) reverted, I am able to create a new filesystem using the above sequence, and a typical mkcephfs --init-local-daemons osd looks like this: 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0 creating /mnt/ceph/misc.osd.0/keyring.osd.0 Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkcephfs regression in current master branch
On 05/24/2012 03:13 PM, Sage Weil wrote: Hi Jim, On Thu, 24 May 2012, Jim Schutt wrote: Hi, In my testing I make repeated use of the manual mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo master# mkcephfs --prepare-mon -d /tmp/foo monnode# mkcephfs --init-local-daemons mon -d /tmp/foo Using current master branch (commit ca79f45a33f9), the mkcephfs --init-local-daemons osd phase breaks like this: 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) async snap create 'snap_1' transid 0 got (17) File exists os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) ceph version 0.47.1-157-gb003815 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e) 1: (FileStore::sync_entry()+0x34f0) [0x68e190] 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd] 3: (Thread::_entry_func(void*)+0x1b9) [0x803499] 4: (()+0x77f1) [0x7fdcd58637f1] 5: (clone()+0x6d) [0x7fdcd4cb4ccd] NOTE: a copy of the executable, or `objdump -rdSexecutable` is needed to interpret this. 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error) I just pushed a fix for this to master. Great, that fixed it for me. Thanks for the quick response! BTW, the real change happening with these patches is that --mkfs no longer clobbers existing data. If you want to wipe out an osd and start anew, you need to rm -r (and btrfs sub delete snap_* and current), or re-run mkfs.btrfs. Ah, OK. It turns out I always run mkfs.btrfs anyway, but this is good to know. -- Jim Thanks for the report! sage FWIW, with commit 598dea12411 (filestore: mkfs: only create snap_0 if we created current_op_seq) reverted, I am able to create a new filesystem using the above sequence, and a typical mkcephfs --init-local-daemons osd looks like this: 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0 creating /mnt/ceph/misc.osd.0/keyring.osd.0 Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RBD format changes and layering
RBD object format changes = To enable us to add more features to rbd, including copy-on-write cloning via layering, we need to change to rbd header object format. Since this won't be backwards compatible, the old format will still be used by default. Once layering is implemented, the old format will be deprecated, but still usable with an extra option (something like rbd create --legacy ...). Clients will still be able to read the old format, and images can be converted by exporting and importing them. While we're making these changes, we can clean up the way librbd and the rbd kernel module access the header, so that they don't have to change each time we change the header format. Instead of reading the header directly, they can use the OSD class mechanism to interact with it. librbd already does this for snapshots, but kernel rbd reads the entire header directly. Making them both use a well-defined api will make later format additions much simpler. I'll describe the changes needed in general, and then those that are needed for rbd layering. New format, pre-layering Right now the header object is name $image_name.rbd, and the data objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number. Since we're making other incompatible changes, we have a chance to rename these to be less likely to collide with other objects. Prefixing them with a more specific string will help, and will work well with a new security feature for layering discussed later. The new names are: rbd_header.$image_name rbd_data.$id.$object_number The new header will have the existing (used) fields of the old format as key/value pairs in an omap (this is the rados interface that stores key/value pairs in leveldb). Specifically, the existing fields are: * object_prefix // previously known as block_name * order // bit shift to determine size of the data objects * size // total size of the image in bytes * snap_seq // latest snapshot id used with the image * snapshots // list of (snap_name, snap_id, image_size) tuples To make adding new things easier, there will be an additional 'features' field, which is a mask of the features used by the image. Clients will know whether they can use an image by checking if they support all the features the image uses that the osd reports as being incompatible (see get_info() below). RBD class interface === Here's a proposed basic interface - new features will add more functions and data to existing ones. /** * Initialize the header with basic metadata. * Extra features may initialize more fields in the future. * Everything is stored as key/value pairs as omaps in the header object. * * If features the OSD does not understand are requested, -ENOSYS is * returned. */ create(__le64 size, __le32 order, __le64 features) /** * Get the metadata about the image required to do I/O * to it. In the future this may include extra information for * features that require it, like encryption/compression type. * This extra data will be added at the end of the response, so * clients that don't support it don't interpret it. * * Features that would require clients to be updated to access * the image correctly (such as image bitmaps) are set in * the incompat_features field. A client that doesn't understand * those features will return an error when they try to open * the image. * * The size and any extra information is read from the appropriate * snapshot metadata, if snapid is not CEPH_NOSNAP. * * Returns __le64 size, __le64 order, __le64 features, * __le64 incompat_features, __le64 snapseq and * list of __le64 snapids */ get_info(__le64 snapid) /** * Used when resizing the image. Sets the size in bytes. */ set_size(__le64 size) /** * The same as the existing snap_add/snap_remove methods, but using the * new format. */ snapshot_add(string snap_name, __le64 snap_id) snapshot_remove(string snap_name) /** * list snapshots - like the existing snap_list, but * can return a subset of them. * * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples * (snap_id, snap_size) just like the current snap_list */ snapshot_list(__le64 max_len) /** * The same as the existing method. Should only be called * on the rbd_info object. * Returns an id number to use for a new image. */ assign_bid() RBD layering The first step is to implement trivial layering, i.e. layering without bitmaps, as described at: http://marc.info/?l=ceph-develm=129867273303846w=2 There are a couple of things that complicate the implementation: 1) making sure parent images are not deleted when children still refer to them A simple way to solve this is to add a reference count to the parent image. This can cause issues with partially deleted images, if the reference count is decremented more than once because the child image's header was only deleted the
Re: RBD format changes and layering
On Thu, May 24, 2012 at 4:05 PM, Josh Durgin josh.dur...@inktank.com wrote: RBD object format changes = To enable us to add more features to rbd, including copy-on-write cloning via layering, we need to change to rbd header object format. Since this won't be backwards compatible, the old format will still be used by default. Once layering is implemented, the old format will be deprecated, but still usable with an extra option (something like rbd create --legacy ...). Clients will still be able to read the old format, and images can be converted by exporting and importing them. While we're making these changes, we can clean up the way librbd and the rbd kernel module access the header, so that they don't have to change each time we change the header format. Instead of reading the header directly, they can use the OSD class mechanism to interact with it. librbd already does this for snapshots, but kernel rbd reads the entire header directly. Making them both use a well-defined api will make later format additions much simpler. I'll describe the changes needed in general, and then those that are needed for rbd layering. New format, pre-layering Right now the header object is name $image_name.rbd, and the data objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number. Since we're making other incompatible changes, we have a chance to rename these to be less likely to collide with other objects. Prefixing them with a more specific string will help, and will work well with a new security feature for layering discussed later. The new names are: rbd_header.$image_name rbd_data.$id.$object_number The new header will have the existing (used) fields of the old format as key/value pairs in an omap (this is the rados interface that stores key/value pairs in leveldb). Specifically, the existing fields are: * object_prefix // previously known as block_name * order // bit shift to determine size of the data objects * size // total size of the image in bytes * snap_seq // latest snapshot id used with the image * snapshots // list of (snap_name, snap_id, image_size) tuples To make adding new things easier, there will be an additional 'features' field, which is a mask of the features used by the image. Clients will know whether they can use an image by checking if they support all the features the image uses that the osd reports as being incompatible (see get_info() below). RBD class interface === Here's a proposed basic interface - new features will add more functions and data to existing ones. /** * Initialize the header with basic metadata. * Extra features may initialize more fields in the future. * Everything is stored as key/value pairs as omaps in the header object. * * If features the OSD does not understand are requested, -ENOSYS is * returned. */ create(__le64 size, __le32 order, __le64 features) /** * Get the metadata about the image required to do I/O * to it. In the future this may include extra information for * features that require it, like encryption/compression type. * This extra data will be added at the end of the response, so * clients that don't support it don't interpret it. * * Features that would require clients to be updated to access * the image correctly (such as image bitmaps) are set in * the incompat_features field. A client that doesn't understand * those features will return an error when they try to open * the image. * * The size and any extra information is read from the appropriate * snapshot metadata, if snapid is not CEPH_NOSNAP. * * Returns __le64 size, __le64 order, __le64 features, * __le64 incompat_features, __le64 snapseq and * list of __le64 snapids */ get_info(__le64 snapid) /** * Used when resizing the image. Sets the size in bytes. */ set_size(__le64 size) /** * The same as the existing snap_add/snap_remove methods, but using the * new format. */ snapshot_add(string snap_name, __le64 snap_id) snapshot_remove(string snap_name) /** * list snapshots - like the existing snap_list, but * can return a subset of them. * * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples * (snap_id, snap_size) just like the current snap_list */ snapshot_list(__le64 max_len) /** * The same as the existing method. Should only be called * on the rbd_info object. * Returns an id number to use for a new image. */ assign_bid() RBD layering The first step is to implement trivial layering, i.e. layering without bitmaps, as described at: http://marc.info/?l=ceph-develm=129867273303846w=2 There are a couple of things that complicate the implementation: 1) making sure parent images are not deleted when children still refer to them A simple way to solve this is to add a reference count to
Re: poor OSD performance using kernel 3.4
On 05/24/2012 02:05 PM, Stefan Priebe wrote: Am 24.05.2012 20:53, schrieb Mark Nelson: Hi Stefan, Thanks for the info! I've been testing on 3.4 for the last couple of days but haven't run into that problem here. It looks like your journal has writes going to it quickly and then things stall as it tries to write out to your data disk. That's a good point. Right now while testing i'm using a tmpfs ramdisk for the journal and have set journal dio = false in ceph.conf? Might this be the difference / problem? 3.2.18 works fine too. Honestly I don't know if tmpfs journal with dio = false would lead to that kind of behavior. Anything interesting in the logs if you turn debugging up? I wonder if any of the data actually makes it to the disk... Can you run iostat or collectl or something and see what kind of write throughput you get to the OSD data disks? none... so it seems get's never transferred from journal to disk. This might be a stupid question, but writes to those partitions work outside of Ceph with the new kernel right? Stefan Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html