Re: Ceph on btrfs 3.4rc

2012-05-24 Thread Martin Mailand

Hi,
the ceph cluster is running under heavy load for the last 13 hours 
without a problem, dmesg is empty and the performance is good.


-martin

Am 23.05.2012 21:12, schrieb Martin Mailand:

this patch is running for 3 hours without a Bug and without the Warning.
I will let it run overnight and report tomorrow.
It looks very good ;-)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to free space from rados bench comman?

2012-05-24 Thread Stefan Priebe - Profihost AG
Hi,

every rados bench write uses disk space and my space fills up. How to
free this space again?

Used command?
rados -p data bench 60 write -t 16

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Wido den Hollander

Hi,

On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote:

Hi,

every rados bench write uses disk space and my space fills up. How to
free this space again?

Used command?
rados -p data bench 60 write -t 16


What does this show:

$ rados -p data ls|wc -l

If that shows something greater than 0 it means you still have objects 
in that pool which are using up space.


Try removing those objects manually. Be cautious not to remove any other 
objects!


To be safe I'd recommend running benchmark commands in a separate pool.

Also note that when you remove objects it will take some time before the 
OSD's have removed them and you see the usage go down with ceph -s.


Wido



Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Stefan Priebe - Profihost AG
Am 24.05.2012 09:28, schrieb Wido den Hollander:
 Hi,
 
 On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote:
 Hi,

 every rados bench write uses disk space and my space fills up. How to
 free this space again?

 Used command?
 rados -p data bench 60 write -t 16
 
 What does this show:
 
 $ rados -p data ls|wc -l

~# rados -p data ls|wc -l
46631

I do not use the data pool so it is seperate ;-) i only use the rbd pool
for block devices.

So i will free the space with:
for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done

Thanks!

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple named clusters on same nodes

2012-05-24 Thread Amon Ott
On Wednesday 23 May 2012 wrote Tommi Virtanen:
 On Wed, May 23, 2012 at 2:00 AM, Amon Ott a@m-privacy.de wrote:
  So I started experimenting with the new cluster variable, but it does
  not seem to be well supported so far. mkcephfs does not even know about
  it and always uses ceph as cluster name. Setting a value for cluster
  in global section of ceph.conf (homeuser.conf, backup.conf, ...) does not
  work, it is not even used in the same config file, instead it has the
  fixed value ceph.
 [...]
 I don't think anyone is likely to fix mkcephfs to work with it -- I'm
 personally trying to get mkcephfs declared obsolete. It's
 fundamentally the wrong tool; for example, it cannot expand or
 reconfigure an existing cluster.

Attached is a patch based on current git stable that makes mkcephfs work fine 
for me with --cluster name. ceph-mon uses the wrong mkfs path for mon data 
(default ceph instead of supplied cluster name), so I put in a workaround.

Please have a look and consider inclusion as well as fixing mon data path. 
Thanks.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
commit fc394c63b9fd4f5fea4bc3a430f57164a96dc543
Author: Amon Ott a...@rsbac.org
Date:   Thu May 24 09:48:29 2012 +0200

mkcephfs: Support --cluster name for cluster naming

Current mkcephs can only create clusters with name ceph.
This patch allows to specify the cluster name and fixes some default paths
to the new $cluster based locations.
Parameter --conf is now optional and defaults to /etc/ceph/$cluster.conf.

Signed-off-by: Amon Ott a@m-privacy.de

diff --git a/src/mkcephfs.in b/src/mkcephfs.in
index 17b6014..e1c061e 100644
--- a/src/mkcephfs.in
+++ b/src/mkcephfs.in
@@ -60,7 +60,7 @@ else
 fi
 
 usage_exit() {
-echo usage: $0 -a -c ceph.conf [-k adminkeyring] [--mkbtrfs]
+echo usage: $0 [--cluster name] -a [-c ceph.conf] [-k adminkeyring] [--mkbtrfs]
 echoto generate a new ceph cluster on all nodes; for advanced usage see man page
 echo** be careful, this WILL clobber old data; check your ceph.conf carefully **
 exit
@@ -89,6 +89,7 @@ moreargs=
 auto_action=0
 manual_action=0
 nocopyconf=0
+cluster=ceph
 
 while [ $# -ge 1 ]; do
 case $1 in
@@ -141,6 +142,11 @@ case $1 in
 	shift
 	conf=$1
 	;;
+--cluster | -C)
+	[ -z $2 ]  usage_exit
+	shift
+	cluster=$1
+	;;
 --numosd)
 	[ -z $2 ]  usage_exit
 	shift
@@ -181,6 +187,8 @@ done
 
 [ -z $conf ]  [ -n $dir ]  conf=$dir/conf
 
+[ -z $conf ]  conf=/etc/ceph/$cluster.conf
+
 if [ $manual_action -eq 0 ]; then
 if [ $auto_action -eq 0 ]; then
 echo You must specify an action. See man page.
@@ -245,19 +253,19 @@ if [ -n $initdaemon ]; then
 name=$type.$id
 
 # create /var/run/ceph (or wherever pid file and/or admin socket live)
-get_conf pid_file /var/run/ceph/$name.pid pid file
+get_conf pid_file /var/run/ceph/$type/$cluster-$id.pid pid file
 rundir=`dirname $pid_file`
 if [ $rundir != . ]  [ ! -d $rundir ]; then
 	mkdir -p $rundir
 fi
-get_conf asok_file /var/run/ceph/$name.asok admin socket
+get_conf asok_file /var/run/ceph/$type/$cluster-$id.asok admin socket
 rundir=`dirname $asok_file`
 if [ $rundir != . ]  [ ! -d $rundir ]; then
 	mkdir -p $rundir
 fi
 
 if [ $type = osd ]; then
-	$BINDIR/ceph-osd -c $conf --monmap $dir/monmap -i $id --mkfs
+	$BINDIR/ceph-osd --cluster $cluster -c $conf --monmap $dir/monmap -i $id --mkfs
 	create_private_key
 fi
 
@@ -266,7 +274,9 @@ if [ -n $initdaemon ]; then
 fi
 
 if [ $type = mon ]; then
-	$BINDIR/ceph-mon -c $conf --mkfs -i $id --monmap $dir/monmap --osdmap $dir/osdmap -k $dir/keyring.mon
+get_conf mondata  mon data
+test -z $mondata  mondata=/var/lib/ceph/mon/$cluster-$id
+	$BINDIR/ceph-mon --cluster $cluster -c $conf --mon-data=$mondata --mkfs -i $id --monmap $dir/monmap --osdmap $dir/osdmap -k $dir/keyring.mon
 fi
 
 exit 0
@@ -442,14 +452,14 @@ if [ $allhosts -eq 1 ]; then
 
 	if [ $nocopyconf -eq 0 ]; then
 		# also put conf at /etc/ceph/ceph.conf
-		scp -q $dir/conf $host:/etc/ceph/ceph.conf
+		scp -q $dir/conf $host:/etc/ceph/$cluster.conf
 	fi
 	else
 	rdir=$dir
 
 	if [ $nocopyconf -eq 0 ]; then
 		# also put conf at /etc/ceph/ceph.conf
-		cp $dir/conf /etc/ceph/ceph.conf
+		cp $dir/conf /etc/ceph/$cluster.conf
 	fi
 	fi
 	
@@ -486,15 +496,15 @@ if [ $allhosts -eq 1 ]; then
 	scp -q $dir/* $host:$rdir
 
 	if [ $nocopyconf -eq 0 ]; then
-		# also put conf at /etc/ceph/ceph.conf
-		scp -q $dir/conf $host:/etc/ceph/ceph.conf
+		# also put conf at /etc/ceph/$cluster.conf
+		scp -q $dir/conf $host:/etc/ceph/$cluster.conf
 	fi
 	else
 	  

Re: how to free space from rados bench comman?

2012-05-24 Thread Wido den Hollander



On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote:

Am 24.05.2012 09:28, schrieb Wido den Hollander:

Hi,

On 24-05-12 09:01, Stefan Priebe - Profihost AG wrote:

Hi,

every rados bench write uses disk space and my space fills up. How to
free this space again?

Used command?
rados -p data bench 60 write -t 16


What does this show:

$ rados -p data ls|wc -l


~# rados -p data ls|wc -l
46631


That is weird, I thought the bench tool cleaned up it's mess.

Imho it should cleanup after it's done, but there might be a reason why 
it's not. Did you abort the benchmark or did you let it do the whole run?




I do not use the data pool so it is seperate ;-) i only use the rbd pool
for block devices.

So i will free the space with:
for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done


rados -p data ls|xargs -n 1 rados -p data rm

I love shorter commands ;)

Wido



Thanks!

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW, future directions

2012-05-24 Thread Sławomir Skowron
On Thu, May 24, 2012 at 7:15 AM, Wido den Hollander w...@widodh.nl wrote:


 On 22-05-12 20:07, Yehuda Sadeh wrote:

 RGW is maturing. Beside looking at performance, which highly ties into
 RADOS performance, we'd like to hear whether there are certain pain
 points or future directions that you (you as in the ceph community)
 would like to see us taking.

 There are a few directions that we were thinking about:

 1. Extend Object Storage API

 Swift and S3 has some features that we don't currently support. We can
 certainly extend our functionality, however, is there any demand for
 more features? E.g., self destructing objects, web site, user logs,
 etc.

 2. Better OpenStack interoperability

 Keystone support? Other?

 3. New features

 Some examples:

  - multitenancy: api for domains and user management
  - snapshots
  - computation front end: upload object, then do some data
 transformation/calculation.
  - simple key-value api

 4. CDMI

 Sage brought up the CDMI support question to ceph-devel, and I don't
 remember him getting any response. Is there any intereset in CDMI?


 5. Native apache/nginx module or embedded web server

 We still need to prove that the web server is a bottleneck, or poses
 scaling issues. Writing a correct native nginx module will require
 turning rgw process model into event driven, which is not going to be
 easy.


 I'd not go for a native nginx or Apache module, that would bring extra C
 code into the story which would mean extra dependencies.

 My vote would still go to a embedded webserver written in something like
 Python. You could then use Apache/nginx/Varnish as a reverse proxy in front
 and do all kinds of cool stuff.

 You could even doing caching in nginx or Varnish and let the RGW notify
 those proxy's when an object has changed so they can purge their cache. This
 would dramatically improve the performance of the gateway.

 It would also simplify the code, why try to do caching on your own when some
 great HTTP caches are out there?


100% +1


 6. Improve garbage collection

 Currently rgw generates intent logs for garabage removal that require
 running an external tool later, which is an administrative pain. We
 can implement other solutions (OSD side garbage collection,
 integrating cleanup process into the gateway, etc.) but we need to
 understand the priority.

 7. libradosgw

 We have had this in mind for some time now. Creating a programming api
 for rgw, not too different from librados and librbd. It'll hopefully
 make code much cleaner. It will allow users to write different front
 ends for the rgw backend, and it will make it easier for users to
 write applications that interact with the backend, e.g., do processing
 on objects that users uploaded, FUSE for rgw without S3 as an
 intermediate, etc.


 Yes, I would really like this. Combine this with the Python
 stand-alone/embedded webserver I proposed and you get a really nice RGW I
 think.


 8. Administration tools improvement

 We can always do better there.


 When we have libradosgw it wouldn't be that hard to make a nice web
 front-end where you can manage the whole thing.


 9. Other ideas?


 Any comments are welcome!

 Thanks,
 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
-
Pozdrawiam

Sławek sZiBis Skowron
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-24 Thread Jerker Nyberg

On Wed, 23 May 2012, Gregory Farnum wrote:


On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote:


 * Scratch file system for HPC. (kernel client)
 * Scratch file system for research groups. (SMB, NFS, SSH)
 * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
 * Metropolitan cluster.
 * VDI backend. KVM with RBD.


Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much either.


Oh, sorry, those were my use cases for ceph in general. Yes, scratch is 
mosty of interest. But also fast backup. Currently IOPS is limiting our 
backup speed on a small cluster with many files but not much data. I have 
problems scanning through and backing all changed files every night. 
Currently I am backing to ZFS but Ceph might help with scaling up 
performance and size. Another option is going for SSD instead of 
mechanical drives.



Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)


Thank you for your thoughts on this. I hope to be able to do that soon.

Regards,
Jerker Nyberg, Uppsala, Sweden.

Problems while doing rpmbuild on CENTOS

2012-05-24 Thread Ajit K Jena
Hi All,

I am trying to build the CEPH stuff from bzip source archive by
using the command below:

  rpmbuild -tb /home/src/ceph.related/ceph-0.47.2.tar.bz2

I get the following error:

  error: File /home/src/ceph.related/libs3-trunk.tar.gz: No such
file or directory

What is this and where do I get this file from ? Can anyone pl give
me some pointers on this ?

Regards.

--ajit


  

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple named clusters on same nodes

2012-05-24 Thread Amon Ott
On Thursday 24 May 2012 wrote Amon Ott:
 Attached is a patch based on current git stable that makes mkcephfs work
 fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon
 data (default ceph instead of supplied cluster name), so I put in a
 workaround.

 Please have a look and consider inclusion as well as fixing mon data path.
 Thanks.

And another patch for the init script to handle multiple clusters.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
commit d446077dc93894784348f7560ee29eaf6e3ce272
Author: Amon Ott a...@rsbac.org
Date:   Thu May 24 10:55:27 2012 +0200

Make init script init-ceph.in cluster name aware.

Add --cluster clustername parameter to start/stop/etc. specific cluster
with default config file /etc/ceph/cluster.conf.
If no clustername is given, walk through /etc/ceph/*.conf and try to
start/stop/etc. them all with clustername taken from conf basename.

Signed-off-by: Amon Ott a@m-privacy.de

diff --git a/src/init-ceph.in b/src/init-ceph.in
index f2702e3..6efe7f0 100644
--- a/src/init-ceph.in
+++ b/src/init-ceph.in
@@ -28,6 +28,7 @@ fi
 
 usage_exit() {
 echo usage: $0 [options] {start|stop|restart} [mon|osd|mds]...
+printf \t--cluster clustername\n
 printf \t-c ceph.conf\n
 printf \t--valgrind\trun via valgrind\n
 printf \t--hostname [hostname]\toverride hostname lookup\n
@@ -36,6 +37,8 @@ usage_exit() {
 
 . $LIBDIR/ceph_common.sh
 
+conf=
+
 EXIT_STATUS=0
 
 signal_daemon() {
@@ -45,7 +48,7 @@ signal_daemon() {
 signal=$4
 action=$5
 [ -z $action ]  action=Stopping
-echo -n $action Ceph $name on $host...
+echo -n $action Ceph $cluster $name on $host...
 do_cmd if [ -e $pidfile ]; then
 pid=`cat $pidfile`
 if [ -e /proc/\$pid ]  grep -q $daemon /proc/\$pid/cmdline ; then
@@ -75,7 +78,7 @@ stop_daemon() {
 signal=$4
 action=$5
 [ -z $action ]  action=Stopping
-echo -n $action Ceph $name on $host...
+echo -n $action Ceph $cluster $name on $host...
 do_cmd while [ 1 ]; do 
 	[ -e $pidfile ] || break
 	pid=\`cat $pidfile\`
@@ -103,6 +106,7 @@ monaddr=
 dobtrfs=1
 dobtrfsumount=0
 verbose=0
+cluster=
 
 while echo $1 | grep -q '^-'; do # FIXME: why not '^-'?
 case $1 in
@@ -151,6 +155,12 @@ case $1 in
 	shift
 	hostname=$1
 ;;
+--cluster )
+	[ -z $2 ]  usage_exit
+	options=$options $1
+	shift
+	cluster=$1
+;;
 *)
 	echo unrecognized option \'$1\'
 	usage_exit
@@ -160,11 +170,25 @@ options=$options $1
 shift
 done
 
-verify_conf
-
 command=$1
 [ -n $* ]  shift
 
+if test -z $cluster
+then
+for c in /etc/ceph/*.conf
+do
+test -f $c  $0 --cluster $(basename $c .conf) $command $@
+done
+exit 0
+fi
+
+if test -z $conf
+then
+conf=/etc/ceph/$cluster.conf
+fi
+
+verify_conf
+
 get_name_list $@
 
 for name in $what; do
@@ -176,9 +200,9 @@ for name in $what; do
 check_host || continue
 
 binary=$BINDIR/ceph-$type
-cmd=$binary -i $id
+cmd=$binary --cluster $cluster -i $id
 
-get_conf pid_file $RUN_DIR/$type.$id.pid pid file
+get_conf pid_file $RUN_DIR/$type/$cluster-$id.pid pid file
 if [ -n $pid_file ]; then
 	do_cmd mkdir -p `dirname $pid_file`
 	cmd=$cmd --pid-file $pid_file
@@ -191,13 +215,13 @@ for name in $what; do
 get_conf auto_start  auto start
 if [ $auto_start = no ] || [ $auto_start = false ] || [ $auto_start = 0 ]; then
 if [ -z $@ ]; then
-echo Skipping Ceph $name on $host... auto start is disabled
+echo Skipping Ceph $cluster $name on $host... auto start is disabled
 continue
 fi
 fi
 
 	if daemon_is_running $name ceph-$type $id $pid_file; then
-	echo Starting Ceph $name on $host...already running
+	echo Starting Ceph $cluster $name on $host...already running
 	continue
 	fi
 
@@ -228,7 +252,7 @@ for name in $what; do
 fi
 
 # do lockfile, if RH
-get_conf lockfile /var/lock/subsys/ceph lock file
+get_conf lockfile /var/lock/subsys/ceph/$cluster lock file
 lockdir=`dirname $lockfile`
 if [ ! -d $lockdir ]; then
 	lockfile=
@@ -270,7 +294,7 @@ for name in $what; do
 		echo Mounting Btrfs on $host:$btrfs_path
 		do_root_cmd modprobe btrfs ; btrfs device scan || btrfsctl -a ; egrep -q '^[^ ]+ $btrfs_path' /proc/mounts || mount -t btrfs $btrfs_opt $first_dev $btrfs_path
 	fi
-	echo Starting Ceph $name on $host...
+	echo Starting Ceph $cluster $name on $host...
 	mkdir -p $RUN_DIR
 	get_conf pre_start_eval  pre start eval
 	[ -n $pre_start_eval ]  $pre_start_eval
@@ -297,14 +321,14 @@ for name in $what; do
 
 	status)
 	if 

Re: Ceph on btrfs 3.4rc

2012-05-24 Thread Christian Brunner
Same thing here.

I've tried really hard, but even after 12 hours I wasn't able to get a
single warning from btrfs.

I think you cracked it!

Thanks,
Christian

2012/5/24 Martin Mailand mar...@tuxadero.com:
 Hi,
 the ceph cluster is running under heavy load for the last 13 hours without a
 problem, dmesg is empty and the performance is good.

 -martin

 Am 23.05.2012 21:12, schrieb Martin Mailand:

 this patch is running for 3 hours without a Bug and without the Warning.
 I will let it run overnight and report tomorrow.
 It looks very good ;-)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Stefan Priebe - Profihost AG
Hi list,

i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block
device within a network booted kvm.

Sequential write/reads and random reads are fine. No problems so far.

But when i trigger lots of 4k random writes all of them stall after
short time and i get 0 iops and 0 transfer.

used command:
fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G
--numjobs=50 --runtime=30 --group_reporting --name=file1

Then some time later i see this call trace:

INFO: task ceph-osd:3065 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
ceph-osdD 8803b0e61d88 0  3065  1 0x0004
 88032f3ab7f8 0086 8803bffdac08 8803
 8803b0e61820 00010800 88032f3abfd8 88032f3aa010
 88032f3abfd8 00010800 81a0b020 8803b0e61820
Call Trace:
 [815e0e1a] schedule+0x3a/0x60
 [815e127d] schedule_timeout+0x1fd/0x2e0
 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
 [81074db1] ? down_trylock+0x31/0x50
 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
 [815e20b9] __down+0x69/0xb0
 [8128c4a6] ? _xfs_buf_find+0xf6/0x280
 [81074e6b] down+0x3b/0x50
 [8128b7b0] xfs_buf_lock+0x40/0xe0
 [8128c4a6] _xfs_buf_find+0xf6/0x280
 [8128c689] xfs_buf_get+0x59/0x190
 [8128ccf7] xfs_buf_read+0x27/0x100
 [81282f97] xfs_trans_read_buf+0x1e7/0x420
 [81239371] xfs_read_agf+0x61/0x1a0
 [812394e4] xfs_alloc_read_agf+0x34/0xd0
 [8123c877] xfs_alloc_fix_freelist+0x3f7/0x470
 [81288005] ? kmem_free+0x35/0x40
 [8127ff6e] ? xfs_trans_free_item_desc+0x2e/0x30
 [812800a7] ? xfs_trans_free_items+0x87/0xb0
 [8127cc73] ? xfs_perag_get+0x33/0xb0
 [8123c97f] ? xfs_free_extent+0x8f/0x120
 [8123c990] xfs_free_extent+0xa0/0x120
 [81287f07] ? kmem_zone_alloc+0x77/0xf0
 [81245ead] xfs_bmap_finish+0x15d/0x1a0
 [8126d15e] xfs_itruncate_finish+0x15e/0x340
 [81285495] xfs_setattr+0x365/0x980
 [812926e6] xfs_vn_setattr+0x16/0x20
 [8111e0ad] notify_change+0x11d/0x300
 [81103ccc] do_truncate+0x5c/0x90
 [8110ea35] ? get_write_access+0x15/0x50
 [81103ef7] sys_truncate+0x127/0x130
 [815e367b] system_call_fastpath+0x16/0x1b
INFO: task flush-8:16:3089 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
flush-8:16  D 8803af0d9d88 0  3089  2 0x
 88032e835940 0046 00010fe0 8803
 8803af0d9820 00010800 88032e835fd8 88032e834010
 88032e835fd8 00010800 8803b0f7e080 8803af0d9820
Call Trace:
 [810be570] ? __lock_page+0x70/0x70
 [815e0e1a] schedule+0x3a/0x60
 [815e0ec7] io_schedule+0x87/0xd0
 [810be579] sleep_on_page+0x9/0x10
 [815e1412] __wait_on_bit_lock+0x52/0xb0
 [810be562] __lock_page+0x62/0x70
 [8106fb80] ? autoremove_wake_function+0x40/0x40
 [810c8fd0] ? pagevec_lookup_tag+0x20/0x30
 [810c7f66] write_cache_pages+0x386/0x4d0
 [810c6c10] ? set_page_dirty+0x70/0x70
 [810fd7ab] ? kmem_cache_free+0x1b/0xe0
 [810c80fc] generic_writepages+0x4c/0x70
 [81288bcf] xfs_vm_writepages+0x4f/0x60
 [810c813c] do_writepages+0x1c/0x40
 [81128854] writeback_single_inode+0xf4/0x260
 [81128c45] writeback_sb_inodes+0xe5/0x1b0
 [811290a8] writeback_inodes_wb+0x98/0x160
 [81129ac3] wb_writeback+0x2f3/0x460
 [815e089e] ? __schedule+0x3ae/0x850
 [8105df47] ? lock_timer_base+0x37/0x70
 [81129e4f] wb_do_writeback+0x21f/0x270
 [81129f3a] bdi_writeback_thread+0x9a/0x230
 [81129ea0] ? wb_do_writeback+0x270/0x270
 [81129ea0] ? wb_do_writeback+0x270/0x270
 [8106f646] kthread+0x96/0xa0
 [815e46d4] kernel_thread_helper+0x4/0x10
 [8106f5b0] ? kthread_worker_fn+0x130/0x130
 [815e46d0] ? gs_change+0xb/0xb

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Florian Haas
Stefan,

On 05/24/12 13:07, Stefan Priebe - Profihost AG wrote:
 Hi list,

 i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block
 device within a network booted kvm.

 Sequential write/reads and random reads are fine. No problems so far.

 But when i trigger lots of 4k random writes all of them stall after
 short time and i get 0 iops and 0 transfer.

 used command:
 fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G
 --numjobs=50 --runtime=30 --group_reporting --name=file1

 Then some time later i see this call trace:

 INFO: task ceph-osd:3065 blocked for more than 120 seconds.
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 ceph-osdD 8803b0e61d88 0  3065  1 0x0004
  88032f3ab7f8 0086 8803bffdac08 8803
  8803b0e61820 00010800 88032f3abfd8 88032f3aa010
  88032f3abfd8 00010800 81a0b020 8803b0e61820
 Call Trace:
  [815e0e1a] schedule+0x3a/0x60
  [815e127d] schedule_timeout+0x1fd/0x2e0
  [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
  [81074db1] ? down_trylock+0x31/0x50
  [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
  [815e20b9] __down+0x69/0xb0
  [8128c4a6] ? _xfs_buf_find+0xf6/0x280
  [81074e6b] down+0x3b/0x50

sorry I'm coming a bit late to the various threads you've posted
recently, but on this particular issue: what kernel are your OSDs
running on, and do these hung tasks occur if you're using a local
filesystem other than XFS?

As of late XFS has occasionally been producing seemingly random kernel
hangs. Your call trace doesn't have the signature entries from xfssyncd
that identify a particular problem that I've been struggling with
lately, but you just might be affected by some other effect of the same
root issue.

Take a look at these to see if anything looks familiar:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

Not sure if this helps at all; just thought I might pitch that in.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-24 Thread Felix Feinhals
Hi,

i was using the Debian Packages, but i tried now from source.
I used the same version from GIT
(cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
report.
Then i applied your patch but again the same crash, i think the
backtrace is also the same:

 (gdb) thread 1
[Switching to thread 1 (Thread 9564)]#0  0x7f33a3e58ebb in raise
(sig=value optimized out)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
41  in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) backtrace
#0  0x7f33a3e58ebb in raise (sig=value optimized out)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x0081423e in reraise_fatal (signum=11) at
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  signal handler called
#4  SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
at mds/snap.cc:112
#5  0x0055d58b in MDCache::check_realm_past_parents
(this=0x27a7200, realm=0x0)
at mds/MDCache.cc:4495
#6  0x00572eec in
MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
at mds/MDCache.cc:4533
#7  0x005931a0 in MDCache::rejoin_gather_finish
(this=0x27a7200) at mds/MDCache.cc:
#8  0x0059b9d5 in MDCache::rejoin_send_rejoins
(this=0x27a7200) at mds/MDCache.cc:3388
#9  0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
mds/MDS.cc:1404
#10 0x004c253a in MDS::handle_mds_map (this=0x27bc000,
m=value optimized out)
at mds/MDS.cc:968
#11 0x004c4513 in MDS::handle_core_message (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1651
#12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
at mds/MDS.cc:1790
#13 0x004c628b in MDS::ms_dispatch (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1602
#14 0x00732609 in Messenger::ms_deliver_dispatch
(this=0x279f680) at msg/Messenger.h:178
#15 SimpleMessenger::dispatch_entry (this=0x279f680) at
msg/SimpleMessenger.cc:363
#16 0x007207ad in SimpleMessenger::DispatchThread::entry() ()
#17 0x7f33a3e508ca in start_thread (arg=value optimized out) at
pthread_create.c:300
#18 0x7f33a26d892d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x in ?? ()

Any more ideas? :)
Or can i get you more debugging output?



2012/5/23 Gregory Farnum g...@inktank.com:
 On Wed, May 23, 2012 at 5:28 AM, Felix Feinhals
 f...@turtle-entertainment.de wrote:
 Hey,

 ok i installed libc-dbg and run your commands now this comes up:

 gdb /usr/bin/ceph-mds core

 snip

 GNU gdb (GDB) 7.0.1-debian
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type show copying
 and show warranty for details.
 This GDB was configured as x86_64-linux-gnu.
 For bug reporting instructions, please see:
 http://www.gnu.org/software/gdb/bugs/...
 Reading symbols from /usr/bin/ceph-mds...Reading symbols from
 /usr/lib/debug/usr/bin/ceph-mds...done.
 (no debugging symbols found)...done.
 [New Thread 22980]
 [New Thread 22984]
 [New Thread 22986]
 [New Thread 22979]
 [New Thread 22970]
 [New Thread 22981]
 [New Thread 22971]
 [New Thread 22976]
 [New Thread 22973]
 [New Thread 22975]
 [New Thread 22974]
 [New Thread 22972]
 [New Thread 22978]
 [New Thread 22982]

 warning: Can't read pathname for load map: Input/output error.
 Reading symbols from /lib/libpthread.so.0...Reading symbols from
 /usr/lib/debug/lib/libpthread-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libpthread.so.0
 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libcrypto++.so.8
 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libuuid.so.1
 Reading symbols from /lib/librt.so.1...Reading symbols from
 /usr/lib/debug/lib/librt-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/librt.so.1
 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libtcmalloc.so.0
 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libstdc++.so.6
 Reading symbols from /lib/libm.so.6...Reading symbols from
 /usr/lib/debug/lib/libm-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libm.so.6
 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libgcc_s.so.1
 Reading symbols from /lib/libc.so.6...Reading symbols from
 /usr/lib/debug/lib/libc-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libc.so.6
 Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
 from /usr/lib/debug/lib/ld-2.11.3.so...done.
 (no debugging symbols 

Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Stefan Priebe - Profihost AG
Am 24.05.2012 14:12, schrieb Florian Haas:
 Stefan,
 sorry I'm coming a bit late to the various threads you've posted
 recently, but on this particular issue: what kernel are your OSDs
 running on, and do these hung tasks occur if you're using a local
 filesystem other than XFS?

OSDs run 3.0.30 but i tried 3.3.7 too - no difference (regarding XFS
crash and random writes).

Just tried btrfs with 3.4 kernel and the posted patch from yesterday.

But with kernel 3.4 the performance is in general pretty low doesn't
matter if i use xfs or btrfs:

~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  163519   75.982476  0.294869  0.376607
2  165135   69.984464  0.103118  0.345375
3  16725674.65284  0.1139090.5364
4  168872   71.986664  0.641818  0.786378
5  169579   63.188728  0.131084  0.737699
6  16   11397   64.655372  0.232688  0.851319
7  16   129   113   64.560464   0.35199  0.822971
8  16   148   132   65.988876   0.09892  0.739852
9  16   149   133   59.1007 4  0.833541  0.740556
   10  16   157   141   56.389932  0.101306  0.715187
   11  16   157   141   51.2634 0 -  0.715187
   12  16   157   141   46.9914 0 -  0.715187
   13  16   157   141   43.3766 0 -  0.715187
   14  16   157   141   40.2782 0 -  0.715187
   15  16   157   14137.593 0 -  0.715187
   16  16   157   141   35.2434 0 -  0.715187
Total time run:16.471636
Total writes made: 158
Write size:4194304
Bandwidth (MB/sec):38.369

Average Latency:   1.66534
Max latency:   13.554
Min latency:   0.095194

 As of late XFS has occasionally been producing seemingly random kernel
 hangs. Your call trace doesn't have the signature entries from xfssyncd
 that identify a particular problem that I've been struggling with
 lately, but you just might be affected by some other effect of the same
 root issue.

 Take a look at these to see if anything looks familiar:

 http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

These are solved by using 3.0.20.

Stefan



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


poor OSD performance using kernel 3.4

2012-05-24 Thread Stefan Priebe - Profihost AG
Hi list,

today while testing btrfs i discovered a very poor osd performance using
kernel 3.4.

Underlying FS is XFS but it is the same with btrfs.

3.0.30:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  164125   99.9767   100  0.586984  0.447293
2  167155   109.979   120  0.934388  0.488375
3  169983   110.647   112   1.15982  0.503111
4  16   130   114   113.981   124   1.05952  0.516925
5  16   159   143   114.382   116  0.149313  0.510734
6  16   188   172   114.649   116  0.287166   0.52203
7  16   215   199   113.697   108  0.151784  0.531461
8  16   242   226   112.984   108  0.623478  0.539896
9  16   265   249   110.65192   0.50354  0.538504
   10  16   296   280   111.984   124  0.155048  0.542846
Total time run:10.776153
Total writes made: 297
Write size:4194304
Bandwidth (MB/sec):110.243

Average Latency:   0.577534
Max latency:   1.85499
Min latency:   0.091473


3.4:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  164024   95.979496  0.393196  0.455936
2  166852   103.983   112  0.835652  0.517297
3  168569   91.984968   1.00535  0.493058
4  169680   79.986944  0.096564  0.577948
5  16   10387   69.587928  0.092722  0.589147
6  16   117   101   67.321656  0.222175  0.675334
7  16   130   114   65.132152   0.15677  0.623806
8  16   144   128   63.989656  0.089157   0.56746
9  16   144   128   56.8794 0 -   0.56746
   10  16   144   128   51.1912 0 -   0.56746
   11  16   144   128   46.5373 0 -   0.56746
   12  16   144   128   42.6591 0 -   0.56746
   13  16   144   128   39.3776 0 -   0.56746
   14  16   144   128   36.5649 0 -   0.56746
   15  16   144   128   34.1272 0 -   0.56746
   16  16   145   129   32.2443   0.5   11.3422  0.650985
Total time run:16.193871
Total writes made: 145
Write size:4194304
Bandwidth (MB/sec):35.816

Average Latency:   1.78467
Max latency:   14.4744
Min latency:   0.088753

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Florian Haas
On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Take a look at these to see if anything looks familiar:

 http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

 These are solved by using 3.0.20.

... or so Christoph says, but comment #4 in bug 922 seems to indicate otherwise.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-24 Thread Mark Nelson

Hi Stefan,

Were these both tested on fresh filesystems?  If you still have any 
3.0.30 available, could you try a couple of longer running tests (say 5 
minutes) and see how they compare?


Thanks,
Mark

On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote:

Hi list,

today while testing btrfs i discovered a very poor osd performance using
kernel 3.4.

Underlying FS is XFS but it is the same with btrfs.

3.0.30:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  164125   99.9767   100  0.586984  0.447293
 2  167155   109.979   120  0.934388  0.488375
 3  169983   110.647   112   1.15982  0.503111
 4  16   130   114   113.981   124   1.05952  0.516925
 5  16   159   143   114.382   116  0.149313  0.510734
 6  16   188   172   114.649   116  0.287166   0.52203
 7  16   215   199   113.697   108  0.151784  0.531461
 8  16   242   226   112.984   108  0.623478  0.539896
 9  16   265   249   110.65192   0.50354  0.538504
10  16   296   280   111.984   124  0.155048  0.542846
Total time run:10.776153
Total writes made: 297
Write size:4194304
Bandwidth (MB/sec):110.243

Average Latency:   0.577534
Max latency:   1.85499
Min latency:   0.091473


3.4:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  164024   95.979496  0.393196  0.455936
 2  166852   103.983   112  0.835652  0.517297
 3  168569   91.984968   1.00535  0.493058
 4  169680   79.986944  0.096564  0.577948
 5  16   10387   69.587928  0.092722  0.589147
 6  16   117   101   67.321656  0.222175  0.675334
 7  16   130   114   65.132152   0.15677  0.623806
 8  16   144   128   63.989656  0.089157   0.56746
 9  16   144   128   56.8794 0 -   0.56746
10  16   144   128   51.1912 0 -   0.56746
11  16   144   128   46.5373 0 -   0.56746
12  16   144   128   42.6591 0 -   0.56746
13  16   144   128   39.3776 0 -   0.56746
14  16   144   128   36.5649 0 -   0.56746
15  16   144   128   34.1272 0 -   0.56746
16  16   145   129   32.2443   0.5   11.3422  0.650985
Total time run:16.193871
Total writes made: 145
Write size:4194304
Bandwidth (MB/sec):35.816

Average Latency:   1.78467
Max latency:   14.4744
Min latency:   0.088753

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote:
 Am 24.05.2012 10:22, schrieb Wido den Hollander:
  On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote:
   
   ~# rados -p data ls|wc -l
   46631
  
  
  
  That is weird, I thought the bench tool cleaned up it's mess.
  
  Imho it should cleanup after it's done, but there might be a reason why
  it's not. Did you abort the benchmark or did you let it do the whole run?
 
 
 No it doesn't BUG?
It doesn't because you might want to leave around the data for read 
benchmarking (or so that your cluster is full of data).
There should probably be an option to clean up bench data, though! I've created 
a bug: http://tracker.newdream.net/issues/2477
 
 
 ~# rados -p data ls
 ~#
 ~# rados -p data bench 20 write -t 16
 ...
 ~# rados -p data ls| wc -l
 589
 
   I do not use the data pool so it is seperate ;-) i only use the rbd pool
   for block devices.
   
   So i will free the space with:
   for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done
  
  
  
  rados -p data ls|xargs -n 1 rados -p data rm
  
  I love shorter commands ;)
 me too i just tried it without -n and hoped that this works but rados
 didn't support more than 1 file per command and i didn't remembered -n1 ;)
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I have some problem to mount ceph file system

2012-05-24 Thread Greg Farnum
That's not an option any more, since malicious clients can fake it so easily. 
:(  


On Wednesday, May 23, 2012 at 10:35 PM, FrankWOO Su wrote:

 So in this version, can i do some settings about mount command limited by IP ?
  
 any example ??
  
 Thanks
 -Frank
  
 2012/5/24 Sage Weil s...@inktank.com (mailto:s...@inktank.com)
  On Wed, 23 May 2012, Gregory Farnum wrote:
   On Wed, May 23, 2012 at 1:51 AM, Frank frankwoo@gmail.com 
   (mailto:frankwoo@gmail.com) wrote:
Hello
I have a question about ceph.
 
When I mount ceph, I do the command as follow :
 
# mount -t ceph -o name=admin,secret=XX 10.1.0.1:6789/ 
(http://10.1.0.1:6789/) /mnt/ceph -vv
 
now I create an user foo and make a secretkey by ceph-authtool like 
that :
 
# ceph-authtool /etc/ceph/keyring.bin -n client.foo --gen-key
 
then I add the key into ceph :
 
# ceph auth add client.foo osd 'allow *' mon 'allow *' mds 'allow' -i
/etc/ceph/keyring.bin
 
so i can mount ceph by foo :
 
# mount -t ceph -o name=foo,secret=XOXOXO 10.1.0.1:6789/ 
(http://10.1.0.1:6789/) /mnt/ceph -vv
 
my question is if i don't want foo that has permission to mount 
10.1.0.1:6789/ (http://10.1.0.1:6789/)
 
HOW TO DO ITÿÿ
 
if there is a directory foo
 
I want he can mount 10.1.0.1:6789:/foo/
 
but have no access to mount 10.1.0.1:6789:/

   I'm afraid that's not an option with Ceph right now, that I'm aware
   of. It was built and designed for a trusted set of servers and
   clients, and while we're slowly carving out areas of security, this
   isn't one we've done yet.
   If it's an important feature for you, you should create a feature
   request in the tracker (tracker.newdream.net 
   (http://tracker.newdream.net)) for it, which we will
   prioritize and work on once we've moved to focus on the full
   filesystem. :)
   
   
  http://tracker.newdream.net/issues/1237
   
  (tho the final config will probably not look like that; suggestions
  welcome.)
   
  sag
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS re-exporting CEPH cluster

2012-05-24 Thread Greg Farnum
On Wednesday, May 23, 2012 at 10:14 PM, Madhusudhana U wrote:
 Hi all,
 Can anyone tried re-exporting CEPH cluster via NFS with success (I mean to 
 say, mount the CEPH cluster in one of the machine and then export that via 
 NFS to clients)? I need to do this bcz of my client kernel version and some
 EDA tools compatibility.Can someone suggest me how I can successfully 
 re-export CEPH over NFS ?

Have you tried something and it failed? Or are you looking for suggestions?
If the former, please report the failure. :)
If the latter: http://ceph.com/wiki/Re-exporting_NFS
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Josh Durgin

On 05/24/2012 10:55 AM, Greg Farnum wrote:

On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote:

Am 24.05.2012 10:22, schrieb Wido den Hollander:

On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote:


~# rados -p data ls|wc -l
46631




That is weird, I thought the bench tool cleaned up it's mess.

Imho it should cleanup after it's done, but there might be a reason why
it's not. Did you abort the benchmark or did you let it do the whole run?



No it doesn't BUG?

It doesn't because you might want to leave around the data for read 
benchmarking (or so that your cluster is full of data).
There should probably be an option to clean up bench data, though! I've created 
a bug: http://tracker.newdream.net/issues/2477


Why not have the read benchmark write data itself, and then benchmark
reading? Then both read and write benchmarks can clean up after
themselves.

It's a bit odd to have the read benchmark depend on you running a write
benchmark first.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSDMap::apply_incremental not updating crush map

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 10:58 AM, Adam Crume wrote:
 I'm trying to simulate adding an OSD to a cluster. I set up an
 OSDMap::Incremental and apply it, but nothing ever gets mapped to the
 new OSD. Apparently, the crush map never gets updated. Do I have to
 do that manually?

Yes. If you need help, check out the OSDMonitor::prepare_command code crush 
section. :)
 
 It seems like apply_incremental should do it
 automatically.

apply_incremental has no idea where the new ID is located in terms of failure 
domains.
 
 My test case is below. It shows that the OSDMap is
 updated to have 11 OSDs, but the crush map still shows only 10.
 
 Thanks,
 Adam Crume
 
 #include assert.h
 #include osd/OSDMap.h
 #include common/code_environment.h
 
 int main() {
 OSDMap *osdmap = new OSDMap();
 CephContext *cct = new CephContext(CODE_ENVIRONMENT_UTILITY);
 uuid_d fsid;
 int num_osds = 10;
 osdmap-build_simple(cct, 1, fsid, num_osds, 7, 8);
 for(int i = 0; i  num_osds; i++) {
 osdmap-set_state(i, osdmap-get_state(i) | CEPH_OSD_UP |
 CEPH_OSD_EXISTS);
 osdmap-set_weight(i, CEPH_OSD_IN);
 }
 
 int osd_num = 10;
 OSDMap::Incremental inc(osdmap-get_epoch() + 1);
 inc.new_max_osd = osdmap-get_max_osd() + 1;
 inc.new_weight[osd_num] = CEPH_OSD_IN;
 inc.new_state[osd_num] = CEPH_OSD_UP | CEPH_OSD_EXISTS;
 inc.new_up_client[osd_num] = entity_addr_t();
 inc.new_up_internal[osd_num] = entity_addr_t();
 inc.new_hb_up[osd_num] = entity_addr_t();
 inc.new_up_thru[osd_num] = inc.epoch;
 uuid_d new_uuid;
 new_uuid.generate_random();
 inc.new_uuid[osd_num] = new_uuid;
 int e = osdmap-apply_incremental(inc);
 assert(e == 0);
 printf(State for 10: %d, State for 0: %d\n,
 osdmap-get_state(10), osdmap-get_state(0));
 printf(10 exists: %s\n, osdmap-exists(10) ? yes : no);
 printf(10 is in: %s\n, osdmap-is_in(10) ? yes : no);
 printf(10 is up: %s\n, osdmap-is_up(10) ? yes : no);
 printf(OSDMap max OSD: %d\n, osdmap-get_max_osd());
 printf(CRUSH max devices: %d\n, osdmap-crush-get_max_devices());
 }
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 11:05 AM, Josh Durgin wrote:
 Why not have the read benchmark write data itself, and then benchmark
 reading? Then both read and write benchmarks can clean up after
 themselves.
 
 It's a bit odd to have the read benchmark depend on you running a write
 benchmark first.
 
 Josh 
We've talked about that and decided we didn't like it. I think it was about 
being able to repeat large read benchmarks without having to wait for all the 
data to get written out first, and also (although this was never implemented) 
being able to implement random read benchmarks and things in ways that allowed 
you to make the cache cold first.
Which is not to say that changing it is a bad idea; I could be talked into that 
or somebody else could do it. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-24 Thread Mark Nelson

Hi Stefan,

Thanks for the info!  I've been testing on 3.4 for the last couple of 
days but haven't run into that problem here.  It looks like your journal 
has writes going to it quickly and then things stall as it tries to 
write out to your data disk.  I wonder if any of the data actually makes 
it to the disk...  Can you run iostat or collectl or something and see 
what kind of write throughput you get to the OSD data disks?


Thanks,
Mark

On 05/24/2012 01:15 PM, Stefan Priebe wrote:


Am 24.05.2012 16:55, schrieb Mark Nelson:

Hi Stefan,

Were these both tested on fresh filesystems?  If you still have any
3.0.30 available, could you try a couple of longer running tests (say 5
minutes) and see how they compare?


Yes with 3.4 it totally stalls. Tested with XFS and btrfs. Client 
always had the same Kernel. So i just changed the kernel on osd side.


Kernel 3.4
http://pastebin.com/raw.php?i=CApKbSNj

Kernel 3.0.30
http://pastebin.com/raw.php?i=kZ7rnwcM

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-24 Thread Stefan Priebe

Am 24.05.2012 20:53, schrieb Mark Nelson:

Hi Stefan,

Thanks for the info! I've been testing on 3.4 for the last couple of
days but haven't run into that problem here. It looks like your journal
has writes going to it quickly and then things stall as it tries to
write out to your data disk.
That's a good point. Right now while testing i'm using a tmpfs ramdisk 
for the journal and have set journal dio = false in ceph.conf? Might 
this be the difference / problem?


3.2.18 works fine too.

 I wonder if any of the data actually makes

it to the disk... Can you run iostat or collectl or something and see
what kind of write throughput you get to the OSD data disks?

none... so it seems get's never transferred from journal to disk.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt

Hi,

In my testing I make repeated use of the manual mkcephfs
sequence described in the man page:

   master# mkdir /tmp/foo
   master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo
   osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo
   mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo
   master# mkcephfs --prepare-mon -d /tmp/foo
   monnode# mkcephfs --init-local-daemons mon -d /tmp/foo

Using current master branch (commit ca79f45a33f9), the
mkcephfs --init-local-daemons osd phase breaks like this:

2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) 
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) 
async snap create 'snap_1' transid 0 got (17) File exists
os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 
time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)
 ceph version 0.47.1-157-gb003815 
(commit:b003815c222add8bdcf645d9ba4ef7e13f34587e)
 1: (FileStore::sync_entry()+0x34f0) [0x68e190]
 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd]
 3: (Thread::_entry_func(void*)+0x1b9) [0x803499]
 4: (()+0x77f1) [0x7fdcd58637f1]
 5: (clone()+0x6d) [0x7fdcd4cb4ccd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void 
FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)

FWIW, with commit 598dea12411 (filestore: mkfs: only
create snap_0 if we created current_op_seq) reverted,
I am able to create a new filesystem using the above
sequence, and a typical mkcephfs --init-local-daemons osd
looks like this:

2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) 
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store 
/ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid 
f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f
creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0
creating /mnt/ceph/misc.osd.0/keyring.osd.0

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkcephfs regression in current master branch

2012-05-24 Thread Sage Weil
Hi Jim,

On Thu, 24 May 2012, Jim Schutt wrote:

 Hi,
 
 In my testing I make repeated use of the manual mkcephfs
 sequence described in the man page:
 
master# mkdir /tmp/foo
master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo
osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo
mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo
master# mkcephfs --prepare-mon -d /tmp/foo
monnode# mkcephfs --init-local-daemons mon -d /tmp/foo
 
 Using current master branch (commit ca79f45a33f9), the
 mkcephfs --init-local-daemons osd phase breaks like this:
 
 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0)
 could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
 directory
 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0)
 async snap create 'snap_1' transid 0 got (17) File exists
 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread
 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)
  ceph version 0.47.1-157-gb003815
 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e)
  1: (FileStore::sync_entry()+0x34f0) [0x68e190]
  2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd]
  3: (Thread::_entry_func(void*)+0x1b9) [0x803499]
  4: (()+0x77f1) [0x7fdcd58637f1]
  5: (clone()+0x6d) [0x7fdcd4cb4ccd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.
 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void
 FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
 os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)

I just pushed a fix for this to master.

BTW, the real change happening with these patches is that --mkfs no longer 
clobbers existing data.  If you want to wipe out an osd and start anew, 
you need to rm -r (and btrfs sub delete snap_* and current), or re-run 
mkfs.btrfs.

Thanks for the report!

sage



 
 FWIW, with commit 598dea12411 (filestore: mkfs: only
 create snap_0 if we created current_op_seq) reverted,
 I am able to create a new filesystem using the above
 sequence, and a typical mkcephfs --init-local-daemons osd
 looks like this:
 
 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0)
 could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
 directory
 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store
 /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid
 f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f
 creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0
 creating /mnt/ceph/misc.osd.0/keyring.osd.0
 
 Thanks -- Jim
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt

On 05/24/2012 03:13 PM, Sage Weil wrote:

Hi Jim,

On Thu, 24 May 2012, Jim Schutt wrote:


Hi,

In my testing I make repeated use of the manual mkcephfs
sequence described in the man page:

master# mkdir /tmp/foo
master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo
osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo
mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo
master# mkcephfs --prepare-mon -d /tmp/foo
monnode# mkcephfs --init-local-daemons mon -d /tmp/foo

Using current master branch (commit ca79f45a33f9), the
mkcephfs --init-local-daemons osd phase breaks like this:

2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0)
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
directory
2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0)
async snap create 'snap_1' transid 0 got (17) File exists
os/FileStore.cc: In function 'void FileStore::sync_entry()' thread
7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)
  ceph version 0.47.1-157-gb003815
(commit:b003815c222add8bdcf645d9ba4ef7e13f34587e)
  1: (FileStore::sync_entry()+0x34f0) [0x68e190]
  2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd]
  3: (Thread::_entry_func(void*)+0x1b9) [0x803499]
  4: (()+0x77f1) [0x7fdcd58637f1]
  5: (clone()+0x6d) [0x7fdcd4cb4ccd]
  NOTE: a copy of the executable, or `objdump -rdSexecutable` is needed to
interpret this.
2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void
FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == async snap ioctl error)


I just pushed a fix for this to master.


Great, that fixed it for me.

Thanks for the quick response!



BTW, the real change happening with these patches is that --mkfs no longer
clobbers existing data.  If you want to wipe out an osd and start anew,
you need to rm -r (and btrfs sub delete snap_* and current), or re-run
mkfs.btrfs.


Ah, OK.  It turns out I always run mkfs.btrfs anyway, but
this is good to know.

-- Jim



Thanks for the report!

sage





FWIW, with commit 598dea12411 (filestore: mkfs: only
create snap_0 if we created current_op_seq) reverted,
I am able to create a new filesystem using the above
sequence, and a typical mkcephfs --init-local-daemons osd
looks like this:

2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0)
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
directory
2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store
/ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid
f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f
creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0
creating /mnt/ceph/misc.osd.0/keyring.osd.0

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html








--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD format changes and layering

2012-05-24 Thread Josh Durgin

RBD object format changes
=

To enable us to add more features to rbd, including copy-on-write
cloning via layering, we need to change to rbd header object
format. Since this won't be backwards compatible, the old format will
still be used by default. Once layering is implemented, the old format
will be deprecated, but still usable with an extra option (something
like rbd create --legacy ...). Clients will still be able to read the
old format, and images can be converted by exporting and importing them.

While we're making these changes, we can clean up the way librbd and
the rbd kernel module access the header, so that they don't have to
change each time we change the header format. Instead of reading the
header directly, they can use the OSD class mechanism to interact with
it. librbd already does this for snapshots, but kernel rbd reads the
entire header directly. Making them both use a well-defined api will
make later format additions much simpler. I'll describe the changes
needed in general, and then those that are needed for rbd layering.

New format, pre-layering


Right now the header object is name $image_name.rbd, and the data
objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
Since we're making other incompatible changes, we have a chance to
rename these to be less likely to collide with other objects. Prefixing
them with a more specific string will help, and will work well with
a new security feature for layering discussed later. The new
names are:

rbd_header.$image_name
rbd_data.$id.$object_number

The new header will have the existing (used) fields of the old format as
key/value pairs in an omap (this is the rados interface that stores
key/value pairs in leveldb). Specifically, the existing fields are:

 * object_prefix // previously known as block_name
 * order // bit shift to determine size of the data objects
 * size  // total size of the image in bytes
 * snap_seq  // latest snapshot id used with the image
 * snapshots // list of (snap_name, snap_id, image_size) tuples

To make adding new things easier, there will be an additional
'features' field, which is a mask of the features used by the image.
Clients will know whether they can use an image by checking if they
support all the features the image uses that the osd reports as being
incompatible (see get_info() below).

RBD class interface
===

Here's a proposed basic interface - new features will
add more functions and data to existing ones.

/**
 * Initialize the header with basic metadata.
 * Extra features may initialize more fields in the future.
 * Everything is stored as key/value pairs as omaps in the header object.
 *
 * If features the OSD does not understand are requested, -ENOSYS is
 * returned.
 */
create(__le64 size, __le32 order, __le64 features)

/**
 * Get the metadata about the image required to do I/O
 * to it. In the future this may include extra information for
 * features that require it, like encryption/compression type.
 * This extra data will be added at the end of the response, so
 * clients that don't support it don't interpret it.
 *
 * Features that would require clients to be updated to access
 * the image correctly (such as image bitmaps) are set in
 * the incompat_features field. A client that doesn't understand
 * those features will return an error when they try to open
 * the image.
 *
 * The size and any extra information is read from the appropriate
 * snapshot metadata, if snapid is not CEPH_NOSNAP.
 *
 * Returns __le64 size, __le64 order, __le64 features,
 * __le64 incompat_features, __le64 snapseq and
 * list of __le64 snapids
 */
get_info(__le64 snapid)

/**
 * Used when resizing the image. Sets the size in bytes.
 */
set_size(__le64 size)

/**
 * The same as the existing snap_add/snap_remove methods, but using the
 * new format.
 */
snapshot_add(string snap_name, __le64 snap_id)
snapshot_remove(string snap_name)

/**
 * list snapshots - like the existing snap_list, but
 * can return a subset of them.
 *
 * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
 * (snap_id, snap_size) just like the current snap_list
 */
snapshot_list(__le64 max_len)

/**
 * The same as the existing method. Should only be called
 * on the rbd_info object.
 * Returns an id number to use for a new image.
 */
assign_bid()


RBD layering


The first step is to implement trivial layering, i.e.
layering without bitmaps, as described at:

http://marc.info/?l=ceph-develm=129867273303846w=2

There are a couple of things that complicate the implementation:

1) making sure parent images are not deleted when children still
   refer to them

A simple way to solve this is to add a reference count to the parent
image. This can cause issues with partially deleted images, if the
reference count is decremented more than once because the child
image's header was only deleted the 

Re: RBD format changes and layering

2012-05-24 Thread Yehuda Sadeh
On Thu, May 24, 2012 at 4:05 PM, Josh Durgin josh.dur...@inktank.com wrote:
 RBD object format changes
 =

 To enable us to add more features to rbd, including copy-on-write
 cloning via layering, we need to change to rbd header object
 format. Since this won't be backwards compatible, the old format will
 still be used by default. Once layering is implemented, the old format
 will be deprecated, but still usable with an extra option (something
 like rbd create --legacy ...). Clients will still be able to read the
 old format, and images can be converted by exporting and importing them.

 While we're making these changes, we can clean up the way librbd and
 the rbd kernel module access the header, so that they don't have to
 change each time we change the header format. Instead of reading the
 header directly, they can use the OSD class mechanism to interact with
 it. librbd already does this for snapshots, but kernel rbd reads the
 entire header directly. Making them both use a well-defined api will
 make later format additions much simpler. I'll describe the changes
 needed in general, and then those that are needed for rbd layering.

 New format, pre-layering
 

 Right now the header object is name $image_name.rbd, and the data
 objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
 Since we're making other incompatible changes, we have a chance to
 rename these to be less likely to collide with other objects. Prefixing
 them with a more specific string will help, and will work well with
 a new security feature for layering discussed later. The new
 names are:

 rbd_header.$image_name
 rbd_data.$id.$object_number

 The new header will have the existing (used) fields of the old format as
 key/value pairs in an omap (this is the rados interface that stores
 key/value pairs in leveldb). Specifically, the existing fields are:

  * object_prefix // previously known as block_name
  * order         // bit shift to determine size of the data objects
  * size          // total size of the image in bytes
  * snap_seq      // latest snapshot id used with the image
  * snapshots     // list of (snap_name, snap_id, image_size) tuples

 To make adding new things easier, there will be an additional
 'features' field, which is a mask of the features used by the image.
 Clients will know whether they can use an image by checking if they
 support all the features the image uses that the osd reports as being
 incompatible (see get_info() below).

 RBD class interface
 ===

 Here's a proposed basic interface - new features will
 add more functions and data to existing ones.

 /**
  * Initialize the header with basic metadata.
  * Extra features may initialize more fields in the future.
  * Everything is stored as key/value pairs as omaps in the header object.
  *
  * If features the OSD does not understand are requested, -ENOSYS is
  * returned.
  */
 create(__le64 size, __le32 order, __le64 features)

 /**
  * Get the metadata about the image required to do I/O
  * to it. In the future this may include extra information for
  * features that require it, like encryption/compression type.
  * This extra data will be added at the end of the response, so
  * clients that don't support it don't interpret it.
  *
  * Features that would require clients to be updated to access
  * the image correctly (such as image bitmaps) are set in
  * the incompat_features field. A client that doesn't understand
  * those features will return an error when they try to open
  * the image.
  *
  * The size and any extra information is read from the appropriate
  * snapshot metadata, if snapid is not CEPH_NOSNAP.
  *
  * Returns __le64 size, __le64 order, __le64 features,
  *         __le64 incompat_features, __le64 snapseq and
  *         list of __le64 snapids
  */
 get_info(__le64 snapid)

 /**
  * Used when resizing the image. Sets the size in bytes.
  */
 set_size(__le64 size)

 /**
  * The same as the existing snap_add/snap_remove methods, but using the
  * new format.
  */
 snapshot_add(string snap_name, __le64 snap_id)
 snapshot_remove(string snap_name)

 /**
  * list snapshots - like the existing snap_list, but
  * can return a subset of them.
  *
  * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
  * (snap_id, snap_size) just like the current snap_list
  */
 snapshot_list(__le64 max_len)

 /**
  * The same as the existing method. Should only be called
  * on the rbd_info object.
  * Returns an id number to use for a new image.
  */
 assign_bid()


 RBD layering
 

 The first step is to implement trivial layering, i.e.
 layering without bitmaps, as described at:

 http://marc.info/?l=ceph-develm=129867273303846w=2

 There are a couple of things that complicate the implementation:

 1) making sure parent images are not deleted when children still
   refer to them

 A simple way to solve this is to add a reference count to 

Re: poor OSD performance using kernel 3.4

2012-05-24 Thread Mark Nelson

On 05/24/2012 02:05 PM, Stefan Priebe wrote:

Am 24.05.2012 20:53, schrieb Mark Nelson:

Hi Stefan,

Thanks for the info! I've been testing on 3.4 for the last couple of
days but haven't run into that problem here. It looks like your journal
has writes going to it quickly and then things stall as it tries to
write out to your data disk.
That's a good point. Right now while testing i'm using a tmpfs ramdisk 
for the journal and have set journal dio = false in ceph.conf? Might 
this be the difference / problem?


3.2.18 works fine too.


Honestly I don't know if tmpfs journal with dio = false would lead to 
that kind of behavior.  Anything interesting in the logs if you turn 
debugging up?




 I wonder if any of the data actually makes

it to the disk... Can you run iostat or collectl or something and see
what kind of write throughput you get to the OSD data disks?

none... so it seems get's never transferred from journal to disk.


This might be a stupid question, but writes to those partitions work 
outside of Ceph with the new kernel right?




Stefan


Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html