Re: Poor read performance in KVM

2012-07-16 Thread Josh Durgin

On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:

Hello,

Lately I was trying to get KVM to perform well on RBD. But it still
appears elusive.

[root@alpha etc]# rados -p rbd bench 120 seq -t 8

Total time run:16.873277
Total reads made: 302
Read size:4194304
Bandwidth (MB/sec):71.592

Average Latency:   0.437984
Max latency:   3.26817
Min latency:   0.015786

Fairly good performance. But when I run in KVM:

[root@mail ~]# hdparm -tT /dev/vda

/dev/vda:
  Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec


This is just the guest page cache - it's reading the first two
megabytes of the device repeatedly.


  Timing buffered disk reads:  10 MB in  6.21 seconds =   1.61 MB/sec


This is a sequential read, so readahead in the guest should help here.


Not even close to what rados bench show! I even seen 900KB/sec
performance. Such slow read performance of course affecting guests.

Any ideas where to start to look for performance boost?


Do you have rbd caching enabled? It would also be interesting to see
how the guest reads are translating to rados reads. hdparm is doing
2MiB sequential reads of the block device. If you add
admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
command line) you can see number of requests, latency, and
request size info while the guest is running via:

ceph --admin-daemon /var/run/ceph/kvm.asok perf dump

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread Noah Watkins
Ramu,

I receive the same error code in my installation when the
CEPH_CONF_FILE environment variable contains an invalid path. Could
you please verify that the path you are using points to a valid Ceph
configuration?

Thanks,
- Noah

On Sun, Jul 15, 2012 at 10:32 PM, ramu ramu.freesyst...@gmail.com wrote:

 Hi Noah,

 I tried but am getting following error in terminal ,

 Buildfile: /home/vu/java-rados/build.xml

 makedir:

 compile-rados:

 compile-tests:

 jar:

 test:
 [junit] Running ClusterStatsTest
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec

 BUILD FAILED
 /home/vu/java-rados/build.xml:134: Test ClusterStatsTest failed

 Total time: 1 second

 and one more error file
 is /home/vu/java-rados/TEST-ClusterStatsTest.txt in that file the text is like
 this,

 Testsuite: ClusterStatsTest
 Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec

 Testcase: test_ClusterStats took 0.022 sec
 Caused an ERROR
 conf_read_file: ret=-22
 net.newdream.ceph.rados.RadosException: conf_read_file: ret=-22
 at net.newdream.ceph.rados.Cluster.native_conf_read_file(Native 
 Method)
 at net.newdream.ceph.rados.Cluster.readConfigFile(Unknown Source)
 at ClusterStatsTest.setup(Unknown Source)

 Thanks,
 Ramu.



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread ramu
Hi Noah,

I printed the CONF_FILE and file and states,the printed as following,these 
are in ClusterStatsTest.java.

test:
[junit] Running ClusterStatsTest
[junit] conffile---/etc/ceph/ceph.conf
[junit] state--CONFIGURING
[junit] file---/etc/ceph/ceph.conf
[junit] state--CONFIGURING
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.123 sec

BUILD FAILED
/home/ramu/java-rados/build.xml:134: Test ClusterStatsTest failed




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel hanged when try to remove a rbd device

2012-07-16 Thread Guan Jun He
Hi,

 kernel hanged when try to remove a rbd device, detail steps are:

Create a rbd image and map it to client;
then stop ceph cluster through '/etc/init.d/ceph -a stop';

then in client side, run command 'echo id  /sys/bus/rbd/remove',and
this command can not return. Checking dmesg, seems like it enters an
endless loop, try to re-connect osds and mons;

Then press keys 'CTRL + C' to send an INT signal to 
'echo id  /sys/bus/rbd/remove',then kernel hanged.

Can I use rados in this way?


with the following patch, kernel will not hang, but ,this patch is not
good as well, for there is transaction has not been finished, if just
delete it, maybe the data will be inconsistent.

But, seems like there is no way to stop this transaction safely,I mean
cancel this transaction(avoid data inconsistence) and tell it's caller
that this transaction has been failed and has been canceled.
(well,If any one know there is a way/or many ways,please tell
me,thanks).
Also, if there are plans to do these things, I'am very glad to join in
and do some work.

Or, are there any other resolving plans?

thanks a lot for your reply!


Signed-off-by: Guanjun He hegua...@gmail.com
---
 net/ceph/osd_client.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1ffebed..4dba062 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -688,11 +688,20 @@ static void __remove_osd(struct ceph_osd_client *osdc, 
struct ceph_osd *osd)
 
 static void remove_all_osds(struct ceph_osd_client *osdc)
 {
+   struct list_head *pos, *q;
+   struct ceph_osd_request *req;
+
dout(__remove_old_osds %p\n, osdc);
mutex_lock(osdc-request_mutex);
while (!RB_EMPTY_ROOT(osdc-osds)) {
struct ceph_osd *osd = rb_entry(rb_first(osdc-osds),
struct ceph_osd, o_node);
+   list_for_each_safe(pos, q, osd-o_requests) {
+   req = list_entry(pos, struct ceph_osd_request,
+   r_osd_item);
+   list_del(pos);
+   __unregister_request(osdc, req);
+   kfree(req);
+   }
__remove_osd(osdc, osd);
}
mutex_unlock(osdc-request_mutex);

-- 
best,
Guanjun g...@suse.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: trivial fix for the incorrect debug output

2012-07-16 Thread Alex Elder
On 07/15/2012 01:45 AM, Jiaju Zhang wrote:
 This is a trivial fix for the debug output, as it is inconsistent
 with the function name so may confuse people when debugging.
 
 Signed-off-by: Jiaju Zhang jjzh...@suse.de

I have been converting these to use __func__ whenever I touch
code nearby.  Mind if I do that here as well?

Reviewed-by: Alex Elder el...@inktank.com

 ---
  net/ceph/osd_client.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 1ffebed..ad6d745 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -688,7 +688,7 @@ static void __remove_osd(struct ceph_osd_client *osdc, 
 struct ceph_osd *osd)
  
  static void remove_all_osds(struct ceph_osd_client *osdc)
  {
 - dout(__remove_old_osds %p\n, osdc);
 + dout(__remove_all_osds %p\n, osdc);

dout(%s %p\n, __func__, osdc);

   mutex_lock(osdc-request_mutex);
   while (!RB_EMPTY_ROOT(osdc-osds)) {
   struct ceph_osd *osd = rb_entry(rb_first(osdc-osds),
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: trivial fix for the incorrect debug output

2012-07-16 Thread Jiaju Zhang
On Mon, 2012-07-16 at 07:55 -0500, Alex Elder wrote:
 On 07/15/2012 01:45 AM, Jiaju Zhang wrote:
  This is a trivial fix for the debug output, as it is inconsistent
  with the function name so may confuse people when debugging.
  
  Signed-off-by: Jiaju Zhang jjzh...@suse.de
 
 I have been converting these to use __func__ whenever I touch
 code nearby.  Mind if I do that here as well?
 
 Reviewed-by: Alex Elder el...@inktank.com

Oh, please do;) Using __func__ would be good.
Thanks for the review.

Thanks,
Jiaju

 
  ---
   net/ceph/osd_client.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
  index 1ffebed..ad6d745 100644
  --- a/net/ceph/osd_client.c
  +++ b/net/ceph/osd_client.c
  @@ -688,7 +688,7 @@ static void __remove_osd(struct ceph_osd_client *osdc, 
  struct ceph_osd *osd)
   
   static void remove_all_osds(struct ceph_osd_client *osdc)
   {
  -   dout(__remove_old_osds %p\n, osdc);
  +   dout(__remove_all_osds %p\n, osdc);
 
 dout(%s %p\n, __func__, osdc);
 
  mutex_lock(osdc-request_mutex);
  while (!RB_EMPTY_ROOT(osdc-osds)) {
  struct ceph_osd *osd = rb_entry(rb_first(osdc-osds),
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread Sage Weil
On Mon, 16 Jul 2012, Noah Watkins wrote:
 Ramu,
 
 I receive the same error code in my installation when the
 CEPH_CONF_FILE environment variable contains an invalid path. Could
 you please verify that the path you are using points to a valid Ceph
 configuration?

BTW it's CEPH_CONF for the config file (not CEPH_CONFG_FILE).

Alternatively, you can stick config options in the same format as the 
command line arguments in CEPH_ARGS.  E.g.,

CEPH_ARGS=--debug-ms 1 --log-file foo some_command ...

sage


 
 Thanks,
 - Noah
 
 On Sun, Jul 15, 2012 at 10:32 PM, ramu ramu.freesyst...@gmail.com wrote:
 
  Hi Noah,
 
  I tried but am getting following error in terminal ,
 
  Buildfile: /home/vu/java-rados/build.xml
 
  makedir:
 
  compile-rados:
 
  compile-tests:
 
  jar:
 
  test:
  [junit] Running ClusterStatsTest
  [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec
 
  BUILD FAILED
  /home/vu/java-rados/build.xml:134: Test ClusterStatsTest failed
 
  Total time: 1 second
 
  and one more error file
  is /home/vu/java-rados/TEST-ClusterStatsTest.txt in that file the text is 
  like
  this,
 
  Testsuite: ClusterStatsTest
  Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec
 
  Testcase: test_ClusterStats took 0.022 sec
  Caused an ERROR
  conf_read_file: ret=-22
  net.newdream.ceph.rados.RadosException: conf_read_file: ret=-22
  at net.newdream.ceph.rados.Cluster.native_conf_read_file(Native 
  Method)
  at net.newdream.ceph.rados.Cluster.readConfigFile(Unknown Source)
  at ClusterStatsTest.setup(Unknown Source)
 
  Thanks,
  Ramu.
 
 
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


obsync crashes when src file is unreadable

2012-07-16 Thread Adrian Fita
Hello. I'm trying to synchronize 2 Amazon S3 buckets, but obsync keeps
crashing when it tries to read a certain file. Here is the message
displayed when run with --verbose --more-verbose:

Mon, 16 Jul 2012 15:24:48 GMT
/src-bucket/1356/file
Traceback (most recent call last):
  File /usr/bin/obsync, line 1165, in module
sobj = src_all_objects.next()
  File /usr/bin/obsync, line 636, in next
k = self.bucket.get_key(key.name)
  File /usr/lib/python2.7/dist-packages/boto/s3/bucket.py, line 195,
in get_key
response.status, response.reason, '')
S3ResponseError: S3ResponseError: 403 Forbidden

ERROR TYPE: unknown, ORIGIN: source

I checked with other S3 clients the state of the file and indeed, it
can not be read and I can't change its ACLs. From what I gather, this
is one of those very rare cases of unreadable Amazon S3 files due to
the failure of Amazon's underlying storage (I think the file will be
automatically recovered at some point later by S3).

It would be nice if obsync would get over this error, continue with
the other files and just display a warning message that the file
couldn't be read. Is there something I can do right now that can make
obsync continue copying (maybe a quick hack in the code or something)?

This is happening on Ubuntu 12.10 2012-07-05, obsync version 0.47.2-0ubuntu2.

Thanks.
--
Fita Adrian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread Noah Watkins
On Mon, Jul 16, 2012 at 8:26 AM, Sage Weil s...@inktank.com wrote:
 On Mon, 16 Jul 2012, Noah Watkins wrote:
 Ramu,

 I receive the same error code in my installation when the
 CEPH_CONF_FILE environment variable contains an invalid path. Could
 you please verify that the path you are using points to a valid Ceph
 configuration?

 BTW it's CEPH_CONF for the config file (not CEPH_CONFG_FILE).

It turns out to be quite awkward to tell the unit tests about the
configuration location, and using an environment variable is
convenient. However, I wasn't aware of CEPH_CONF, and in fact the unit
test framework looks for CEPH_CONFIG_FILE. The later should definitely
be removed (ramu, can you try CEPH_CONF?), and this isn't an issue in
libcephfs wrappers -- the rados wrappers are incredibly old.

Thanks, Noah



 Alternatively, you can stick config options in the same format as the
 command line arguments in CEPH_ARGS.  E.g.,

 CEPH_ARGS=--debug-ms 1 --log-file foo some_command ...

 sage



 Thanks,
 - Noah

 On Sun, Jul 15, 2012 at 10:32 PM, ramu ramu.freesyst...@gmail.com wrote:
 
  Hi Noah,
 
  I tried but am getting following error in terminal ,
 
  Buildfile: /home/vu/java-rados/build.xml
 
  makedir:
 
  compile-rados:
 
  compile-tests:
 
  jar:
 
  test:
  [junit] Running ClusterStatsTest
  [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec
 
  BUILD FAILED
  /home/vu/java-rados/build.xml:134: Test ClusterStatsTest failed
 
  Total time: 1 second
 
  and one more error file
  is /home/vu/java-rados/TEST-ClusterStatsTest.txt in that file the text is 
  like
  this,
 
  Testsuite: ClusterStatsTest
  Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.04 sec
 
  Testcase: test_ClusterStats took 0.022 sec
  Caused an ERROR
  conf_read_file: ret=-22
  net.newdream.ceph.rados.RadosException: conf_read_file: ret=-22
  at net.newdream.ceph.rados.Cluster.native_conf_read_file(Native 
  Method)
  at net.newdream.ceph.rados.Cluster.readConfigFile(Unknown Source)
  at ClusterStatsTest.setup(Unknown Source)
 
  Thanks,
  Ramu.
 
 
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread Noah Watkins
Disregard that last message. Ok, I'm not sure where else EINVAL gets
returned in the configuration path, but I can look into it this
evening. I tested the wrappers on a clean install last night and they
seem to be working for me. Can you turn on debug logging with
CEPH_ARGS (as per Sage's last email)?

On Mon, Jul 16, 2012 at 12:23 AM, ramu ramu.freesyst...@gmail.com wrote:
 Hi Noah,

 I printed the CONF_FILE and file and states,the printed as following,these
 are in ClusterStatsTest.java.

 test:
 [junit] Running ClusterStatsTest
 [junit] conffile---/etc/ceph/ceph.conf
 [junit] state--CONFIGURING
 [junit] file---/etc/ceph/ceph.conf
 [junit] state--CONFIGURING
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.123 sec

 BUILD FAILED
 /home/ramu/java-rados/build.xml:134: Test ClusterStatsTest failed




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph status reporting non-existing osd

2012-07-16 Thread Gregory Farnum
On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
 On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
 (mailto:s...@inktank.com) wrote:
  On Fri, 13 Jul 2012, Gregory Farnum wrote:
   On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru 
   (mailto:and...@xdel.ru) wrote:
Hi,
 
Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage on
six-node,
and I have removed a bunch of rbd objects during recovery to avoid
overfill.
Right now I`m constantly receiving a warn about nearfull state on
non-existing osd:
 
health HEALTH_WARN 1 near full osd(s)
monmap e3: 3 mons at
{0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
election epoch 240, quorum 0,1,2 0,1,2
osdmap e2098: 4 osds: 4 up, 4 in
pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
used, 143 GB / 324 GB avail
mdsmap e181: 1/1/1 up {0=a=up:active}
 
HEALTH_WARN 1 near full osd(s)
osd.4 is near full at 89%
 
Needless to say, osd.4 remains only in ceph.conf, but not at crushmap.
Reducing has been done 'on-line', e.g. without restart entire cluster.



   Whoops! It looks like Sage has written some patches to fix this, but
   for now you should be good if you just update your ratios to a larger
   number, and then bring them back down again. :)
   
   
   
  Restarting ceph-mon should also do the trick.
   
  Thanks for the bug report!
  sage
  
  
  
 Should I restart mons simultaneously?
I don't think restarting will actually do the trick for you — you actually will 
need to set the ratios again.
  
 Restarting one by one has no
 effect, same as filling up data pool up to ~95 percent(btw, when I
 deleted this 50Gb file on cephfs, mds was stuck permanently and usage
 remained same until I dropped and recreated data pool - hope it`s one
 of known posix layer bugs). I also deleted entry from config, and then
 restarted mons, with no effect. Any suggestions?

I'm not sure what you're asking about here?  
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph status reporting non-existing osd

2012-07-16 Thread Gregory Farnum
ceph pg set_full_ratio 0.95  
ceph pg set_nearfull_ratio 0.94


On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote:

 On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
   On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
   (mailto:s...@inktank.com) wrote:
On Fri, 13 Jul 2012, Gregory Farnum wrote:
 On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru 
 (mailto:and...@xdel.ru) wrote:
  Hi,
   
  Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage 
  on
  six-node,
  and I have removed a bunch of rbd objects during recovery to avoid
  overfill.
  Right now I`m constantly receiving a warn about nearfull state on
  non-existing osd:
   
  health HEALTH_WARN 1 near full osd(s)
  monmap e3: 3 mons at
  {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
  election epoch 240, quorum 0,1,2 0,1,2
  osdmap e2098: 4 osds: 4 up, 4 in
  pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
  used, 143 GB / 324 GB avail
  mdsmap e181: 1/1/1 up {0=a=up:active}
   
  HEALTH_WARN 1 near full osd(s)
  osd.4 is near full at 89%
   
  Needless to say, osd.4 remains only in ceph.conf, but not at 
  crushmap.
  Reducing has been done 'on-line', e.g. without restart entire 
  cluster.
  
  
  
  
  
 Whoops! It looks like Sage has written some patches to fix this, but
 for now you should be good if you just update your ratios to a larger
 number, and then bring them back down again. :)
 
 
 
 
 
Restarting ceph-mon should also do the trick.
 
Thanks for the bug report!
sage





   Should I restart mons simultaneously?
  I don't think restarting will actually do the trick for you — you actually 
  will need to set the ratios again.
   
   Restarting one by one has no
   effect, same as filling up data pool up to ~95 percent(btw, when I
   deleted this 50Gb file on cephfs, mds was stuck permanently and usage
   remained same until I dropped and recreated data pool - hope it`s one
   of known posix layer bugs). I also deleted entry from config, and then
   restarted mons, with no effect. Any suggestions?
   
   
   
  I'm not sure what you're asking about here?
  -Greg
  
  
  
 Oh, sorry, I have mislooked and thought that you suggested filling up
 osds. How do I can set full/nearfull ratios correctly?
  
 $ceph injectargs '--mon_osd_full_ratio 96'
 parsed options
 $ ceph injectargs '--mon_osd_near_full_ratio 94'
 parsed options
  
 ceph pg dump | grep 'full'
 full_ratio 0.95
 nearfull_ratio 0.85
  
 Setting parameters in the ceph.conf and then restarting mons does not
 affect ratios either.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph status reporting non-existing osd

2012-07-16 Thread Andrey Korolyov
On Mon, Jul 16, 2012 at 10:48 PM, Gregory Farnum g...@inktank.com wrote:
 ceph pg set_full_ratio 0.95
 ceph pg set_nearfull_ratio 0.94


 On Monday, July 16, 2012 at 11:42 AM, Andrey Korolyov wrote:

 On Mon, Jul 16, 2012 at 8:12 PM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  On Saturday, July 14, 2012 at 7:20 AM, Andrey Korolyov wrote:
   On Fri, Jul 13, 2012 at 9:09 PM, Sage Weil s...@inktank.com 
   (mailto:s...@inktank.com) wrote:
On Fri, 13 Jul 2012, Gregory Farnum wrote:
 On Fri, Jul 13, 2012 at 1:17 AM, Andrey Korolyov and...@xdel.ru 
 (mailto:and...@xdel.ru) wrote:
  Hi,
 
  Recently I`ve reduced my test suite from 6 to 4 osds at ~60% usage 
  on
  six-node,
  and I have removed a bunch of rbd objects during recovery to avoid
  overfill.
  Right now I`m constantly receiving a warn about nearfull state on
  non-existing osd:
 
  health HEALTH_WARN 1 near full osd(s)
  monmap e3: 3 mons at
  {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
  election epoch 240, quorum 0,1,2 0,1,2
  osdmap e2098: 4 osds: 4 up, 4 in
  pgmap v518696: 464 pgs: 464 active+clean; 61070 MB data, 181 GB
  used, 143 GB / 324 GB avail
  mdsmap e181: 1/1/1 up {0=a=up:active}
 
  HEALTH_WARN 1 near full osd(s)
  osd.4 is near full at 89%
 
  Needless to say, osd.4 remains only in ceph.conf, but not at 
  crushmap.
  Reducing has been done 'on-line', e.g. without restart entire 
  cluster.





 Whoops! It looks like Sage has written some patches to fix this, but
 for now you should be good if you just update your ratios to a larger
 number, and then bring them back down again. :)
   
   
   
   
   
Restarting ceph-mon should also do the trick.
   
Thanks for the bug report!
sage
  
  
  
  
  
   Should I restart mons simultaneously?
  I don't think restarting will actually do the trick for you — you actually 
  will need to set the ratios again.
 
   Restarting one by one has no
   effect, same as filling up data pool up to ~95 percent(btw, when I
   deleted this 50Gb file on cephfs, mds was stuck permanently and usage
   remained same until I dropped and recreated data pool - hope it`s one
   of known posix layer bugs). I also deleted entry from config, and then
   restarted mons, with no effect. Any suggestions?
 
 
 
  I'm not sure what you're asking about here?
  -Greg



 Oh, sorry, I have mislooked and thought that you suggested filling up
 osds. How do I can set full/nearfull ratios correctly?

 $ceph injectargs '--mon_osd_full_ratio 96'
 parsed options
 $ ceph injectargs '--mon_osd_near_full_ratio 94'
 parsed options

 ceph pg dump | grep 'full'
 full_ratio 0.95
 nearfull_ratio 0.85

 Setting parameters in the ceph.conf and then restarting mons does not
 affect ratios either.




Thanks, it worked, but setting values back result to turn warning back.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


can rbd unmap detect if device is mounted?

2012-07-16 Thread Travis Rhoden
Hi folks,

I've made this mistake a couple of times now (completely my fault,
when will I learn?), and am wondering if a bit of protection can be
put in place against user errors.

I mapped a device rbd map, then formatted and and mounted the device
(mkfs.extf /dev/rbd0..., mount /dev/rbd0...).  Then sometime later,
I want to remove the RBD device.  Stupidly, I do the rbd unmap
command before I unmount the device.  The kernel doesn't really care
for this.  Or more accurately, I can't remap that same RBD because I
run into:

kernel: [2248653.941688] sysfs: cannot create duplicate filename
'/devices/virtual/block/rbd0'

kernel: [2248653.941833] kobject_add_internal failed for rbd0 with
-EEXIST, don't try to register things with the same name in the same
directory.

At this point, the rbd map command hangs indefinitely (producing the
logs from above).  Ctrl-C does exit out, though.  But if I try to fix
my mistake by doing the unmount now, I get the error:

umount: device is busy.

So really I get stuck.  I can't unmount without the device, and I
can't remap the device to the old block device.  I have to reboot to
clean up and move on.

I imagine other bad things can happen with the block device goes away
out from under the mount point.  Any way the rbd unmap command can
detect when the device is in use or mounted and inform the user?

Thanks,

 - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can rbd unmap detect if device is mounted?

2012-07-16 Thread Josh Durgin

On 07/16/2012 12:59 PM, Travis Rhoden wrote:

Hi folks,

I've made this mistake a couple of times now (completely my fault,
when will I learn?), and am wondering if a bit of protection can be
put in place against user errors.


Yeah, we've been working on advisory locking. The first step is
just adding an option to lock via the rbd command line tool, so you
could script lock/map and unmap/unlock.

This is described a little more in this thread:

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094


I mapped a device rbd map, then formatted and and mounted the device
(mkfs.extf /dev/rbd0..., mount /dev/rbd0...).  Then sometime later,
I want to remove the RBD device.  Stupidly, I do the rbd unmap
command before I unmount the device.  The kernel doesn't really care
for this.  Or more accurately, I can't remap that same RBD because I
run into:

kernel: [2248653.941688] sysfs: cannot create duplicate filename
'/devices/virtual/block/rbd0'

kernel: [2248653.941833] kobject_add_internal failed for rbd0 with
-EEXIST, don't try to register things with the same name in the same
directory.

At this point, the rbd map command hangs indefinitely (producing the
logs from above).  Ctrl-C does exit out, though.  But if I try to fix
my mistake by doing the unmount now, I get the error:

umount: device is busy.

So really I get stuck.  I can't unmount without the device, and I
can't remap the device to the old block device.  I have to reboot to
clean up and move on.


A similar issue was fixed in 3.4 (see 
http://tracker.newdream.net/issues/1907).

What kernel are you using? 3.2 had a nasty possibility of preventing
further operations if mapping hung while trying to connect to the
monitors.


I imagine other bad things can happen with the block device goes away
out from under the mount point.  Any way the rbd unmap command can
detect when the device is in use or mounted and inform the user?


Before actually unmapping the device, rbd unmap could check if it was
present in mtab. If it's being used as a raw block device and not
mounted or you created and used your own device node this wouldn't
help, but it would be better than nothing.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can rbd unmap detect if device is mounted?

2012-07-16 Thread Travis Rhoden
Thanks for the response, Josh. Sorry I didn't send in my version info
with the initial message.

On Mon, Jul 16, 2012 at 6:43 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 07/16/2012 12:59 PM, Travis Rhoden wrote:

 Hi folks,

 I've made this mistake a couple of times now (completely my fault,
 when will I learn?), and am wondering if a bit of protection can be
 put in place against user errors.


 Yeah, we've been working on advisory locking. The first step is
 just adding an option to lock via the rbd command line tool, so you
 could script lock/map and unmap/unlock.

 This is described a little more in this thread:

 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094

I had in fact seen this.  Looks like the advisory locking is targeted
for 0.49.  That's great!  Thanks for reminding me of it.  I'll look
forward to it.


 I mapped a device rbd map, then formatted and and mounted the device
 (mkfs.extf /dev/rbd0..., mount /dev/rbd0...).  Then sometime later,
 I want to remove the RBD device.  Stupidly, I do the rbd unmap
 command before I unmount the device.  The kernel doesn't really care
 for this.  Or more accurately, I can't remap that same RBD because I
 run into:

 kernel: [2248653.941688] sysfs: cannot create duplicate filename
 '/devices/virtual/block/rbd0'
 
 kernel: [2248653.941833] kobject_add_internal failed for rbd0 with
 -EEXIST, don't try to register things with the same name in the same
 directory.

 At this point, the rbd map command hangs indefinitely (producing the
 logs from above).  Ctrl-C does exit out, though.  But if I try to fix
 my mistake by doing the unmount now, I get the error:

 umount: device is busy.

 So really I get stuck.  I can't unmount without the device, and I
 can't remap the device to the old block device.  I have to reboot to
 clean up and move on.


 A similar issue was fixed in 3.4 (see
 http://tracker.newdream.net/issues/1907).
 What kernel are you using? 3.2 had a nasty possibility of preventing
 further operations if mapping hung while trying to connect to the
 monitors.

I am using the stock Ubuntu 12.04 kernel, which is in fact 3.2  Good
point.  So, the only way for me to get the updates you mentioned is to
upgrade to a 3.4 kernel, correct?


 I imagine other bad things can happen with the block device goes away
 out from under the mount point.  Any way the rbd unmap command can
 detect when the device is in use or mounted and inform the user?


 Before actually unmapping the device, rbd unmap could check if it was
 present in mtab. If it's being used as a raw block device and not
 mounted or you created and used your own device node this wouldn't
 help, but it would be better than nothing.

That would be awesome.  Every little bit helps.

 Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can rbd unmap detect if device is mounted?

2012-07-16 Thread Tommi Virtanen
On Mon, Jul 16, 2012 at 3:43 PM, Josh Durgin josh.dur...@inktank.com wrote:
 I've made this mistake a couple of times now (completely my fault,
 when will I learn?), and am wondering if a bit of protection can be
 put in place against user errors.
 Yeah, we've been working on advisory locking. The first step is
 just adding an option to lock via the rbd command line tool, so you
 could script lock/map and unmap/unlock.

Is his problem really about the locking?

It sounded to me like, he has something (the mount) referencing a
block device, and we're letting the block device disappear.

The locking you guys have been talking about sounds like that lock
would be held whenever the image is mapped, regardless of whether it's
mounted or not (think mkfs).


Should unmap even be possible while the block device is open?
Shouldn't there be a refcount and an -EBUSY? That's what other block
device providers do:

[0 tv@dreamer ~]$ dd if=/dev/zero of=foo bs=1M count=40
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 0.167171 s, 251 MB/s
[1 tv@dreamer ~]$ sudo losetup --show -f foo
/dev/loop0
[0 tv@dreamer ~]$ sudo mkfs /dev/loop0
mke2fs 1.42 (29-Nov-2011)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
10240 inodes, 40960 blocks
2048 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=41943040
5 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
8193, 24577

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

[0 tv@dreamer ~]$ sudo mount /dev/loop0 /mnt
[0 tv@dreamer ~]$ sudo losetup -d /dev/loop0
loop: can't delete device /dev/loop0: Device or resource busy
[1 tv@dreamer ~]$ sudo umount /mnt
[0 tv@dreamer ~]$ sudo losetup -d /dev/loop0
[0 tv@dreamer ~]$
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can rbd unmap detect if device is mounted?

2012-07-16 Thread Josh Durgin

On 07/16/2012 04:37 PM, Tommi Virtanen wrote:

On Mon, Jul 16, 2012 at 3:43 PM, Josh Durginjosh.dur...@inktank.com  wrote:

I've made this mistake a couple of times now (completely my fault,
when will I learn?), and am wondering if a bit of protection can be
put in place against user errors.

Yeah, we've been working on advisory locking. The first step is
just adding an option to lock via the rbd command line tool, so you
could script lock/map and unmap/unlock.


Is his problem really about the locking?

It sounded to me like, he has something (the mount) referencing a
block device, and we're letting the block device disappear.

The locking you guys have been talking about sounds like that lock
would be held whenever the image is mapped, regardless of whether it's
mounted or not (think mkfs).


Should unmap even be possible while the block device is open?
Shouldn't there be a refcount and an -EBUSY? That's what other block
device providers do:


That would be the best solution. Looking into it, the rbd driver
is already keeping a refcount just like the loop driver (in 
block_device_operations .open/.release). The rbd driver is only using

it for the struct device though, instead of struct rbd_device. This
shouldn't be too hard to fix.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Robustify ceph-rbdnamer and adapt udev rules

2012-07-16 Thread Josh Durgin

On 07/12/2012 12:49 AM, Pascal de Bruijn | Unilogic Networks B.V. wrote:

On Wed, 2012-07-11 at 09:28 -0700, Josh Durgin wrote:

On 07/11/2012 06:23 AM, Pascal de Bruijn | Unilogic Networks B.V. wrote:

Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

/dev/rbd/mypool/myrbd -   /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

/usr/bin/ceph-rbdnamer /dev/rbd3
/usr/bin/ceph-rbdnamer /dev/rbd3p1
/usr/bin/ceph-rbdnamer rbd3
/usr/bin/ceph-rbdnamer rbd3p1
/usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change has to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

/dev/rbd/mypool/myrbd -   /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

/dev/rbd/mypool/myrbd -   /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

/dev/rbd/mypool/myrbd -   /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

/dev/rbd/mypool/myrbd-part1 -   /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.


This all makes sense, but maybe we should put the -part suffix in
another namespace to avoid colliding with images that happen to have
-partN in their name, e.g.:

  /dev/rbd/mypool/myrbd/part1 -  /dev/rbd3p1


Well my current patch changes the udev rules in a way that's consistent
with other udev bits. For example:

   /dev/disk/by-id/cciss-3600508b100103835322020202026
   /dev/disk/by-id/cciss-3600508b100103835322020202026-part1
   /dev/disk/by-id/cciss-3600508b100103835322020202026-part2

There is no namespacing there either. That said, those rules tends to
use serials/unique-id's for naming (and not user specified strings), so
there is little risk of conflicting with the -part%n bit.

Also, having a namespace as suggested:

   /dev/rbd/mypool/myrbd/part1 -  /dev/rbd3p1

Also precludes:

   /dev/rbd/mypool/myrbd -  /dev/rbd3

 From existing, as myrbd can't be both a device file and directory at the
same time :)

Assuming you'd want to continue with this approach the disk udev link
should probably be something like:

   /dev/rbd/mypool/myrbd/disk -  /dev/rbd3

Please do note that this would change the udev rules in a way that could
potentially break people's existing scripts which might assume the old
udev scheme (whereas my current patch does not break the old scheme).


Good point.


Maybe it's worth considering applying my patch as-is to the 0.48.x
stable tree, and experimenting with other udev schemes in newer
development releases?


That sounds best. I've applied your patch to the stable, next, and
master branches.

Thanks!
Josh


Regards,
Pascal de Bruijn


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Intermittent loss of connectivity with KVM-Ceph-Network (solved)

2012-07-16 Thread Australian Jade

Hello,

Just want to share my recent experience with KVM backed by RBD. Ceph 
appears to be not at fault but I post it here for others to read as my 
configuration is something other users of Ceph may configure.


Over last three weeks I was battling some elusive issue: KVM backed by 
RBD intermittently lost network under some consistent (but relatively 
low) load. It was driving me nuts as nothing appeared in the logs and 
everything was seemingly OK. With only one exception: pings to nearby 
host sometimes were coming with No buffer space available error and 
number of pings could be delayed by 20-30 seconds (obviously such delay 
causes a lot of timeouts). KVM has two network virtio interfaces and on 
host vhost_net module running. One interface is publicly available, 
second one connected to private network. I have tried to change virtio 
to e1000, increase network buffers - all in vain. I also noticed that 
when hold up on interface happened pings come back with second 
difference: ie they piling up on interface and then suddenly sent 
through all at once and remote host returns all of them pretty much 
simultaneously. Another observation which I have made: this behaviour 
was clearly evident when network/disk activity was reasonable - during 
backup. I have stopped backup for one day but it did not help (however 
loss of connectivity did not happen as much).


Being unable to identify the cause I started to pull things apart: I 
have moved image from RBD to qcow and magically everything became 
normal. Back to RBD and issue manifested itself again. But on other hand 
I had number of freshly installed VMs which also backed by RBD and do 
not have such issue. VMs which had this fault were different: they were 
migrated from hardware hosts to VM environment. Fresh VMs and migrated 
VMs are distro-synched FC17 and so I did not expect any difference. The 
only difference I had left was that migrated VMs were 32 bit and freshly 
installed were 64 bit. So in the end I have upgraded kernel in one 
faulty VM to 64 bit (while leaving balance of the system 32 bit) and 
problem disappeared! Next day I have upgraded another VM the same way 
and it also became problem free! So I am now sure that problem lies in 
32 bit kernel which is run on 64 bit host.


So I guess that there some race condition, likely in virtio_net driver 
or in tcp stack, which is apparently triggered by context switching from 
32 bit to 64 bit and io delays introduced by QEMU-RBD driver. Only when 
32 bit VM runs on 64 bit host and it is backed by RBD image this issue 
appears. Being unable to identify exact spot in the kernel were the 
problem is I even not sure where exactly I should report it to so I 
decided to post it here as the place where people with VMs backed by RBD 
most likely will look for solution.


Regards,
Vladimir
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can rbd unmap detect if device is mounted?

2012-07-16 Thread Travis Rhoden
On Mon, Jul 16, 2012 at 7:37 PM, Tommi Virtanen t...@inktank.com wrote:

 [0 tv@dreamer ~]$ sudo mount /dev/loop0 /mnt
 [0 tv@dreamer ~]$ sudo losetup -d /dev/loop0
 loop: can't delete device /dev/loop0: Device or resource busy
 [1 tv@dreamer ~]$ sudo umount /mnt
 [0 tv@dreamer ~]$ sudo losetup -d /dev/loop0
 [0 tv@dreamer ~]$

Thanks for the ingenious loop device example, Tommi.  You illustrated
my point better than I ever could have.  Sounds like you guys have a
handle on the situation.

 - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-16 Thread ramu
Hi Noah,

I hardcoded CONF_FIL=/etc/ceph/ceph.conf; and commented 
//System.getProperty(CEPH_CONF_FILE); in RadosTestUtils.java.
And run the ant and after run ant test,but iam getting same error.

Buildfile: /home/vu/java-rados/build.xml

makedir:

compile-rados:

compile-tests:
[javac] Compiling 1 source file to /home/vu/java-rados/build/test

jar:

test:
[junit] Running ClusterStatsTest
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.068 sec

BUILD FAILED
/home/vu/java-rados/build.xml:134: Test ClusterStatsTest failed

Total time: 2 seconds
and in TEST-ClusterStatsTest.txt file also getting same errors,

Testsuite: ClusterStatsTest
Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.068 sec

Testcase: test_ClusterStats took 0.036 sec
Caused an ERROR
rados_connect: ret=-1
net.newdream.ceph.rados.RadosException: rados_connect: ret=-1
at net.newdream.ceph.rados.Cluster.native_connect(Native Method)
at net.newdream.ceph.rados.Cluster.connect(Unknown Source)
at ClusterStatsTest.setup(Unknown Source)
--How to give CEPH_ARGS.

Thanks,
Ramu.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html