date:20141020

[ceph-users] how to resolve : start mon assert == 0

2014-10-20 Thread minchen

Hello , all


when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all 
mons(disabled cephx).
An exception occured as follows:


# ceph-mon -i b
mon/AuthMonitor.cc: In function 'virtual void 
AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 
2014-10-20 15:29:31.966367
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
 4: (Monitor::init_paxos()+0xf5) [0x54a515]
 5: (Monitor::preinit()+0x69f) [0x56291f]
 6: (main()+0x2665) [0x534df5]
 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d]
 8: ceph-mon() [0x537bf9]



Anyone can help to solve this problem? 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] how to resolve : start mon assert == 0

2014-10-20 Thread minchen

Hello , all


when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all 
mons(disabled cephx).
An exception occured as follows:


# ceph-mon -i b
mon/AuthMonitor.cc: In function 'virtual void 
AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 
2014-10-20 15:29:31.966367
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
 4: (Monitor::init_paxos()+0xf5) [0x54a515]
 5: (Monitor::preinit()+0x69f) [0x56291f]
 6: (main()+0x2665) [0x534df5]
 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d]
 8: ceph-mon() [0x537bf9]



Anyone can help to solve this problem? 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Same rbd mount from multiple servers

2014-10-20 Thread Mihály Árva-Tóth

Hello,

I made a 2GB RBD on Ceph and mounted on three separated servers.I followed
this:

http://ceph.com/docs/master/start/quick-rbd/

Set up, mkfs (extt4) and mount were finished success, but every node seems
like three different rbd volume. :-o If I copy one 100 MB file on test1
node I don't see this file on test2 and test3 nodes. I'm using Ubuntu 14.04
x64 with latest stable ceph (0.80.7).

What's wrong?

Thank you,
Mihaly
http://www.virtual-call-center.eu/ http://www.virtual-call-center.hu/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Same rbd mount from multiple servers

2014-10-20 Thread Mihály Árva-Tóth

Hi Sean,

Thank you for your quick response! Okay I see, is there any preferred
clustered FS in this case? OCFS2, GFS?

Thanks,
Mihaly

2014-10-20 10:36 GMT+02:00 Sean Redmond sean.redm...@ukfast.co.uk:

  Hi Mihaly,



 To my understanding you cannot mount an ext4 file system on more than one
 server at the same time, You would need to look to use a clustered file
 system.



 Thanks



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Mihály Árva-Tóth
 *Sent:* 20 October 2014 09:34
 *To:* ceph-users@lists.ceph.com
 *Subject:* [ceph-users] Same rbd mount from multiple servers



 Hello,

 I made a 2GB RBD on Ceph and mounted on three separated servers.I followed
 this:

 http://ceph.com/docs/master/start/quick-rbd/

 Set up, mkfs (extt4) and mount were finished success, but every node seems
 like three different rbd volume. :-o If I copy one 100 MB file on test1
 node I don't see this file on test2 and test3 nodes. I'm using Ubuntu 14.04
 x64 with latest stable ceph (0.80.7).

 What's wrong?

 Thank you,

 Mihaly

 --

 NOTICE AND DISCLAIMER
 This e-mail (including any attachments) is intended for the above-named
 person(s). If you are not the intended recipient, notify the sender
 immediately, delete this email from your system and do not disclose or use
 for any purpose. We may monitor all incoming and outgoing emails in line
 with current legislation. We have taken steps to ensure that this email and
 attachments are free from any virus, but it remains your responsibility to
 ensure that viruses do not adversely affect you

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to resolve : start mon assert == 0

2014-10-20 Thread Shu, Xinxin

Please refer to http://tracker.ceph.com/issues/8851

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of minchen
Sent: Monday, October 20, 2014 3:42 PM
To: ceph-users; ceph-de...@vger.kernel.org
Subject: [ceph-users] how to resolve : start mon assert == 0

Hello , all

when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all 
mons(disabled cephx).
An exception occured as follows:

# ceph-mon -i b
mon/AuthMonitor.cc: In function 'virtual void 
AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 
2014-10-20 15:29:31.966367
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
 4: (Monitor::init_paxos()+0xf5) [0x54a515]
 5: (Monitor::preinit()+0x69f) [0x56291f]
 6: (main()+0x2665) [0x534df5]
 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d]
 8: ceph-mon() [0x537bf9]

Anyone can help to solve this problem?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to calculate file size when mount a block device from rbd image

2014-10-20 Thread Vickie CH

Hello all,
I have a question about how to calculate file size when mount a block
device from rbd image .
[Cluster information：]
1.The cluster with 1 mon and 6 osds. Every osd is 1T. Total spaces is 5556G.
2.rbd pool：replicated size 2 min_size 1. num = 128. Except rbd pool other
pools is empty.

[Steps]
1.On Linux client I use rbd command to create a 1.5T rbd image and format
it with ext4.
2.Use dd command to create a 1.2T file.
   #dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000
3.When dd finished the information shows No space left on device. But
parted -l display the disk space is 1611G. Why does the system show space
not enough？

Is there something I misunderstand or wrong？

Best wishes,
Mika
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] real beginner question

2014-10-20 Thread Ranju Upadhyay

Hi list,

 

This is a real newbie question.(and hopefully the right list to ask to!)

 

Is it possible to set up ceph in an already virtualized environment? i.e. we
have a scenario here, where we have virtual machine ( as opposed to
individual physical machines) with ubuntu OS on it. We are trying to create
ceph cluster on this virtual machine . (not sure if this is a sensible thing
to do!)

 

On our effort to install ceph we used vagrant ( came across some notes
through google). We thought that would be the easiest route, as we do not
know anything yet. But we are unsuccessful. We can go as far as creating a
virtual machine but it fails as provisioning stage (i.e.
mons;osds;mdss;rgws etc do not get created)

 

 

Any suggestions?

 

 

Thanks

Ranju Upadhyay

Maynooth University, 

Ireland.

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to calculate file size when mount a block device from rbd image

2014-10-20 Thread Wido den Hollander

On 10/20/2014 11:16 AM, Vickie CH wrote:
 Hello all,
 I have a question about how to calculate file size when mount a block
 device from rbd image .
 [Cluster information：]
 1.The cluster with 1 mon and 6 osds. Every osd is 1T. Total spaces is 5556G.
 2.rbd pool：replicated size 2 min_size 1. num = 128. Except rbd pool other
 pools is empty.
 
 [Steps]
 1.On Linux client I use rbd command to create a 1.5T rbd image and format
 it with ext4.
 2.Use dd command to create a 1.2T file.
#dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000
 3.When dd finished the information shows No space left on device. But
 parted -l display the disk space is 1611G. Why does the system show space
 not enough？
 
 Is there something I misunderstand or wrong？
 

Probably the rounding of GB and GiB. Keep in mind that 1.5TiB is 1.39TB
and that ext4 also eats up space and reserves 5% for the superuser.

 Best wishes,
 Mika
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] slow requests - what is causing them?

2014-10-20 Thread Andrei Mikhailovsky


Hello cephers, 

I've been testing flashcache and enhanceio block device caching for the osds 
and i've noticed i have started getting the slow requests. The caching type 
that I use is ready only, so all writes bypass the caching ssds and go directly 
to osds, just like what it used to be before introducing the caching layer. 
Prior to introducing caching, i rarely had the slow requests. Judging by the 
logs, all slow requests are looking like these: 


2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow 
request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: 
osd_op(client.36035566.0:16626375 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow 
request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: 
osd_op(client.36035566.0:16626376 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow 
requests, 1 included below; oldest blocked for  57.003961 secs 
2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow 
request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: 
osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 
0304 [stat,set-alloc-hint object_size 4194304 write_size 
4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently 
waiting for subops from 11 
2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow 
requests, 2 included below; oldest blocked for  43.531516 secs 


In general, I see between 0 and about 2,000 slow request log entries per day. 
On one day I saw over 100k entries, but it only happened once. 

I am struggling to understand what is casing the slow requests? If all the 
writes go the same path as before caching was introduced, how come I am getting 
them? How can I investigate this further? 

Thanks 

Andrei 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reweight a host

2014-10-20 Thread Erik Logtenberg

I don't think so, check this out:

# idweight  type name   up/down reweight
-6  3.05root ssd
-7  0.04999 host ceph-01-ssd
11  0.04999 osd.11  up  1
-8  1   host ceph-02-ssd
12  0.04999 osd.12  up  1
-9  1   host ceph-03-ssd
13  0.03999 osd.13  up  1
-10 1   host ceph-04-ssd
14  0.03999 osd.14  up  1

As you can see, only host ceph-01-ssd has the same weight as its osd,
the other three hosts have weight 1 which is different from their
associated osd.

If the weight of the host -should- be the sum of all osd weights on this
hosts, then my question becomes: how do I make that so for the three
hosts where this is currently not the case?

Thanks,

Erik.


On 20-10-14 03:55, Lei Dong wrote:
 According to my understanding, the weight of a host is the sum of all osd
 weights on this host. So you just reweight any osd on this host, the
 weight of this host is reweighed.
 
 Thanks
 LeiDong
 
 On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 
 Hi,

 Simple question: how do I reweight a host in crushmap?

 I can use ceph osd crush reweight to reweight an osd, but I would like
 to change the weight of a host instead.

 I tried exporting the crushmap, but I noticed that the weights of all
 hosts are commented out, like so:

# weight 5.460

 And they are not the same values as seen in ceph osd tree.

 So how do I keep everything as it currently it, but simply change one
 single weight of one single host?

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reweight a host

2014-10-20 Thread Lei Dong

I¹ve never see this before. The weight of host is commented out because
it¹s weight is the sum of weight in the following lines started from
³item². Can you attach your crush map? Did you manually change it?

On 10/20/14, 6:06 PM, Erik Logtenberg e...@logtenberg.eu wrote:

I don't think so, check this out:

# idweight  type name   up/down reweight
-6  3.05root ssd
-7  0.04999 host ceph-01-ssd
11  0.04999 osd.11  up  1
-8  1   host ceph-02-ssd
12  0.04999 osd.12  up  1
-9  1   host ceph-03-ssd
13  0.03999 osd.13  up  1
-10 1   host ceph-04-ssd
14  0.03999 osd.14  up  1

As you can see, only host ceph-01-ssd has the same weight as its osd,
the other three hosts have weight 1 which is different from their
associated osd.

If the weight of the host -should- be the sum of all osd weights on this
hosts, then my question becomes: how do I make that so for the three
hosts where this is currently not the case?

Thanks,

Erik.


On 20-10-14 03:55, Lei Dong wrote:
 According to my understanding, the weight of a host is the sum of all
osd
 weights on this host. So you just reweight any osd on this host, the
 weight of this host is reweighed.
 
 Thanks
 LeiDong
 
 On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 
 Hi,

 Simple question: how do I reweight a host in crushmap?

 I can use ceph osd crush reweight to reweight an osd, but I would
like
 to change the weight of a host instead.

 I tried exporting the crushmap, but I noticed that the weights of all
 hosts are commented out, like so:

# weight 5.460

 And they are not the same values as seen in ceph osd tree.

 So how do I keep everything as it currently it, but simply change one
 single weight of one single host?

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] real beginner question

2014-10-20 Thread Ashish Chandra

Hi Ranjan,

Is it possible to set up ceph in an already virtualized environment?

Yes obviously, you can try out all the features of ceph in a virtualized
environment. Infact it is the easiet and recommended way of playing with
Ceph.
Ceph docs lists the way to do this, it should take hardly any time to get
it done.

http://ceph.com/docs/master/start/quick-start-preflight/

Please reach out, in case of any issues.



On Mon, Oct 20, 2014 at 3:53 PM, Ranju Upadhyay ranju.upadh...@nuim.ie
wrote:

 Hi list,



 This is a real newbie question.(and hopefully the right list to ask to!)



 Is it possible to set up ceph in an already virtualized environment? i.e.
 we have a scenario here, where we have virtual machine ( as opposed to
 individual physical machines) with ubuntu OS on it. We are trying to create
 ceph cluster on this virtual machine . (not sure if this is a sensible
 thing to do!)



 On our effort to install ceph we used vagrant ( came across some notes
 through google). We thought that would be the easiest route, as we do not
 know anything yet. But we are unsuccessful. We can go as far as creating a
 virtual machine but it fails as provisioning stage (i.e.
  mons;osds;mdss;rgws etc do not get created)





 Any suggestions?





 Thanks

 Ranju Upadhyay

 Maynooth University,

 Ireland.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

.- O -..--.  ,---.  .-==-.
   /_-\'''/-_\  / / '' \ \ |,-.| /____\
  |/  o) (o  \|| | ')(' | |   /,'-'.\   |/ (')(') \|
   \   ._.   /  \ \/ /   {_/(') (')\_}   \   __   /
   ,-_,,,_-.   '=jf=' `.   _   .','--__--'.
 /  .  \/\ /'-___-'\/:|\
(_) . (_)  /  \   / \  (_)   :|   (_)
 \_-'--/  (_)(_) (_)___(_)   |___:||
  \___/ || \___/ |_|


Thanks and Regards

Ashish Chandra

Openstack Developer, Cloud Engineering

Reliance Jio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

2014-10-20 Thread Mark Wu

Test result Update:


Number of Hosts   Maximum single volume IOPS   Maximum aggregated IOPS
 SSD Disk IOPS SSD Disk Utilization

7  14k45k
 9800+90%

8  21k
 50k  9800+90%

9  30k
 56k  9800+90%

1040k
 54k  8200+70%



Note:  the disk average request size is about 20 sectors, not same as
client side (4k)


I have two questions about the result:


1. No matter how many nodes the cluster has,  the backend write throughput
is always almost 8 times of client side.  Is it normal behavior in Ceph,
 or caused by some wrong configuration in my setup?


The following data is captured in the 9 hosts test.  Roughly, the
aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9 = 1980M/s

The client side is 56k * 4 = 244M/s


Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s
  rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.330.001.33 0.0010.67 8.00
0.000.00   0.00   0.00
sdb   0.00 6.000.00 10219.67 0.00 223561.67
 21.88 4.080.40   0.09  89.43
sdc   0.00 6.000.00 9750.67 0.00 220286.6722.59
2.470.25   0.09  89.83
dm-0  0.00 0.000.000.00 0.00 0.00 0.00
0.000.00   0.00   0.00
dm-1  0.00 0.000.001.33 0.0010.67 8.00
0.000.00   0.00   0.00

Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s
  rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.000.001.00 0.0026.6726.67
0.000.00   0.00   0.00
sdb   0.00 6.330.00 10389.00 0.00 224668.67
 21.63 3.780.36   0.09  89.23
sdc   0.00 4.330.00 10106.67 0.00 217986.00
 21.57 3.830.38   0.09  91.10
dm-0  0.00 0.000.000.00 0.00 0.00 0.00
0.000.00   0.00   0.00
dm-1  0.00 0.000.001.00 0.0026.6726.67
0.000.00   0.00   0.00


2.  For the scalability issue ( 10 hosts performs worse than 9 hosts),  is
there any tuning suggestion to improve it?

Thanks!







2014-10-17 16:52 GMT+08:00 Mark Wu wud...@gmail.com:


  I assume you added more clients and checked that it didn't scale past
  that?
 Yes, correct.
  You might look through the list archives; there are a number of
 discussions about how and how far you can scale SSD-backed cluster
 performance.
 I have look at those discussions before, particular the one initiated by
 Sebastien:
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12486.html
 I found that Giant can provide better utilization on SSD backend from the
 thread.  It does improve a lot in the test of 4k random write, compared
 with Firefly.
 In the previous tests with Firefly and 16 osds, I found that the iops of
 4k random write on single volume is 14k, and which almost reach the peak of
 whole cluster.
 And the iops on SSD disk is less than 1000, which is far away from the
 hardware limitation. It looks that ceph doesn't dispatch fast enough.

 With 0.86,  the following  options and disabling debugging can improve
 obviously.
  throttler perf counter = false
  osd enable op tracker = false

  Just scanning through the config options you set, you might want to
  bump up all the filestore and journal queue values a lot farther.

 Tried the following options.  It doesn't change.

 ournal_queue_max_ops=3000
 objecter_inflight_ops=10240
 journal_max_write_bytes=1048576000
 journal_queue_max_bytes=1048576000

 ms_dispatch_throttle_bytes=1048576000
 objecter_infilght_op_bytes=1048576000
 filestore_max_sync_interval=10

 I have a question about the relationship between the write I/O numbers
 performed on ceph client and the osd disks. From the iostat pasted in the
 first message,
 the write per second is about 5000 and the average request size is 17~22
 sectors. Roughly, the write throughtput on all osd nodes is 20 * 512 * 5000
 * 30 = 1500MB/s
 The replica setting is 2 and the journal and osd data on the same disk, so
 can we assume the write on ssd disks is 40k (fio client result) * 4k * 2 *
 2 = 640MB/s in theory?
 I don't understand why he actual write is so high compared with the
 theoretical value. And the average request size is also more than twice of
 client request size.
 I run blktrace to check if

[ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler

Dear All

I have in them moment a issue with my cluster. The recovery process stops.

ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at 
{0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 
active+degraded+remapped+wait_backfill+backfill_toofull, 1 
active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
active+degraded+remapped+backfill_toofull, 2 
active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 
6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded 
(7.491%)


I have tried to restart all OSD in the cluster, but does not help to finish the 
recovery of the cluster.

Have someone any idea

Kind Regards
Harald Rößler   



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to calculate file size when mount a block device from rbd image

2014-10-20 Thread Benedikt Fraunhofer

Hi Mika,

2014-10-20 11:16 GMT+02:00 Vickie CH mika.leaf...@gmail.com:

 2.Use dd command to create a 1.2T file.
#dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000

I think you're off by one zero

12288000/1024/1024
11

Means you're instructing it to create a 11TB file on a 1.5T volume.

Cheers

  Benedikt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander

On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 

See this: 2 active+degraded+remapped+backfill_toofull

156 pgs backfill_toofull

You have one or more OSDs which are to full and that causes recovery to
stop.

If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
osdmap e6748: 24 osds: 23 up, 23 in
 pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 
 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded 
 (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Leszek Master

I think it's because you have too full osds like in warning message. I had
similiar problem recently and i did:

ceph osd reweight-by-utilization

But first read what this command does. It solved problem for me.

2014-10-20 14:45 GMT+02:00 Harald Rößler harald.roess...@btd.de:

 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.

 ceph -s
health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean;
 recovery 111487/1488290 degraded (7.491%)
monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election
 epoch 332, quorum 0,1,2 0,12,6
osdmap e6748: 24 osds: 23 up, 23 in
 pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped,
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2
 active+degraded+remapped+backfill_toofull, 2
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to
 finish the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

2014-10-20 Thread Mark Nelson


On 10/20/2014 06:27 AM, Mark Wu wrote:

Test result Update:


Number of Hosts   Maximum single volume IOPS   Maximum aggregated IOPS
  SSD Disk IOPS SSD Disk Utilization

7  14k 45k  9800+90%

8  21k
  50k  9800+90%

9  30k
  56k  9800+ 90%

1040k
  54k  8200+70%



Note:  the disk average request size is about 20 sectors, not same as
client side (4k)


I have two questions about the result:


1. No matter how many nodes the cluster has,  the backend write
throughput is always almost 8 times of client side.  Is it normal
behavior in Ceph,  or caused by some wrong configuration in my setup?


Are you counting journal writes and replication into this?  Also note 
that journal writes will be slightly larger and padded to a 4K boundary 
for each write due to header information.  I suspect for coalesced 
journal writes we may be able to pack the headers together to reduce 
this overhead.





The following data is captured in the 9 hosts test.  Roughly, the
aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9 = 1980M/s

The client side is 56k * 4 = 244M/s


Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.330.001.33 0.0010.67
8.00 0.000.00   0.00   0.00
sdb   0.00 6.000.00 10219.67 0.00 223561.67
  21.88 4.080.40   0.09  89.43
sdc   0.00 6.000.00 9750.67 0.00 220286.67
  22.59 2.470.25   0.09  89.83
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
dm-1  0.00 0.000.001.33 0.0010.67
8.00 0.000.00   0.00   0.00

Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.001.00 0.0026.67
  26.67 0.000.00   0.00   0.00
sdb   0.00 6.330.00 10389.00 0.00 224668.67
  21.63 3.780.36   0.09  89.23
sdc   0.00 4.330.00 10106.67 0.00 217986.00
  21.57 3.830.38   0.09  91.10
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
dm-1  0.00 0.000.001.00 0.0026.67
  26.67 0.000.00   0.00   0.00


2.  For the scalability issue ( 10 hosts performs worse than 9 hosts),
  is there any tuning suggestion to improve it?


Can you post exactly the test you are running and on how many 
hosts/volumes?  That would help us debug.


Thanks!
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread 池信泽

hi, cephers:

  When I look into the ceph source code, I found the erasure code pool
not support
the random write, it only support the append write. Why? Is that random
write of is erasure code high cost and the performance of the deep scrub is
very poor?

 Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] real beginner question

2014-10-20 Thread Dan Geist

Hi, Ranju. Are you talking about setting up Ceph Monitors and OSD nodes on VMs 
for the purposes of learning, or adding a Ceph storage cluster to an existing 
KVM-based infrastructure that's using local storage/NFS/iSCSI for block storage 
now?

- If the former, this is pretty easy. Although performance will suffer, OSDs 
and Monitors will run fine in VMs, just observe the minimum specs in the 
official hardware howtos. I setup my first cluster like this:

Ubuntu 14.04 Workstation (with LVM)
   -Ceph1: 14.04 with Mon1 and OSD using Raw disk access from different LVM 
partitions of hypervisor OS
   -Ceph2: 
   -Ceph3: 
   -Test VM1: 14.04 desktop with 20G filesystem exposed through RBD to libvirt.

What's neat (and was non-obvious) was that simply configuring the KVM 
hypervisor as a Ceph client allowed you to leverage its exposed storage even 
though the hosts exposing that storage were VMs on the same machine (horribly 
non-resilient design, yes, but it helped teach the concepts).

- If you're looking to do the latter, you can create your Ceph cluster of nodes 
adjacent your existing infrastructure, configure your hypervisor nodes as 
ceph/rbd clients (and test them with ceph -w, etc) then convert/copy the disk 
images one by one to rbd block images:
http://ceph.com/docs/master/rbd/libvirt/
http://ceph.com/docs/master/rbd/qemu-rbd/

Once you create a few test VMs on local disk and get into the practice of 
migrating them over, you'll find it's pretty straightforward with the commands 
listed in those pages.

Dan

Dan Geist dan(@)polter.net



- Original Message -
From: Ranju Upadhyay ranju.upadh...@nuim.ie
To: ceph-users@lists.ceph.com
Sent: Monday, October 20, 2014 6:23:59 AM
Subject: [ceph-users] real beginner question

Hi list, 



This is a real newbie question.(and hopefully the right list to ask to!) 



Is it possible to set up ceph in an already virtualized environment? i.e. we 
have a scenario here, where we have virtual machine ( as opposed to individual 
physical machines) with ubuntu OS on it. We are trying to create ceph cluster 
on this virtual machine . (not sure if this is a sensible thing to do!) 



On our effort to install ceph we used vagrant ( came across some notes through 
google). We thought that would be the easiest route, as we do not know anything 
yet. But we are unsuccessful. We can go as far as creating a virtual machine 
but it fails as provisioning stage (i.e. mons;osds;mdss;rgws etc do not get 
created) 





Any suggestions? 





Thanks 

Ranju Upadhyay 

Maynooth University, 

Ireland. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Gregory Farnum

This is a common constraint in many erasure coding storage system. It
arises because random writes turn into a read-modify-write cycle (in order
to redo the parity calculations). So we simply disallow them in EC pools,
which works fine for the target use cases right now.
-Greg

On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote:

 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

  Thanks.



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler

Yes, I had some OSD which was near full, after that I tried to fix the problem 
with ceph osd reweight-by-utilization, but this does not help. After that I 
set the near full ratio to 88% with the idea that the remapping would fix the 
issue. Also a restart of the OSD doesn’t help. At the same time I had a 
hardware failure of on disk. :-(. After that failure the recovery process start 
at degraded ~ 13%“ and stops at 7%.
Honestly I am scared in the moment I am doing the wrong operation.

Regards
Harald Rößler   
 


 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 
 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded 
 (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

2014-10-20 Thread Mark Wu

2014-10-20 21:04 GMT+08:00 Mark Nelson mark.nel...@inktank.com:

 On 10/20/2014 06:27 AM, Mark Wu wrote:

 Test result Update:


 Number of Hosts   Maximum single volume IOPS   Maximum aggregated IOPS
   SSD Disk IOPS SSD Disk Utilization

 7  14k 45k  9800+
 90%

 8  21k
   50k  9800+90%

 9  30k
   56k  9800+ 90%

 1040k
   54k  8200+70%



 Note:  the disk average request size is about 20 sectors, not same as
 client side (4k)


 I have two questions about the result:


 1. No matter how many nodes the cluster has,  the backend write
 throughput is always almost 8 times of client side.  Is it normal
 behavior in Ceph,  or caused by some wrong configuration in my setup?


 Are you counting journal writes and replication into this?  Also note that
 journal writes will be slightly larger and padded to a 4K boundary for each
 write due to header information.  I suspect for coalesced journal writes we
 may be able to pack the headers together to reduce this overhead.


Yes, the journal writes and replication are counted into backend writes.
Each ssd disk has two partitions: the raw one is used for journal and the
one formatted as xfs is used osd data. The replica setting is 2.
So considering the journal writes and replication,  I expect the writes on
backend is 4 times of client side.  From the perspective of disk
utilization,  it's good because it's already close to the physical
limitation.
But the overhead is too big. Is it possible to try your idea without
modifying code?  If yes, I am glad to give it a try.





 The following data is captured in the 9 hosts test.  Roughly, the
 aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9 = 1980M/s

 The client side is 56k * 4 = 244M/s


 Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
 wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.00 0.330.001.33 0.0010.67
 8.00 0.000.00   0.00   0.00
 sdb   0.00 6.000.00 10219.67 0.00 223561.67
   21.88 4.080.40   0.09  89.43
 sdc   0.00 6.000.00 9750.67 0.00 220286.67
   22.59 2.470.25   0.09  89.83
 dm-0  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-1  0.00 0.000.001.33 0.0010.67
 8.00 0.000.00   0.00   0.00

 Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
 wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.00 0.000.001.00 0.0026.67
   26.67 0.000.00   0.00   0.00
 sdb   0.00 6.330.00 10389.00 0.00 224668.67
   21.63 3.780.36   0.09  89.23
 sdc   0.00 4.330.00 10106.67 0.00 217986.00
   21.57 3.830.38   0.09  91.10
 dm-0  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-1  0.00 0.000.001.00 0.0026.67
   26.67 0.000.00   0.00   0.00


 2.  For the scalability issue ( 10 hosts performs worse than 9 hosts),
   is there any tuning suggestion to improve it?


 Can you post exactly the test you are running and on how many
 hosts/volumes?  That would help us debug.

In the test, we run vdbench with the following parameters on one host:

sd=sd1,lun=/dev/rbd2,threads=128
sd=sd2,lun=/dev/rbd0,threads=128
sd=sd3,lun=/dev/rbd1,threads=128
*sd=sd4,lun=/dev/rbd3,threads=128
wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct
*wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct
rd=run1,wd=wd*,iorate=10,elapsed=500,interval=1




 Thanks!
 Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander

On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the remapping 
 would fix the issue. Also a restart of the OSD doesn’t help. At the same time 
 I had a hardware failure of on disk. :-(. After that failure the recovery 
 process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 

Any chance of adding a new node with some fresh disks? Seems like you
are operating on the storage capacity limit of the nodes and that your
only remedy would be adding more spindles.

Wido

 Regards
 Harald Rößler 
  
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB 
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler   



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander

On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea? 
 

If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

 Harald Rößler 
 
 
 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.

 Wido

 Regards
 Harald Rößler   



 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler 



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Wido den Hollander

On 10/20/2014 03:25 PM, 池信泽 wrote:
 hi, cephers:
 
   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?
 

To modify a EC object you need to read all chunks in order to compute
the parity again.

So that would involve a lot of reads for what might be just a very small
write.

That's also why EC can't be used for RBD images.

  Thanks.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD very slow startup

2014-10-20 Thread Lionel Bouton

Hi,

More information on our Btrfs tests.

Le 14/10/2014 19:53, Lionel Bouton a écrit :


 Current plan: wait at least a week to study 3.17.0 behavior and
 upgrade the 3.12.21 nodes to 3.17.0 if all goes well.


3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only
(no corruption but OSD goes down) on some access patterns with snapshots:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html

The bug may be present in earlier kernels (at least the 3.16.4 code in
fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and
3.17.1) but seems at least less likely to show up (never saw it with
3.16.4 in several weeks but it happened with 3.17.1 three times in just
a few hours). As far as I can tell from its Changelog, 3.17.1 didn't
patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same
behaviour.

I switched all servers to 3.16.4 which I had previously tested without
any problem.

The performance problem is still there with 3.16.4. In fact one of the 2
large OSD was so slow it was repeatedly marked out and generated lots of
latencies when in. I just had to remove it: when this OSD is shut down
with noout to avoid backfills slowing down the storage network,
latencies are back to normal. I chose to reformat this one with XFS.

The other big node has a nearly perfectly identical system (same
hardware, same software configuration, same logical volume
configuration, same weight in the crush map, comparable disk usage in
the OSD fs, ...) but is behaving itself (maybe slower than our smaller
XFS and Btrfs OSD, but usable). The only notable difference is that it
was formatted more recently. So the performance problem might be linked
to the cumulative amount of data access to the OSD over time. If my
suspicion is true I believe we might see performance problems on the
other Btrfs OSDs later (we'll have to wait).

Is any Btrfs developper subscribed to this list? I could forward this
information to linux-btrfs@vger if needed but I can't offer much
debugging help (the storage cluster is in production and I'm more
inclined to migrate slow OSDs to XFS than doing invasive debugging with
Btrfs).

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler

Yes I agree 100%, but actual every disk have a maximum of 86% usage, there 
should a way to recover the cluster. To set the near full ratio to higher than 
85% should be only a short term solution. New disk for higher capacity are 
already ordered, I only don’t like degraded situation, for a week or more.
Also one of the VM’s doesn’t start because an slow request warning.

Thanks for your advise.
Harald Rößler   


 Am 20.10.2014 um 17:12 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea? 
 
 
 If the disks are all full, then, no.
 
 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.
 
 Wido
 
 Harald Rößler
 
 
 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 
 
 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.
 
 Wido
 
 Regards
 Harald Rößler  
 
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process 
 stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
 health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
 monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
 osdmap e6748: 24 osds: 23 up, 23 in
  pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 
 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

2014-10-20 Thread Mark Nelson


On 10/20/2014 09:28 AM, Mark Wu wrote:



2014-10-20 21:04 GMT+08:00 Mark Nelson mark.nel...@inktank.com
mailto:mark.nel...@inktank.com:

On 10/20/2014 06:27 AM, Mark Wu wrote:

Test result Update:


Number of Hosts   Maximum single volume IOPS   Maximum
aggregated IOPS
   SSD Disk IOPS SSD Disk Utilization

7  14k 45k  9800+
 90%

8  21k
   50k  9800+
 90%

9  30k
   56k  9800+ 90%

1040k
   54k  8200+
 70%



Note:  the disk average request size is about 20 sectors, not
same as
client side (4k)


I have two questions about the result:


1. No matter how many nodes the cluster has,  the backend write
throughput is always almost 8 times of client side.  Is it normal
behavior in Ceph,  or caused by some wrong configuration in my
setup?


Are you counting journal writes and replication into this?  Also
note that journal writes will be slightly larger and padded to a 4K
boundary for each write due to header information.  I suspect for
coalesced journal writes we may be able to pack the headers together
to reduce this overhead.


Yes, the journal writes and replication are counted into backend
writes.  Each ssd disk has two partitions: the raw one is used for
journal and the one formatted as xfs is used osd data. The replica
setting is 2.
So considering the journal writes and replication,  I expect the writes
on backend is 4 times of client side.  From the perspective of disk
utilization,  it's good because it's already close to the physical
limitation.
But the overhead is too big. Is it possible to try your idea without
modifying code?  If yes, I am glad to give it a try.


Sadly it will require code changes and is something we've only briefly 
talked about.  So it is surprising that you would see 8x writes with 2x 
replication and on-disk journals imho.  In the past one of the things 
I've done is add up all of the totals for the entire test both on the 
client side and on the server side just to make sure that the numbers 
are right.  At least in past testing things properly added up, at least 
on our test rig.








The following data is captured in the 9 hosts test.  Roughly, the
aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9
= 1980M/s

The client side is 56k * 4 = 244M/s


Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.330.001.33 0.0010.67
8.00 0.000.00   0.00   0.00
sdb   0.00 6.000.00 10219.67 0.00 223561.67
   21.88 4.080.40   0.09  89.43
sdc   0.00 6.000.00 9750.67 0.00 220286.67
   22.59 2.470.25   0.09  89.83
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
dm-1  0.00 0.000.001.33 0.0010.67
8.00 0.000.00   0.00   0.00

Filesystem:  rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
wBlk_dir/s   rBlk_svr/s   wBlk_svr/s ops/srops/swops/s

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.001.00 0.0026.67
   26.67 0.000.00   0.00   0.00
sdb   0.00 6.330.00 10389.00 0.00 224668.67
   21.63 3.780.36   0.09  89.23
sdc   0.00 4.330.00 10106.67 0.00 217986.00
   21.57 3.830.38   0.09  91.10
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
dm-1  0.00 0.000.001.00 0.0026.67
   26.67 0.000.00   0.00   0.00


2.  For the scalability issue ( 10 hosts performs worse than 9
hosts),
   is there any tuning suggestion to improve it?


Can you post exactly the test you are running and on how many
hosts/volumes?  That would help us debug.

In the test, we run vdbench with the following parameters on one host:

sd=sd1,lun=/dev/rbd2,threads=128
sd=sd2,lun=/dev/rbd0,threads=128
sd=sd3,lun=/dev/rbd1,threads=128
*sd=sd4,lun=/dev/rbd3,threads=128
wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct

[ceph-users] RADOS pool snaps and RBD

2014-10-20 Thread Xavier Trilla

Hi,

It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or had 
RBD snapshots. They only work on RBD pools which never had a RBD snapshot. 

So, basically this works:

rados mkpool test-pool 1024 1024 replicated
rbd -p test-pool create --size=102400 test-image
ceph osd pool mksnap test-pool rados-snap

But this doesn't:

rados mkpool test-pool 1024 1024 replicated
rbd -p test-pool create --size=102400 test-image
rbd -p test-pool snap create test-image@rbd-snap
ceph osd pool mksnap test-pool rados-snap

And we get the following error message:

Error EINVAL: pool test-pool is in unmanaged snaps mode

I've been checking the source code and it seems to be the expecte behavior, but 
I did not manage to find any information regarding unmanaged snaps mode. Also 
I did not find any information about RBD snapshots and pool snapshots being 
mutually exclusive. And even deleting all the RBD snapshots in a pool doesn't 
enable RADOS snapshots again. 

So, I have a couple of questions:

- Are RBD and RADOS snapshots mutually exclusive?
- What does mean unmanaged snaps mode message?
- Is there any way to revert a pool status to allow RADOS pool snapshots after 
all RBD snapshots are removed? 

We are designing a quite interesting  way to perform incremental backups of RBD 
pools managed by OpenStack Cinder. The idea is to do the incremental backup at 
a RADOS level, basically using the mtime property  of the object and comparing 
it against the time we did the last backup / pool snapshot. That way it should 
be really easy to find modified objects transferring only them, making the 
implementation of a DR solution easier.. But the issue explained here would be 
a big problem, as the backup solution would stop working if just one user 
creates a RBD snapshot on the pool (For example using Cinder Backup).

I hope somebody could give us more information about this unmanaged snaps 
mode or point us to a way to revert this behavior once all RBD snapshots have 
been removed from a pool.

Thanks!

Saludos cordiales,
Xavier Trilla P.
Silicon Hosting

¿Sabías que ahora en SiliconHosting 
resolvemos tus dudas técnicas gratis?

Más información en: siliconhosting.com/qa/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Lionel Bouton

Le 20/10/2014 16:39, Wido den Hollander a écrit :
 On 10/20/2014 03:25 PM, 池信泽 wrote:
 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

 To modify a EC object you need to read all chunks in order to compute
 the parity again.

 So that would involve a lot of reads for what might be just a very small
 write.

 That's also why EC can't be used for RBD images.

I'm surprised this is a show stopper. Even if writes are really slow, I
can see several uses case for RBD images on EC pools (archiving,
template RDBs, ...). Using tier caching in a write-back configuration
might even alleviate some of the performance problems if writes from the
cache pool are done on properly aligned and sized chunks of data.

It may be overly optimistic (the small benchmark on the following page
might be done with all planets aligned...) but Sheepdog seems to
implement EC storage with what would be interesting for me if I could
get equivalent performance on purely sequential accesses with a
theoretical Ceph EC RBDs.

https://github.com/sheepdog/sheepdog/wiki/Erasure-Code-Support#performance

Lionel Bouotn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Leszek Master

You can set lower weight on full osds, or try changing the
osd_near_full_ratio parameter in your cluster from 85 to for example 89.
But i don't know what can go wrong when you do that.

2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix the
 problem with ceph osd reweight-by-utilization, but this does not help.
 After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I had a hardware failure of on disk. :-(. After that failure
 the recovery process start at degraded ~ 13%“ and stops at 7%.
  Honestly I am scared in the moment I am doing the wrong operation.
 
 
  Any chance of adding a new node with some fresh disks? Seems like you
  are operating on the storage capacity limit of the nodes and that your
  only remedy would be adding more spindles.
 
  Wido
 
  Regards
  Harald Rößler
 
 
 
  Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 02:45 PM, Harald Rößler wrote:
  Dear All
 
  I have in them moment a issue with my cluster. The recovery process
 stops.
 
 
  See this: 2 active+degraded+remapped+backfill_toofull
 
  156 pgs backfill_toofull
 
  You have one or more OSDs which are to full and that causes recovery
 to
  stop.
 
  If you add more capacity to the cluster recovery will continue and
 finish.
 
  ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4
 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck
 unclean; recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped,
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2
 active+degraded+remapped+backfill_toofull, 2
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290
 degraded (7.491%)
 
 
  I have tried to restart all OSD in the cluster, but does not help to
 finish the recovery of the cluster.
 
  Have someone any idea
 
  Kind Regards
  Harald Rößler
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  --
  Wido den Hollander
  Ceph consultant and trainer
  42on B.V.
 
  Phone: +31 (0)20 700 9902
  Skype: contact42on
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  --
  Wido den Hollander
  Ceph consultant and trainer
  42on B.V.
 
  Phone: +31 (0)20 700 9902
  Skype: contact42on
 


 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix the
 problem with ceph osd reweight-by-utilization, but this does not help.
 After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I

Re: [ceph-users] Ceph OSD very slow startup

2014-10-20 Thread Gregory Farnum

On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote:
 Hi,

 More information on our Btrfs tests.

 Le 14/10/2014 19:53, Lionel Bouton a écrit :



 Current plan: wait at least a week to study 3.17.0 behavior and upgrade the
 3.12.21 nodes to 3.17.0 if all goes well.


 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no
 corruption but OSD goes down) on some access patterns with snapshots:
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html

 The bug may be present in earlier kernels (at least the 3.16.4 code in
 fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and
 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4
 in several weeks but it happened with 3.17.1 three times in just a few
 hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any
 vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour.

 I switched all servers to 3.16.4 which I had previously tested without any
 problem.

 The performance problem is still there with 3.16.4. In fact one of the 2
 large OSD was so slow it was repeatedly marked out and generated lots of
 latencies when in. I just had to remove it: when this OSD is shut down with
 noout to avoid backfills slowing down the storage network, latencies are
 back to normal. I chose to reformat this one with XFS.

 The other big node has a nearly perfectly identical system (same hardware,
 same software configuration, same logical volume configuration, same weight
 in the crush map, comparable disk usage in the OSD fs, ...) but is behaving
 itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The
 only notable difference is that it was formatted more recently. So the
 performance problem might be linked to the cumulative amount of data access
 to the OSD over time.

Yeah; we've seen this before and it appears to be related to our
aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag
well under our use case. The btrfs developers make sporadic concerted
efforts to improve things (and succeed!), but it apparently still
hasn't gotten enough better yet. :(
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph RBD

2014-10-20 Thread Sage Weil

On Mon, 20 Oct 2014, Dianis Dimoglo wrote:
 I installed ceph two nodes, 2 mon 2 osd in xfs, also used the RBD and 
 mount the pool on two different ceph host and when I write data through 
 one of the hosts at the other I do not see the data, what's wrong?

Although the RBD disk can be shared, that will only be useful if the file 
system you put on top is designed to allow that.  The usual suspects 
(ext4, xfs, etc.) do not--they assume only a single host is using the disk 
at any time.  That means that unless you deploy a cluster fs like ocfs2 or 
gfs2, you can only use an RBD on a single host at a time.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler

yes, tomorrow I will get the replacement of the failed disk, to get a new node 
with many disk will take a few days.
No other idea? 

Harald Rößler   


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the remapping 
 would fix the issue. Also a restart of the OSD doesn’t help. At the same 
 time I had a hardware failure of on disk. :-(. After that failure the 
 recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 
 
 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.
 
 Wido
 
 Regards
 Harald Rößler
 
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB 
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler  
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph RBD

2014-10-20 Thread Fred Yang

Sage,
Even with cluster file system, it will still need a fencing mechanism to
allow SCSI device shared by multiple host, what kind of SCSI reservation
RBD currently support?

Fred

Sent from my Samsung Galaxy S3
On Oct 20, 2014 4:42 PM, Sage Weil s...@newdream.net wrote:

 On Mon, 20 Oct 2014, Dianis Dimoglo wrote:
  I installed ceph two nodes, 2 mon 2 osd in xfs, also used the RBD and
  mount the pool on two different ceph host and when I write data through
  one of the hosts at the other I do not see the data, what's wrong?

 Although the RBD disk can be shared, that will only be useful if the file
 system you put on top is designed to allow that.  The usual suspects
 (ext4, xfs, etc.) do not--they assume only a single host is using the disk
 at any time.  That means that unless you deploy a cluster fs like ocfs2 or
 gfs2, you can only use an RBD on a single host at a time.

 sage
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread 池信泽

Thanks.

   Another reason is the checksum in the attr of object used for deep scrub
in EC pools should be computed when modify the object. When supporting the
random write, We should caculate the whole object for checksum, even if
there is a bit modified. If only supporting append write, We can get the
checksum based on the previously checksum and the append date which is more
quickly.

   Am I right?

2014-10-21 0:36 GMT+08:00 Gregory Farnum g...@inktank.com:

 This is a common constraint in many erasure coding storage system. It
 arises because random writes turn into a read-modify-write cycle (in order
 to redo the parity calculations). So we simply disallow them in EC pools,
 which works fine for the target use cases right now.
 -Greg


 On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote:

 hi, cephers:

   When I look into the ceph source code, I found the erasure code
 pool not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

  Thanks.



 --
 Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Lionel Bouton

Hi,

Le 21/10/2014 01:10, 池信泽 a écrit :
 Thanks.

Another reason is the checksum in the attr of object used for deep
 scrub in EC pools should be computed when modify the object. When
 supporting the random write, We should caculate the whole object for
 checksum, even if there is a bit modified. If only supporting append
 write, We can get the checksum based on the previously checksum and
 the append date which is more quickly. 
   
Am I right?

From what I understand, the deep scrub doesn't use a Ceph checksum but
compares data between OSDs (and probably use a majority wins rule for
repair). If you are using Btrfs it will report an I/O error because it
uses an internal checksum by default which will force Ceph to use other
OSDs for repair.
I'd be glad to be proven wrong on this subject though.

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD (and probably other settings) not being picked up outside of the [global] section

2014-10-20 Thread Craig Lewis

I'm still running Emperor, but I'm not seeing that behavior.  My ceph.conf
is pretty similar:
[global]
  mon initial members = ceph0
  mon host = 10.129.0.6:6789, 10.129.0.7:6789, 10.129.0.8:6789
  cluster network = 10.130.0.0/16
  osd pool default flag hashpspool = true
  osd pool default min size = 2
  osd pool default size = 3
  public network = 10.129.0.0/16

[osd]
  osd journal size = 6144
  osd mkfs options xfs = -s size=4096
  osd mkfs type = xfs
  osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64



If you manually run ceph-disk-prepare and ceph-disk-activate, are the mkfs
params being picked up?

For the daemon configs, you can query a running daemon to see what it's
config params are:
root@ceph0:~# ceph daemon osd.0 config get 'osd_op_threads'
{ osd_op_threads: 2}
root@ceph0:~# ceph daemon osd.0 config get 'osd_scrub_load_threshold'
{ osd_scrub_load_threshold: 0.5}


While we try to figure this out, you can tell the running daemons to use
your values with:
ceph tell osd.\* --inject_args '--osd_op_threads 10'




On Thu, Oct 16, 2014 at 6:54 PM, Christian Balzer ch...@gol.com wrote:


 Hello,

 Consider this rather basic configuration file:
 ---
 [global]
 fsid = e6687ef7-54e1-44bd-8072-f9ecab00815
 mon_initial_members = ceph-01, comp-01, comp-02
 mon_host = 10.0.0.21,10.0.0.5,10.0.0.6
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 mon_osd_downout_subtree_limit = host
 public_network = 10.0.0.0/8
 osd_pool_default_pg_num = 2048
 osd_pool_default_pgp_num = 2048
 osd_crush_chooseleaf_type = 1

 [osd]
 osd_mkfs_type = ext4
 osd_mkfs_options_ext4 = -J size=1024 -E
 lazy_itable_init=0,lazy_journal_init=0
 osd_op_threads = 10
 osd_scrub_load_threshold = 2.5
 filestore_max_sync_interval = 10
 ---

 Let us slide the annoying fact that ceph ignores the pg and pgp settings
 when creating the initial pools.
 And that monitors are preferred based on IP address instead of the
 sequence they're listed in the config file.

 Interestingly ceph-deploy correctly picks up the mkfs_options but why it
 fails to choose the mkfs_type as default is beyond me.

 The real issue is that the other three OSD setting are NOT picked up by
 ceph on startup.
 But they sure are when moved to the global section.

 Anybody else seeing this (both with 0.80.1 and 0.80.6)?

 Regards,

 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] urgent- object unfound

2014-10-20 Thread Craig Lewis

It's probably a bit late now, but did you get the issue resolved?

If not, why is OSD.49 down?  I'd start by trying to get all of your OSDs
back UP and IN.  It may take a little while to unblock the requests.
Recovery doesn't appear to prioritize blocked PGs, so it might take a while
for recovery to get to the PG you care about.


PGs tracks which OSDs were the primary over time (see ceph pg 6.766 query) .
If your OSDs are flapping, it's possible for OSD.49 to have the most recent
version of object1, and OSD.21 to have the most recent version of object2.
Ceph can repair this, but it needs all of the PGs with the most recent
version to be UP, and it blocks until the current primary has the latest
version of the object requested.




On Thu, Oct 16, 2014 at 5:36 AM, Ta Ba Tuan tuant...@vccorp.vn wrote:

 Hi eveyone,  I use replicate 3, many unfound object and Ceph very slow.

 pg 6.9d8 is active+recovery_wait+degraded+remapped, acting [22,93], 4
 unfound
 pg 6.766 is active+recovery_wait+degraded+remapped, acting [21,36], 1
 unfound
 pg 6.73f is active+recovery_wait+degraded+remapped, acting [19,84], 2
 unfound
 pg 6.63c is active+recovery_wait+degraded+remapped, acting [10,37], 2
 unfound
 pg 6.56c is active+recovery_wait+degraded+remapped, acting [124,93], 2
 unfound
 pg 6.4d3 is active+recovering+degraded+remapped, acting [33,94], 2 unfound
 pg 6.4a5 is active+recovery_wait+degraded+remapped, acting [11,94], 2
 unfound
 pg 6.2f9 is active+recovery_wait+degraded+remapped, acting [22,34], 2
 unfound
 recovery 535673/52672768 objects degraded (1.017%); 17/17470639 unfound
 (0.000%)

 ceph pg map  6.766
 osdmap e94990 pg 6.766 (6.766) - up [49,36,21] acting [21,36]


 I can't resolve it. I need data on those objects. Guide me, please!

 Thank you!

 --
 Tuan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph RBD

2014-10-20 Thread Sage Weil

Hi Fred,

There is a fencing mechanism.  There is work underway to wire it up to an 
iSCSI target (LIO in this case), but I think that isn't needed to simply 
run ocfs2 (or similar) directly on top of an RBD device.  Honestly I'm not 
quite sure how that would glue together.

sage



On Mon, 20 Oct 2014, Fred Yang wrote:

 
 Sage,
 Even with cluster file system, it will still need a fencing mechanism to
 allow SCSI device shared by multiple host, what kind of SCSI reservation RBD
 currently support?
 
 Fred
 
 Sent from my Samsung Galaxy S3
 
 On Oct 20, 2014 4:42 PM, Sage Weil s...@newdream.net wrote:
   On Mon, 20 Oct 2014, Dianis Dimoglo wrote:
I installed ceph two nodes, 2 mon 2 osd in xfs, also used the
   RBD and
mount the pool on two different ceph host and when I write
   data through
one of the hosts at the other I do not see the data, what's
   wrong?
 
   Although the RBD disk can be shared, that will only be useful if
   the file
   system you put on top is designed to allow that.  The usual
   suspects
   (ext4, xfs, etc.) do not--they assume only a single host is
   using the disk
   at any time.  That means that unless you deploy a cluster fs
   like ocfs2 or
   gfs2, you can only use an RBD on a single host at a time.
 
   sage
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period

2014-10-20 Thread Craig Lewis

RadosGW Federation can fulfill this use case:
http://ceph.com/docs/master/radosgw/federated-config/ .  Depending on your
setup, it may or may not be easily.

To start, radosgw-agent handles the replication.  It does the metadata
(users and bucket) and the data (objects in a bucket).  It only flows from
the primary to the secondary, so you're good there.

It tracks what's been replicated, and maintains this state in (I believe)
the secondary cluster.  If replication is started up after being down, it
starts from the last replication timestamp, and runs up to now (whatever is
now when the run starts).  Objects that have been deleted and garbage
collected in the primary won't replicate, but it won't cause the
replication to fail.


The currently version of radosgw-agent, 1.2, attempts to get everything
from it's last replication timestamp to current in a single pass.  It
doesn't persist it's replication state until it finishes that pass.
Because of this, any interruption of the replication will start over.

This is really only a problem if you have large buckets.  If you have many
bucket with a small amount of data, you'll just want to run a lot of
replication threads.  I have a few buckets, with ~1M objects and ~1 TiB of
data per bucket.  Took me a while to figure out that nightly log rotation
was restarting the daemon.  Once I disable log rotation, I ran into
problems with the stability of my VPN connection.


It's definitely do-able.  I would setup some virtual test clusters and try
it out.



On Thu, Oct 16, 2014 at 2:05 AM, Anthony Alba ascanio.al...@gmail.com
wrote:

 Hi list,

 Can RADOS fulfil the following use case:

 I wish to have a radosgw-S3 object store that is LIVE,
 this represents current objects of users.

 Separated by an air-gap is another radosgw-S3 object store that is
 ARCHIVE.

 The objects will only be created and manipulated by radosgw.

 Periodically, (on the order of 3-6 months), I want to connect the two
 clusters and replicate all objects from LIVE to ARCHIVE created from
 time period DDMM1 - DDMM2 or better yet from
 the last timestamp . This is a one way replication and the objects
 are transferred only in the LIVE == ARCHIVE direction.

 Can this be done easily?

 Thanks
 Anthony
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD (and probably other settings) not being picked up outside of the [global] section

2014-10-20 Thread Christian Balzer


Hello,


On Mon, 20 Oct 2014 17:09:57 -0700 Craig Lewis wrote:

 I'm still running Emperor, but I'm not seeing that behavior.  My
 ceph.conf is pretty similar:

Yeah, I tested things extensively with Emperor back in the day and at that
time frequently verified that changes in the config file were reflected in
the running configuration after a restart.

Until last week I of course blissfully assumed that this basic
functionality would still work in Firefly. ^o^

 [global]
   mon initial members = ceph0
   mon host = 10.129.0.6:6789, 10.129.0.7:6789, 10.129.0.8:6789
   cluster network = 10.130.0.0/16
   osd pool default flag hashpspool = true
   osd pool default min size = 2
   osd pool default size = 3
   public network = 10.129.0.0/16
 
 [osd]
   osd journal size = 6144
   osd mkfs options xfs = -s size=4096
   osd mkfs type = xfs
   osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64
 
 
 
 If you manually run ceph-disk-prepare and ceph-disk-activate, are the
 mkfs params being picked up?
 
No idea really, I will have to test that.
Of course with ceph-deploy (and I assume ceph-disk-prepare) the activate
bit is a bit of misnomer, as the udev magic will happily activate an OSD
instantly after creation despite me using just ceph-deploy osd
prepare 

 For the daemon configs, you can query a running daemon to see what it's
 config params are:
 root@ceph0:~# ceph daemon osd.0 config get 'osd_op_threads'
 { osd_op_threads: 2}
 root@ceph0:~# ceph daemon osd.0 config get 'osd_scrub_load_threshold'
 { osd_scrub_load_threshold: 0.5}
 
I of course know that, that is how I found out that things didn't get
picked up.

 
 While we try to figure this out, you can tell the running daemons to use
 your values with:
 ceph tell osd.\* --inject_args '--osd_op_threads 10'
 
That I'm also aware of, but for the time being having everything in
[global] resolves the problem and more importantly makes it reboot proof.

Christian
 
 
 
 On Thu, Oct 16, 2014 at 6:54 PM, Christian Balzer ch...@gol.com wrote:
 
 
  Hello,
 
  Consider this rather basic configuration file:
  ---
  [global]
  fsid = e6687ef7-54e1-44bd-8072-f9ecab00815
  mon_initial_members = ceph-01, comp-01, comp-02
  mon_host = 10.0.0.21,10.0.0.5,10.0.0.6
  auth_cluster_required = cephx
  auth_service_required = cephx
  auth_client_required = cephx
  filestore_xattr_use_omap = true
  mon_osd_downout_subtree_limit = host
  public_network = 10.0.0.0/8
  osd_pool_default_pg_num = 2048
  osd_pool_default_pgp_num = 2048
  osd_crush_chooseleaf_type = 1
 
  [osd]
  osd_mkfs_type = ext4
  osd_mkfs_options_ext4 = -J size=1024 -E
  lazy_itable_init=0,lazy_journal_init=0
  osd_op_threads = 10
  osd_scrub_load_threshold = 2.5
  filestore_max_sync_interval = 10
  ---
 
  Let us slide the annoying fact that ceph ignores the pg and pgp
  settings when creating the initial pools.
  And that monitors are preferred based on IP address instead of the
  sequence they're listed in the config file.
 
  Interestingly ceph-deploy correctly picks up the mkfs_options but why
  it fails to choose the mkfs_type as default is beyond me.
 
  The real issue is that the other three OSD setting are NOT picked up by
  ceph on startup.
  But they sure are when moved to the global section.
 
  Anybody else seeing this (both with 0.80.1 and 0.80.6)?
 
  Regards,
 
  Christian
  --
  Christian BalzerNetwork/Systems Engineer
  ch...@gol.com   Global OnLine Japan/Fusion Communications
  http://www.gol.com/
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to resolve : start mon assert == 0

2014-10-20 Thread minchen

  thank you very much, Shu, Xinxin

I just start all mons , with command ceph-kvstore-tool 
/var/lib/ceph/mon/store.db set auth last_committed ver 0 on each mon node
  Min Chen
在2014-10-20，Shu, Xinxin xinxin@intel.com 写道:-原始邮件-
 发件人: Shu, Xinxin xinxin@intel.com
 发送时间: 2014年10月20日 星期一
 收件人: minchen minc...@ubuntukylin.com, ceph-users 
ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org 
ceph-de...@vger.kernel.org
 主题: RE: [ceph-users] how to resolve : start mon assert == 0

Please refer to http://tracker.ceph.com/issues/8851
 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of minchen
 Sent: Monday, October 20, 2014 3:42 PM
 To: ceph-users; ceph-de...@vger.kernel.org
 Subject: [ceph-users] how to resolve : start mon assert == 0
 
Hello , all

 

when i restart any mon in mon cluster{mon.a, mon.b, mon.c} after kill all 
mons(disabled cephx).

An exception occured as follows:

 

# ceph-mon -i b

mon/AuthMonitor.cc: In function 'virtual void 
AuthMonitor::update_from_paxos(bool*)' thread thread 7fc801c78780 time 
2014-10-20 15:29:31.966367

mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)

ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)

 1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]

 2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]

 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]

 4: (Monitor::init_paxos()+0xf5) [0x54a515]

 5: (Monitor::preinit()+0x69f) [0x56291f]

 6: (main()+0x2665) [0x534df5]

 7: (__libc_start_main()+0xed) [0x7fc7ffc7876d]

 8: ceph-mon() [0x537bf9]


 

Anyone can help to solve this problem? 

 

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOS pool snaps and RBD

2014-10-20 Thread Shu, Xinxin

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Xavier 
Trilla
Sent: Tuesday, October 21, 2014 12:42 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] RADOS pool snaps and RBD

Hi,

It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or had 
RBD snapshots. They only work on RBD pools which never had a RBD snapshot. 

So, basically this works:

rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create 
--size=102400 test-image ceph osd pool mksnap test-pool rados-snap

But this doesn't:

rados mkpool test-pool 1024 1024 replicated rbd -p test-pool create 
--size=102400 test-image rbd -p test-pool snap create test-image@rbd-snap ceph 
osd pool mksnap test-pool rados-snap

And we get the following error message:

Error EINVAL: pool test-pool is in unmanaged snaps mode

I've been checking the source code and it seems to be the expecte behavior, but 
I did not manage to find any information regarding unmanaged snaps mode. Also 
I did not find any information about RBD snapshots and pool snapshots being 
mutually exclusive. And even deleting all the RBD snapshots in a pool doesn't 
enable RADOS snapshots again. 

So, I have a couple of questions:

- Are RBD and RADOS snapshots mutually exclusive?

 I think the answer is yes ,  this will be checked before you get your 
 snap_seq in OSDMonitor.

- What does mean unmanaged snaps mode message?

 This means you have create a rbd snapshot , you cannot create a snapshot 
 for rados

- Is there any way to revert a pool status to allow RADOS pool snapshots after 
all RBD snapshots are removed? 

 not very sure

We are designing a quite interesting  way to perform incremental backups of RBD 
pools managed by OpenStack Cinder. The idea is to do the incremental backup at 
a RADOS level, basically using the mtime property  of the object and comparing 
it against the time we did the last backup / pool snapshot. That way it should 
be really easy to find modified objects transferring only them, making the 
implementation of a DR solution easier.. But the issue explained here would be 
a big problem, as the backup solution would stop working if just one user 
creates a RBD snapshot on the pool (For example using Cinder Backup).

I hope somebody could give us more information about this unmanaged snaps 
mode or point us to a way to revert this behavior once all RBD snapshots have 
been removed from a pool.

Thanks!

Saludos cordiales,
Xavier Trilla P.
Silicon Hosting

¿Sabías que ahora en SiliconHosting
resolvemos tus dudas técnicas gratis?

Más información en: siliconhosting.com/qa/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosGW balancer best practices

2014-10-20 Thread Craig Lewis

I'm using my existing HAProxy server to also balance my RadosGW nodes.  I'm
not going to run into bandwidth problems on that link any time soon, but
I'll split RadosGW off onto it's own HAProxy instance when it does become
congested.

I have a smaller cluster, only 5 nodes.  I'm running mon on the first 3
nodes, and osd and rgw on all 5 nodes.  rgw and mon have very little
overhead.  I don't plan put those services on dedicated nodes anytime soon.


MONs need a decent amount of Disk I/O; bad things happen if the disks can't
keep up with the MONMAP updates.  Virtual machines with dedicated IOPS
should work fine for them.  RadosGW doesn't need much CPU or Disk I/O.  I
would have no problem testing virtual machines as RadosGW nodes.

As far as few-and-big or many-and-small, my gut feeling is that Apache
FastCGI isn't going to scale up to 10 GigE speeds.  Obviously you should
test this.  I don't see much downside to going many-and-small, unless
you're planning to go crazy with the many.  But that also depends on your
network, and how the GigE and 10 GigE networks switch/route.


If you have some spare CPU on your MON machines, I see no reason they can't
double as RadosGW nodes too.





On Tue, Oct 14, 2014 at 11:41 AM, Simone Spinelli simone.spine...@unipi.it
wrote:

  Dear all,

 we are going to add rados-gw to our ceph cluster (144 OSD on 12 servers +
 3 monitors connected via 10giga network) and we have a couple of questions.


 The first question is about the load balancer, do you have some advice
 based on real-world experience?

 Second question is about the number of gateway instances: is it better to
 have many littlegiga-connected servers or less fat10giga-connected
 servers considering that the total bandwidth available is 10 giga anyway?
 Do you use real or virtual servers? Any advice in terms of performances and
 reliability?

 Many thanks!

 Simone



 --
 Simone Spinelli simone.spine...@unipi.it
 Università di Pisa
 Direzione ICT - Servizi di Rete


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period

2014-10-20 Thread Anthony Alba

Great information, thanks.

I would like to confirm that if I regularly delete older buckets off the
LIVE primary system, the extra objects on the ARCHIVE secondaries are
ignored during replication.

I.e. it does not behave like

rsync -avz --delete LIVE/ ARCHIVE/

Rather it behaves more like

rsync -avz LIVE/ ARCHIVE/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph counters

2014-10-20 Thread Craig Lewis

I've just started on this myself..

I started with https://ceph.com/docs/v0.80/dev/perf_counters/

I'm currently monitoring the latency, using the (to pick one example)
[op_w_latency][sum] and [op_w_latency][avgcount].  Both values are
counters, so they only increase with time.  The lifetime average latency of
the cluster isn't verify useful, so I track the deltas of those values,
then divide the recent deltas to get the average latency over my sample
period.

Just graphing the latencies let me see a spike in write latency on all
disks on one node, which eventually led me to a dead write-cache battery.


That's for the OSDs.  I have similar things setup for MON and RadosGW.


I'm sure there are many more useful things to graph.  One of things I'm
interested in (but haven't found time to research yet) is the journal
usage, with maybe some alerts if the journal is more than 90% full.



On Mon, Oct 13, 2014 at 2:57 PM, Jakes John jakesjohn12...@gmail.com
wrote:

 Bump:). It would be helpful, if someone can share info related to
 debugging using counters/stats

 On Sun, Oct 12, 2014 at 7:42 PM, Jakes John jakesjohn12...@gmail.com
 wrote:

 Hi All,
   I would like to know if there are useful performance counters
 in ceph which can help to debug the cluster. I have seen hundreds of stat
 counters in various daemon dumps. Some of them are,

 1. commit_latency_ms
 2. apply_latency_ms
 3. snap_trim_queue_len
 4. num_snap_trimming

 What do these indicate?. .

 I have used iostat, atop for cluster statistics but, none of them
 indicate the internal ceph status.  Machines might be new but, osds can
 still be slow.  If some of these counters can help to debug why certain
 osds are bad( or can get bad later), it would be great. Some counters like
 total processed requests, pending requests in queue, avg time taken to
 process a request etc ?


 Are there any docs for all performance counters which I can read?. I
 couldn't find anything in ceph docs.

 Thanks



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph counters

2014-10-20 Thread Mark Nelson


On 10/20/2014 08:22 PM, Craig Lewis wrote:

I've just started on this myself..

I started with https://ceph.com/docs/v0.80/dev/perf_counters/

I'm currently monitoring the latency, using the (to pick one example)
[op_w_latency][sum] and [op_w_latency][avgcount].  Both values are
counters, so they only increase with time.  The lifetime average latency
of the cluster isn't verify useful, so I track the deltas of those
values, then divide the recent deltas to get the average latency over my
sample period.

Just graphing the latencies let me see a spike in write latency on all
disks on one node, which eventually led me to a dead write-cache battery.


That's for the OSDs.  I have similar things setup for MON and RadosGW.


I'm sure there are many more useful things to graph.  One of things I'm
interested in (but haven't found time to research yet) is the journal
usage, with maybe some alerts if the journal is more than 90% full.


This is not likely to be an issue with the default journal config since 
the wbthrottle code is pretty aggressive about flushing the journal to 
avoid spiky client IO.  Having said that, I tend to agree that we need 
to do a better job of documenting everything from the perf counters to 
the states described in dump_historic_ops.  Even internally it can get 
confusing trying to keep track of what's going on where.


Mark





On Mon, Oct 13, 2014 at 2:57 PM, Jakes John jakesjohn12...@gmail.com
mailto:jakesjohn12...@gmail.com wrote:

Bump:). It would be helpful, if someone can share info related to
debugging using counters/stats

On Sun, Oct 12, 2014 at 7:42 PM, Jakes John
jakesjohn12...@gmail.com mailto:jakesjohn12...@gmail.com wrote:

Hi All,
   I would like to know if there are useful performance
counters in ceph which can help to debug the cluster. I have
seen hundreds of stat counters in various daemon dumps. Some of
them are,

1. commit_latency_ms
2. apply_latency_ms
3. snap_trim_queue_len
4. num_snap_trimming

What do these indicate?. .

I have used iostat, atop for cluster statistics but, none of
them indicate the internal ceph status.  Machines might be new
but, osds can still be slow.  If some of these counters can help
to debug why certain osds are bad( or can get bad later), it
would be great. Some counters like total processed requests,
pending requests in queue, avg time taken to process a request
etc ?


Are there any docs for all performance counters which I can
read?. I couldn't find anything in ceph docs.

Thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Use case: one-way RADOS replication between two clusters by time period

2014-10-20 Thread Craig Lewis

In a normal setup, where radosgw-agent runs all the time, it will delete
the objects and buckets fairly quickly after they're deleted in the primary
zone.

If you shut down radosgw-agent, then nothing will update in the secondary
cluster.  Once you re-enable radosgw-agent, it will eventually process the
deletes (along with all the writes).

radosgw-agent is a relatively straight-forward python script.  It shouldn't
be too difficult to ignore the deletes, or write them to a database and
process them 6 months later.


I'm working on some snapshot capabilities for RadosGW (
https://wiki.ceph.com/Planning/Blueprints/Hammer/rgw%3A_Snapshots).  Even
if I (or my code) does something really stupid, I'll be able to go back and
read the deleted objects from the snapshots.  It's not perfect, it won't
protect against malicious actions, but it will give me a safety net.


On Mon, Oct 20, 2014 at 6:18 PM, Anthony Alba ascanio.al...@gmail.com
wrote:

 Great information, thanks.

 I would like to confirm that if I regularly delete older buckets off the
 LIVE primary system, the extra objects on the ARCHIVE secondaries are
 ignored during replication.

 I.e. it does not behave like

 rsync -avz --delete LIVE/ ARCHIVE/

 Rather it behaves more like

 rsync -avz LIVE/ ARCHIVE/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph counters

2014-10-20 Thread Craig Lewis



 I'm sure there are many more useful things to graph.  One of things I'm
 interested in (but haven't found time to research yet) is the journal
 usage, with maybe some alerts if the journal is more than 90% full.


 This is not likely to be an issue with the default journal config since
 the wbthrottle code is pretty aggressive about flushing the journal to
 avoid spiky client IO.  Having said that, I tend to agree that we need to
 do a better job of documenting everything from the perf counters to the
 states described in dump_historic_ops.  Even internally it can get
 confusing trying to keep track of what's going on where.

 Mark


I've always had issues during deep-scrubbing, particularly when there is a
lot of deep-scrubbing going on for a long time.  For example, I left
nodeep-scrub set for a month.  Things were pretty painful when I unset it.
Everything was fine, but after ~8 hours, I start getting slow requests,
then osds marked down for being unresponsive.

So full journals is just my most recent theory.  I haven't figured out
how to test my theory.  I've tested (and fixed) a lot of other issues,
which have made things better.


It less of a problem now with journals on SSD, but it's something I ran
into a several times when my journals were on the HDD.  With with the SSD
journals, if I do something that affects ~20% of my OSDs, I start having
issues.  I only have 5 nodes, and I can trigger this by re-formatting all
of the OSDs on one node.  I haven't (yet) had problems with smaller
operations that affect less than 5% of my OSDs.  My disk are 4TB, ~70%
full, and a fresh format takes 24-48 hours to backfill.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOS pool snaps and RBD

2014-10-20 Thread Sage Weil

On Mon, 20 Oct 2014, Xavier Trilla wrote:
 Hi,
 
 It seems Ceph doesn't allow rados pool snapshots on RBD pools which have or 
 had RBD snapshots. They only work on RBD pools which never had a RBD 
 snapshot. 
 
 So, basically this works:
 
 rados mkpool test-pool 1024 1024 replicated
 rbd -p test-pool create --size=102400 test-image
 ceph osd pool mksnap test-pool rados-snap
 
 But this doesn't:
 
 rados mkpool test-pool 1024 1024 replicated
 rbd -p test-pool create --size=102400 test-image
 rbd -p test-pool snap create test-image@rbd-snap
 ceph osd pool mksnap test-pool rados-snap
 
 And we get the following error message:
 
 Error EINVAL: pool test-pool is in unmanaged snaps mode
 
 I've been checking the source code and it seems to be the expecte behavior, 
 but I did not manage to find any information regarding unmanaged snaps 
 mode. Also I did not find any information about RBD snapshots and pool 
 snapshots being mutually exclusive. And even deleting all the RBD snapshots 
 in a pool doesn't enable RADOS snapshots again. 
 
 So, I have a couple of questions:
 
 - Are RBD and RADOS snapshots mutually exclusive?

Xinxin already mentioned this, but to confirm, yes.

 - What does mean unmanaged snaps mode message?

It means the librados user is manaing its own snapshot metadata.  I this 
case, that's RBD; it stores information about what snapshots apply to what 
images in the RBD header object.

 - Is there any way to revert a pool status to allow RADOS pool snapshots 
 after all RBD snapshots are removed? 

No.

 We are designing a quite interesting way to perform incremental backups 
 of RBD pools managed by OpenStack Cinder. The idea is to do the 
 incremental backup at a RADOS level, basically using the mtime property 
 of the object and comparing it against the time we did the last backup / 
 pool snapshot. That way it should be really easy to find modified 
 objects transferring only them, making the implementation of a DR 
 solution easier.. But the issue explained here would be a big problem, 
 as the backup solution would stop working if just one user creates a RBD 
 snapshot on the pool (For example using Cinder Backup).

This is already possible using the diff-export and diff-import functions 
of RBD on a per-image granularity.  I think the only thing it doesn't 
provide is the ability to build a consistency group of lots of images and 
snapshot them together.

Note also that listing all objects to find the changed ones is not very 
efficient.  The export-diff function is currnetly also not very efficient 
(it enumerates image objects), but the 'object map' changes that Jason 
is working on for RBD will fix this and make it quite fast.

sage



 
 I hope somebody could give us more information about this unmanaged 
 snaps mode or point us to a way to revert this behavior once all RBD 
 snapshots have been removed from a pool.
 
 Thanks!
 
 Saludos cordiales,
 Xavier Trilla P.
 Silicon Hosting
 
 ?Sab?as que ahora en SiliconHosting 
 resolvemos tus dudas t?cnicas gratis?
 
 M?s informaci?n en: siliconhosting.com/qa/
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

52 matches

Mail list logo