Re: [ceph-users] Understanding incomplete PGs

2019-07-05 Thread Kyle
On Friday, July 5, 2019 11:50:44 AM CDT Paul Emmerich wrote:
> * There are virtually no use cases for ec pools with m=1, this is a bad
> configuration as you can't have both availability and durability

I'll have to look into this more. The cluster only has 4 hosts, so it might be 
worth switching to osd failure domain for the EC pools and using k=5,m=2.

> 
> * Due to weird internal restrictions ec pools below their min size can't
> recover, you'll probably have to reduce min_size temporarily to recover it

Lowering min_size to 2 did allow it to recover.

> 
> * Depending on your version it might be necessary to restart some of the
> OSDs due to a bug (fixed by now) that caused it to mark some objects as
> degraded if you remove or restart an OSD while you have remapped objects
> 
> * run "ceph osd safe-to-destroy X" to check if it's safe to destroy a given
> OSD

Excellent, thanks!

> 
> > Hello,
> > 
> > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore
> > on
> > lvm) and recently ran into a problem with 17 pgs marked as incomplete
> > after
> > adding/removing OSDs.
> > 
> > Here's the sequence of events:
> > 1. 7 osds in the cluster, health is OK, all pgs are active+clean
> > 2. 3 new osds on a new host are added, lots of backfilling in progress
> > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0"
> > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd
> > utilization"
> > 5. ceph osd out 6
> > 6. systemctl stop ceph-osd@6
> > 7. the drive backing osd 6 is pulled and wiped
> > 8. backfilling has now finished all pgs are active+clean except for 17
> > incomplete pgs
> > 
> > From reading the docs, it sounds like there has been unrecoverable data
> > loss
> > in those 17 pgs. That raises some questions for me:
> > 
> > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead
> > of
> > the current actual allocation?
> > 
> > Why is there data loss from a single osd being removed? Shouldn't that be
> > recoverable?
> > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1
> > with
> > default "host" failure domain. They shouldn't suffer data loss with a
> > single
> > osd being removed even if there were no reweighting beforehand. Does the
> > backfilling temporarily reduce data durability in some way?
> > 
> > Is there a way to see which pgs actually have data on a given osd?
> > 
> > I attached an example of one of the incomplete pgs.
> > 
> > Thanks for any help,
> > 
> > Kyle___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding incomplete PGs

2019-07-05 Thread Kyle
On Friday, July 5, 2019 11:28:32 AM CDT Caspar Smit wrote:
> Kyle,
> 
> Was the cluster still backfilling when you removed osd 6 or did you only
> check its utilization?

Yes, still backfilling.

> 
> Running an EC pool with m=1 is a bad idea. EC pool min_size = k+1 so losing
> a single OSD results in inaccessible data.
> Your incomplete PG's are probably all EC pool pgs, please verify.

Yes, also correct.

> 
> If the above statement is true, you could *temporarily* set min_size to 2
> (on your EC pools) to get back access to your data again but this is a very
> dangerous action. Losing another OSD during this period results in actual
> data loss.

This resolved the issue. I had seen reducing min_size mentioned elsewhere, but 
for some reason I thought that applied only to replicated pools. Thank you!

> 
> Kind regards,
> Caspar Smit
> 
> Op vr 5 jul. 2019 om 01:17 schreef Kyle :
> > Hello,
> > 
> > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore
> > on
> > lvm) and recently ran into a problem with 17 pgs marked as incomplete
> > after
> > adding/removing OSDs.
> > 
> > Here's the sequence of events:
> > 1. 7 osds in the cluster, health is OK, all pgs are active+clean
> > 2. 3 new osds on a new host are added, lots of backfilling in progress
> > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0"
> > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd
> > utilization"
> > 5. ceph osd out 6
> > 6. systemctl stop ceph-osd@6
> > 7. the drive backing osd 6 is pulled and wiped
> > 8. backfilling has now finished all pgs are active+clean except for 17
> > incomplete pgs
> > 
> > From reading the docs, it sounds like there has been unrecoverable data
> > loss
> > in those 17 pgs. That raises some questions for me:
> > 
> > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead
> > of
> > the current actual allocation?
> > 
> > Why is there data loss from a single osd being removed? Shouldn't that be
> > recoverable?
> > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1
> > with
> > default "host" failure domain. They shouldn't suffer data loss with a
> > single
> > osd being removed even if there were no reweighting beforehand. Does the
> > backfilling temporarily reduce data durability in some way?
> > 
> > Is there a way to see which pgs actually have data on a given osd?
> > 
> > I attached an example of one of the incomplete pgs.
> > 
> > Thanks for any help,
> > 
> > Kyle___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understanding incomplete PGs

2019-07-04 Thread Kyle
Hello,

I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore on 
lvm) and recently ran into a problem with 17 pgs marked as incomplete after 
adding/removing OSDs.

Here's the sequence of events:
1. 7 osds in the cluster, health is OK, all pgs are active+clean
2. 3 new osds on a new host are added, lots of backfilling in progress
3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0"
4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd utilization"
5. ceph osd out 6
6. systemctl stop ceph-osd@6
7. the drive backing osd 6 is pulled and wiped
8. backfilling has now finished all pgs are active+clean except for 17 
incomplete pgs

>From reading the docs, it sounds like there has been unrecoverable data loss 
in those 17 pgs. That raises some questions for me:

Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead of 
the current actual allocation?

Why is there data loss from a single osd being removed? Shouldn't that be 
recoverable?
All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 with 
default "host" failure domain. They shouldn't suffer data loss with a single 
osd being removed even if there were no reweighting beforehand. Does the 
backfilling temporarily reduce data durability in some way?

Is there a way to see which pgs actually have data on a given osd?

I attached an example of one of the incomplete pgs.

Thanks for any help,

Kyle{
"state": "incomplete",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 2087,
"up": [
4,
3,
8
],
"acting": [
4,
3,
8
],
"info": {
"pgid": "15.59s0",
"last_update": "753'7465",
"last_complete": "753'7465",
"log_tail": "663'4401",
"last_user_version": 6947,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 603,
"epoch_pool_created": 603,
"last_epoch_started": 1581,
"last_interval_started": 1580,
"last_epoch_clean": 945,
"last_interval_clean": 944,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 2082,
"same_interval_since": 2082,
"same_primary_since": 2076,
"last_scrub": "753'7465",
"last_scrub_stamp": "2019-07-02 13:40:58.935208",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2019-06-27 17:42:04.685790",
"last_clean_scrub_stamp": "2019-07-02 13:40:58.935208"
},
"stats": {
"version": "753'7465",
"reported_seq": "12691",
"reported_epoch": "2087",
"state": "incomplete",
"last_fresh": "2019-07-04 14:30:47.930190",
"last_change": "2019-07-04 14:30:47.930190",
"last_active": "2019-07-03 13:04:00.967354",
"last_peered": "2019-07-03 13:02:40.242867",
"last_clean": "2019-07-02 23:04:26.601070",
"last_became_active": "2019-07-03 08:35:12.459857",
"last_became_peered": "2019-07-03 08:35:12.459857",
"last_unstale": "2019-07-04 14:30:47.930190",
"last_undegraded": "2019-07-04 14:30:47.930190",
"last_fullsized": "2019-07-04 14:30:47.930190",
"mapping_epoch": 2082,
"log_start": "663'4401",
"ondisk_log_start": "663'4401",
"created": 603,
"last_epoch_clean": 945,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "753'7465",
"last_scrub_stamp": "2019-07-02 13:40:58.935208",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2019-06-27 17:42:04.685790",
"last_clean_scrub_stamp": "2019-07-02 1

Re: [ceph-users] Prioritized pool recovery

2019-05-06 Thread Kyle Brantley

On 5/6/2019 6:37 PM, Gregory Farnum wrote:

Hmm, I didn't know we had this functionality before. It looks to be
changing quite a lot at the moment, so be aware this will likely
require reconfiguring later.


Good to know, and not a problem. In any case, I'd assume it won't change 
substantially for luminous, correct?
 
 

I'm not seeing this in the luminous docs, are you sure? The source


You're probably right, but there are options for this in luminous:

# ceph osd pool get vm
Invalid command: missing required parameter var([...] 
recovery_priority|recovery_op_priority [...])



code indicates in Luminous it's 0-254. (As I said, things have
changed, so in the current master build it seems to be -10 to 10 and
configured a bit differently.)
 

The 1-63 values generally apply to op priorities within the OSD, and
are used as part of a weighted priority queue when selecting the next
op to work on out of those available; you may have been looking at
osd_recovery_op_priority which is on that scale and should apply to
individual recovery messages/ops but will not work to schedule PGs
differently.


So I was probably looking at the OSD level then.




Questions:
1) If I have pools 1-4, what would I set these values to in order to backfill 
pools 1, 2, 3, and then 4 in order?


So if I'm reading the code right, they just need to be different
weights, and the higher value will win when trying to get a
reservation if there's a queue of them. (However, it's possible that
lower-priority pools will send off requests first and get to do one or
two PGs first, then the higher-priority pool will get to do all its
work before that pool continues.)


Where higher is 0, or higher is 254? And what's the difference between 
recovery_priority and recovery_op_priority?

In reading the docs for the OSD, _op_ is "priority set for recovery operations," and 
non-op is "priority set for recovery work queue." For someone new to ceph such as myself, 
this reads like the same thing at a glance. Would the recovery operations not be a part of the work 
queue?

And would this apply the same for the pools?




2) Assuming this is possible, how do I ensure that backfill isn't prioritized 
over client I/O?


This is an ongoing issue but I don't think the pool prioritization
will change the existing mechanisms.


Okay, understood. Not a huge problem, I'm primarily looking for understanding.



3) Is there a command that enumerates the weights of the current operations (so 
that I can observe what's going on)?


"ceph osd pool ls detail" will include them.



Perfect!

Thank you very much for the information. Once I have a little more, I'm 
probably going to work towards sending a pull request in for the docs...


--Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Prioritized pool recovery

2019-05-05 Thread Kyle Brantley

I've been running luminous / ceph-12.2.11-0.el7.x86_64 on CentOS 7 for about a 
month now, and had a few times when I've needed to recreate the OSDs on a 
server. (no I'm not planning on routinely doing this...)

What I've noticed is that the recovery will generally stagger the recovery so 
that the pools on the cluster will finish around the same time (+/- a few 
hours). What I'm hoping to do is prioritize specific pools over others, so that 
ceph will recover all of pool 1 before it moves on to pool 2, for example.

In the docs, recovery_{,op}_priority both have roughly the same description, which is 
"the priority set for recovery operations" as well as a valid range of 1-63, 
default 5. This doesn't tell me if a value of 1 is considered a higher priority than 63, 
and it doesn't tell me how it fits in line with other ceph operations.

Questions:
1) If I have pools 1-4, what would I set these values to in order to backfill 
pools 1, 2, 3, and then 4 in order?
2) Assuming this is possible, how do I ensure that backfill isn't prioritized 
over client I/O?
3) Is there a command that enumerates the weights of the current operations (so 
that I can observe what's going on)?

For context, my pools are:
1) cephfs_metadata
2) vm (RBD pool, VM OS drives)
3) storage (RBD pool, VM data drives)
4) cephfs_data

These are sorted by both size (smallest to largest) and criticality of recovery 
(most to least). If there's a critique of this setup / a better way of 
organizing this, suggestions are welcome.

Thanks,
--Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-02-28 Thread Kyle Hutson
I'm following up from awhile ago. I don't think this is the same bug. The
bug referenced shows "abort: Corruption: block checksum mismatch", and I'm
not seeing that on mine.

Now I've had 8 OSDs down on this one server for a couple of weeks, and I
just tried to start it back up. Here's a link to the log of that OSD (which
segfaulted right after starting up):
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log

To me, it looks like the logs are providing surprisingly few hints as to
where the problem lies. Is there a way I can turn up logging to see if I
can get any more info as to why this is happening?

On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor  wrote:

> On 7/02/2018 8:23 AM, Kyle Hutson wrote:
> > We had a 26-node production ceph cluster which we upgraded to Luminous
> > a little over a month ago. I added a 27th-node with Bluestore and
> > didn't have any issues, so I began converting the others, one at a
> > time. The first two went off pretty smoothly, but the 3rd is doing
> > something strange.
> >
> > Initially, all the OSDs came up fine, but then some started to
> > segfault. Out of curiosity more than anything else, I did reboot the
> > server to see if it would get better or worse, and it pretty much
> > stayed the same - 12 of the 18 OSDs did not properly come up. Of
> > those, 3 again segfaulted
> >
> > I picked one that didn't properly come up and copied the log to where
> > anybody can view it:
> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
> > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
> >
> > You can contrast that with one that is up:
> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
> > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
> >
> > (which is still showing segfaults in the logs, but seems to be
> > recovering from them OK?)
> >
> > Any ideas?
> Ideas ? yes
>
> There is a a bug which is hitting a small number of systems and at this
> time there is no solution. Issues details at
> http://tracker.ceph.com/issues/22102.
>
> Please submit more details of your problem on the ticket.
>
> Mike
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Segfaults after Bluestore conversion

2018-02-06 Thread Kyle Hutson
We had a 26-node production ceph cluster which we upgraded to Luminous a
little over a month ago. I added a 27th-node with Bluestore and didn't have
any issues, so I began converting the others, one at a time. The first two
went off pretty smoothly, but the 3rd is doing something strange.

Initially, all the OSDs came up fine, but then some started to segfault.
Out of curiosity more than anything else, I did reboot the server to see if
it would get better or worse, and it pretty much stayed the same - 12 of
the 18 OSDs did not properly come up. Of those, 3 again segfaulted

I picked one that didn't properly come up and copied the log to where
anybody can view it:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log

You can contrast that with one that is up:
http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log

(which is still showing segfaults in the logs, but seems to be recovering
from them OK?)

Any ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver

2017-04-09 Thread Kyle Drake
I tried Fedora 25 (kernel 4.10.8-200.fc25.x86_64) with the kernel driver
and it works great. Perhaps you're on to something with the kernel version.
I didn't realize how far behind 16.04 was on this.

I will give upgrading Ubuntu 16.04 to a newer kernel the old college try.
Thanks.

On Sun, Apr 9, 2017 at 11:41 AM, Kyle Drake  wrote:

> On Sun, Apr 9, 2017 at 9:31 AM, John Spray  wrote:
>
>> On Sun, Apr 9, 2017 at 12:48 AM, Kyle Drake  wrote:
>> > Pretty much says it all. 1GB test file copy to local:
>> >
>> > $ time cp /mnt/ceph-kernel-driver-test/test.img .
>> >
>> > real 2m50.063s
>> > user 0m0.000s
>> > sys 0m9.000s
>> >
>> > $ time cp /mnt/ceph-fuse-test/test.img .
>> >
>> > real 0m3.648s
>> > user 0m0.000s
>> > sys 0m1.872s
>> >
>> > Yikes. The kernel driver averages ~5MB and the fuse driver averages
>> > ~150MBish? Something crazy is happening here. It's not caching, I ran
>> both
>> > tests fresh.
>>
>> What does "fresh" mean in this context?  i.e. what did you do in
>> between runs to reset it?  Have you tried running your procedure in
>> the reverse order (i.e. is the kernel client still slow when you're
>> running it after the fuse client)?
>>
>
> I rebooted the machine and ran the same test.
>
> I just repeated the exercise by creating two completely different test
> files, one for each driver, and got the same results.
>
> The FUSE driver has never under any circumstances been as slow, though
> when I feed a lot of activity into it at once, it tends to get stuck on
> something and hang for a while, so it's not a solution for me unfortunately.
>
>
>>
>> > Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial,
>> ceph-fs-common
>> > 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same issue).
>>
>> I don't know of any issues in the older kernel that you're running,
>> but you should be aware that 4.4 is over a year old and as far as I
>> know there is no backporting of cephfs stuff to the Ubuntu kernel, so
>> you're not getting the latest fixes.
>>
>
> That could be related, but Ubuntu 16.04 is going to be around for a long
> time, so this is probably something that needs to get addressed (unless I'm
> literally the only person on the planet experiencing this bug, as it seems
> to be right now). I don't know how to get on a newer kernel than that
> (without potentially wrecking the distro).
>
> I was actually about to try 14.04 to see if it does the same thing. If it
> works I'll post an update.
>
> -Kyle
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver

2017-04-09 Thread Kyle Drake
On Sun, Apr 9, 2017 at 9:31 AM, John Spray  wrote:

> On Sun, Apr 9, 2017 at 12:48 AM, Kyle Drake  wrote:
> > Pretty much says it all. 1GB test file copy to local:
> >
> > $ time cp /mnt/ceph-kernel-driver-test/test.img .
> >
> > real 2m50.063s
> > user 0m0.000s
> > sys 0m9.000s
> >
> > $ time cp /mnt/ceph-fuse-test/test.img .
> >
> > real 0m3.648s
> > user 0m0.000s
> > sys 0m1.872s
> >
> > Yikes. The kernel driver averages ~5MB and the fuse driver averages
> > ~150MBish? Something crazy is happening here. It's not caching, I ran
> both
> > tests fresh.
>
> What does "fresh" mean in this context?  i.e. what did you do in
> between runs to reset it?  Have you tried running your procedure in
> the reverse order (i.e. is the kernel client still slow when you're
> running it after the fuse client)?
>

I rebooted the machine and ran the same test.

I just repeated the exercise by creating two completely different test
files, one for each driver, and got the same results.

The FUSE driver has never under any circumstances been as slow, though when
I feed a lot of activity into it at once, it tends to get stuck on
something and hang for a while, so it's not a solution for me unfortunately.


>
> > Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial,
> ceph-fs-common
> > 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same issue).
>
> I don't know of any issues in the older kernel that you're running,
> but you should be aware that 4.4 is over a year old and as far as I
> know there is no backporting of cephfs stuff to the Ubuntu kernel, so
> you're not getting the latest fixes.
>

That could be related, but Ubuntu 16.04 is going to be around for a long
time, so this is probably something that needs to get addressed (unless I'm
literally the only person on the planet experiencing this bug, as it seems
to be right now). I don't know how to get on a newer kernel than that
(without potentially wrecking the distro).

I was actually about to try 14.04 to see if it does the same thing. If it
works I'll post an update.

-Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver

2017-04-08 Thread Kyle Drake
Pretty much says it all. 1GB test file copy to local:

$ time cp /mnt/ceph-kernel-driver-test/test.img .

real 2m50.063s
user 0m0.000s
sys 0m9.000s

$ time cp /mnt/ceph-fuse-test/test.img .

real 0m3.648s
user 0m0.000s
sys 0m1.872s

Yikes. The kernel driver averages ~5MB and the fuse driver averages
~150MBish? Something crazy is happening here. It's not caching, I ran both
tests fresh.

Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial,
ceph-fs-common 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same
issue).

Anyone run into this? Did a lot of digging on the ML and didn't see
anything. I'm was going to use FUSE for production, but it tends to lag
more on a lot of small requests so I had to fall back the kernel driver.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default CRUSH Weight Set To 0 ?

2016-02-05 Thread Kyle
Burkhard Linke  writes:

> 
> The default weight is the size of the OSD in tera bytes. Did you 
use
> a very small OSD partition for test purposes, e.g. 20 GB? In that
> case the weight is rounded and results in an effective weight of
> 0.0. As a result the OSD will not be used for data storage.
> Regards,
> Burkhard
>   
> 
> ___
> ceph-users mailing list
> ceph-users@...
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Thanks Burkhard,

Yes, I used only 5 GB drives for this test.  So how is this calculated 
in the Infernalis release?  I used the exact same setup and the CRUSH 
weight turned out to be 321?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Default CRUSH Weight Set To 0 ?

2016-02-04 Thread Kyle Harris
Hello,

I have been working on a very basic cluster with 3 nodes and a single OSD
per node.  I am using Hammer installed on CentOS 7
(ceph-0.94.5-0.el7.x86_64) since it is the LTS version.  I kept running
into an issue of not getting past the status of
undersized+degraded+peered.  I finally discovered the problem was that in
the default CRUSH map, the weight assigned is 0.  I changed the weight and
everything came up as it should.  I did the same test using the Infernalis
release and everything worked as expected as the weight has been changed to
a default of 321.

- Is this a bug or by design and if the latter, why?  Perhaps I'm missing
something?
- Has anyone else ran into this?
- Am I correct in assuming a weight of 0 won't allow the OSDs to be used or
is there some other purpose for this?

Hopefully this will help others that may run into this same situation.

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
Nice! Thanks!

On Wed, Oct 14, 2015 at 1:23 PM, Sage Weil  wrote:

> On Wed, 14 Oct 2015, Kyle Hutson wrote:
> > > Which bug?  We want to fix hammer, too!
> >
> > This
> > one:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html
> >
> > (Adam sits about 5' from me.)
>
> Oh... that fix is already in the hammer branch and will be in 0.94.4.
> Since you have to go to that anyway before infernalis you may as well stop
> there (unless there is something else you want from internalis!).
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
> Which bug?  We want to fix hammer, too!

This one:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html

(Adam sits about 5' from me.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
A couple of questions related to this, especially since we have a hammer
bug that's biting us so we're anxious to upgrade to Infernalis.

1) RE: ibrbd and librados ABI compatibility is broken.  Be careful installing
this RC on client machines (e.g., those running qemu). It will be fixed in
the final v9.2.0 release.

We have several qemu clients. If we upgrade the ceph servers (and not the
qemu clients), will this affect us?

2) RE: Upgrading directly from Firefly v0.80.z is not possible.  All
clusters must first upgrade to Hammer v0.94.4 or a later v0.94.z release;
only then is it possible to upgrade to Infernalis 9.2.z.

I think I understand this, but want to verify. We're on 0.94.3. Can we
upgrade to the RC 9.1.0 and then safely upgrade to 9.2.z when it is
finalized? Any foreseen issues with this upgrade path?

On Wed, Oct 14, 2015 at 7:30 AM, Sage Weil  wrote:

> On Wed, 14 Oct 2015, Dan van der Ster wrote:
> > Hi Goncalo,
> >
> > On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
> >  wrote:
> > > Hi Sage...
> > >
> > > I've seen that the rh6 derivatives have been ruled out.
> > >
> > > This is a problem in our case since the OS choice in our systems is,
> > > somehow, imposed by CERN. The experiments software is certified for
> SL6 and
> > > the transition to SL7 will take some time.
> >
> > Are you accessing Ceph directly from "physics" machines? Here at CERN
> > we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
> > time we upgrade to Infernalis the servers will all be CentOS 7 as
> > well. Batch nodes running SL6 don't (currently) talk to Ceph directly
> > (in the future they might talk to Ceph-based storage via an xroot
> > gateway). But if there are use-cases then perhaps we could find a
> > place to build and distributing the newer ceph clients.
> >
> > There's a ML ceph-t...@cern.ch where we could take this discussion.
> > Mail me if have trouble joining that e-Group.
>
> Also note that it *is* possible to build infernalis on el6, but it
> requires a lot more effort... enough that we would rather spend our time
> elsewhere (at least as far as ceph.com packages go).  If someone else
> wants to do that work we'd be happy to take patches to update the and/or
> release process.
>
> IIRC the thing that eventually made me stop going down this patch was the
> fact that the newer gcc had a runtime dependency on the newer libstdc++,
> which wasn't part of the base distro... which means we'd need also to
> publish those packages in the ceph.com repos, or users would have to
> add some backport repo or ppa or whatever to get things running.  Bleh.
>
> sage
>
>
> >
> > Cheers, Dan
> > CERN IT-DSS
> >
> > > This is kind of a showstopper specially if we can't deploy clients in
> SL6 /
> > > Centos6.
> > >
> > > Is there any alternative?
> > >
> > > TIA
> > > Goncalo
> > >
> > >
> > >
> > > On 10/14/2015 08:01 AM, Sage Weil wrote:
> > >>
> > >> This is the first Infernalis release candidate.  There have been some
> > >> major changes since hammer, and the upgrade process is non-trivial.
> > >> Please read carefully.
> > >>
> > >> Getting the release candidate
> > >> -
> > >>
> > >> The v9.1.0 packages are pushed to the development release
> repositories::
> > >>
> > >>http://download.ceph.com/rpm-testing
> > >>http://download.ceph.com/debian-testing
> > >>
> > >> For for info, see::
> > >>
> > >>http://docs.ceph.com/docs/master/install/get-packages/
> > >>
> > >> Or install with ceph-deploy via::
> > >>
> > >>ceph-deploy install --testing HOST
> > >>
> > >> Known issues
> > >> 
> > >>
> > >> * librbd and librados ABI compatibility is broken.  Be careful
> > >>installing this RC on client machines (e.g., those running qemu).
> > >>It will be fixed in the final v9.2.0 release.
> > >>
> > >> Major Changes from Hammer
> > >> -
> > >>
> > >> * *General*:
> > >>* Ceph daemons are now managed via systemd (with the exception of
> > >>  Ubuntu Trusty, which still uses upstart).
> > >>* Ceph daemons run as 'ceph' user instead root.
> > >>* On Red Hat distros, there is also an SELinux policy.
> > >> * *RADOS*:
> > >>* The RADOS cache tier can now proxy write operations to the base
> > >>  tier, allowing writes to be handled without forcing migration of
> > >>  an object into the cache.
> > >>* The SHEC erasure coding support is no longer flagged as
> > >>  experimental. SHEC trades some additional storage space for
> faster
> > >>  repair.
> > >>* There is now a unified queue (and thus prioritization) of client
> > >>  IO, recovery, scrubbing, and snapshot trimming.
> > >>* There have been many improvements to low-level repair tooling
> > >>  (ceph-objectstore-tool).
> > >>* The internal ObjectStore API has been significantly cleaned up in
> > >> order
> > >>  to faciliate new storage backends like NewStore.
> > >> * *RGW*:
> > >>* The Swift API now 

Re: [ceph-users] CephFS and caching

2015-09-10 Thread Kyle Hutson
A 'rados -p cachepool ls' takes about 3 hours - not exactly useful.

I'm intrigued that you say a single read may not promote it into the cache.
My understanding is that if you have an EC-backed pool the clients can't
talk to them directly, which means they would necessarily be promoted to
the cache pool so the client could read it. Is my understanding wrong?

I'm also wondering if it's possible to use RAM as a read-cache layer.
Obviously, we don't want this for write-cache because of power outages,
motherboard failures, etc., but it seems to make sense for a read-cache. Is
that something that's being done, can be done, is going to be done, or has
even been considered?

On Wed, Sep 9, 2015 at 10:33 AM, Gregory Farnum  wrote:

> On Wed, Sep 9, 2015 at 4:26 PM, Kyle Hutson  wrote:
> >
> >
> > On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum 
> wrote:
> >>
> >> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
> >> > We are using Hammer - latest released version. How do I check if it's
> >> > getting promoted into the cache?
> >>
> >> Umm...that's a good question. You can run rados ls on the cache pool,
> >> but that's not exactly scalable; you can turn up logging and dig into
> >> them to see if redirects are happening, or watch the OSD operations
> >> happening via the admin socket. But I don't know if there's a good
> >> interface for users to just query the cache state of a single object.
> >> :/
> >
> >
> > even using 'rados ls', I (naturally) get cephfs object names - is there a
> > way to see a filename -> objectname conversion ... or objectname ->
> filename
> > ?
>
> The object name is .. So you can
> look at the file inode and then see which of its objects are actually
> in the pool.
> -Greg
>
> >
> >>
> >> > We're using the latest ceph kernel client. Where do I poke at
> readahead
> >> > settings there?
> >>
> >> Just the standard kernel readahead settings; I'm not actually familiar
> >> with how to configure those but I don't believe Ceph's are in any way
> >> special. What do you mean by "latest ceph kernel client"; are you
> >> running one of the developer testing kernels or something?
> >
> >
> > No, just what comes with the latest stock kernel. Sorry for any
> confusion.
> >
> >>
> >> I think
> >> Ilya might have mentioned some issues with readahead being
> >> artificially blocked, but that might have only been with RBD.
> >>
> >> Oh, are the files you're using sparse? There was a bug with sparse
> >> files not filling in pages that just got patched yesterday or
> >> something.
> >
> >
> > No, these are not sparse files. Just really big.
> >
> >>
> >> >
> >> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum 
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson 
> >> >> wrote:
> >> >> > I was wondering if anybody could give me some insight as to how
> >> >> > CephFS
> >> >> > does
> >> >> > its caching - read-caching in particular.
> >> >> >
> >> >> > We are using CephFS with an EC pool on the backend with a
> replicated
> >> >> > cache
> >> >> > pool in front of it. We're seeing some very slow read times. Trying
> >> >> > to
> >> >> > compute an md5sum on a 15GB file twice in a row (so it should be in
> >> >> > cache)
> >> >> > takes the time from 23 minutes down to 17 minutes, but this is
> over a
> >> >> > 10Gbps
> >> >> > network and with a crap-ton of OSDs (over 300), so I would expect
> it
> >> >> > to
> >> >> > be
> >> >> > down in the 2-3 minute range.
> >> >>
> >> >> A single sequential read won't necessarily promote an object into the
> >> >> cache pool (although if you're using Hammer I think it will), so you
> >> >> want to check if it's actually getting promoted into the cache before
> >> >> assuming that's happened.
> >> >>
> >> >> >
> >> >> > I'm just trying to figure out what we can do to increase the
> >> >> > performance. I
> >> >> > have over 300 TB of live data that I have to be careful with,
> though,
> >> >> > so
> >> >> > I
> >> >> > have to have some level of caution.
> >> >> >
> >> >> > Is there some other caching we can do (client-side or server-side)
> >> >> > that
> >> >> > might give us a decent performance boost?
> >> >>
> >> >> Which client are you using for this testing? Have you looked at the
> >> >> readahead settings? That's usually the big one; if you're only asking
> >> >> for 4KB at once then stuff is going to be slow no matter what (a
> >> >> single IO takes at minimum about 2 milliseconds right now, although
> >> >> the RADOS team is working to improve that).
> >> >> -Greg
> >> >>
> >> >> >
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Kyle Hutson
On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum  wrote:

> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
> > We are using Hammer - latest released version. How do I check if it's
> > getting promoted into the cache?
>
> Umm...that's a good question. You can run rados ls on the cache pool,
> but that's not exactly scalable; you can turn up logging and dig into
> them to see if redirects are happening, or watch the OSD operations
> happening via the admin socket. But I don't know if there's a good
> interface for users to just query the cache state of a single object.
> :/
>

even using 'rados ls', I (naturally) get cephfs object names - is there a
way to see a filename -> objectname conversion ... or objectname ->
filename ?


> > We're using the latest ceph kernel client. Where do I poke at readahead
> > settings there?
>
> Just the standard kernel readahead settings; I'm not actually familiar
> with how to configure those but I don't believe Ceph's are in any way
> special. What do you mean by "latest ceph kernel client"; are you
> running one of the developer testing kernels or something?


No, just what comes with the latest stock kernel. Sorry for any confusion.


> I think
> Ilya might have mentioned some issues with readahead being
> artificially blocked, but that might have only been with RBD.
>
> Oh, are the files you're using sparse? There was a bug with sparse
> files not filling in pages that just got patched yesterday or
> something.
>

No, these are not sparse files. Just really big.


> >
> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum 
> wrote:
> >>
> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson 
> wrote:
> >> > I was wondering if anybody could give me some insight as to how CephFS
> >> > does
> >> > its caching - read-caching in particular.
> >> >
> >> > We are using CephFS with an EC pool on the backend with a replicated
> >> > cache
> >> > pool in front of it. We're seeing some very slow read times. Trying to
> >> > compute an md5sum on a 15GB file twice in a row (so it should be in
> >> > cache)
> >> > takes the time from 23 minutes down to 17 minutes, but this is over a
> >> > 10Gbps
> >> > network and with a crap-ton of OSDs (over 300), so I would expect it
> to
> >> > be
> >> > down in the 2-3 minute range.
> >>
> >> A single sequential read won't necessarily promote an object into the
> >> cache pool (although if you're using Hammer I think it will), so you
> >> want to check if it's actually getting promoted into the cache before
> >> assuming that's happened.
> >>
> >> >
> >> > I'm just trying to figure out what we can do to increase the
> >> > performance. I
> >> > have over 300 TB of live data that I have to be careful with, though,
> so
> >> > I
> >> > have to have some level of caution.
> >> >
> >> > Is there some other caching we can do (client-side or server-side)
> that
> >> > might give us a decent performance boost?
> >>
> >> Which client are you using for this testing? Have you looked at the
> >> readahead settings? That's usually the big one; if you're only asking
> >> for 4KB at once then stuff is going to be slow no matter what (a
> >> single IO takes at minimum about 2 milliseconds right now, although
> >> the RADOS team is working to improve that).
> >> -Greg
> >>
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Kyle Hutson
We are using Hammer - latest released version. How do I check if it's
getting promoted into the cache?

We're using the latest ceph kernel client. Where do I poke at readahead
settings there?

On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum  wrote:

> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson  wrote:
> > I was wondering if anybody could give me some insight as to how CephFS
> does
> > its caching - read-caching in particular.
> >
> > We are using CephFS with an EC pool on the backend with a replicated
> cache
> > pool in front of it. We're seeing some very slow read times. Trying to
> > compute an md5sum on a 15GB file twice in a row (so it should be in
> cache)
> > takes the time from 23 minutes down to 17 minutes, but this is over a
> 10Gbps
> > network and with a crap-ton of OSDs (over 300), so I would expect it to
> be
> > down in the 2-3 minute range.
>
> A single sequential read won't necessarily promote an object into the
> cache pool (although if you're using Hammer I think it will), so you
> want to check if it's actually getting promoted into the cache before
> assuming that's happened.
>
> >
> > I'm just trying to figure out what we can do to increase the
> performance. I
> > have over 300 TB of live data that I have to be careful with, though, so
> I
> > have to have some level of caution.
> >
> > Is there some other caching we can do (client-side or server-side) that
> > might give us a decent performance boost?
>
> Which client are you using for this testing? Have you looked at the
> readahead settings? That's usually the big one; if you're only asking
> for 4KB at once then stuff is going to be slow no matter what (a
> single IO takes at minimum about 2 milliseconds right now, although
> the RADOS team is working to improve that).
> -Greg
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS and caching

2015-09-03 Thread Kyle Hutson
I was wondering if anybody could give me some insight as to how CephFS does
its caching - read-caching in particular.

We are using CephFS with an EC pool on the backend with a replicated cache
pool in front of it. We're seeing some very slow read times. Trying to
compute an md5sum on a 15GB file twice in a row (so it should be in cache)
takes the time from 23 minutes down to 17 minutes, but this is over a
10Gbps network and with a crap-ton of OSDs (over 300), so I would expect it
to be down in the 2-3 minute range.

I'm just trying to figure out what we can do to increase the performance. I
have over 300 TB of live data that I have to be careful with, though, so I
have to have some level of caution.

Is there some other caching we can do (client-side or server-side) that
might give us a decent performance boost?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph migration to AWS

2015-05-04 Thread Kyle Bader
> To those interested in a tricky problem,
>
> We have a Ceph cluster running at one of our data centers. One of our
> client's requirements is to have them hosted at AWS. My question is: How do
> we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
> cluster?
>
> Ideas currently on the table:
>
> 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum
> at AWS then sever the connection between AWS and our data center.

I would highly discourage this.

> 2. Build a Ceph cluster at AWS and send snapshots from our data center to
> our AWS cluster allowing us to "migrate" to AWS.

This sounds far more sensible. I'd look at the I2 (iops) or D2
(density) class instances, depending on use case.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds crashing

2015-04-15 Thread Kyle Hutson
Thank you, John!

That was exactly the bug we were hitting. My Google-fu didn't lead me to
this one.

On Wed, Apr 15, 2015 at 4:16 PM, John Spray  wrote:

> On 15/04/2015 20:02, Kyle Hutson wrote:
>
>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
>> pretty well.
>>
>> Then, about noon today, we had an mds crash. And then the failover mds
>> crashed. And this cascaded through all 4 mds servers we have.
>>
>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting
>> to 'active', it crashes again.
>>
>> I have the mds log at http://people.beocat.cis.ksu.
>> edu/~kylehutson/ceph-mds.hobbit01.log <http://people.beocat.cis.ksu.
>> edu/%7Ekylehutson/ceph-mds.hobbit01.log>
>>
>> For the possibly, but not necessarily, useful background info.
>> - Yesterday we took our erasure coded pool and increased both pg_num and
>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
>> but those seem to be continuing to clean themselves up.
>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
>> filesystem to this filesystem.
>> - Before we realized the mds crashes, we had just changed the size of our
>> metadata pool from 2 to 4.
>>
>
> It looks like you're seeing http://tracker.ceph.com/issues/10449, which
> is a situation where the SessionMap object becomes too big for the MDS to
> save.The cause of it in that case was stuck requests from a misbehaving
> client running a slightly older kernel.
>
> Assuming you're using the kernel client and having a similar problem, you
> could try to work around this situation by forcibly unmounting the clients
> while the MDS is offline, such that during clientreplay the MDS will remove
> them from the SessionMap after timing out, and then next time it tries to
> save the map it won't be oversized.  If that works, you could then look
> into getting newer kernels on the clients to avoid hitting the issue again
> -- the #10449 ticket has some pointers about which kernel changes were
> relevant.
>
> Cheers,
> John
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds crashing

2015-04-15 Thread Kyle Hutson
I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
pretty well.

Then, about noon today, we had an mds crash. And then the failover mds
crashed. And this cascaded through all 4 mds servers we have.

If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
'rejoin' 'clientreplay' and 'active' but nearly immediately after getting
to 'active', it crashes again.

I have the mds log at
http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log

For the possibly, but not necessarily, useful background info.
- Yesterday we took our erasure coded pool and increased both pg_num and
pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
but those seem to be continuing to clean themselves up.
- We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
filesystem to this filesystem.
- Before we realized the mds crashes, we had just changed the size of our
metadata pool from 2 to 4.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer

2015-04-09 Thread Kyle Hutson
http://people.beocat.cis.ksu.edu/~kylehutson/crushmap

On Thu, Apr 9, 2015 at 11:25 AM, Gregory Farnum  wrote:

> Hmmm. That does look right and neither I nor Sage can come up with
> anything via code inspection. Can you post the actual binary crush map
> somewhere for download so that we can inspect it with our tools?
> -Greg
>
> On Thu, Apr 9, 2015 at 7:57 AM, Kyle Hutson  wrote:
> > Here 'tis:
> > https://dpaste.de/POr1
> >
> >
> > On Thu, Apr 9, 2015 at 9:49 AM, Gregory Farnum  wrote:
> >>
> >> Can you dump your crush map and post it on pastebin or something?
> >>
> >> On Thu, Apr 9, 2015 at 7:26 AM, Kyle Hutson  wrote:
> >> > Nope - it's 64-bit.
> >> >
> >> > (Sorry, I missed the reply-all last time.)
> >> >
> >> > On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum 
> wrote:
> >> >>
> >> >> [Re-added the list]
> >> >>
> >> >> Hmm, I'm checking the code and that shouldn't be possible. What's
> your
> >> >> ciient? (In particular, is it 32-bit? That's the only thing i can
> >> >> think of that might have slipped through our QA.)
> >> >>
> >> >> On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson 
> wrote:
> >> >> > I did nothing to enable anything else. Just changed my ceph repo
> from
> >> >> > 'giant' to 'hammer', then did 'yum update' and restarted services.
> >> >> >
> >> >> > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum 
> >> >> > wrote:
> >> >> >>
> >> >> >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by
> >> >> >> the
> >> >> >> cluster unless you made changes to the layout requiring it.
> >> >> >>
> >> >> >> If you did, the clients have to be upgraded to understand it. You
> >> >> >> could disable all the v4 features; that should let them connect
> >> >> >> again.
> >> >> >> -Greg
> >> >> >>
> >> >> >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson 
> >> >> >> wrote:
> >> >> >> > This particular problem I just figured out myself ('ceph -w' was
> >> >> >> > still
> >> >> >> > running from before the upgrade, and ctrl-c and restarting
> solved
> >> >> >> > that
> >> >> >> > issue), but I'm still having a similar problem on the ceph
> client:
> >> >> >> >
> >> >> >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my
> >> >> >> > 2b84a042aca <
> >> >> >> > server's 102b84a042aca, missing 1
> >> >> >> >
> >> >> >> > It appears that even the latest kernel doesn't have support for
> >> >> >> > CEPH_FEATURE_CRUSH_V4
> >> >> >> >
> >> >> >> > How do I make my ceph cluster backward-compatible with the old
> >> >> >> > cephfs
> >> >> >> > client?
> >> >> >> >
> >> >> >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson  >
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is
> >> >> >> >> constantly
> >> >> >> >> repeating this message:
> >> >> >> >>
> >> >> >> >> 2015-04-09 08:50:26.318042 7f95dbf86700  0 --
> 10.5.38.1:0/2037478
> >> >> >> >> >>
> >> >> >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0
> cs=0
> >> >> >> >> l=1
> >> >> >> >> c=0x7f95e0023670).connect protocol feature mismatch, my
> >> >> >> >> 3fff
> >> >> >> >> <
> >> >> >> >> peer
> >> >> >> >> 13fff missing 1
> >> >> >> >>
> >> >> >> >> It isn't always the same IP for the destination - here's
> another:
> >> >> >> >> 2015-04-09 08:50:20.322059 7f95dc087700  0 --
> 10.5.38.1:0/2037478
> >> >> >> >> >>
> >> >> >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0
> cs=0
> >> >> >> >> l=1
> >> >> >> >> c=0x7f95e002b480).connect protocol feature mismatch, my
> >> >> >> >> 3fff
> >> >> >> >> <
> >> >> >> >> peer
> >> >> >> >> 13fff missing 1
> >> >> >> >>
> >> >> >> >> Some details about our install:
> >> >> >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning
> >> >> >> >> disks
> >> >> >> >> in
> >> >> >> >> an
> >> >> >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD
> partitions
> >> >> >> >> used
> >> >> >> >> for a
> >> >> >> >> caching tier in front of the EC pool. All 24 hosts are
> monitors.
> >> >> >> >> 4
> >> >> >> >> hosts are
> >> >> >> >> mds. We are running cephfs with a client trying to write data
> >> >> >> >> over
> >> >> >> >> cephfs
> >> >> >> >> when we're seeing these messages.
> >> >> >> >>
> >> >> >> >> Any ideas?
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > ___
> >> >> >> > ceph-users mailing list
> >> >> >> > ceph-users@lists.ceph.com
> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer

2015-04-09 Thread Kyle Hutson
Here 'tis:
https://dpaste.de/POr1


On Thu, Apr 9, 2015 at 9:49 AM, Gregory Farnum  wrote:

> Can you dump your crush map and post it on pastebin or something?
>
> On Thu, Apr 9, 2015 at 7:26 AM, Kyle Hutson  wrote:
> > Nope - it's 64-bit.
> >
> > (Sorry, I missed the reply-all last time.)
> >
> > On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum  wrote:
> >>
> >> [Re-added the list]
> >>
> >> Hmm, I'm checking the code and that shouldn't be possible. What's your
> >> ciient? (In particular, is it 32-bit? That's the only thing i can
> >> think of that might have slipped through our QA.)
> >>
> >> On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson  wrote:
> >> > I did nothing to enable anything else. Just changed my ceph repo from
> >> > 'giant' to 'hammer', then did 'yum update' and restarted services.
> >> >
> >> > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum 
> wrote:
> >> >>
> >> >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by the
> >> >> cluster unless you made changes to the layout requiring it.
> >> >>
> >> >> If you did, the clients have to be upgraded to understand it. You
> >> >> could disable all the v4 features; that should let them connect
> again.
> >> >> -Greg
> >> >>
> >> >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson 
> wrote:
> >> >> > This particular problem I just figured out myself ('ceph -w' was
> >> >> > still
> >> >> > running from before the upgrade, and ctrl-c and restarting solved
> >> >> > that
> >> >> > issue), but I'm still having a similar problem on the ceph client:
> >> >> >
> >> >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my
> 2b84a042aca <
> >> >> > server's 102b84a042aca, missing 1
> >> >> >
> >> >> > It appears that even the latest kernel doesn't have support for
> >> >> > CEPH_FEATURE_CRUSH_V4
> >> >> >
> >> >> > How do I make my ceph cluster backward-compatible with the old
> cephfs
> >> >> > client?
> >> >> >
> >> >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson 
> >> >> > wrote:
> >> >> >>
> >> >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is
> >> >> >> constantly
> >> >> >> repeating this message:
> >> >> >>
> >> >> >> 2015-04-09 08:50:26.318042 7f95dbf86700  0 -- 10.5.38.1:0/2037478
> >>
> >> >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0
> l=1
> >> >> >> c=0x7f95e0023670).connect protocol feature mismatch, my
> 3fff
> >> >> >> <
> >> >> >> peer
> >> >> >> 13fff missing 1
> >> >> >>
> >> >> >> It isn't always the same IP for the destination - here's another:
> >> >> >> 2015-04-09 08:50:20.322059 7f95dc087700  0 -- 10.5.38.1:0/2037478
> >>
> >> >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0
> l=1
> >> >> >> c=0x7f95e002b480).connect protocol feature mismatch, my
> 3fff
> >> >> >> <
> >> >> >> peer
> >> >> >> 13fff missing 1
> >> >> >>
> >> >> >> Some details about our install:
> >> >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning disks
> >> >> >> in
> >> >> >> an
> >> >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions
> >> >> >> used
> >> >> >> for a
> >> >> >> caching tier in front of the EC pool. All 24 hosts are monitors. 4
> >> >> >> hosts are
> >> >> >> mds. We are running cephfs with a client trying to write data over
> >> >> >> cephfs
> >> >> >> when we're seeing these messages.
> >> >> >>
> >> >> >> Any ideas?
> >> >> >
> >> >> >
> >> >> >
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer

2015-04-09 Thread Kyle Hutson
Nope - it's 64-bit.

(Sorry, I missed the reply-all last time.)

On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum  wrote:

> [Re-added the list]
>
> Hmm, I'm checking the code and that shouldn't be possible. What's your
> ciient? (In particular, is it 32-bit? That's the only thing i can
> think of that might have slipped through our QA.)
>
> On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson  wrote:
> > I did nothing to enable anything else. Just changed my ceph repo from
> > 'giant' to 'hammer', then did 'yum update' and restarted services.
> >
> > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum  wrote:
> >>
> >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by the
> >> cluster unless you made changes to the layout requiring it.
> >>
> >> If you did, the clients have to be upgraded to understand it. You
> >> could disable all the v4 features; that should let them connect again.
> >> -Greg
> >>
> >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson  wrote:
> >> > This particular problem I just figured out myself ('ceph -w' was still
> >> > running from before the upgrade, and ctrl-c and restarting solved that
> >> > issue), but I'm still having a similar problem on the ceph client:
> >> >
> >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my 2b84a042aca <
> >> > server's 102b84a042aca, missing 1
> >> >
> >> > It appears that even the latest kernel doesn't have support for
> >> > CEPH_FEATURE_CRUSH_V4
> >> >
> >> > How do I make my ceph cluster backward-compatible with the old cephfs
> >> > client?
> >> >
> >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson 
> wrote:
> >> >>
> >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is
> >> >> constantly
> >> >> repeating this message:
> >> >>
> >> >> 2015-04-09 08:50:26.318042 7f95dbf86700  0 -- 10.5.38.1:0/2037478 >>
> >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1
> >> >> c=0x7f95e0023670).connect protocol feature mismatch, my 3fff
> <
> >> >> peer
> >> >> 13fff missing 1
> >> >>
> >> >> It isn't always the same IP for the destination - here's another:
> >> >> 2015-04-09 08:50:20.322059 7f95dc087700  0 -- 10.5.38.1:0/2037478 >>
> >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1
> >> >> c=0x7f95e002b480).connect protocol feature mismatch, my 3fff
> <
> >> >> peer
> >> >> 13fff missing 1
> >> >>
> >> >> Some details about our install:
> >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in
> >> >> an
> >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used
> >> >> for a
> >> >> caching tier in front of the EC pool. All 24 hosts are monitors. 4
> >> >> hosts are
> >> >> mds. We are running cephfs with a client trying to write data over
> >> >> cephfs
> >> >> when we're seeing these messages.
> >> >>
> >> >> Any ideas?
> >> >
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer

2015-04-09 Thread Kyle Hutson
This particular problem I just figured out myself ('ceph -w' was still
running from before the upgrade, and ctrl-c and restarting solved that
issue), but I'm still having a similar problem on the ceph client:

libceph: mon19 10.5.38.20:6789 feature set mismatch, my 2b84a042aca <
server's 102b84a042aca, missing 1

It appears that even the latest kernel doesn't have support
for CEPH_FEATURE_CRUSH_V4

How do I make my ceph cluster backward-compatible with the old cephfs
client?

On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson  wrote:

> I upgraded from giant to hammer yesterday and now 'ceph -w' is constantly
> repeating this message:
>
> 2015-04-09 08:50:26.318042 7f95dbf86700  0 -- 10.5.38.1:0/2037478 >>
> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1
> c=0x7f95e0023670).connect protocol feature mismatch, my 3fff < peer
> 13fff missing 1
>
> It isn't always the same IP for the destination - here's another:
> 2015-04-09 08:50:20.322059 7f95dc087700  0 -- 10.5.38.1:0/2037478 >>
> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1
> c=0x7f95e002b480).connect protocol feature mismatch, my 3fff < peer
> 13fff missing 1
>
> Some details about our install:
> We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in an
> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used for a
> caching tier in front of the EC pool. All 24 hosts are monitors. 4 hosts
> are mds. We are running cephfs with a client trying to write data over
> cephfs when we're seeing these messages.
>
> Any ideas?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "protocol feature mismatch" after upgrading to Hammer

2015-04-09 Thread Kyle Hutson
I upgraded from giant to hammer yesterday and now 'ceph -w' is constantly
repeating this message:

2015-04-09 08:50:26.318042 7f95dbf86700  0 -- 10.5.38.1:0/2037478 >>
10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1
c=0x7f95e0023670).connect protocol feature mismatch, my 3fff < peer
13fff missing 1

It isn't always the same IP for the destination - here's another:
2015-04-09 08:50:20.322059 7f95dc087700  0 -- 10.5.38.1:0/2037478 >>
10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1
c=0x7f95e002b480).connect protocol feature mismatch, my 3fff < peer
13fff missing 1

Some details about our install:
We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in an
erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used for a
caching tier in front of the EC pool. All 24 hosts are monitors. 4 hosts
are mds. We are running cephfs with a client trying to write data over
cephfs when we're seeing these messages.

Any ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-26 Thread Kyle Hutson
For what it's worth, I don't think  "being patient" was the answer. I was
having the same problem a couple of weeks ago, and I waited from before 5pm
one day until after 8am the next, and still got the same errors. I ended up
adding a "new" cephfs pool with a newly-created small pool, but was never
able to actually remove cephfs altogether.

On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett 
wrote:

> On 03/25/2015 05:44 PM, Gregory Farnum wrote:
>
>> On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett 
>> wrote:
>>
>>> Dear All,
>>>
>>> Please forgive this post if it's naive, I'm trying to familiarise myself
>>> with cephfs!
>>>
>>> I'm using Scientific Linux 6.6. with Ceph 0.87.1
>>>
>>> My first steps with cephfs using a replicated pool worked OK.
>>>
>>> Now trying now to test cephfs via a replicated caching tier on top of an
>>> erasure pool. I've created an erasure pool, cannot put it under the
>>> existing
>>> replicated pool.
>>>
>>> My thoughts were to delete the existing cephfs, and start again, however
>>> I
>>> cannot delete the existing cephfs:
>>>
>>> errors are as follows:
>>>
>>> [root@ceph1 ~]# ceph fs rm cephfs2
>>> Error EINVAL: all MDS daemons must be inactive before removing filesystem
>>>
>>> I've tried killing the ceph-mds process, but this does not prevent the
>>> above
>>> error.
>>>
>>> I've also tried this, which also errors:
>>>
>>> [root@ceph1 ~]# ceph mds stop 0
>>> Error EBUSY: must decrease max_mds or else MDS will immediately
>>> reactivate
>>>
>>
>> Right, so did you run "ceph mds set_max_mds 0" and then repeating the
>> stop command? :)
>>
>>
>>> This also fail...
>>>
>>> [root@ceph1 ~]# ceph-deploy mds destroy
>>> [ceph_deploy.conf][DEBUG ] found configuration file at:
>>> /root/.cephdeploy.conf
>>> [ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds
>>> destroy
>>> [ceph_deploy.mds][ERROR ] subcommand destroy not implemented
>>>
>>> Am I doing the right thing in trying to wipe the original cephfs config
>>> before attempting to use an erasure cold tier? Or can I just redefine the
>>> cephfs?
>>>
>>
>> Yeah, unfortunately you need to recreate it if you want to try and use
>> an EC pool with cache tiering, because CephFS knows what pools it
>> expects data to belong to. Things are unlikely to behave correctly if
>> you try and stick an EC pool under an existing one. :(
>>
>> Sounds like this is all just testing, which is good because the
>> suitability of EC+cache is very dependent on how much hot data you
>> have, etc...good luck!
>> -Greg
>>
>>
>>> many thanks,
>>>
>>> Jake Grimmett
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> Thanks for your help - much appreciated.
>
> The "set_max_mds 0" command worked, but only after I rebooted the server,
> and restarted ceph twice. Before this I still got an
> "mds active" error, and so was unable to destroy the cephfs.
>
> Possibly I was being impatient, and needed to let mds go inactive? there
> were ~1 million files on the system.
>
> [root@ceph1 ~]# ceph mds set_max_mds 0
> max_mds = 0
>
> [root@ceph1 ~]# ceph mds stop 0
> telling mds.0 10.1.0.86:6811/3249 to deactivate
>
> [root@ceph1 ~]# ceph mds stop 0
> Error EEXIST: mds.0 not active (up:stopping)
>
> [root@ceph1 ~]# ceph fs rm cephfs2
> Error EINVAL: all MDS daemons must be inactive before removing filesystem
>
> There shouldn't be any other mds servers running..
> [root@ceph1 ~]# ceph mds stop 1
> Error EEXIST: mds.1 not active (down:dne)
>
> At this point I rebooted the server, did a "service ceph restart" twice.
> Shutdown ceph, then restarted ceph before this command worked:
>
> [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it
>
> Anyhow, I've now been able to create an erasure coded pool, with a
> replicated tier which cephfs is running on :)
>
> *Lots* of testing to go!
>
> Again, many thanks
>
> Jake
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New EC pool undersized

2015-03-04 Thread Kyle Hutson
So it sounds like I should figure out at 'how many nodes' do I need to
increase pg_num to 4096, and again for 8192, and increase those
incrementally when as I add more hosts, correct?

On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner  wrote:

>  Sorry, I missed your other questions, down at the bottom.  See here
> <http://ceph.com/docs/master/rados/operations/placement-groups/> (look
> for “number of replicas for replicated pools or the K+M sum for erasure
> coded pools”) for the formula; 38400/8 probably implies 8192.
>
>
>
> The thing is, you’ve got to think about how many ways you can form
> combinations of 8 unique OSDs (with replacement) that match your failure
> domain rules.  If you’ve only got 8 hosts, and your failure domain is
> hosts, it severely limits this number.  And I have read that too many
> isn’t good either – a serialization issue, I believe.
>
>
>
> -don-
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Don Doerner
> *Sent:* 04 March, 2015 12:49
> *To:* Kyle Hutson
> *Cc:* ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] New EC pool undersized
>
>
>
> Hmmm, I just struggled through this myself.  How many racks do you have?  If
> not more than 8, you might want to make your failure domain smaller?  I.e.,
> maybe host?  That, at least, would allow you to debug the situation…
>
>
>
> -don-
>
>
>
> *From:* Kyle Hutson [mailto:kylehut...@ksu.edu ]
> *Sent:* 04 March, 2015 12:43
> *To:* Don Doerner
> *Cc:* Ceph Users
> *Subject:* Re: [ceph-users] New EC pool undersized
>
>
>
> It wouldn't let me simply change the pg_num, giving
>
> Error EEXIST: specified pg_num 2048 <= current 8192
>
>
>
> But that's not a big deal, I just deleted the pool and recreated with
> 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile'
>
> ...and the result is quite similar: 'ceph status' is now
>
> ceph status
>
> cluster 196e5eb8-d6a7-4435-907e-ea028e946923
>
>  health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
> undersized
>
>  monmap e1: 4 mons at {hobbit01=
> 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
> <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0A&s=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8>},
> election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
>
>  osdmap e412: 144 osds: 144 up, 144 in
>
>   pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
>
> 90590 MB used, 640 TB / 640 TB avail
>
>4 active+undersized+degraded
>
> 6140 active+clean
>
>
>
> 'ceph pg dump_stuck results' in
>
> ok
>
> pg_stat   objects   mip  degr misp unf  bytes log  disklog state
> state_stampvreported  up   up_primary actingacting_primary
> last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
>
> 2.296 00000000
> active+undersized+degraded2015-03-04 14:33:26.672224 0'0  412:9
> [5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]
> 50'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911
>
> 2.69c 00000000
> active+undersized+degraded2015-03-04 14:33:24.984802 0'0  412:9
> [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647
> ,60] 93   0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04
> 14:33:15.695747
>
> 2.36d 00000000
> active+undersized+degraded2015-03-04 14:33:21.937620 0'0  412:9
> [12,108,136,104,52,18,63,2147483647]12   [12,108,136,104,52,18,63,
> 2147483647]12   0'0  2015-03-04 14:33:15.6524800'0  2015-03-04
> 14:33:15.652480
>
> 2.5f7 00000000
> active+undersized+degraded2015-03-04 14:33:26.169242 0'0  412:9
> [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647
> ,113] 94   0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04
> 14:33:15.687695
>
>
>
> I do have questions for you, even at this point, though.
>
> 1) Where did you find the formula (14400/(k+m))?
>
> 2) I was really trying to size this for when it goes to production, at
> which point it may have as many

Re: [ceph-users] New EC pool undersized

2015-03-04 Thread Kyle Hutson
That did it.

'step set_choose_tries 200'  fixed the problem right away.

Thanks Yann!

On Wed, Mar 4, 2015 at 2:59 PM, Yann Dupont  wrote:

>
> Le 04/03/2015 21:48, Don Doerner a écrit :
>
>  Hmmm, I just struggled through this myself.  How many racks do you have?
> If not more than 8, you might want to make your failure domain smaller?  I.e.,
> maybe host?  That, at least, would allow you to debug the situation…
>
>
>
> -don-
>
>
>
>
> Hello, I think I already had this problem.
> It's explained here
> http://tracker.ceph.com/issues/10350
>
> And solution is probably here :
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
>
> Section : "CRUSH gives up too soon"
>
> Cheers,
> Yann
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New EC pool undersized

2015-03-04 Thread Kyle Hutson
My lowest level (other than OSD) is 'disktype' (based on the crushmaps at
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
) since I have SSDs and HDDs on the same host.

I just made that change (deleted the pool, deleted the profile, deleted the
crush ruleset), then re-created using ruleset-failure-domain=disktype. Very
similar results.
health HEALTH_WARN 3 pgs degraded; 3 pgs stuck unclean; 3 pgs undersized
'ceph pg dump stuck' looks very similar to the last one I posted.

On Wed, Mar 4, 2015 at 2:48 PM, Don Doerner  wrote:

>  Hmmm, I just struggled through this myself.  How many racks do you have?
> If not more than 8, you might want to make your failure domain smaller?  I.e.,
> maybe host?  That, at least, would allow you to debug the situation…
>
>
>
> -don-
>
>
>
> *From:* Kyle Hutson [mailto:kylehut...@ksu.edu]
> *Sent:* 04 March, 2015 12:43
> *To:* Don Doerner
> *Cc:* Ceph Users
>
> *Subject:* Re: [ceph-users] New EC pool undersized
>
>
>
> It wouldn't let me simply change the pg_num, giving
>
> Error EEXIST: specified pg_num 2048 <= current 8192
>
>
>
> But that's not a big deal, I just deleted the pool and recreated with
> 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile'
>
> ...and the result is quite similar: 'ceph status' is now
>
> ceph status
>
> cluster 196e5eb8-d6a7-4435-907e-ea028e946923
>
>  health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
> undersized
>
>  monmap e1: 4 mons at {hobbit01=
> 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
> <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0A&s=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8>},
> election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
>
>  osdmap e412: 144 osds: 144 up, 144 in
>
>   pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
>
> 90590 MB used, 640 TB / 640 TB avail
>
>4 active+undersized+degraded
>
> 6140 active+clean
>
>
>
> 'ceph pg dump_stuck results' in
>
> ok
>
> pg_stat   objects   mip  degr misp unf  bytes log  disklog state
> state_stampvreported  up   up_primary actingacting_primary
> last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
>
> 2.296 00000000
> active+undersized+degraded2015-03-04 14:33:26.672224 0'0  412:9
> [5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]
> 50'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911
>
> 2.69c 00000000
> active+undersized+degraded2015-03-04 14:33:24.984802 0'0  412:9
> [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647
> ,60] 93   0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04
> 14:33:15.695747
>
> 2.36d 00000000
> active+undersized+degraded2015-03-04 14:33:21.937620 0'0  412:9
> [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63,
> 2147483647]12   0'0  2015-03-04 14:33:15.6524800'0  2015-03-04
> 14:33:15.652480
>
> 2.5f7 00000000
> active+undersized+degraded2015-03-04 14:33:26.169242 0'0  412:9
> [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647
> ,113] 94   0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04
> 14:33:15.687695
>
>
>
> I do have questions for you, even at this point, though.
>
> 1) Where did you find the formula (14400/(k+m))?
>
> 2) I was really trying to size this for when it goes to production, at
> which point it may have as many as 384 OSDs. Doesn't that imply I should
> have even more pgs?
>
>
>
> On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner 
> wrote:
>
> Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so
> try 2048.
>
>
>
> -don-
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Don Doerner
> *Sent:* 04 March, 2015 12:14
> *To:* Kyle Hutson; Ceph Users
> *Subject:* Re: [ceph-users] New EC pool undersized
>
>
>
> In this case, that number means that there is not an OSD that can be
> assigned.  What’s your k, m fro

Re: [ceph-users] New EC pool undersized

2015-03-04 Thread Kyle Hutson
It wouldn't let me simply change the pg_num, giving
Error EEXIST: specified pg_num 2048 <= current 8192

But that's not a big deal, I just deleted the pool and recreated with 'ceph
osd pool create ec44pool 2048 2048 erasure ec44profile'
...and the result is quite similar: 'ceph status' is now
ceph status
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
undersized
 monmap e1: 4 mons at {hobbit01=
10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0},
election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e412: 144 osds: 144 up, 144 in
  pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
90590 MB used, 640 TB / 640 TB avail
   4 active+undersized+degraded
6140 active+clean

'ceph pg dump_stuck results' in
ok
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
2.296 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.672224
0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5
[5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0
2015-03-04
14:33:15.649911
2.69c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:24.984802
0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93
[93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747
0'0 2015-03-04
14:33:15.695747
2.36d 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:21.937620
0'0 412:9 [12,108,136,104,52,18,63,2147483647] 12
[12,108,136,104,52,18,63,2147483647] 12 0'0 2015-03-04 14:33:15.652480
0'0 2015-03-04
14:33:15.652480
2.5f7 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.169242
0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94
[94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695
0'0 2015-03-04
14:33:15.687695

I do have questions for you, even at this point, though.
1) Where did you find the formula (14400/(k+m))?
2) I was really trying to size this for when it goes to production, at
which point it may have as many as 384 OSDs. Doesn't that imply I should
have even more pgs?

On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner  wrote:

>  Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so
> try 2048.
>
>
>
> -don-
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Don Doerner
> *Sent:* 04 March, 2015 12:14
> *To:* Kyle Hutson; Ceph Users
> *Subject:* Re: [ceph-users] New EC pool undersized
>
>
>
> In this case, that number means that there is not an OSD that can be
> assigned.  What’s your k, m from you erasure coded pool?  You’ll need
> approximately (14400/(k+m)) PGs, rounded up to the next power of 2…
>
>
>
> -don-
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Kyle Hutson
> *Sent:* 04 March, 2015 12:06
> *To:* Ceph Users
> *Subject:* [ceph-users] New EC pool undersized
>
>
>
> Last night I blew away my previous ceph configuration (this environment is
> pre-production) and have 0.87.1 installed. I've manually edited the
> crushmap so it down looks like https://dpaste.de/OLEa
> <https://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEa&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0A&s=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442>
>
>
>
> I currently have 144 OSDs on 8 nodes.
>
>
>
> After increasing pg_num and pgp_num to a more suitable 1024 (due to the
> high number of OSDs), everything looked happy.
>
> So, now I'm trying to play with an erasure-coded pool.
>
> I did:
>
> ceph osd erasure-code-profile set ec44profile k=4 m=4
> ruleset-failure-domain=rack
>
> ceph osd pool create ec44pool 8192 8192 erasure ec44profile
>
>
>
> After settling for a bit 'ceph status' gives
>
> cluster 196e5eb8-d6a7-4435-907e-ea028e946923
>
>  health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck
> unclean; 7 pgs stuck undersized; 7 pgs undersized
>
>  monmap e1: 4 mons at {hobbit01=
> 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
> <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2w

[ceph-users] New EC pool undersized

2015-03-04 Thread Kyle Hutson
Last night I blew away my previous ceph configuration (this environment is
pre-production) and have 0.87.1 installed. I've manually edited the
crushmap so it down looks like https://dpaste.de/OLEa

I currently have 144 OSDs on 8 nodes.

After increasing pg_num and pgp_num to a more suitable 1024 (due to the
high number of OSDs), everything looked happy.
So, now I'm trying to play with an erasure-coded pool.
I did:
ceph osd erasure-code-profile set ec44profile k=4 m=4
ruleset-failure-domain=rack
ceph osd pool create ec44pool 8192 8192 erasure ec44profile

After settling for a bit 'ceph status' gives
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck
unclean; 7 pgs stuck undersized; 7 pgs undersized
 monmap e1: 4 mons at {hobbit01=
10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0},
election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e409: 144 osds: 144 up, 144 in
  pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects
90598 MB used, 640 TB / 640 TB avail
   7 active+undersized+degraded
   12281 active+clean

So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck'
ok
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
1.d77 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:57.502849
0'0 408:12 [15,95,58,73,52,31,116,2147483647] 15
[15,95,58,73,52,31,116,2147483647] 15 0'0 2015-03-04 11:33:42.100752
0'0 2015-03-04
11:33:42.100752
1.10fa 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:29.362554
0'0 408:12 [23,12,99,114,132,53,56,2147483647] 23
[23,12,99,114,132,53,56,2147483647] 23 0'0 2015-03-04 11:33:42.168571
0'0 2015-03-04
11:33:42.168571
1.1271 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:48.795742
0'0 408:12 [135,112,69,4,22,95,2147483647,83] 135
[135,112,69,4,22,95,2147483647,83] 135 0'0 2015-03-04 11:33:42.139555
0'0 2015-03-04
11:33:42.139555
1.2b5 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:32.189738
0'0 408:12 [11,115,139,19,76,52,94,2147483647] 11
[11,115,139,19,76,52,94,2147483647] 11 0'0 2015-03-04 11:33:42.079673
0'0 2015-03-04
11:33:42.079673
1.7ae 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:26.848344
0'0 408:12 [27,5,132,119,94,56,52,2147483647] 27
[27,5,132,119,94,56,52,2147483647] 27 0'0 2015-03-04 11:33:42.109832
0'0 2015-03-04
11:33:42.109832
1.1a97 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:25.457454
0'0 408:12 [20,53,14,54,102,118,2147483647,72] 20
[20,53,14,54,102,118,2147483647,72] 20 0'0 2015-03-04 11:33:42.833850
0'0 2015-03-04
11:33:42.833850
1.10a6 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:30.059936
0'0 408:12 [136,22,4,2147483647,72,52,101,55] 136
[136,22,4,2147483647,72,52,101,55] 136 0'0 2015-03-04 11:33:42.125871
0'0 2015-03-04
11:33:42.125871

This appears to have a number on all these (2147483647) that is way out of
line from what I would expect.

Thoughts?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
Just did it. Thanks for suggesting it.

On Wed, Feb 25, 2015 at 5:59 PM, Brad Hubbard  wrote:

> On 02/26/2015 09:05 AM, Kyle Hutson wrote:
>
>> Thank you Thomas. You at least made me look it the right spot. Their
>> long-form is showing what to do for a mon, not an osd.
>>
>> At the bottom of step 11, instead of
>> sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/sysvinit
>>
>> It should read
>> sudo touch /var/lib/ceph/osd/{cluster-name}-{osd-num}/sysvinit
>>
>> Once I did that 'service ceph status' correctly shows that I have that
>> OSD available and I can start or stop it from there.
>>
>>
> Could you open a bug for this?
>
> Cheers,
> Brad
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
Thank you Thomas. You at least made me look it the right spot. Their
long-form is showing what to do for a mon, not an osd.

At the bottom of step 11, instead of
sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/sysvinit

It should read
sudo touch /var/lib/ceph/osd/{cluster-name}-{osd-num}/sysvinit

Once I did that 'service ceph status' correctly shows that I have that OSD
available and I can start or stop it from there.

On Wed, Feb 25, 2015 at 4:55 PM, Thomas Foster 
wrote:

> I am using the long form and have it working.  The one thing that I saw
> was to change from osd_host to just host. See if that works.
> On Feb 25, 2015 5:44 PM, "Kyle Hutson"  wrote:
>
>> I just tried it, and that does indeed get the OSD to start.
>>
>> However, it doesn't add it to the appropriate place so it would survive a
>> reboot. In my case, running 'service ceph status osd.16' still results in
>> the same line I posted above.
>>
>> There's still something broken such that 'ceph-disk activate' works
>> correctly, but using the long-form version does not.
>>
>> On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc 
>> wrote:
>>
>>> Step #6 in
>>> http://ceph.com/docs/master/install/manual-deployment/#long-form
>>> only set-ups the file structure for the OSD, it doesn't start the long
>>> running process.
>>>
>>> On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson  wrote:
>>> > But I already issued that command (back in step 6).
>>> >
>>> > The interesting part is that "ceph-disk activate" apparently does it
>>> > correctly. Even after reboot, the services start as they should.
>>> >
>>> > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc 
>>> > wrote:
>>> >>
>>> >> I think that your problem lies with systemd (even though you are using
>>> >> SysV syntax, systemd is really doing the work). Systemd does not like
>>> >> multiple arguments and I think this is why it is failing. There is
>>> >> supposed to be some work done to get systemd working ok, but I think
>>> >> it has the limitation of only working with a cluster named 'ceph'
>>> >> currently.
>>> >>
>>> >> What I did to get around the problem was to run the osd command
>>> manually:
>>> >>
>>> >> ceph-osd -i 
>>> >>
>>> >> Once I understand the under-the-hood stuff, I moved to ceph-disk and
>>> >> now because of the GPT partition IDs, udev automatically starts up the
>>> >> OSD process at boot/creation and moves to the appropiate CRUSH
>>> >> location (configuratble in ceph.conf
>>> >>
>>> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
>>> >> an example: crush location = host=test rack=rack3 row=row8
>>> >> datacenter=local region=na-west root=default). To restart an OSD
>>> >> process, I just kill the PID for the OSD then issue ceph-disk activate
>>> >> /dev/sdx1 to restart the OSD process. You probably could stop it with
>>> >> systemctl since I believe udev creates a resource for it (I should
>>> >> probably look into that now that this system will be going production
>>> >> soon).
>>> >>
>>> >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson 
>>> wrote:
>>> >> > I'm having a similar issue.
>>> >> >
>>> >> > I'm following
>>> http://ceph.com/docs/master/install/manual-deployment/ to
>>> >> > a T.
>>> >> >
>>> >> > I have OSDs on the same host deployed with the short-form and they
>>> work
>>> >> > fine. I am trying to deploy some more via the long form (because I
>>> want
>>> >> > them
>>> >> > to appear in a different location in the crush map). Everything
>>> through
>>> >> > step
>>> >> > 10 (i.e. ceph osd crush add {id-or-name} {weight}
>>> >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to
>>> step
>>> >> > 11
>>> >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
>>> >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
>>> >> > mon.hobbit01
>>> >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8
>>> osd.12
>>> >>

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
I just tried it, and that does indeed get the OSD to start.

However, it doesn't add it to the appropriate place so it would survive a
reboot. In my case, running 'service ceph status osd.16' still results in
the same line I posted above.

There's still something broken such that 'ceph-disk activate' works
correctly, but using the long-form version does not.

On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc 
wrote:

> Step #6 in
> http://ceph.com/docs/master/install/manual-deployment/#long-form
> only set-ups the file structure for the OSD, it doesn't start the long
> running process.
>
> On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson  wrote:
> > But I already issued that command (back in step 6).
> >
> > The interesting part is that "ceph-disk activate" apparently does it
> > correctly. Even after reboot, the services start as they should.
> >
> > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc 
> > wrote:
> >>
> >> I think that your problem lies with systemd (even though you are using
> >> SysV syntax, systemd is really doing the work). Systemd does not like
> >> multiple arguments and I think this is why it is failing. There is
> >> supposed to be some work done to get systemd working ok, but I think
> >> it has the limitation of only working with a cluster named 'ceph'
> >> currently.
> >>
> >> What I did to get around the problem was to run the osd command
> manually:
> >>
> >> ceph-osd -i 
> >>
> >> Once I understand the under-the-hood stuff, I moved to ceph-disk and
> >> now because of the GPT partition IDs, udev automatically starts up the
> >> OSD process at boot/creation and moves to the appropiate CRUSH
> >> location (configuratble in ceph.conf
> >> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
> >> an example: crush location = host=test rack=rack3 row=row8
> >> datacenter=local region=na-west root=default). To restart an OSD
> >> process, I just kill the PID for the OSD then issue ceph-disk activate
> >> /dev/sdx1 to restart the OSD process. You probably could stop it with
> >> systemctl since I believe udev creates a resource for it (I should
> >> probably look into that now that this system will be going production
> >> soon).
> >>
> >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson 
> wrote:
> >> > I'm having a similar issue.
> >> >
> >> > I'm following http://ceph.com/docs/master/install/manual-deployment/
> to
> >> > a T.
> >> >
> >> > I have OSDs on the same host deployed with the short-form and they
> work
> >> > fine. I am trying to deploy some more via the long form (because I
> want
> >> > them
> >> > to appear in a different location in the crush map). Everything
> through
> >> > step
> >> > 10 (i.e. ceph osd crush add {id-or-name} {weight}
> >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step
> >> > 11
> >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
> >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
> >> > mon.hobbit01
> >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12
> >> > osd.6
> >> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7
> >> > osd.15
> >> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11
> >> > osd.5
> >> > osd.4 osd.0)
> >> >
> >> >
> >> >
> >> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden 
> >> > wrote:
> >> >>
> >> >> Also, did you successfully start your monitor(s), and define/create
> the
> >> >> OSDs within the Ceph cluster itself?
> >> >>
> >> >> There are several steps to creating a Ceph cluster manually.  I'm
> >> >> unsure
> >> >> if you have done the steps to actually create and register the OSDs
> >> >> with the
> >> >> cluster.
> >> >>
> >> >>  - Travis
> >> >>
> >> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master 
> >> >> wrote:
> >> >>>
> >> >>> Check firewall rules and selinux. It sometimes is a pain in the ...
> :)
> >> >>>
> >> >>> 25 lut 2015 01:46 "Barclay Jameson" 
> >> >>> napisał(

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
So I issue it twice? e.g.
ceph-osd -i X --mkfs --mkkey
...other commands...
ceph-osd -i X

?


On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc 
wrote:

> Step #6 in
> http://ceph.com/docs/master/install/manual-deployment/#long-form
> only set-ups the file structure for the OSD, it doesn't start the long
> running process.
>
> On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson  wrote:
> > But I already issued that command (back in step 6).
> >
> > The interesting part is that "ceph-disk activate" apparently does it
> > correctly. Even after reboot, the services start as they should.
> >
> > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc 
> > wrote:
> >>
> >> I think that your problem lies with systemd (even though you are using
> >> SysV syntax, systemd is really doing the work). Systemd does not like
> >> multiple arguments and I think this is why it is failing. There is
> >> supposed to be some work done to get systemd working ok, but I think
> >> it has the limitation of only working with a cluster named 'ceph'
> >> currently.
> >>
> >> What I did to get around the problem was to run the osd command
> manually:
> >>
> >> ceph-osd -i 
> >>
> >> Once I understand the under-the-hood stuff, I moved to ceph-disk and
> >> now because of the GPT partition IDs, udev automatically starts up the
> >> OSD process at boot/creation and moves to the appropiate CRUSH
> >> location (configuratble in ceph.conf
> >> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
> >> an example: crush location = host=test rack=rack3 row=row8
> >> datacenter=local region=na-west root=default). To restart an OSD
> >> process, I just kill the PID for the OSD then issue ceph-disk activate
> >> /dev/sdx1 to restart the OSD process. You probably could stop it with
> >> systemctl since I believe udev creates a resource for it (I should
> >> probably look into that now that this system will be going production
> >> soon).
> >>
> >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson 
> wrote:
> >> > I'm having a similar issue.
> >> >
> >> > I'm following http://ceph.com/docs/master/install/manual-deployment/
> to
> >> > a T.
> >> >
> >> > I have OSDs on the same host deployed with the short-form and they
> work
> >> > fine. I am trying to deploy some more via the long form (because I
> want
> >> > them
> >> > to appear in a different location in the crush map). Everything
> through
> >> > step
> >> > 10 (i.e. ceph osd crush add {id-or-name} {weight}
> >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step
> >> > 11
> >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
> >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
> >> > mon.hobbit01
> >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12
> >> > osd.6
> >> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7
> >> > osd.15
> >> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11
> >> > osd.5
> >> > osd.4 osd.0)
> >> >
> >> >
> >> >
> >> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden 
> >> > wrote:
> >> >>
> >> >> Also, did you successfully start your monitor(s), and define/create
> the
> >> >> OSDs within the Ceph cluster itself?
> >> >>
> >> >> There are several steps to creating a Ceph cluster manually.  I'm
> >> >> unsure
> >> >> if you have done the steps to actually create and register the OSDs
> >> >> with the
> >> >> cluster.
> >> >>
> >> >>  - Travis
> >> >>
> >> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master 
> >> >> wrote:
> >> >>>
> >> >>> Check firewall rules and selinux. It sometimes is a pain in the ...
> :)
> >> >>>
> >> >>> 25 lut 2015 01:46 "Barclay Jameson" 
> >> >>> napisał(a):
> >> >>>
> >> >>>> I have tried to install ceph using ceph-deploy but sgdisk seems to
> >> >>>> have too many issues so I did a manual install. After mkfs.btrfs on
> >> >>>> the disks and journals and mounted them 

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
But I already issued that command (back in step 6).

The interesting part is that "ceph-disk activate" apparently does it
correctly. Even after reboot, the services start as they should.

On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc 
wrote:

> I think that your problem lies with systemd (even though you are using
> SysV syntax, systemd is really doing the work). Systemd does not like
> multiple arguments and I think this is why it is failing. There is
> supposed to be some work done to get systemd working ok, but I think
> it has the limitation of only working with a cluster named 'ceph'
> currently.
>
> What I did to get around the problem was to run the osd command manually:
>
> ceph-osd -i 
>
> Once I understand the under-the-hood stuff, I moved to ceph-disk and
> now because of the GPT partition IDs, udev automatically starts up the
> OSD process at boot/creation and moves to the appropiate CRUSH
> location (configuratble in ceph.conf
> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
> an example: crush location = host=test rack=rack3 row=row8
> datacenter=local region=na-west root=default). To restart an OSD
> process, I just kill the PID for the OSD then issue ceph-disk activate
> /dev/sdx1 to restart the OSD process. You probably could stop it with
> systemctl since I believe udev creates a resource for it (I should
> probably look into that now that this system will be going production
> soon).
>
> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson  wrote:
> > I'm having a similar issue.
> >
> > I'm following http://ceph.com/docs/master/install/manual-deployment/ to
> a T.
> >
> > I have OSDs on the same host deployed with the short-form and they work
> > fine. I am trying to deploy some more via the long form (because I want
> them
> > to appear in a different location in the crush map). Everything through
> step
> > 10 (i.e. ceph osd crush add {id-or-name} {weight}
> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step 11
> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
> mon.hobbit01
> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12
> osd.6
> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7
> osd.15
> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11
> osd.5
> > osd.4 osd.0)
> >
> >
> >
> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden 
> wrote:
> >>
> >> Also, did you successfully start your monitor(s), and define/create the
> >> OSDs within the Ceph cluster itself?
> >>
> >> There are several steps to creating a Ceph cluster manually.  I'm unsure
> >> if you have done the steps to actually create and register the OSDs
> with the
> >> cluster.
> >>
> >>  - Travis
> >>
> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master 
> wrote:
> >>>
> >>> Check firewall rules and selinux. It sometimes is a pain in the ... :)
> >>>
> >>> 25 lut 2015 01:46 "Barclay Jameson" 
> napisał(a):
> >>>
> >>>> I have tried to install ceph using ceph-deploy but sgdisk seems to
> >>>> have too many issues so I did a manual install. After mkfs.btrfs on
> >>>> the disks and journals and mounted them I then tried to start the osds
> >>>> which failed. The first error was:
> >>>> #/etc/init.d/ceph start osd.0
> >>>> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
> >>>> /var/lib/ceph defines )
> >>>>
> >>>> I then manually added the osds to the conf file with the following as
> >>>> an example:
> >>>> [osd.0]
> >>>> osd_host = node01
> >>>>
> >>>> Now when I run the command :
> >>>> # /etc/init.d/ceph start osd.0
> >>>>
> >>>> There is no error or output from the command and in fact when I do a
> >>>> ceph -s no osds are listed as being up.
> >>>> Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are
> >>>> no osd running.
> >>>> I also have done htop to see if any process are running and none are
> >>>> shown.
> >>>>
> >>>> I had this working on SL6.5 with Firefly but Giant on Centos 7 has
> >>>> been nothing but a giant pain.
> >>>> ___
> >>>&g

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Kyle Hutson
I'm having a similar issue.

I'm following http://ceph.com/docs/master/install/manual-deployment/ to a T.

I have OSDs on the same host deployed with the short-form and they work
fine. I am trying to deploy some more via the long form (because I want
them to appear in a different location in the crush map). Everything
through step 10 (i.e. ceph osd crush add {id-or-name} {weight}
[{bucket-type}={bucket-name} ...] ) works just fine. When I go to step 11 (sudo
/etc/init.d/ceph start osd.{osd-num}) I get:
/etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
mon.hobbit01 osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13
osd.8 osd.12 osd.6 osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines
mon.hobbit01 osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13
osd.8 osd.12 osd.6 osd.11 osd.5 osd.4 osd.0)



On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden  wrote:

> Also, did you successfully start your monitor(s), and define/create the
> OSDs within the Ceph cluster itself?
>
> There are several steps to creating a Ceph cluster manually.  I'm unsure
> if you have done the steps to actually create and register the OSDs with
> the cluster.
>
>  - Travis
>
> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master  wrote:
>
>> Check firewall rules and selinux. It sometimes is a pain in the ... :)
>> 25 lut 2015 01:46 "Barclay Jameson"  napisał(a):
>>
>> I have tried to install ceph using ceph-deploy but sgdisk seems to
>>> have too many issues so I did a manual install. After mkfs.btrfs on
>>> the disks and journals and mounted them I then tried to start the osds
>>> which failed. The first error was:
>>> #/etc/init.d/ceph start osd.0
>>> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
>>> /var/lib/ceph defines )
>>>
>>> I then manually added the osds to the conf file with the following as
>>> an example:
>>> [osd.0]
>>> osd_host = node01
>>>
>>> Now when I run the command :
>>> # /etc/init.d/ceph start osd.0
>>>
>>> There is no error or output from the command and in fact when I do a
>>> ceph -s no osds are listed as being up.
>>> Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are
>>> no osd running.
>>> I also have done htop to see if any process are running and none are
>>> shown.
>>>
>>> I had this working on SL6.5 with Firefly but Giant on Centos 7 has
>>> been nothing but a giant pain.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing a crushmap

2015-02-20 Thread Kyle Hutson
Oh, and I don't yet have any important data here, so I'm not worried about
losing anything at this point. I just need to get my cluster happy again so
I can play with it some more.

On Fri, Feb 20, 2015 at 11:00 AM, Kyle Hutson  wrote:

> Here was the process I went through.
> 1) I created an EC pool which created ruleset 1
> 2) I edited the crushmap to approximately its current form
> 3) I discovered my previous EC pool wasn't doing what I meant for it to
> do, so I deleted it.
> 4) I created a new EC pool with the parameters I wanted and told it to use
> ruleset 3
>
> On Fri, Feb 20, 2015 at 10:55 AM, Luis Periquito 
> wrote:
>
>> The process of creating an erasure coded pool and a replicated one is
>> slightly different. You can use Sebastian's guide to create/manage the osd
>> tree, but you should follow this guide
>> http://ceph.com/docs/giant/dev/erasure-coded-pool/ to create the EC pool.
>>
>> I'm not sure (i.e. I never tried) to create a EC pool the way you did.
>> The normal replicated ones do work like this.
>>
>> On Fri, Feb 20, 2015 at 4:49 PM, Kyle Hutson  wrote:
>>
>>> I manually edited my crushmap, basing my changes on
>>> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>>> I have SSDs and HDDs in the same box and was wanting to separate them by
>>> ruleset. My current crushmap can be seen at http://pastie.org/9966238
>>>
>>> I had it installed and everything looked gooduntil I created a new
>>> pool. All of the new pgs are stuck in "creating". I first tried creating an
>>> erasure-coded pool using ruleset 3, then created another pool using ruleset
>>> 0. Same result.
>>>
>>> I'm not opposed to an 'RTFM' answer, so long as you can point me to the
>>> right one. I've seen very little documentation on crushmap rules, in
>>> particular.
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing a crushmap

2015-02-20 Thread Kyle Hutson
Here was the process I went through.
1) I created an EC pool which created ruleset 1
2) I edited the crushmap to approximately its current form
3) I discovered my previous EC pool wasn't doing what I meant for it to do,
so I deleted it.
4) I created a new EC pool with the parameters I wanted and told it to use
ruleset 3

On Fri, Feb 20, 2015 at 10:55 AM, Luis Periquito 
wrote:

> The process of creating an erasure coded pool and a replicated one is
> slightly different. You can use Sebastian's guide to create/manage the osd
> tree, but you should follow this guide
> http://ceph.com/docs/giant/dev/erasure-coded-pool/ to create the EC pool.
>
> I'm not sure (i.e. I never tried) to create a EC pool the way you did. The
> normal replicated ones do work like this.
>
> On Fri, Feb 20, 2015 at 4:49 PM, Kyle Hutson  wrote:
>
>> I manually edited my crushmap, basing my changes on
>> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>> I have SSDs and HDDs in the same box and was wanting to separate them by
>> ruleset. My current crushmap can be seen at http://pastie.org/9966238
>>
>> I had it installed and everything looked gooduntil I created a new
>> pool. All of the new pgs are stuck in "creating". I first tried creating an
>> erasure-coded pool using ruleset 3, then created another pool using ruleset
>> 0. Same result.
>>
>> I'm not opposed to an 'RTFM' answer, so long as you can point me to the
>> right one. I've seen very little documentation on crushmap rules, in
>> particular.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fixing a crushmap

2015-02-20 Thread Kyle Hutson
I manually edited my crushmap, basing my changes on
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
I have SSDs and HDDs in the same box and was wanting to separate them by
ruleset. My current crushmap can be seen at http://pastie.org/9966238

I had it installed and everything looked gooduntil I created a new
pool. All of the new pgs are stuck in "creating". I first tried creating an
erasure-coded pool using ruleset 3, then created another pool using ruleset
0. Same result.

I'm not opposed to an 'RTFM' answer, so long as you can point me to the
right one. I've seen very little documentation on crushmap rules, in
particular.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Kyle Bader
> do people consider a UPS + Shutdown procedures a suitable substitute?

I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] private network - VLAN vs separate switch

2014-11-26 Thread Kyle Bader
> Thanks for all the help. Can the moving over from VLAN to separate
> switches be done on a live cluster? Or does there need to be a down
> time?

You can do it on a life cluster. The more cavalier approach would be
to quickly switch the link over one server at a time, which might
cause a short io stall. The more careful approach would be to 'ceph
osd set noup' mark all the osds on a node down, move the link, 'ceph
osd unset noup', and then wait for their peers to mark them back up
before proceeding to the next host.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] private network - VLAN vs separate switch

2014-11-25 Thread Kyle Bader
> For a large network (say 100 servers and 2500 disks), are there any
> strong advantages to using separate switch and physical network
> instead of VLAN?

Physical isolation will ensure that congestion on one does not affect
the other. On the flip side, asymmetric network failures tend to be
more difficult to troubleshoot eg. backend failure with functional
front end. That said, in a pinch you can switch to using the front end
network for both until you can repair the backend.

> Also, how difficult it would be to switch from a VLAN to using
> separate switches later?

Should be relatively straight forward. Simply configure the
VLAN/subnets on the new physical switches and move links over one by
one. Once all the links are moved over you can remove the VLAN and
subnets that are now on the new kit from the original hardware.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-08-06 Thread Kyle Bader
> Can you paste me the whole output of the install? I am curious why/how you 
> are getting el7 and el6 packages.

priority=1 required in /etc/yum.repos.d/ceph.repo entries

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is OSDs based on VFS?

2014-07-21 Thread Kyle Bader
> I wonder that OSDs use system calls of Virtual File System (i.e. open, read,
> write, etc) when they access disks.
>
> I mean ... Could I monitor I/O command requested by OSD to disks if I
> monitor VFS?

Ceph OSDs run on top of a traditional filesystem, so long as they
support xattrs - xfs by default. As such you can use kernel
instrumentation to view what is going on "under" the Ceph OSDs.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-02 Thread Kyle Bader
> I was wondering, having a cache pool in front of an RBD pool is all fine
> and dandy, but imagine you want to pull backups of all your VMs (or one
> of them, or multiple...). Going to the cache for all those reads isn't
> only pointless, it'll also potentially fill up the cache and possibly
> evict actually frequently used data. Which got me thinking... wouldn't
> it be nifty if there was a special way of doing specific backup reads
> where you'd bypass the cache, ensuring the dirty cache contents get
> written to cold pool first? Or at least doing special reads where a
> cache-miss won't actually cache the requested data?
>
> AFAIK the backup routine for an RBD-backed KVM usually involves creating
> a snapshot of the RBD and putting that into a backup storage/tape, all
> done via librbd/API.
>
> Maybe something like that even already exists?

When used in the context of OpenStack Cinder, it does:

http://ceph.com/docs/next/rbd/rbd-openstack/#configuring-cinder-backup

You can have the backup pool use the default crush rules, assuming the
default isn't your hot pool. Another option might be to put backups on
an erasure coded pool, I'm not sure if that has been tested, but in
principle should work since objects composing a snapshot should be
immutable.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Kyle Bader
> TL;DR: Power outages are more common than your colo facility will admit.

Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a datacenter with five nines power availability is going to
see > 5m of downtime per year, and that would qualify for the highest
rating from the Uptime Institute (Tier IV)! I've lost power to Ceph
clusters on several occasions, in all cases the journals were on
spinning media.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Kyle Bader
> Anyway replacing set of monitors means downtime for every client, so
> I`m in doubt if 'no outage' word is still applicable there.

Taking the entire quorum down for migration would be bad. It's better
to add one in the new location, remove one at the old, ad infinitum.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate whole clusters

2014-05-09 Thread Kyle Bader
> Let's assume a test cluster up and running with real data on it.
> Which is the best way to migrate everything to a production (and
> larger) cluster?
>
> I'm thinking to add production MONs to the test cluster, after that,
> add productions OSDs to the test cluster, waiting for a full rebalance
> and then starting to remove test OSDs and test mons.
>
> This should migrate everything with no outage.

It's possible and I've done it, this was around the argonaut/bobtail
timeframe on a pre-production cluster. If your cluster has a lot of
data then it may take a long time or be disruptive, make sure you've
tested that your recovery tunables are suitable for your hardware
configuration.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-17 Thread Kyle Bader
>> >> I think the timing should work that we'll be deploying with Firefly and
>> >> so
>> >> have Ceph cache pool tiering as an option, but I'm also evaluating
>> >> Bcache
>> >> versus Tier to act as node-local block cache device. Does anybody have
>> >> real
>> >> or anecdotal evidence about which approach has better performance?
>> > New idea that is dependent on failure behaviour of the cache tier...
>>
>> The problem with this type of configuration is it ties a VM to a
>> specific hypervisor, in theory it should be faster because you don't
>> have network latency from round trips to the cache tier, resulting in
>> higher iops. Large sequential workloads may achieve higher throughput
>> by parallelizing across many OSDs in a cache tier, whereas local flash
>> would be limited to single device throughput.
>
> Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm really
> looking at:
> 2-copy write-back object ssd cache-pool
> versus
> OSD write-back ssd block-cache
> versus
> 1-copy write-around object cache-pool & ssd journal

Ceph cache pools allow you to scale the size of the cache pool
independent of the underlying storage and avoids constraints about
disk:ssd ratios (for flashcache, bcache, etc). Local block caches
should have lower latency than a cache tier for a cache miss, due to
the extra hop(s) across the network. I would lean towards using Ceph's
cache tiers for the scaling independence.

> This is undoubtedly true for a write-back cache-tier. But in the scenario
> I'm suggesting, a write-around cache, that needn't be bad news - if a
> cache-tier OSD is lost the cache simply just got smaller and some cached
> objects were unceremoniously flushed. The next read on those objects should
> just miss and bring them into the now smaller cache.
>
> The thing I'm trying to avoid with the above is double read-caching of
> objects (so as to get more aggregate read cache). I assume the standard
> wisdom with write-back cache-tiering is that the backing data pool shouldn't
> bother with ssd journals?

Currently, all cache tiers need to be durable - regardless of cache
mode. As such, cache tiers should be erasure coded or N+1 replicated
(I'd recommend N+2 or 3x replica). Ceph could potentially do what you
described in the future, it just doesn't yet.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Kyle Bader
>> Obviously the ssds could be used as journal devices, but I'm not really
>> convinced whether this is worthwhile when all nodes have 1GB of hardware
>> writeback cache (writes to journal and data areas on the same spindle have
>> time to coalesce in the cache and minimise seek time hurt). Any advice on
>> this?

All writes need to be written to the journal before being written to
the data volume so it's going to impact your overall throughput and
cause seeking, a hardware cache will only help with the later (unless
you use btrfs).

>> I think the timing should work that we'll be deploying with Firefly and so
>> have Ceph cache pool tiering as an option, but I'm also evaluating Bcache
>> versus Tier to act as node-local block cache device. Does anybody have real
>> or anecdotal evidence about which approach has better performance?
> New idea that is dependent on failure behaviour of the cache tier...

The problem with this type of configuration is it ties a VM to a
specific hypervisor, in theory it should be faster because you don't
have network latency from round trips to the cache tier, resulting in
higher iops. Large sequential workloads may achieve higher throughput
by parallelizing across many OSDs in a cache tier, whereas local flash
would be limited to single device throughput.

> Carve the ssds 4-ways: each with 3 partitions for journals servicing the
> backing data pool and a fourth larger partition serving a write-around cache
> tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd
> capacity is not halved by replication for availability.
>
> ...The crux is how the current implementation behaves in the face of cache
> tier OSD failures?

Cache tiers are durable by way of replication or erasure coding, OSDs
will remap degraded placement groups and backfill as appropriate. With
single replica cache pools loss of OSDs becomes a real concern, in the
case of RBD this means losing arbitrary chunk(s) of your block devices
- bad news. If you want host independence, durability and speed your
best bet is a replicated cache pool (2-3x).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-15 Thread Kyle Bader
> I'm assuming Ceph/RBD doesn't have any direct awareness of this since
> the file system doesn't traditionally have a "give back blocks"
> operation to the block device.  Is there anything special RBD does in
> this case that communicates the release of the Ceph storage back to the
> pool?

VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing TRIM.

http://wiki.qemu.org/Features/QED/Trim

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: 答复: why object can't be recovered when delete one replica

2014-03-24 Thread Kyle Bader
> I have run the repair command, and the warning info disappears in the
output of "ceph health detail", but the replicas isn't recovered in the
"current" directory.
> In all, the ceph cluster status can recover (the pg's status recover from
inconsistent to active and clean), but not the replica.

If you run a pg query does it still show the osd you removed the object
from in the acting set? It could be that the pg has a different member now
and the restored copy is on another osd.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error initializing cluster client: Error

2014-03-22 Thread Kyle Bader
> I have two nodes with 8 OSDs on each. First node running 2 monitors on 
> different virtual machines (mon.1 and mon.2), second node runing mon.3
> After several reboots (I have tested power failure scenarios) "ceph -w" on 
> node 2 always fails with message:
>
> root@bes-mon3:~# ceph --verbose -w
> Error initializing cluster client: Error

The cluster is simply protecting itself from a split brain situation.
Say you have:

mon.1  mon.2  mon.3

If mon.1 fails, no big deal, you still have 2/3 so no problem.

Now instead, say mon.1 is separated from mon.2 and mon.3 because of a
network partition (trunk failure, whatever). If one monitor of the
three could elect itself as leader then you might have divergence
between your monitors. Self-elected mon.1 thinks it's the leader and
mon.{2,3} have elected a leader amongst themselves. The harsh reality
is you really need to have monitors on 3 distinct physical hosts to
protect against the failure of a physical host.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why object can't be recovered when delete one replica

2014-03-22 Thread Kyle Bader
> I upload a file through swift API, then delete it in the “current” directory
> in the secondary OSD manually, why the object can’t be recovered?
>
> If I delete it in the primary OSD, the object is deleted directly in the
> pool .rgw.bucket and it can’t be recovered from the secondary OSD.
>
> Do anyone know this behavior?

This is because the placement group containing that object likely
needs to scrub (just a light scrub should do). The scrub will compare
the two replicas, notice the replica is missing from the secondary and
trigger recovery/backfill. Can you try scrubbing the placement group
containing the removed object and let us know if it triggers recovery?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting with dmcrypt still fails

2014-03-22 Thread Kyle Bader
> ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir 
> /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb
> ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping 
> (dm-crypt?): dm-0

It sounds like device-mapper still thinks it's using the the volume,
you might be able to track it down with this:

for i in `ls -1 /sys/block/ | grep sd`; do echo $i: `ls
/sys/block/$i/${i}1/holders/`; done

Then it's a matter of making sure there are no open file handles on
the encrypted volume and unmounting it. You will still need to
completely clear out the partition table on that disk, which can be
tricky with GPT because it's not as simple as dd'in the start of the
volume. This is what the zapdisk parameter is for in
ceph-disk-prepare, I don't know enough about ceph-deploy to know if
you can somehow pass it.

After you know the device/dm mapping you can use udevadm to find out
where it should map to (uuids replaced with xxx's):

udevadm test /block/sdc/sdc1

run: '/sbin/cryptsetup --key-file /etc/ceph/dmcrypt-keys/x
--key-size 256 create  /dev/sdc1'
run: '/bin/bash -c 'while [ ! -e /dev/mapper/x ];do sleep 1; done''
run: '/usr/sbin/ceph-disk-activate /dev/mapper/x'

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd rebalance question

2014-03-22 Thread Kyle Bader
>  I need to add a extend server, which reside several osds, to a
> running ceph cluster. During add osds, ceph would not automatically modify
> the ceph.conf. So I manually modify the ceph.conf
>
> And restart the whole ceph cluster with command: ’service ceph –a restart’.
> I just confused that if I restart the ceph cluster, ceph would rebalance the
> whole data(redistribution whole data) among osds? Or just move some
>
> Data from existed osds to new osds? Anybody knows?

It depends on how you added the OSDs, if the initial crush weight is
set to 0 then no data will be moved to the OSD when it joins the
cluster. Only once the weight has been increased with the rest of the
OSD population will data start to move to the new OSD(s). If you add
new OSD(s) with an initial weight > 0 then they will start accepting
data from peers as soon as they are up/in.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...

2014-03-22 Thread Kyle Bader
> One downside of the above arrangement: I read that support for mapping
> newer-format RBDs is only present in fairly recent kernels.  I'm running
> Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel.  There
> is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking
> at a new deployment it might be better to wait until 14.04: then you'll
> get kernel 3.13.
>
> Anyone else have any ideas on the above?

I don't think there are any hairy udev issues or similar that will
make using a newer kernel on precise problematic. The only thing I can
think of that is a caveat of this kind of setup if if you lose a
hypervisor the cache will go with it and you likely wont be able to
migrate the guest to another host. The alternative is to use
flashcache on top of the OSD partition but then you introduce network
hops and is closer to what the tiering feature will offer, except the
flashcache OSD method is more particular about disk:ssd ratio, whereas
in a tier the flash could be on s completely separate hosts (possibly
dedicated flash machines).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?

2014-03-22 Thread Kyle Bader
> If I want to use a disk dedicated for osd, can I just use something like
> /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance?

You can pass /dev/sdb to ceph-disk-prepare and it will create two
partitions, one for the journal (raw partition) and one for the data
volume (defaults to formatting xfs). This is known as a single device
OSD, in contrast with a multi-device OSD where the journal is on a
completely different device (like a partition on a shared journaling
SSD).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] if partition name changes, will ceph get corrupted?

2014-03-12 Thread Kyle Bader
> We use /dev/disk/by-path for this reason, but we confirmed that is stable
> for our HBAs. Maybe /dev/disk/by-something is consistent with your
> controller.

The upstart/udev scripts will handle mounting and osd id detection, at
least on Ubuntu.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
> This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 
> & curl work fine as far as queries go. But file transfer rate degrades badly 
> once I start file up/download.

Maybe the difference can be attributed to LAN client traffic with
jumbo frames vs F5 using a smaller WAN MTU?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
> You're right.  Sorry didn't specify I was trying this for Radosgw.  Even for 
> this I'm seeing performance degrade once my clients start to hit the LB VIP.

Could you tell us more about your load balancer and configuration?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
> Anybody has a good practice on how to set up a ceph cluster behind a pair of 
> load balancer?

The only place you would want to put a load balancer in the context of
a Ceph cluster would be north of RGW nodes. You can do L3 transparent
load balancing or balance with a L7 proxy, ie Linux Virtual Server or
HAProxy/Nginx. The other components of Ceph are horizontally scalable
and because of the way Ceph's native protocols work you don't need
load balancers doing L2/L3/L7 tricks to achieve HA.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-rbd

2014-03-11 Thread Kyle Bader
> I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the 
> kernel client.
>
> Can you please let me know how to setup RBD backend for FIO? I'm assuming 
> this RBD backend is also based on librbd?

You will probably have to build fio from source since the rbd engine is new:

https://github.com/axboe/fio

Assuming you already have a cluster and a client configured this
should do the trick:

https://github.com/axboe/fio/blob/master/examples/rbd.fio

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder

2014-03-11 Thread Kyle Bader
> 1.   Is it possible to install Ceph and Ceph monitors on the the XCP
> (XEN) Dom0 or would we need to install it on the DomU containing the
> Openstack components?

I'm not a Xen guru but in the case of KVM I would run the OSDs on the
hypervisor to avoid virtualization overhead.

> 2.   Is Ceph server aware, or Rack aware so that replicas are not stored
> on the same server?

Yes, placement is defined with your crush map and placement rules.

> 3.   Are 4Tb OSD’s too large? We are attempting to restrict the qty of
> OSD’s per server to minimise system overhead

Nope!

> Any other feedback regarding our plan would also be welcomed.

I would probably run each disk as it's own OSD, which means you need a
bit more memory per host. Networking could certainly be a bottleneck
with 8 to 16 spindle nodes. YMMV.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption/Multi-tennancy

2014-03-11 Thread Kyle Bader
> There could be millions of tennants. Looking deeper at the docs, it looks 
> like Ceph prefers to have one OSD per disk.  We're aiming at having 
> backblazes, so will be looking at 45 OSDs per machine, many machines.  I want 
> to separate the tennants and separately encrypt their data.  The encryption 
> will be provided by us, but I was originally intending to have 
> passphrase-based encryption, and use programmatic means to either hash the 
> passphrase or/and encrypt it using the same passphrase.  This way, we 
> wouldn't be able to access the tennant's data, or the key for the passphrase, 
> although we'd still be able to store both.


The way I see it you have several options:

1. Encrypted OSDs

Preserve confidentiality in the event someone gets physical access to
a disk, whether theft or RMA. Requires tenant to trust provider.

vm
rbd
rados
osd <-here
disks

2. Whole disk VM encryption

Preserve confidentiality in the even someone gets physical access to
disk, whether theft or RMA.

tenant: key/passphrase
provider: nothing

tenant: passphrase
provider: key

tenant: nothing
provider: key

vm <--- here
rbd
rados
osd
disks

3. Encryption further up stack (application perhaps?)

To me, #1/#2 are identical except in the case of #2 when the rbd
volume is not attached to a VM. Block devices attached to a VM and
mounted will be decrypted, making the encryption only useful at
defending against unauthorized access to storage media. With a
different key per VM, with potentially millions of tenants, you now
have a massive key escrow/management problem that only buys you a bit
of additional security when block devices are detached. Sounds like a
crappy deal to me, I'd either go with #1 or #3.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommended node size for Ceph

2014-03-10 Thread Kyle Bader
> Why the limit of 6 OSDs per SSD?

SATA/SAS throughput generally.

> I am doing testing with a PCI-e based SSD, and showing that even with 15
OSD disk drives per SSD that the SSD is keeping up.

That will probably be fine performance wise but it's worth noting that all
OSDs will fail if the flash fails (same as node failure).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption/Multi-tennancy

2014-03-10 Thread Kyle Bader
> Ceph is seriously badass, but my requirements are to create a cluster in 
> which I can host my customer's data in separate areas which are independently 
> encrypted, with passphrases which we as cloud admins do not have access to.
>
> My current thoughts are:
> 1. Create an OSD per machine stretching over all installed disks, then create 
> a user-sized block device per customer.  Mount this block device on an access 
> VM and create a LUKS container in to it followed by a zpool and then I can 
> allow the users to create separate bins of data as separate ZFS filesystems 
> in the container which is actually a blockdevice striped across the OSDs.
> 2. Create an OSD per customer and use dm-crypt, then store the dm-crypt key 
> somewhere which is rendered in some way so that we cannot access it, such as 
> a pgp-encrypted file using a passphrase which only the customer knows.

> My questions are:
> 1. What are people's comments regarding this problem (irrespective of my 
> thoughts)

What is the threat model that leads to these requirements? The story
"cloud admins do not have access" is not achievable through technology
alone.

> 2. Which would be the most efficient of (1) and (2) above?

In the case of #1 and #2, you are only protecting data at rest. With
#2 you would need to decrypt the key to open the block device, and the
key would remain in memory until it is unmounted (which the cloud
admin could access). This means #2 is safe so long as you never mount
the volume, which means it's utility is rather limited (archive
perhaps). Neither of these schemes buy you much more than the
encryption handling provided by ceph-disk-prepare (dmcrypted osd
data/journal volumes), the key management problem becomes more acute,
eg. per tenant.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running a mon on a USB stick

2014-03-08 Thread Kyle Bader
> Is there an issue with IO performance?

Ceph monitors store cluster maps and various other things in leveldb,
which persists to disk. I wouldn't recommend using a sd/usb cards for
the monitor store because they tend to be slow and have poor
durability.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about ceph cluster in multi-dacenter

2014-02-20 Thread Kyle Bader
> What could be the best replication ?

Are you using two sites to increase availability, durability, or both?

For availability your really better off using three sites and use
CRUSH to place each of three replicas in a different datacenter. In
this setup you can survive losing 1 of 3 datacenters. If you two sites
is the only option and your goal is availability and durability then I
would do 4 replicas, using osd_pool_default_min_size = 2.

> How to tune the crushmap of this kind of setup ?
> and last question : It's possible to have the reads from vms on DC1 to always 
> read datas on DC1 ?

No yet!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How client choose among replications?

2014-02-11 Thread Kyle Bader
> Why would it help? Since it's not that ONE OSD will be primary for all
objects. There will be 1 Primary OSD per PG and you'll probably have a
couple of thousands PGs.

The primary may be across a oversubscribed/expensive link, in which case a
local replica with a common ancestor to the client may be preferable. It's
WIP with the goal of landing in firefly iirc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-01 Thread Kyle Bader
> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
> optimal' didn't help :(

Did you bump pgp_num as well? The split pgs will stay in place until
pgp_num is bumped as well, if you do this be prepared for (potentially
lots) of data movement.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS Gateway Issues

2014-01-23 Thread Kyle Bader
> HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
stuck unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests;
pool cloudstack has too few pgs; pool .rgw.buckets has too few pgs
> pg 14.0 is stuck inactive since forever, current state incomplete, last
acting [5,0]
> pg 14.2 is stuck inactive since forever, current state incomplete, last
acting [0,5]
> pg 14.6 is stuck inactive since forever, current state down+incomplete,
last acting [4,2]
> pg 14.0 is stuck unclean since forever, current state incomplete, last
acting [5,0]
> pg 14.2 is stuck unclean since forever, current state incomplete, last
acting [0,5]
> pg 14.6 is stuck unclean since forever, current state down+incomplete,
last acting [4,2]
> pg 14.0 is incomplete, acting [5,0]
> pg 14.2 is incomplete, acting [0,5]
> pg 14.6 is down+incomplete, acting [4,2]
> 3 ops are blocked > 8388.61 sec
> 3 ops are blocked > 4194.3 sec
> 1 ops are blocked > 2097.15 sec
> 1 ops are blocked > 8388.61 sec on osd.0
> 1 ops are blocked > 4194.3 sec on osd.0
> 2 ops are blocked > 8388.61 sec on osd.4
> 2 ops are blocked > 4194.3 sec on osd.5
> 1 ops are blocked > 2097.15 sec on osd.5
> 3 osds have slow requests
> pool cloudstack objects per pg (37316) is more than 27.1587 times cluster
average (1374)
> pool .rgw.buckets objects per pg (76219) is more than 55.4723 times
cluster average (1374)
>
>
> Ignore the cloudstack pool, I was using cloudstack but not anymore, it's
an inactive pool.

You will probably want to check osd 0,2,4,5 to make sure they are all up
and in. Pg 14.6 need (4,2) and the others need (0,5). Other than that you
may find that a pg query on the inactive/incomplete will provide more
insight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power Cycle Problems

2014-01-16 Thread Kyle Bader
> On two separate occasions I have lost power to my Ceph cluster. Both times, I 
> had trouble bringing the cluster back to good health. I am wondering if I 
> need to config something that would solve this problem?

No special configuration should be necessary, I've had the unfortunate
luck of witnessing several power loss events with large Ceph clusters.
In both cases something other than Ceph was the source of frustrations
once power was returned. That said, monitor daemons should be started
first and must form a quorum before the cluster will be usable. It
sounds like you have made it that far if your getting output from
"ceph health" commands. The next step is to get your Ceph OSD daemons
running, which will require the data partitions to be mounted and the
journal device present. In Ubuntu installations this is handled by
udev scripts installed by the Ceph packages (I think this is may be
true for RHEL/CentOS but have not verified). Short of the udev method
you can mount the data partition manually. Once the data partition is
mounted you can start the OSDs manually in the event that init still
doesn't work after mounting, to do so you will need to know the
location of your keyring, ceph.conf and the OSD id. If you are unsure
of what the OSD id is then you can look at the root of the OSD data
partition, after it is mounted, in a file named "whoami". To manually
start:

/usr/bin/ceph-osd -i ${OSD_ID} --pid-file
/var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf

After that it's a matter of examining the logs if your still having
issues getting the OSDs to boot.

> After powering back up the cluster, “ceph health” revealed stale pages, mds 
> cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a 
> start” but I got no output from the command and the health status did not 
> change.

The placement groups are stale because none of the OSDs have reported
their state recently since they are down.

> I ended up having to re-install the cluster to fix the issue, but as my group 
> wants to use Ceph for VM storage in the future, we need to find a solution.

That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Networking questions

2013-12-26 Thread Kyle Bader
> Do monitors have to be on the cluster network as well or is it sufficient
> for them to be on the public network as
> http://ceph.com/docs/master/rados/configuration/network-config-ref/
> suggests?

Monitors only need to be on the public network.

> Also would the OSDs re-route their traffic over the public network if that
> was still available in case the cluster network fails?

Ceph doesn't currently support this type of configuration.

Hope that clears things up!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-26 Thread Kyle Bader
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size:   3TiB
>> >> FIT rate:826 (MTBF = 138.1 years)
>> >> NRE rate:1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate:50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-23 Thread Kyle Bader
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image  would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size:   3TiB
>> FIT rate:826 (MTBF = 138.1 years)
>> NRE rate:1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate:50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness:  75%
>> declustering:1100 PG/OSD
>> NRE model:  fail
>> object size:  4MB
>> stripe length:   1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-20 Thread Kyle Bader
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RAID parameters
replace:   6 hours
recovery rate:  500MiB/s (100 minutes)
NRE model:  fail
object size:4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RAID-6: 9+2  6-nines   0.000e+00   2.763e-10
0.11%   0.000e+00   9.317e+07


Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate:50MiB/s (40 seconds/drive)
osd fullness:  75%
declustering:1100 PG/OSD
NRE model:  fail
object size:  4MB
stripe length:   1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RADOS: 3 cp 10-nines   0.000e+00   5.232e-08
0.000116%   0.000e+00   6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader
> The area I'm currently investigating is how to configure the
> networking. To avoid a SPOF I'd like to have redundant switches for
> both the public network and the internal network, most likely running
> at 10Gb. I'm considering splitting the nodes in to two separate racks
> and connecting each half to its own switch, and then trunk the
> switches together to allow the two halves of the cluster to see each
> other. The idea being that if a single switch fails I'd only lose half
> of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with "osd pool default min
size" set to 2.

> My question is about configuring the public network. If it's all one
> subnet then the clients consuming the Ceph resources can't have both
> links active, so they'd be configured in an active/standby role. But
> this results in quite heavy usage of the trunk between the two
> switches when a client accesses nodes on the other switch than the one
> they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

> So, can I configure multiple public networks? I think so, based on the
> documentation, but I'm not completely sure. Can I have one half of the
> cluster on one subnet, and the other half on another? And then the
> client machine can have interfaces in different subnets and "do the
> right thing" with both interfaces to talk to all the nodes. This seems
> like a fairly simple solution that avoids a SPOF in Ceph or the network
> layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

> Or maybe I'm missing an alternative that would be better? I'm aiming
> for something that keeps things as simple as possible while meeting
> the redundancy requirements.
>
> As an aside, there's a similar issue on the cluster network side with
> heavy traffic on the trunk between the two cluster switches. But I
> can't see that's avoidable, and presumably it's something people just
> have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw daemon stalls on download of some files

2013-12-19 Thread Kyle Bader
> Do you have any futher detail on this radosgw bug?

https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424
https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd

> Does it only apply to emperor?

The bug is present in dumpling too.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd image performance

2013-12-15 Thread Kyle Bader
>> Has anyone tried scaling a VMs io by adding additional disks and
>> striping them in the guest os?  I am curious what effect this would have
>> on io performance?

> Why would it? You can also change the stripe size of the RBD image.
Depending on the workload you might change it from 4MB to something like
1MB or 32MB? That would give you more or less RADOS objects which will also
give you a different I/O pattern.

The question comes up because it's common for people operating on EC2 to
stripe EBS volumes together for higher iops rates. I've tried striping
kernel RBD volumes before but hit some sort of thread limitation where
throughput was consistent despite the volume count. I've since learned the
thread limit is configurable. I don't think there is a thread limit that
needs to be tweaked for RBD via KVM/QEMU but I haven't tested this
empirically. As Wido mentioned, if you are operating your own cluster
configuring the stripe size may achieve similar results. Google used to use
a 64MB chunk size with GFS but switched to 1MB after they started
supporting more and more seek heavy workloads.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph

2013-12-15 Thread Kyle Bader
For you holiday pleasure I've prepared a SysAdvent article on Ceph:

http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html

Check it out!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH and Savanna Integration

2013-12-14 Thread Kyle Bader
> Introduction of Savanna for those haven't heard of it:
>
> Savanna project aims to provide users with simple means to provision a
> Hadoop
>
> cluster at OpenStack by specifying several parameters like Hadoop version,
> cluster
>
> topology, nodes hardware details and a few more.
>
> For now, Savanna can use Swift as a storage for data that will be processed
> by
> Hadoop jobs. As far as I know, we can use Hadoop with CephFS.
> Is there anybody interested in CEPH and Savanna integration? and how to?

You could use a Ceph RADOS gateway as a drop in replacement that
provides a Swift compatible endpoint. Alternatively, the docs for
using Hadoop in conjunction with CephFS are here:

http://ceph.com/docs/master/cephfs/hadoop/

Hopefully that puts you in the right direction!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NUMA and ceph

2013-12-12 Thread Kyle Bader
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with "numactl --interleave=all". This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the "numactl" package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.

-- 
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reliability in large RBD setups

2013-12-10 Thread Kyle Bader
> I've been running similar calculations recently. I've been using this
> tool from Inktank to calculate RADOS reliabilities with different
> assumptions:
>   https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> But I've also had similar questions about RBD (or any multi-part files
> stored in RADOS) -- naively, a file/device stored in N objects would
> be N times less reliable than a single object. But I hope there's an
> error in that logic.

It's worth pointing out that Ceph's RGW will actually stripe S3 objects
across many RADOS objects - even when it's not a multi-part upload, this
has been the case since the Bobtail release. There is a in depth Google
paper about availability modeling, it might provide some insight into what
the math should look like:

http://research.google.com/pubs/archive/36737.pdf

When reading it you can think of objects as chunks and pgs as stripes.
CRUSH should be configured based on failure domains that cause correlated
failures, ie power and networking. You also want to consider the
availability of the facility itself:

"Typical availability estimates used in the industry range from 99.7%
availability for tier II datacenters to 99.98% and 99.995% for tiers III
and IV, respectively."

http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006

If you combine the cluster availability metric and the facility
availability metric, you might be surprised. A cluster with 99.995%
availability in a Tier II facility is going to be dragged down to 99.7%
availability.  If a cluster goes down in the forest, does anyone know?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?

2013-12-09 Thread Kyle Bader
> We're running OpenStack (KVM) with local disk for ephemeral storage.
> Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich
> for IOPS and have 20GE across the board. Some recent patches in OpenStack
> Havana make it possible to use Ceph RBD as the source of ephemeral VM
> storage, so I'm interested in the potential for clustered storage across our
> hypervisors for this purpose. Any experience out there?

I believe Piston converges their storage/compute, they refer to it as
a null-tier architecture.

http://www.pistoncloud.com/openstack-cloud-software/technology/#storage
-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-06 Thread Kyle Bader
> looking at tcpdump all the traffic is going exactly where it is supposed to 
> go, in particular an osd on the 192.168.228.x network appears to talk to an 
> osd on the 192.168.229.x network without anything strange happening. I was 
> just wondering if there was anything about ceph that could make this 
> non-optimal, assuming traffic was reasonably balanced between all the osd's 
> (eg all the same weights). I think the only time it would suffer is if writes 
> to other osds result in a replica write to a single osd, and even then a 
> single OSD is still limited to 7200RPM disk speed anyway so the loss isn't 
> going to be that great.

Should be fine given you only have a 1:1 ratio of link to disk.

> I think I'll be moving over to bonded setup anyway, although I'm not sure if 
> rr or lacp is best... rr will give the best potential throughput, but lacp 
> should give similar aggregate throughput if there are plenty of connections 
> going on, and less cpu load as no need to reassemble fragments.

One of the DreamHost clusters is using a pair of bonded 1GbE links on
the public network and another pair for the cluster network, we
configured each to use mode 802.3ad.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-04 Thread Kyle Bader
>> Is having two cluster networks like this a supported configuration? Every
>> osd and mon can reach every other so I think it should be.
>
> Maybe. If your back end network is a supernet and each cluster network is a
> subnet of that supernet. For example:
>
> Ceph.conf cluster network (supernet): 10.0.0.0/8
>
> Cluster network #1:  10.1.1.0/24
> Cluster network #2: 10.1.2.0/24
>
> With that configuration OSD address autodection *should* just work.

It should work but thinking more about it the OSDs will likely be
assigned IPs on a single network, whichever is inspected and matches
the supernet range (which could be in either subnet). In order to have
OSDs on two distinct networks you will likely have to use a
declarative configuration in /etc/ceph/ceph.conf which lists the OSD
IP addresses for each OSD (making sure to balance between links).

>> 1. move osd traffic to eth1. This obviously limits maximum throughput to
>> ~100Mbytes/second, but I'm getting nowhere near that right now anyway.
>
> Given three links I would probably do this if your replication factor is >=
> 3. Keep in mind 100Mbps links could very well end up being a limiting
> factor.

Sorry I read Mbytes and Mbps, big difference, the former is much preferable!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-02 Thread Kyle Bader
> Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.

Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:

Ceph.conf cluster network (supernet): 10.0.0.0/8

Cluster network #1:  10.1.1.0/24
Cluster network #2: 10.1.2.0/24

With that configuration OSD address autodection *should* just work.

> 1. move osd traffic to eth1. This obviously limits maximum throughput to
~100Mbytes/second, but I'm getting nowhere near that right now anyway.

Given three links I would probably do this if your replication factor is >=
3. Keep in mind 100Mbps links could very well end up being a limiting
factor.

What are you backing each OSD with storage wise and how many OSDs do you
expect to participate in this cluster?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of fancy striping

2013-11-30 Thread Kyle Bader
> This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.

You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data path it might be slow from MUX/STP/etc overhead. If the
setup is all SAS you might try collocating the journal with it's matching
data partition on a single disk. Two spindles must be contended with 9
OSDs. How are your drives attached?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Kyle Bader
> > Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?

I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid any writes to the USB drive. I would
also mount spinning high capacity media for /var/log or setup log streaming
to something like rsyslog/syslog-ng/logstash.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复:Re: testing ceph performance issue

2013-11-27 Thread Kyle Bader
> How much performance can be improved if use SSDs  to storage journals?

You will see roughly twice the throughput unless you are using btrfs
(still improved but not as dramatic). You will also see lower latency
because the disk head doesn't have to seek back and forth between
journal and data partitions.

>   Kernel RBD Driver  ,  what is this ?

There are several RBD implementations, one is the kernel RBD driver in
upstream Linux, another is built into Qemu/KVM.

> and we want to know the RBD if  support XEN virual  ?

It is possible, but not nearly as well tested and not prevalent as RBD
via Qemu/KVM. This might be a starting point if your interested in
testing Xen/RBD integration:

http://wiki.xenproject.org/wiki/Ceph_and_libvirt_technology_preview

Hope that helps!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on an external, shared device

2013-11-26 Thread Kyle Bader
> Is there any way to manually configure which OSDs are started on which
> machines? The osd configuration block includes the osd name and host, so is
> there a way to say that, say, osd.0 should only be started on host vashti
> and osd.1 should only be started on host zadok?  I tried using this
> configuration:

The ceph udev rules are going to automatically mount disks that match
the ceph "magic" guids, to dig through the full logic you need to
inspect these files:

/lib/udev/rules.d/60-ceph-partuuid-workaround.rules
/lib/udev/rules.d/95-ceph-osd.rules

The upstart scripts look to see what is mounted at /var/lib/ceph/osd/
and starts osd daemons as appropriate:

/etc/init/ceph-osd-all-starter.conf

In theory you should be able to remove the udev scripts and mount the
osds in /var/lib/ceph/osd if your using upstart. You will want to make
sure that upgrades to the ceph package don't replace the files, maybe
that means making a null rule and using "-o
Dpkg::Options::='--force-confold" in ceph-deploy/chef/puppet/whatever.
You will also want to avoid putting the mounts in fstab because it
could render your node unbootable if the device or filesystem fails.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] installing OS on software RAID

2013-11-25 Thread Kyle Bader
Several people have reported issues with combining OS and OSD journals
on the same SSD drives/RAID due to contention. If you do something
like this I would definitely test to make sure it meets your
expectations. Ceph logs are going to compose the majority of the
writes to the OS storage devices.

On Mon, Nov 25, 2013 at 12:46 PM, James Harper
 wrote:
>>
>> We need to install the OS on the 3TB harddisks that come with our Dell
>> servers. (After many attempts, I've discovered that Dell servers won't allow
>> attaching an external harddisk via the PCIe slot. (I've tried everything). )
>>
>> But, must I therefore sacrifice two hard disks (RAID-1) for the OS?  I don't 
>> see
>> why I can't just create a small partition  (~30GB) on all 6 of my hard 
>> disks, do a
>> software-based RAID 1 on it, and be done.
>>
>> I know that software based RAID-5 seems computationally expensive, but
>> shouldn't RAID 1 be fast and computationally inexpensive for a computer
>> built over the last 4 years? I wouldn't think that a CEPH systems (with lots 
>> of
>> VMs but little data changes) would even do much writing to the OS
>> partitionbut I'm not sure. (And in the past, I have noticed that RAID5
>> systems did suck up a lot of CPU and caused lots of waits, unlike what the
>> blogs implied. But I'm thinking that a RAID 1 takes little CPU and the OS 
>> does
>> little writing to disk; it's mostly reads, which should hit the RAM.)
>>
>> Does anyone see any holes in the above idea? Any gut instincts? (I would try
>> it, but it's hard to tell how well the system would really behave under 
>> "real"
>> load conditions without some degree of experience and/or strong
>> theoretical knowledge.)
>
> Is the OS doing anything apart from ceph? Would booting a ramdisk-only system 
> from USB or compact flash work?
>
> If the OS doesn't produce a lot of writes then having it on the main disk 
> should work okay. I've done it exactly as you describe before.
>
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] misc performance tuning queries (related to OpenStack in particular)

2013-11-19 Thread Kyle Bader
> So quick correction based on Michael's response. In question 4, I should
> have not made any reference to Ceph objects, since objects are not striped
> (per Michael's response). Instead, I should simply have used the words "Ceph
> VM Image" instead of "Ceph objects". A Ceph VM image would constitute 1000s
> of objects, and the different objects are striped/spread across multiple
> OSDs from multiple servers. In that situation, what's answer to #4

It depends on which linux bonding driver is in use, some drivers load
share on transmit, some load share on receive, some do both and some
only provide active/passive fault tolerance. I have Ceph OSD hosts
using LACP (bond-mode 802.3ad) and they load share on both receive and
transmit. We're utilizing a pair of bonded 1GbE links for the Ceph
public network and another pair of bonded 1GbE links for the cluster
network. The issues we've seen with 1GbE are complexity, shallow
buffers on 1GbE top of rack switch gear (Cisco 4948-10G) and the fact
that not all flows are equal (4x 1GbE != 4GbE).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance

2013-11-15 Thread Kyle Bader
> We have the plan to run ceph as block storage for openstack, but from test
> we found the IOPS is slow.
>
> Our apps primarily use the block storage for saving logs (i.e, nginx's
> access logs).
> How to improve this?

There are a number of things you can do, notably:

1. Tuning cache on the hypervisor

http://ceph.com/docs/master/rbd/rbd-config-ref/

2. Separate device OSD journals, usually SSD is used (no longer
seeking between data and journal volumes on a single disk)

3. Flash based writeback cache for OSD data volume

https://github.com/facebook/flashcache/
http://bcache.evilpiepirate.org/

If you have any questions let us know!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Today I’ve encountered multiple OSD down and multiple OSD won’t start and OSD disk access “Input/Output” error”

2013-11-15 Thread Kyle Bader
> 3).Comment out,  #hashtag the bad OSD drives in the “/etc/fstab”.

This is unnecessary if your using the provided upstart and udev
scripts, OSD data devices will be identified by label and mounted. If
you choose not to use the upstart and udev scripts then you should
write init scripts that do similar so that you don't have to have
/etc/fstab entries.

> 3).Login to Ceph Node  with bad OSD net/serial/video.

I'd put check dmesg somewhere near the top of this section, often if
you lose an OSD due to a filesystem hiccup then it will be evident in
dmesg output.

>  4).Stop only this local Ceph node  with “service Ceph stop”

You may want to set "noout" depending on whether you expect it to come
back online within your "mon osd down out interval" threshold.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader
> Would this be something like 
> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ?

Something very much like that :)

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >