Re: [ceph-users] unsubscribe

2019-07-12 Thread Brian Topping
It’s in the mail headers on every email: 
mailto:ceph-users-requ...@lists.ceph.com?subject=unsubscribe

> On Jul 12, 2019, at 5:00 PM, Robert Stanford  wrote:
> 
> unsubscribe
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-17 Thread Brian Topping
I don’t have an answer for you, but it’s going to help others to have shown:
Versions of all nodes involved and multi-master configuration
Confirm forward and reverse DNS and SSH / remote sudo since you are using deploy
Specific steps that did not behave properly
> On Jun 17, 2019, at 6:29 AM, CUZA Frédéric  wrote:
> 
> I’ll keep updating this until I find a solution so if anyone faces the same 
> problem he might have solution.
>  
> Atm : I install the new osd node with ceph-deploy and nothing change, node is 
> still not present in the cluster nor in the crushmap.
> I decided to manually add it to the crush map :
> ceph osd crush add-bucket sd0051 host
> and move it to where it should be :
> ceph osd crush move sd0051 room=roomA
> Then I added an osd to that node :
> ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 
> --block-wal /dev/sdb1 –bluestore
> Once finally created the osd is still not linked to the host where it is 
> created and I can’t move it the this host right now;
>  
>  
> Regards,
>  
>  
> De : ceph-users  De la part de CUZA 
> Frédéric
> Envoyé : 15 June 2019 00:34
> À : ceph-users@lists.ceph.com
> Objet : Re: [ceph-users] Weird behaviour of ceph-deploy
>  
> Little update :
> I check one osd I’ve installed even if the host isn’t not present in the 
> crushmap (or in cluster I guess) and I found this :
>  
> monclient: wait_auth_rotating timed out after 30
> osd.xxx 0 unable to obtain rotating service keys; retrying
>  
> I alosy add the host to the admins host :
> ceph-deploy admin sd0051
> and nothing change.
>  
> When I do the install there is not .conf pushed the new node.
>  
> Regards,
>  
> De : ceph-users  > De la part de CUZA Frédéric
> Envoyé : 14 June 2019 18:28
> À : ceph-users@lists.ceph.com 
> Objet : [ceph-users] Weird behaviour of ceph-deploy
>  
> Hi everyone,
>  
> I am facing a strange behavious from ceph-deploy.
> I try to add a new node to our cluster :
> ceph-deploy install --no-adjust-repos sd0051
>  
> Everything seems to work fine but the new bucket (host) is not created in the 
> crushmap and when I try to add a new osd to that host, the osd is created but 
> is not link to any host (normal behaviour the host is not present).
> Anyone already faces this ?
>  
> FI : We already add new node with this and this is the first time we face it.
>  
> Thanks !
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Brian Topping
Lars, I just got done doing this after generating about a dozen CephFS subtrees 
for different Kubernetes clients. 

tl;dr: there is no way for files to move between filesystem formats (ie CephFS 
,> RBD) without copying them.

If you are doing the same thing, there may be some relevance for you in 
https://github.com/kubernetes/enhancements/pull/643. It’s worth checking to see 
if it meets your use case if so.

In any event, what I ended up doing was letting Kubernetes create the new PV 
with the RBD provisioner, then using find piped to cpio to move the file 
subtree. In a non-Kubernetes environment, one would simply create the 
destination RBD as usual. It should be most performant to do this on a monitor 
node.

cpio ensures you don’t lose metadata. It’s been fine for me, but if you have 
special xattrs that the clients of the files need, be sure to test that those 
are copied over. It’s very difficult to move that metadata once a file is 
copied and even harder to deal with a situation where the destination volume 
went live and some files on the destination are both newer versions and missing 
metadata. 

Brian

> On May 15, 2019, at 6:05 AM, Lars Täuber  wrote:
> 
> Hi,
> 
> is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/
> 
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Brian Topping
> On Apr 26, 2019, at 1:50 PM, Gregory Farnum  wrote:
> 
> Hmm yeah, it's probably not using UTC. (Despite it being good
> practice, it's actually not an easy default to adhere to.) cephx
> requires synchronized clocks and probably the same timezone (though I
> can't swear to that.)

Apps don’t “see” timezones, timezones are a rendering transform of an absolute 
time. The instant “now” is the same throughout space and time, regardless of 
how that instant is quantified. UNIX wall time is just one such quantification.

Problems ensue when the rendered time is incorrect for the time zone shown in 
the rendering. If a machine that is “not using time zones” shows that the time 
is 3PM UTC and one lives in London, the internal time will be correct. On the 
other hand, if they live in NYC, the internal time is incorrect. This is to say 
15:00UTC rendered at 3PM in NYC is very wrong *because it’s not 3PM in London*, 
where UTC is true.

tl;dr: Make sure that your clock is set for the correct time in the time zone 
in whatever rendering is set. It doesn’t matter where the system actually 
resides or whether the TZ matches it’s geographic location. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SOLVED: Multi-site replication speed

2019-04-20 Thread Brian Topping
Followup: Seems to be solved, thanks again for your help. I did have some 
issues with the replication that may have been solved by getting the metadata 
init/run finished first. I haven’t replicated that back to the production 
servers yet, but I’m a lot more comfortable with the behaviors by setting this 
up on a test cluster. 

I believe this is the desired state for unidirectional replication:

> [root@left01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 479d3f20-d57d-4b37-995b-510ba10756bf (left)
>   metadata sync no sync (zone is master)
>   data sync source: 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
> not syncing from zone


> [root@right01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source

My confusion at the start of this thread was not knowing what would be 
replicated. It was not clear to me that the only objects that would be 
replicated were the S3/Swift objects that were created by other S3/Swift 
clients. The note at [1] from John Spray pretty much sorted out everything I 
was looking to do as well as what I did not fully understand about the Ceph 
stack. All in all, a very informative adventure! 

Hopefully the thread is helpful to others who follow. I’m happy to answer 
questions off-thread as well.

best, Brian

[1] 
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/415#note_16192610

> On Apr 19, 2019, at 10:21 PM, Brian Topping  wrote:
> 
> Hi Casey,
> 
> I set up a completely fresh cluster on a new VM host.. everything is fresh 
> fresh fresh. I feel like it installed cleanly and because there is 
> practically zero latency and unlimited bandwidth as peer VMs, this is a 
> better place to experiment. The behavior is the same as the other cluster.
> 
> The realm is “example-test”, has a single zone group named “us”, and there 
> are zones “left” and “right”. The master zone is “left” and I am trying to 
> unidirectionally replicate to “right”. “left” is a two node cluster and right 
> is a single node cluster. Both show "too few PGs per OSD” but are otherwise 
> 100% active+clean. Both clusters have been completely restarted to make sure 
> there are no latent config issues, although only the RGW nodes should require 
> that. 
> 
> The thread at [1] is the most involved engagement I’ve found with a staff 
> member on the subject, so I checked and believe I attached all the logs that 
> were requested there. They all appear to be consistent and are attached below.
> 
> For start: 
>> [root@right01 ~]# radosgw-admin sync status
>>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>>   metadata sync syncing
>> full sync: 0/64 shards
>> incremental sync: 64/64 shards
>> metadata is caught up with master
>>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
>> syncing
>> full sync: 0/128 shards
>> incremental sync: 128/128 shards
>> data is caught up with source
> 
> 
> I tried the information at [2] and do not see any ops in progress, just 
> “linger_ops”. I don’t know what those are, but probably explain the slow 
> stream of requests back and forth between the two RGW endpoints:
>> [root@right01 ~]# ceph daemon client.rgw.right01.54395.94074682941968 
>> objecter_requests
>> {
>> "ops": [],
>> "linger_ops": [
>> {
>> "linger_id": 2,
>> "pg": "2.16dafda0",
>> "osd": 0,
>> "object_id": "notify.1",
>> "object_locator": "@2",
>> "target_object_id": "notify.1",
>> "target_object_locator": "@

Re: [ceph-users] Multi-site replication speed

2019-04-19 Thread Brian Topping
Hi Casey,

I set up a completely fresh cluster on a new VM host.. everything is fresh 
fresh fresh. I feel like it installed cleanly and because there is practically 
zero latency and unlimited bandwidth as peer VMs, this is a better place to 
experiment. The behavior is the same as the other cluster.

The realm is “example-test”, has a single zone group named “us”, and there are 
zones “left” and “right”. The master zone is “left” and I am trying to 
unidirectionally replicate to “right”. “left” is a two node cluster and right 
is a single node cluster. Both show "too few PGs per OSD” but are otherwise 
100% active+clean. Both clusters have been completely restarted to make sure 
there are no latent config issues, although only the RGW nodes should require 
that. 

The thread at [1] is the most involved engagement I’ve found with a staff 
member on the subject, so I checked and believe I attached all the logs that 
were requested there. They all appear to be consistent and are attached below.

For start: 
> [root@right01 ~]# radosgw-admin sync status
>   realm d5078dd2-6a6e-49f8-941e-55c02ad58af7 (example-test)
>   zonegroup de533461-2593-45d2-8975-99072d860bb2 (us)
>zone 5dc80bbc-3d9d-46d5-8f3e-4611fbc17fbe (right)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 479d3f20-d57d-4b37-995b-510ba10756bf (left)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source


I tried the information at [2] and do not see any ops in progress, just 
“linger_ops”. I don’t know what those are, but probably explain the slow stream 
of requests back and forth between the two RGW endpoints:
> [root@right01 ~]# ceph daemon client.rgw.right01.54395.94074682941968 
> objecter_requests
> {
> "ops": [],
> "linger_ops": [
> {
> "linger_id": 2,
> "pg": "2.16dafda0",
> "osd": 0,
> "object_id": "notify.1",
> "object_locator": "@2",
> "target_object_id": "notify.1",
> "target_object_locator": "@2",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "snapid": "head",
> "registered": "1"
> },
> ...
> ],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
> 


The next thing I tried is `radosgw-admin data sync run --source-zone=left` from 
the right side. I get bursts of messages of the following form:
> 2019-04-19 21:46:34.281 7f1c006ad580  0 RGW-SYNC:data:sync:shard[1]: ERROR: 
> failed to read remote data log info: ret=-2
> 2019-04-19 21:46:34.281 7f1c006ad580  0 meta sync: ERROR: RGWBackoffControlCR 
> called coroutine returned -2


When I sorted and filtered the messages, each burst has one RGW-SYNC message 
for each of the PGs on the left side identified by the number in “[]”. Since 
left has 128 PGs, these are the numbers between 0-127. The bursts happen about 
once every five seconds.

The packet traces between the nodes during the `data sync run` are mostly 
requests and responses of the following form:
> HTTP GET: 
> http://right01.example.com:7480/admin/log/?type=data&id=7&marker&extra-info=true&rgwx-zonegroup=de533461-2593-45d2-8975-99072d860bb2
>  
> HTTP
>  404 RESPONSE: 
> {"Code":"NoSuchKey","RequestId":"tx02a01-005cba9593-371d-right","HostId":"371d-right-us”}

When I stop the `data sync run`, these 404s stop, so clearly the `data sync 
run` isn’t changing a state in the rgw, but doing something synchronously. In 
the past, I have done a `data sync init` but it doesn’t seem like doing it 
repeatedly will make a difference so I didn’t do it any more.

NEXT STEPS:

I am working on how to get better logging output from daemons and hope to find 
something in there that will help. If I am lucky, I will find something in 
there and can report back so this thread is useful for others. If I have not 
written back, I probably haven’t found anything, so would be grateful for any 
leads.

Kind regards and thank you!

Brian

[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013188.html 

[2] 
http://docs.ceph.com/docs/master/radosgw/troubleshooting/?highlight=linger_ops#blocked-radosgw-requests
 


CONFIG DUMPS:

> [root@left01 ~]# radosgw-admin period get-current
> {
> "current_period": "cdc3d603-2bc8-493b-ba6a-c6a51c49cc0c"
> }
> [root@le

Re: [ceph-users] Are there any statistics available on how most production ceph clusters are being used?

2019-04-19 Thread Brian Topping
> On Apr 19, 2019, at 10:59 AM, Janne Johansson  wrote:
> 
> May the most significant bit of your life be positive.

Marc, my favorite thing about open source software is it has a 100% money back 
satisfaction guarantee: If you are not completely satisfied, you can have an 
instant refund, just for waving your arm! :D

Seriously though, Janne is right, for any OSS project. Think of it like a party 
where the some people go home “when it’s over” and some people stick around and 
help clean up. Using myself as an example, I’ve been asking questions about RGW 
multi-site, and now that I have a little more experience with it (not much more 
— it’s not working yet, just where I can see gaps in the documentation), I owe 
it to those that have helped me get here by filling those gaps in the docs. 

That’s where I can start, and when I understand what’s going on with more 
authority, I can go into the source and create changes that alter how it works 
for others to review.

Note in both cases I am proposing concrete changes, which is far more effective 
than trying to describe situations that others may have never been in. Many can 
try to help, but if it is frustrating for them, they will lose interest. Good 
pull requests are never frustrating to understand, even if they need more work 
to handle cases others know about. It’s a more quantitative means of expression.

If that kind of commitment doesn’t sound appealing, buy support contracts. Pay 
back in to the community so that those with passion for the product can do 
exactly what I’ve described here. There’s no shame in that, but users like you 
and me need to be careful with the time of those who have put their lives into 
this, at least until we can put more into the party than we have taken out.

Hope that helps!  :B
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-site replication speed

2019-04-18 Thread Brian Topping
Hi Casey, thanks for this info. It’s been doing something for 36 hours, but not 
updating the status at all. So it either takes a really long time for 
“preparing for full sync” or I’m doing something wrong. This is helpful 
information, but there’s a myriad of states that the system could be in. 

With that, I’m going to set up a lab rig and see if I can build a fully 
replicated state. At that point, I’ll have a better understanding of what a 
working system responds like and maybe I can at least ask better questions, 
hopefully figure it out myself. 

Thanks again! Brian

> On Apr 16, 2019, at 08:38, Casey Bodley  wrote:
> 
> Hi Brian,
> 
> On 4/16/19 1:57 AM, Brian Topping wrote:
>>> On Apr 15, 2019, at 5:18 PM, Brian Topping >> <mailto:brian.topp...@gmail.com>> wrote:
>>> 
>>> If I am correct, how do I trigger the full sync?
>> 
>> Apologies for the noise on this thread. I came to discover the 
>> `radosgw-admin [meta]data sync init` command. That’s gotten me with 
>> something that looked like this for several hours:
>> 
>>> [root@master ~]# radosgw-admin  sync status
>>>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>>>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>>>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>>>   metadata sync preparing for full sync
>>> full sync: 64/64 shards
>>> full sync: 0 entries to sync
>>> incremental sync: 0/64 shards
>>> metadata is behind on 64 shards
>>> behind shards: 
>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
>>>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
>>> preparing for full sync
>>> full sync: 1/128 shards
>>> full sync: 0 buckets to sync
>>> incremental sync: 127/128 shards
>>> data is behind on 1 shards
>>> behind shards: [0]
>> 
>> I also had the data sync showing a list of “behind shards”, but both of them 
>> sat in “preparing for full sync” for several hours, so I tried 
>> `radosgw-admin [meta]data sync run`. My sense is that was a bad idea, but 
>> neither of the commands seem to be documented and the thread I found them on 
>> indicated they wouldn’t damage the source data.
>> 
>> QUESTIONS at this point:
>> 
>> 1) What is the best sequence of commands to properly start the sync? Does 
>> init just set things up and do nothing until a run is started?
> The sync is always running. Each shard starts with full sync (where it lists 
> everything on the remote, and replicates each), then switches to incremental 
> sync (where it polls the replication logs for changes). The 'metadata sync 
> init' command clears the sync status, but this isn't synchronized with the 
> metadata sync process running in radosgw(s) - so the gateways need to restart 
> before they'll see the new status and restart the full sync. The same goes 
> for 'data sync init'.
>> 2) Are there commands I should run before that to clear out any previous bad 
>> runs?
> Just restart gateways, and you should see progress via 'sync status'.
>> 
>> *Thanks very kindly for any assistance. *As I didn’t really see any 
>> documentation outside of setting up the realms/zones/groups, it seems like 
>> this would be useful information for others that follow.
>> 
>> best, Brian
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-site replication speed

2019-04-15 Thread Brian Topping
> On Apr 15, 2019, at 5:18 PM, Brian Topping  wrote:
> 
> If I am correct, how do I trigger the full sync?

Apologies for the noise on this thread. I came to discover the `radosgw-admin 
[meta]data sync init` command. That’s gotten me with something that looked like 
this for several hours:

> [root@master ~]# radosgw-admin  sync status
>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>   metadata sync preparing for full sync
> full sync: 64/64 shards
> full sync: 0 entries to sync
> incremental sync: 0/64 shards
> metadata is behind on 64 shards
> behind shards: 
> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
> preparing for full sync
> full sync: 1/128 shards
> full sync: 0 buckets to sync
> incremental sync: 127/128 shards
> data is behind on 1 shards
> behind shards: [0]

I also had the data sync showing a list of “behind shards”, but both of them 
sat in “preparing for full sync” for several hours, so I tried `radosgw-admin 
[meta]data sync run`. My sense is that was a bad idea, but neither of the 
commands seem to be documented and the thread I found them on indicated they 
wouldn’t damage the source data. 

QUESTIONS at this point:

1) What is the best sequence of commands to properly start the sync? Does init 
just set things up and do nothing until a run is started?
2) Are there commands I should run before that to clear out any previous bad 
runs?

Thanks very kindly for any assistance. As I didn’t really see any documentation 
outside of setting up the realms/zones/groups, it seems like this would be 
useful information for others that follow.

best, Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-site replication speed

2019-04-15 Thread Brian Topping
I’m starting to wonder if I actually have things configured and working 
correctly, but the light traffic I am seeing is that of an incremental 
replication. That would make sense, the cluster being replicated does not have 
a lot of traffic on it yet. Obviously, without the full replication, the 
incremental is pretty useless.

Here’s the status coming from the secondary side:

> [root@secondary ~]# radosgw-admin  sync status
>   realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
>   zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
>zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
>   metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
>   data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is caught up with source


If I am correct, how do I trigger the full sync?

Thanks!! Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-site replication speed

2019-04-14 Thread Brian Topping


> On Apr 14, 2019, at 2:08 PM, Brian Topping  wrote:
> 
> Every so often I might see the link running at 20 Mbits/sec, but it’s not 
> consistent. It’s probably going to take a very long time at this rate, if 
> ever. What can I do?

Correction: I was looking at statistics on an aggregate interface while my 
laptop was rebuilding a mailbox. The typical transfer is around 60Kbits/sec, 
but as I said, iperf3 can easily push the link between the two points to 
>750Mbits/sec. Also, system load always has >90% idle on both machines...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multi-site replication speed

2019-04-14 Thread Brian Topping
Hi all! I’m finally running with Ceph multi-site per 
http://docs.ceph.com/docs/nautilus/radosgw/multisite/ 
, woo hoo!

I wanted to confirm that the process can be slow. It’s been a couple of hours 
since the sync started and `radosgw-admin sync status` does not report any 
errors, but the speeds are nowhere near link saturation. iperf3 reports 773 
Mbits/sec on the link in TCP mode, latency is about 5ms. 

Every so often I might see the link running at 20 Mbits/sec, but it’s not 
consistent. It’s probably going to take a very long time at this rate, if ever. 
What can I do?

I’m using civetweb without SSL on the gateway endpoints, only one 
master/mon/rgw for each end on Nautilus 14.2.0.

Apologies if I’ve missed some crucial tuning docs or archive messages somewhere 
on the subject.

Thanks! Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1/3 mon not working after upgrade to Nautilus

2019-03-25 Thread Brian Topping
Did you check port access from other nodes?  My guess is a forgotten firewall 
re-emerged on that node after reboot. 

Sent from my iPhone

> On Mar 25, 2019, at 07:26, Clausen, Jörn  wrote:
> 
> Hi again!
> 
>> moment, one of my three MONs (the then active one) fell out of the 
> 
> "active one" is of course nonsense, I confused it with MGRs. Which are 
> running okay, btw, on the same three hosts.
> 
> I reverted the MON back to a snapshot (vSphere) before the upgrade, repeated 
> the upgrade, and ended up in the same situation. ceph-mon.log is filled with 
> ~3000 lines per second.
> 
> The only line I can assume has any value to this is
> 
> mon.cephtmon03@-1(probing) e1  my rank is now 2 (was -1)
> 
> What does that mean?
> 
> -- 
> Jörn Clausen
> Daten- und Rechenzentrum
> GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> Düsternbrookerweg 20
> 24105 Kiel
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread Brian Topping
> On Feb 19, 2019, at 3:30 PM, Vitaliy Filippov  wrote:
> 
> In our russian-speaking Ceph chat we swear "ceph inside kuber" people all the 
> time because they often do not understand in what state their cluster is at 
> all

Agreed 100%. This is a really good way to lock yourself out of your data (and 
maybe lose it), especially if you’re new to Kubernetes and using Rook to manage 
Ceph. 

Some months ago, I was on VMs running on Citrix. Everything is stable on 
Kubernetes and Ceph now, but it’s been a lot of work. I’d suggest starting with 
Kubernetes first, especially if you are going to do this on bare metal. I can 
give you some ideas about how to lay things out if you are running with limited 
hardware.

Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks Hector. So many things going through my head and I totally forgot to 
explore if just turning off the warnings (if only until I get more disks) was 
an option. 

This is 1000% more sensible for sure.

> On Feb 8, 2019, at 7:19 PM, Hector Martin  wrote:
> 
> My practical suggestion would be to do nothing for now (perhaps tweaking
> the config settings to shut up the warnings about PGs per OSD). Ceph
> will gain the ability to downsize pools soon, and in the meantime,
> anecdotally, I have a production cluster where we overshot the current
> recommendation by 10x due to confusing documentation at the time, and
> it's doing fine :-)
> 
> Stable multi-FS support is also coming, so really, multiple ways to fix
> your problem will probably materialize Real Soon Now, and in the
> meantime having more PGs than recommended isn't the end of the world.
> 
> (resending because the previous reply wound up off-list)
> 
> On 09/02/2019 10.39, Brian Topping wrote:
>> Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To
>> review, I am removing OSDs from a small cluster and running up against
>> the “too many PGs per OSD problem due to lack of clarity. Here’s a
>> summary of what I have collected on it:
>> 
>> 1. The CephFS data pool can’t be changed, only added to. 
>> 2. CephFS metadata pool might be rebuildable
>>via https://www.spinics.net/lists/ceph-users/msg29536.html, but the
>>post is a couple of years old, and even then, the author stated that
>>he wouldn’t do this unless it was an emergency.
>> 3. Running multiple clusters on the same hardware is deprecated, so
>>there’s no way to make a new cluster with properly-sized pools and
>>cpio across.
>> 4. Running multiple filesystems on the same hardware is considered
>>experimental: 
>> http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster.
>>It’s unclear what permanent changes this will effect on the cluster
>>that I’d like to use moving forward. This would be a second option
>>to mount and cpio across.
>> 5. Importing pools (ie `zpool export …`, `zpool import …`) from other
>>clusters is likely not supported, so even if I created a new cluster
>>on a different machine, getting the pools back in the original
>>cluster is fraught.
>> 6. There’s really no way to tell Ceph where to put pools, so when the
>>new drives are added to CRUSH, everything starts rebalancing unless
>>`max pg per osd` is set to some small number that is already
>>exceeded. But if I start copying data to the new pool, doesn’t it fail?
>> 7. Maybe the former problem can be avoided by changing the weights of
>>the OSDs...
>> 
>> 
>> All these options so far seem either a) dangerous or b) like I’m going
>> to have a less-than-pristine cluster to kick off the next ten years
>> with. Unless I am mistaken in that, the only options are to copy
>> everything at least once or twice more:
>> 
>> 1. Copy everything back off CephFS to a `mdadm` RAID 1 with two of the
>>6TB drives. Blow away the cluster and start over with the other two
>>drives, copy everything back to CephFS, then re-add the freed drive
>>used as a store. Might be done by the end of next week.
>> 2. Create a new, properly sized cluster on a second machine, copy
>>everything over ethernet, then move the drives and the
>>`/var/lib/ceph` and `/etc/ceph` back to the cluster seed.
>> 
>> 
>> I appreciate small clusters are not the target use case of Ceph, but
>> everyone has to start somewhere!
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks again to Jan, Burkhard, Marc and Hector for responses on this. To 
review, I am removing OSDs from a small cluster and running up against the “too 
many PGs per OSD problem due to lack of clarity. Here’s a summary of what I 
have collected on it:

The CephFS data pool can’t be changed, only added to. 
CephFS metadata pool might be rebuildable via 
https://www.spinics.net/lists/ceph-users/msg29536.html 
, but the post is a 
couple of years old, and even then, the author stated that he wouldn’t do this 
unless it was an emergency.
Running multiple clusters on the same hardware is deprecated, so there’s no way 
to make a new cluster with properly-sized pools and cpio across.
Running multiple filesystems on the same hardware is considered experimental: 
http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster
 
.
 It’s unclear what permanent changes this will effect on the cluster that I’d 
like to use moving forward. This would be a second option to mount and cpio 
across.
Importing pools (ie `zpool export …`, `zpool import …`) from other clusters is 
likely not supported, so even if I created a new cluster on a different 
machine, getting the pools back in the original cluster is fraught.
There’s really no way to tell Ceph where to put pools, so when the new drives 
are added to CRUSH, everything starts rebalancing unless `max pg per osd` is 
set to some small number that is already exceeded. But if I start copying data 
to the new pool, doesn’t it fail?
Maybe the former problem can be avoided by changing the weights of the OSDs...

All these options so far seem either a) dangerous or b) like I’m going to have 
a less-than-pristine cluster to kick off the next ten years with. Unless I am 
mistaken in that, the only options are to copy everything at least once or 
twice more:

Copy everything back off CephFS to a `mdadm` RAID 1 with two of the 6TB drives. 
Blow away the cluster and start over with the other two drives, copy everything 
back to CephFS, then re-add the freed drive used as a store. Might be done by 
the end of next week.
Create a new, properly sized cluster on a second machine, copy everything over 
ethernet, then move the drives and the `/var/lib/ceph` and `/etc/ceph` back to 
the cluster seed.

I appreciate small clusters are not the target use case of Ceph, but everyone 
has to start somewhere!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Thanks Marc and Burkhard. I think what I am learning is it’s best to copy 
between filesystems with cpio, if not impossible to do it any other way due to 
the “fs metadata in first pool” problem.

FWIW, the mimic docs still describe how to create a differently named cluster 
on the same hardware. But then I see 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021560.html 
 
saying that behavior is deprecated and problematic. 

A hard lesson, but no data was lost. I will set up two machines and a new 
cluster with the larger drives tomorrow.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Brian Topping
Hi Mark, that’s great advice, thanks! I’m always grateful for the knowledge. 

What about the issue with the pools containing a CephFS though? Is it something 
where I can just turn off the MDS, copy the pools and rename them back to the 
original name, then restart the MDS? 

Agreed about using smaller numbers. When I went to using seven disks, I was 
getting warnings about too few PGs per OSD. I’m sure this is something one 
learns to cope with via experience and I’m still picking that up. Had hoped not 
I get in a bind like this so quickly, but hey, here I am again :)

> On Feb 8, 2019, at 01:53, Marc Roos  wrote:
> 
> 
> There is a setting to set the max pg per osd. I would set that 
> temporarily so you can work, create a new pool with 8 pg's and move data 
> over to the new pool, remove the old pool, than unset this max pg per 
> osd.
> 
> PS. I am always creating pools starting 8 pg's and when I know I am at 
> what I want in production I can always increase the pg count.
> 
> 
> 
> -Original Message-
> From: Brian Topping [mailto:brian.topp...@gmail.com] 
> Sent: 08 February 2019 05:30
> To: Ceph Users
> Subject: [ceph-users] Downsizing a cephfs pool
> 
> Hi all, I created a problem when moving data to Ceph and I would be 
> grateful for some guidance before I do something dumb.
> 
> 
> 1.I started with the 4x 6TB source disks that came together as a 
> single XFS filesystem via software RAID. The goal is to have the same 
> data on a cephfs volume, but with these four disks formatted for 
> bluestore under Ceph.
> 2.The only spare disks I had were 2TB, so put 7x together. I sized 
> data and metadata for cephfs at 256 PG, but it was wrong.
> 3.The copy went smoothly, so I zapped and added the original 4x 6TB 
> disks to the cluster.
> 4.I realized what I did, that when the 7x2TB disks were removed, 
> there were going to be far too many PGs per OSD.
> 
> 
> I just read over https://stackoverflow.com/a/39637015/478209, but that 
> addresses how to do this with a generic pool, not pools used by CephFS. 
> It looks easy to copy the pools, but once copied and renamed, CephFS may 
> not recognize them as the target and the data may be lost.
> 
> Do I need to create new pools and copy again using cpio? Is there a 
> better way?
> 
> Thanks! Brian
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Downsizing a cephfs pool

2019-02-07 Thread Brian Topping
Hi all, I created a problem when moving data to Ceph and I would be grateful 
for some guidance before I do something dumb.

I started with the 4x 6TB source disks that came together as a single XFS 
filesystem via software RAID. The goal is to have the same data on a cephfs 
volume, but with these four disks formatted for bluestore under Ceph.
The only spare disks I had were 2TB, so put 7x together. I sized data and 
metadata for cephfs at 256 PG, but it was wrong.
The copy went smoothly, so I zapped and added the original 4x 6TB disks to the 
cluster.
I realized what I did, that when the 7x2TB disks were removed, there were going 
to be far too many PGs per OSD.

I just read over https://stackoverflow.com/a/39637015/478209 
, but that addresses how to do 
this with a generic pool, not pools used by CephFS. It looks easy to copy the 
pools, but once copied and renamed, CephFS may not recognize them as the target 
and the data may be lost.

Do I need to create new pools and copy again using cpio? Is there a better way?

Thanks! Brian___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Brian Topping
I went through this as I reformatted all the OSDs with a much smaller cluster 
last weekend. When turning nodes back on, PGs would sometimes move, only to 
move back, prolonging the operation and system stress. 

What I took away is it’s least overall system stress to have the OSD tree back 
to target state as quickly as safe and practical. Replication will happen as 
replication will, but if the strategy changes midway, it just means the same 
speed of movement over a longer time. 

> On Jan 26, 2019, at 15:41, Chris  wrote:
> 
> It sort of depends on your workload/use case.  Recovery operations can be 
> computationally expensive.  If your load is light because its the weekend you 
> should be able to turn that host back on  as soon as you resolve whatever the 
> issue is with minimal impact.  You can also increase the priority of the 
> recovery operation to make it go faster if you feel you can spare additional 
> IO and it won't affect clients.
> 
> We do this in our cluster regularly and have yet to see an issue (given that 
> we take care to do it during periods of lower client io)
> 
>> On January 26, 2019 17:16:38 Götz Reinicke  
>> wrote:
>> 
>> Hi,
>> 
>> one host out of 10 is down for yet unknown reasons. I guess a power failure. 
>> I could not yet see the server.
>> 
>> The Cluster is recovering and remapping fine, but still has some objects to 
>> process.
>> 
>> My question: May I just switch the server back on and in best case, the 24 
>> OSDs get back online and recovering will do the job without problems.
>> 
>> Or what might be a good way to handle that host? Should I first wait till 
>> the recover is finished?
>> 
>> Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
>> Götz
>> 
>> 
>> --
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with OSDs

2019-01-21 Thread Brian Topping
> On Jan 21, 2019, at 6:47 AM, Alfredo Deza  wrote:
> 
> When creating an OSD, ceph-volume will capture the ID and the FSID and
> use these to create a systemd unit. When the system boots, it queries
> LVM for devices that match that ID/FSID information.

Thanks Alfredo, I see that now. The name comes from the symlink and is passed 
into the script as %i. I should have seen that before, but at best I would have 
done a hacky job of recreating them manually, so in hindsight I’m glad I did 
not see that sooner.

> Is it possible you've attempted to create an OSD and then failed, and
> tried again? That would explain why there would be a systemd unit with
> an FSID that doesn't match. By the output, it does look like
> you have an OSD 1, but with a different FSID (467... instead of
> e3b...). You could try to disable the failing systemd unit with:
> 
>systemctl disable
> ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service 
> 
> 
> (Follow up with OSD 3) and then run:
> 
>ceph-volume lvm activate --all

That worked and recovered startup of all four OSDs on the second node. In an 
abundance of caution, I only disabled one of the volumes with systemctl disable 
and then ran ceph-volume lvm activate --all. That cleaned up all of them 
though, so there was nothing left to do.

https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21 
 helped resolve the 
final issue getting to HEALTH_OK. After rebuilding the mon/mgr node, I did not 
properly clear / restore the firewall. It’s odd that osd tree was reporting 
that two of the OSDs were up and in when the ports for mon/mgr/mds were all 
inaccessible.

I don’t believe there were any failed creation attempts. Cardinal process rule 
with filesystems: Always maintain a known-good state that can be rolled back 
to. If an error comes up that can’t be fully explained, roll back and restart. 
Sometimes a command gets missed by the best of fingers and fully caffeinated 
minds.. :)  I do see that I didn’t do a `ceph osd purge` on the empty/downed 
OSDs that were gracefully `out`. That explains the tree with the even numbered 
OSDs on the rebuilt node. After purging the references to the empty OSDs and 
re-adding the volumes, I am back to full health with all devices and OSDs up/in.

THANK YOU!!! :D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-21 Thread Brian Topping
> On Jan 18, 2019, at 3:48 AM, Eugen Leitl  wrote:
> 
> 
> (Crossposting this from Reddit /r/ceph , since likely to have more technical 
> audience present here).
> 
> I've scrounged up 5 old Atom Supermicro nodes and would like to run them 
> 365/7 for limited production as RBD with Bluestore (ideally latest 13.2.4 
> Mimic), triple copy redundancy. Underlying OS is a Debian 9 64 bit, minimal 
> install.

The other thing to consider about a lab is “what do you want to learn?” If 
reliability isn’t an issue (ie you aren’t putting your family pictures on it), 
regardless of the cluster technology, you can often learn basics more quickly 
without the overhead of maintaining quorums and all that stuff on day one. So 
at risk of being a heretic, start small, for instance with single mon/manager 
and add more later. 

Adding mons into a running cluster just as unique and valuable of an experience 
as maintaining perfect quorum. Knowing when, why and where to add resources is 
much harder if one builds out a monster cluster from the start. This is to say 
“if there are no bottlenecks to solve for, there is far less learning being 
required”. And when it comes to being proficient with such a critical piece of 
production infrastructure, you’ll want to have as many experiences with the 
system going sideways and bringing it back as you can. 

Production heroes are measured by their uptime statistics, and when things get 
testy, the more cluster-foo you have (regardless of the cluster), the less risk 
you’ll have maintaining perfect stats.

$0.02… 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with OSDs

2019-01-20 Thread Brian Topping
Hi all, looks like I might have pooched something. Between the two nodes I 
have, I moved all the PGs to one machine, reformatted the other machine, 
rebuilt that machine, and moved the PGs back. In both cases, I did this by 
taking the OSDs on the machine being moved from “out” and waiting for health to 
be restored, then took them down. 

This worked great up to the point I had the mon/manager/rgw where they started, 
all the OSDs/PGs on the other machine that had been rebuilt. The next step was 
to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then 
re-add new OSDs on the master machine as it were.

This didn’t work so well. The master has come up just fine, but it’s not 
connecting to the OSDs. Of the four OSDs, only two came up, and the other two 
did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like 
the following in it’s logs:

> [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries 
> left: 2
> [2019-01-20 16:22:15,111][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> [2019-01-20 16:22:15,271][ceph_volume.process][INFO  ] stderr -->  
> RuntimeError: could not find osd.1 with fsid 
> e3bfc69e-a145-4e19-aac2-5f888e1ed2ce


I see this for the volumes:

> [root@gw02 ceph]# ceph-volume lvm list 
> 
> == osd.1 ===
> 
>   [block]
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> 
>   type  block
>   osd id1
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  4672bb90-8cea-4580-85f2-1e692811a05a
>   encrypted 0
>   cephx lockbox secret  
>   block uuid3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
>   block device  
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
>   vdo   0
>   crush device classNone
>   devices   /dev/sda3
> 
> == osd.3 ===
> 
>   [block]
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> 
>   type  block
>   osd id3
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  084cf33d-8a38-4c82-884a-7c88e3161479
>   encrypted 0
>   cephx lockbox secret  
>   block uuidPSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
>   block device  
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
>   vdo   0
>   crush device classNone
>   devices   /dev/sdb3
> 
> == osd.5 ===
> 
>   [block]
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> 
>   type  block
>   osd id5
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  e854930d-1617-4fe7-b3cd-98ef284643fd
>   encrypted 0
>   cephx lockbox secret  
>   block uuidF5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
>   block device  
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
>   vdo   0
>   crush device classNone
>   devices   /dev/sdc3
> 
> == osd.7 ===
> 
>   [block]
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> 
>   type  block
>   osd id7
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  5c0d0404-390e-4801-94a9-da52c104206f
>   encrypted 0
>   cephx lockbox secret  
>   block uuidwgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
>   block device  
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
>   vdo   0
>   crush device classNone
>   devices   /dev/sdd3

What I am wondering is if device mapper has lost something with a kernel or 
library change:

> [root@gw02 ceph]# ls -l /dev/dm*
> brw-rw. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> brw-rw. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> brw-rw. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> brw-rw. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> brw-rw. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> [root@gw02 ~]# dmsetup ls
> ceph--1f3d4406

Re: [ceph-users] Boot volume on OSD device

2019-01-19 Thread Brian Topping
> On Jan 18, 2019, at 10:58 AM, Hector Martin  wrote:
> 
> Just to add a related experience: you still need 1.0 metadata (that's
> the 1.x variant at the end of the partition, like 0.9.0) for an
> mdadm-backed EFI system partition if you boot using UEFI. This generally
> works well, except on some Dell servers where the firmware inexplicably
> *writes* to the ESP, messing up the RAID mirroring. 

I love this list. You guys are great. I have to admit I was kind of intimidated 
at first, I felt a little unworthy in the face of such cutting-edge tech. 
Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled 
the trigger on today was the overhead of various subsystems. LVM does not 
create much overhead, but tiny initial mistakes explode into a lot of wasted 
CPU over the course of a deployment lifetime. So I wanted to review everything 
and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and 
any one of the disks should be able to fail without affecting the ability for 
the machine to boot, the bad disk replaced without requiring obscure admin 
skills, and the final recovery to the promised land of “HEALTH_OK”. A single 
machine Ceph deployment is not much better than just using local storage, 
except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between 
safety for mon logs, disk usage and performance for the boot partitions. As I 
learned, an OSD can fit in a single partition with no spillover, so I had three 
partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good 
handle on what was being written to the log and with what frequency and I could 
see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/ 
 provided a 
good background that I did not previously have on the RAID write penalty. I 
combined this with what I learned in 
https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328
 
.
 By the end of these two articles, I felt like I knew all the tradeoffs, but 
the final decision really came down to the penalty table in the first article 
and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for 
RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than 
RAID 1 will not keep all the copies of /boot both up-to-date and ready to 
seamlessly restart the machine in case of a disk failure. Combined with a 
choice of RAID 10 for the root partition, we are left with a configuration that 
can reliably boot from any single drive failure (maybe two, I don’t know what 
mdadm would do if a “less than perfect storm” happened with one mirror from 
each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the 
latest MD metadata because Grub2 knows how to deal with everything. As well, 
`sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as 
bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I 
could just try was how many partitions an OSD would use. Hector mentioned that 
he was using LVM for Bluestore volumes. I privately wondered the value in 
creating LVM VGs when groups did not span disks. But this is exactly what the 
`ceph-deploy osd create` command as documented does in creating Bluestore OSDs. 
Knowing how to wire LVM is not rocket science, but if possible, I wanted to 
avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!! 
Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment, 
but I thought I would share exactly what I ended up with here for future 
seekers. This has been a fun adventure. 

Next up: Convert my existing two pre-production nodes that need to use this 
layout. Fortunately there’s nothing on the second node except Ceph and I can 
take that one down pretty easily. It will be good practice to gracefully shut 
down the four OSDs on that node without losing any data, reformat the node with 
this pattern, bring it the cluster back to health, then migrate the mon (and 
the workloads) to it while I do the same for the first node. With that, I’ll be 
able to remove these satanic SATADOMs and get back to some real work!! ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Today's DocuBetter meeting topic is... SEO

2019-01-18 Thread Brian Topping
Hi Noah!

With an eye toward improving documentation and community, two things come to 
mind:

1. I didn’t know about this meeting or I would have done my very best to enlist 
my roommate, who probably could have answered these questions very quickly. I 
do know there’s something to do with the metadata tags in the HTML  that 
manages most of this. Web spiders see these tags and know what to do.

2. I realized I really didn’t know there were any Ceph meetings like this and 
thought I would raise awareness to 
https://github.com/kubernetes/community/blob/master/events/community-meeting.md 
,
 where the kubernetes team has created an iCal subscription that one can 
automatically get alerts and updates for upcoming events. Best, they work 
accurately across time zones, so no need to have people doing math ("daylight 
savings time” is a pet peeve, please don’t get me started! :))

Hope this provides some value! 

Brian

> On Jan 18, 2019, at 11:37 AM, Noah Watkins  wrote:
> 
> 1 PM PST / 9 PM GMT
> https://bluejeans.com/908675367
> 
> On Fri, Jan 18, 2019 at 10:31 AM Noah Watkins  wrote:
>> 
>> We'll be discussing SEO for the Ceph documentation site today at the
>> DocuBetter meeting. Currently when Googling or DuckDuckGoing for
>> Ceph-related things you may see results from master, mimic, or what's
>> a dumpling? The goal is figure out what sort of approach we can take
>> to make these results more relevant. If you happen to know a bit about
>> the topic of SEO please join and contribute to the conversation.
>> 
>> Best,
>> Noah
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Brian Topping


> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
> 
> On 12/01/2019 15:07, Brian Topping wrote:
>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>> will not be happy that a couple of primary partitions have been used. Will 
>> this be a problem?
> 
> You should look into using ceph-volume in LVM mode. This will allow you to 
> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing PVs 
> with some non-Ceph stuff without any issues. It's the easiest way for OSDs to 
> coexist with other stuff right now.

Very interesting, thanks!

On the subject, I just rediscovered the technique of putting boot and root 
volumes on mdadm-backed stores. The last time I felt the need for this, it was 
a lot of careful planning and commands. 

Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set up 
before mkfs, there’s no manual hackery to reduce the size of a volume to make 
room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 just 
because they need the /boot volume to have the header at the end (grub now 
understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it doesn’t 
seem to matter what one does with the rest of the volumes. Kernel upgrades 
process correctly as well (another major hassle in the old days since mkinitrd 
had to be carefully managed).

best, B

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Offsite replication scenario

2019-01-16 Thread Brian Topping
> On Jan 16, 2019, at 12:08 PM, Anthony Verevkin  wrote:
> 
> I would definitely see huge value in going to 3 MONs here (and btw 2 on-site 
> MGR and 2 on-site MDS)
> However 350Kbps is quite low and MONs may be latency sensitive, so I suggest 
> you do heavy QoS if you want to use that link for ANYTHING else.
> If you do so, make sure your clients are only listing the on-site MONs so 
> they don't try to read from the off-site MON.
> Still you risk the stability of the cluster if the off-site MON starts 
> lagging. If it's still considered on while lagging, all changes to cluster 
> (osd going up/down, etc) would be blocked by waiting it to commit.

Using QOS is definitely something I hadn’t thought of, thanks. Setting up tc 
wouldn’t be rocket science. I’d probably make sure that the offsite mon wasn’t 
even reachable from clients, just the other mons.

> Even if you choose against an off-site MON, maybe consider 2 on-site MON 
> instead. Yes, you'd double the risk of cluster going to a halt if any one 
> node dies vs one specific node dying. But if that happens you have a manual 
> way of downgrading to a single MON (and you still have your MON's data) vs 
> risking to get stuck with a OSD-only node that had never had MON installed 
> and not having a copy of MON DB.

I had thought through some of that. That’s a good idea of having it ready to go 
though. 

It got also me thinking if it would be practical to have the two local and one 
remote set up and ready to go as you note, but only two running at a time. So 
if I need to take a primary node down, I would re-add the remote, do the 
service on the primary node, bring it back, re-establish mon health, then 
remove the remote. That’s probably great until there’s an actual primary 
failure — no quorum and the out-of-date remote can’t be re-added so one is just 
as bad off. 

> I also see how you want to get the data out for backups.
> Having a third replica off-site definitely won't fly with such bandwidth as 
> it would once again block the IO until committed by the off-site OSD.
> I am not quite sure RBD mirroring would play nicely with this kind of link 
> either. Maybe stick with application-level off-site backups.
> And again, whatever replication/backup strategy you do, need to QoS or else 
> you'd cripple your connection which I assume is used for some other 
> communications as well.

The connection is a consumer-grade gig fiber. It would be great if I could 
optimize it for higher speeds, the locations are only 30 miles from each other! 

It seems at this point that I’m not missing anything, I’m grateful for your 
thoughts! I think I just need to get another node in there, whether through a 
VPS from the provider or another unit. The cost of dealing with the house of 
cards being built seems much higher as soon as there is a single unplanned 
configuration and hours get put into bringing it back to health.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /var/lib/ceph/mon/ceph-{node}/store.db on mon nodes

2019-01-16 Thread Brian Topping
Thanks guys! This does leave me a little worried that I only have one mon at 
the moment based on reasons in my previous emails in the list (physical limit 
of two nodes at the moment). Going to have to get more creative!

Sent from my iPhone

> On Jan 16, 2019, at 02:56, Wido den Hollander  wrote:
> 
> 
> 
>> On 1/16/19 10:36 AM, Matthew Vernon wrote:
>> Hi,
>> 
>>> On 16/01/2019 09:02, Brian Topping wrote:
>>> 
>>> I’m looking at writes to a fragile SSD on a mon node,
>>> /var/lib/ceph/mon/ceph-{node}/store.db is the big offender at the
>>> moment.
>>> Is it required to be on a physical disk or can it be in tempfs? One
>>> of the log files has paxos strings, so I’m guessing it has to be on
>>> disk for a panic recovery? Are there other options?
>> Yeah, the mon store is worth keeping ;-) It can get quite large with a
>> large cluster and/or big rebalances. We bought some extra storage for
>> our mons and put the mon store onto dedicated storage.
> 
> Yes, this can't be stressed enough. Keep in mind: If you loose the MON
> stores you will effectively loose your cluster and thus data!
> 
> With some tooling you might be able to rebuild your MON store, but
> that's a task you don't want to take.
> 
> Use a DC-grade SSD for your MON stores with enough space (~100GB) and
> you'll be fine.
> 
> Wido
> 
>> 
>> Regards,
>> 
>> Matthew
>> 
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] /var/lib/ceph/mon/ceph-{node}/store.db on mon nodes

2019-01-16 Thread Brian Topping
I’m looking at writes to a fragile SSD on a mon node, 
/var/lib/ceph/mon/ceph-{node}/store.db is the big offender at the moment.

Is it required to be on a physical disk or can it be in tempfs? One of the log 
files has paxos strings, so I’m guessing it has to be on disk for a panic 
recovery? Are there other options?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Offsite replication scenario

2019-01-14 Thread Brian Topping
Ah! Makes perfect sense now. Thanks!! 

Sent from my iPhone

> On Jan 14, 2019, at 12:30, Gregory Farnum  wrote:
> 
>> On Fri, Jan 11, 2019 at 10:07 PM Brian Topping  
>> wrote:
>> Hi all,
>> 
>> I have a simple two-node Ceph cluster that I’m comfortable with the care and 
>> feeding of. Both nodes are in a single rack and captured in the attached 
>> dump, it has two nodes, only one mon, all pools size 2. Due to physical 
>> limitations, the primary location can’t move past two nodes at the present 
>> time. As far as hardware, those two nodes are 18-core Xeon with 128GB RAM 
>> and connected with 10GbE. 
>> 
>> My next goal is to add an offsite replica and would like to validate the 
>> plan I have in mind. For it’s part, the offsite replica can be considered 
>> read-only except for the occasional snapshot in order to run backups to 
>> tape. The offsite location is connected with a reliable and secured ~350Kbps 
>> WAN link. 
> 
> Unfortunately this is just not going to work. All writes to a Ceph OSD are 
> replicated synchronously to every replica, all reads are served from the 
> primary OSD for any given piece of data, and unless you do some hackery on 
> your CRUSH map each of your 3 OSD nodes is going to be a primary for about 
> 1/3 of the total data.
> 
> If you want to move your data off-site asynchronously, there are various 
> options for doing that in RBD (either periodic snapshots and export-diff, or 
> by maintaining a journal and streaming it out) and RGW (with the multi-site 
> stuff). But you're not going to be successful trying to stretch a Ceph 
> cluster over that link.
> -Greg
>  
>> 
>> The following presuppositions bear challenge:
>> 
>> * There is only a single mon at the present time, which could be expanded to 
>> three with the offsite location. Two mons at the primary location is 
>> obviously a lower MTBF than one, but  with a third one on the other side of 
>> the WAN, I could create resiliency against *either* a WAN failure or a 
>> single node maintenance event. 
>> * Because there are two mons at the primary location and one at the offsite, 
>> the degradation mode for a WAN loss (most likely scenario due to facility 
>> support) leaves the primary nodes maintaining the quorum, which is 
>> desirable. 
>> * It’s clear that a WAN failure and a mon failure at the primary location 
>> will halt cluster access.
>> * The CRUSH maps will be managed to reflect the topology change.
>> 
>> If that’s a good capture so far, I’m comfortable with it. What I don’t 
>> understand is what to expect in actual use:
>> 
>> * Is the link speed asymmetry between the two primary nodes and the offsite 
>> node going to create significant risk or unexpected behaviors?
>> * Will the performance of the two primary nodes be limited to the speed that 
>> the offsite mon can participate? Or will the primary mons correctly 
>> calculate they have quorum and keep moving forward under normal operation?
>> * In the case of an extended WAN outage (and presuming full uptime on 
>> primary site mons), would return to full cluster health be simply a matter 
>> of time? Are there any limits on how long the WAN could be down if the other 
>> two maintain quorum?
>> 
>> I hope I’m asking the right questions here. Any feedback appreciated, 
>> including blogs and RTFM pointers.
>> 
>> 
>> Thanks for a great product!! I’m really excited for this next frontier!
>> 
>> Brian
>> 
>> > [root@gw01 ~]# ceph -s
>> >  cluster:
>> >id: 
>> >health: HEALTH_OK
>> > 
>> >  services:
>> >mon: 1 daemons, quorum gw01
>> >mgr: gw01(active)
>> >mds: cephfs-1/1/1 up  {0=gw01=up:active}
>> >osd: 8 osds: 8 up, 8 in
>> > 
>> >  data:
>> >pools:   3 pools, 380 pgs
>> >objects: 172.9 k objects, 11 GiB
>> >usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>> >pgs: 380 active+clean
>> > 
>> >  io:
>> >client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
>> > 
>> > [root@gw01 ~]# ceph df
>> > GLOBAL:
>> >SIZEAVAIL   RAW USED %RAW USED 
>> >5.8 TiB 5.8 TiB   30 GiB  0.51 
>> > POOLS:
>> >NAMEID USED%USED MAX AVAIL OBJECTS 
>> >cephfs_metadata 2  264 MiB 0   2.7 TiB1085 
>> >cephfs_data 3  8.3 GiB  0.29   2.7 TiB  171283 
>> >   

[ceph-users] Boot volume on OSD device

2019-01-11 Thread Brian Topping
Question about OSD sizes: I have two cluster nodes, each with 4x 800GiB SLC SSD 
using BlueStore. They boot from SATADOM so the OSDs are data-only, but the MLC 
SATADOM have terrible reliability and the SLC are way overpriced for this 
application.

Can I carve off 64GiB of from one of the four drives on a node without causing 
problems? If I understand the strategy properly, this will cause mild extra 
load on the other three drives as the weight goes down on the partitioned 
drive, but it probably won’t be a big deal.

Assuming the correct procedure is documented at 
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/, first 
removing the OSD as documented, zap it, carve off the partition of the freed 
drive, then adding the remaining space back in.

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Offsite replication scenario

2019-01-11 Thread Brian Topping
Hi all,

I have a simple two-node Ceph cluster that I’m comfortable with the care and 
feeding of. Both nodes are in a single rack and captured in the attached dump, 
it has two nodes, only one mon, all pools size 2. Due to physical limitations, 
the primary location can’t move past two nodes at the present time. As far as 
hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 
10GbE. 

My next goal is to add an offsite replica and would like to validate the plan I 
have in mind. For it’s part, the offsite replica can be considered read-only 
except for the occasional snapshot in order to run backups to tape. The offsite 
location is connected with a reliable and secured ~350Kbps WAN link. 

The following presuppositions bear challenge:

* There is only a single mon at the present time, which could be expanded to 
three with the offsite location. Two mons at the primary location is obviously 
a lower MTBF than one, but  with a third one on the other side of the WAN, I 
could create resiliency against *either* a WAN failure or a single node 
maintenance event. 
* Because there are two mons at the primary location and one at the offsite, 
the degradation mode for a WAN loss (most likely scenario due to facility 
support) leaves the primary nodes maintaining the quorum, which is desirable. 
* It’s clear that a WAN failure and a mon failure at the primary location will 
halt cluster access.
* The CRUSH maps will be managed to reflect the topology change.

If that’s a good capture so far, I’m comfortable with it. What I don’t 
understand is what to expect in actual use:

* Is the link speed asymmetry between the two primary nodes and the offsite 
node going to create significant risk or unexpected behaviors?
* Will the performance of the two primary nodes be limited to the speed that 
the offsite mon can participate? Or will the primary mons correctly calculate 
they have quorum and keep moving forward under normal operation?
* In the case of an extended WAN outage (and presuming full uptime on primary 
site mons), would return to full cluster health be simply a matter of time? Are 
there any limits on how long the WAN could be down if the other two maintain 
quorum?

I hope I’m asking the right questions here. Any feedback appreciated, including 
blogs and RTFM pointers.


Thanks for a great product!! I’m really excited for this next frontier!

Brian

> [root@gw01 ~]# ceph -s
>  cluster:
>id: 
>health: HEALTH_OK
> 
>  services:
>mon: 1 daemons, quorum gw01
>mgr: gw01(active)
>mds: cephfs-1/1/1 up  {0=gw01=up:active}
>osd: 8 osds: 8 up, 8 in
> 
>  data:
>pools:   3 pools, 380 pgs
>objects: 172.9 k objects, 11 GiB
>usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>pgs: 380 active+clean
> 
>  io:
>client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> 
> [root@gw01 ~]# ceph df
> GLOBAL:
>SIZEAVAIL   RAW USED %RAW USED 
>5.8 TiB 5.8 TiB   30 GiB  0.51 
> POOLS:
>NAMEID USED%USED MAX AVAIL OBJECTS 
>cephfs_metadata 2  264 MiB 0   2.7 TiB1085 
>cephfs_data 3  8.3 GiB  0.29   2.7 TiB  171283 
>rbd 4  2.0 GiB  0.07   2.7 TiB 542 
> [root@gw01 ~]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF 
> -1   5.82153 root default  
> -3   2.91077 host gw01 
> 0   ssd 0.72769 osd.0 up  1.0 1.0 
> 2   ssd 0.72769 osd.2 up  1.0 1.0 
> 4   ssd 0.72769 osd.4 up  1.0 1.0 
> 6   ssd 0.72769 osd.6 up  1.0 1.0 
> -5   2.91077 host gw02 
> 1   ssd 0.72769 osd.1 up  1.0 1.0 
> 3   ssd 0.72769 osd.3 up  1.0 1.0 
> 5   ssd 0.72769 osd.5 up  1.0 1.0 
> 7   ssd 0.72769 osd.7 up  1.0 1.0 
> [root@gw01 ~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS 
> 0   ssd 0.72769  1.0 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115 
> 2   ssd 0.72769  1.0 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83 
> 4   ssd 0.72769  1.0 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90 
> 6   ssd 0.72769  1.0 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92 
> 1   ssd 0.72769  1.0 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76 
> 3   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102 
> 5   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98 
> 7   ssd 0.72769  1.0 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104 
>TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51  
> MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com