Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-05 Thread Wes Dillingham
Good data point on not trimming when non active+clean PGs are present. So
am I reading this correct? It grew to 32GB? Did it end up growing beyond
that, what was the max? Also is only ~18PGs per OSD a reasonable amount of
PGs per OSD? I would think about quadruple that would be ideal. Is this an
artifact of a steadily growing cluster or a design choice?

On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander <w...@42on.com> wrote:

> Hi,
>
> I just wanted to inform people about the fact that Monitor databases can
> grow quite big when you have a large cluster which is performing a very
> long rebalance.
>
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
>
> Some information:
>
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
>
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
>
>  33488 active+clean
>  4802  active+undersized+degraded+remapped+backfill_wait
>  1670  active+remapped+backfill_wait
>  263   active+undersized+degraded+remapped+backfilling
>  250   active+recovery_wait+degraded
>  54active+recovery_wait+degraded+remapped
>  27active+remapped+backfilling
>  13active+recovery_wait+undersized+degraded+remapped
>  2 active+recovering+degraded
>
> This has been running for a few days now and it has caused this warning:
>
> MON_DISK_BIG mons srv-zmb03-05,srv-zmb04-05,srv-
> zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a lot of disk space
> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>
> This is to be expected as MONs do not trim their store if one or more PGs
> is not active+clean.
>
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
>
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
>
> In between we wait for the cluster to become HEALTH_OK (all PGs
> active+clean) so that the Monitors can trim their database before we start
> with the next rack.
>
> I just want to warn and inform people about this. Under normal
> circumstances a MON database isn't that big, but if you have a very long
> period of backfills/recoveries and also have a large number of OSDs you'll
> see the DB grow quite big.
>
> This has improved significantly going to Jewel and Luminous, but it is
> still something to watch out for.
>
> Make sure your MONs have enough free space to handle this!
>
> Wido
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests on a specific osd

2018-01-15 Thread Wes Dillingham
My understanding is that the exact same objects would move back to the OSD
if weight went 1 -> 0 -> 1 given the same Cluster state and same object
names, CRUSH is deterministic so that would be the almost certain result.

On Mon, Jan 15, 2018 at 2:46 PM, lists <li...@merit.unu.edu> wrote:

> Hi Wes,
>
> On 15-1-2018 20:32, Wes Dillingham wrote:
>
>> I dont hear a lot of people discuss using xfs_fsr on OSDs and going over
>> the mailing list history it seems to have been brought up very infrequently
>> and never as a suggestion for regular maintenance. Perhaps its not needed.
>>
> True, it's just something we've always done on all our xfs filesystems, to
> keep them speedy and snappy. I've disabled it, and then it doesn't happen.
>
> Perhaps I'll keep it disabled.
>
> But on this last question, about data distribution across OSDs:
>
> In that case, how about reweighting that osd.10 to "0", wait until
>> all data has moved off osd.10, and then setting it back to "1".
>> Would this result in *exactly* the same situation as before, or
>> would it at least cause the data to have spread move better across
>> the other OSDs?
>>
>
> Would it work like that? Or would setting it back to "1" give me again the
> same data on this OSD that we started with?
>
> Thanks for your comments,
>
> MJ
> ___________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests on a specific osd

2018-01-15 Thread Wes Dillingham
I dont hear a lot of people discuss using xfs_fsr on OSDs and going over
the mailing list history it seems to have been brought up very infrequently
and never as a suggestion for regular maintenance. Perhaps its not needed.

One thing to consider trying, and to rule out something funky with the XFS
filesystem on that particular OSD/drive would be to remove the OSD entirely
from the cluster, reformat the disk, and then rebuild the OSD, putting a
brand new XFS on the OSD.

On Mon, Jan 15, 2018 at 7:36 AM, lists <li...@merit.unu.edu> wrote:

> Hi,
>
> On our three-node 24 OSDs ceph 10.2.10 cluster, we have started seeing
> slow requests on a specific OSD, during the the two-hour nightly xfs_fsr
> run from 05:00 - 07:00. This started after we applied the meltdown patches.
>
> The specific osd.10 also has the highest space utilization of all OSDs
> cluster-wide, with 45%, while the others are mostly around 40%. All OSDs
> are the same 4TB platters with journal on ssd, all with weight 1.
>
> Smart info for osd.10 shows nothing interesting I think:
>
> Current Drive Temperature: 27 C
>> Drive Trip Temperature:60 C
>>
>> Manufactured in week 04 of year 2016
>> Specified cycle count over device lifetime:  1
>> Accumulated start-stop cycles:  53
>> Specified load-unload count over device lifetime:  30
>> Accumulated load-unload cycles:  697
>> Elements in grown defect list: 0
>>
>> Vendor (Seagate) cache information
>>   Blocks sent to initiator = 1933129649
>>   Blocks received from initiator = 869206640
>>   Blocks read from cache and sent to initiator = 2149311508
>>   Number of read and write commands whose size <= segment size = 676356809
>>   Number of read and write commands whose size > segment size = 12734900
>>
>> Vendor (Seagate/Hitachi) factory information
>>   number of hours powered up = 13625.88
>>   number of minutes until next internal SMART test = 8
>>
>
> Now my question:
> Could it be that osd.10 just happens to contain some data chunks that are
> heavily needed by the VMs around that time, and that the added load of an
> xfs_fsr is simply too much for it to handle?
>
> In that case, how about reweighting that osd.10 to "0", wait until all
> data has moved off osd.10, and then setting it back to "1". Would this
> result in *exactly* the same situation as before, or would it at least
> cause the data to have spread move better across the other OSDs?
>
> (with the idea that better data spread across OSDs brings also better
> distribution of load between the OSDs)
>
> Or other ideas to check out?
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switching a pool from EC to replicated online ?

2018-01-15 Thread Wes Dillingham
You would need to create a new pool and migrate the data to that new pool.

Replicated pool fronting an EC pool for RBD is a known-bad workload:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#a-word-of-caution
but others mileage may vary I suppose.

In order to migrate you could do an RBD at a time, I would probably take a
snapshot and than do an `rbd cp` operation from the poolA/snap to
poolB/image

If you are okay with the VMs being powered down you could do an `rbd mv`
which doesnt support renames across pools, though I would prefer the cp
method.

You could also do a wholesale pool copy using `rados cppool` see
http://ceph.com/geen-categorie/ceph-pool-migration/

best of luck.

On Sat, Jan 13, 2018 at 6:37 PM, moftah moftah <moft...@gmail.com> wrote:

> Hi All,
> is there a way to switch a pool that is set to be EC to being replicated
> without the need to switch to new pool and migrate data ?
>
> I am getting poor results from EC and want to switch to replicated but i
> already have customers on the system .
> i using ceph 11
> the EC already have cache tier that is replicated
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Open Compute (OCP) servers for Ceph

2018-01-10 Thread Wes Dillingham
Not OCP but regarding 12 3.5 drives in 1U with Decent CPU QCT makes the
following:
https://www.qct.io/product/index/Server/rackmount-server/1U-Rackmount-Server/QuantaGrid-S51G-1UL
and have a few other models with some additional SSD included in addition
to the 3.5"

Both of those compared here:
https://www.qct.io/product/compare?model=1323,215

QCT does manufacture quite a few OCP models:
this may fit the mold:
https://www.qct.io/product/index/Rack/Rackgo-X-RSD/Rackgo-X-RSD-Storage/Rackgo-X-RSD-Knoxville#specifications

On Fri, Dec 22, 2017 at 9:22 AM, Wido den Hollander <w...@42on.com> wrote:

>
>
> On 12/22/2017 02:40 PM, Dan van der Ster wrote:
>
>> Hi Wido,
>>
>> We have used a few racks of Wiwynn OCP servers in a Ceph cluster for a
>> couple of years.
>> The machines are dual Xeon [1] and use some of those 2U 30-disk "Knox"
>> enclosures.
>>
>>
> Yes, I see. I was looking for a solution without a JBOD and about 12
> drives 3.5" or ~20 2.5" in 1U with a decent CPU to run OSDs on.
>
> Other than that, I have nothing particularly interesting to say about
>> these. Our data centre procurement team have also moved on with
>> standard racked equipment, so I suppose they also found these
>> uninteresting.
>>
>>
> It really depends. When properly deployed OCP can seriously lower power
> costs for numerous reasons and thus lower the TCO of a Ceph cluster.
>
> But I dislike the machines with a lot of disks for Ceph, I prefer smaller
> machines.
>
> Hopefully somebody knows a vendor who makes such OCP machines.
>
> Wido
>
>
> Cheers, Dan
>>
>> [1] http://www.wiwynn.com/english/product/type/details/32?ptype=28
>>
>>
>> On Fri, Dec 22, 2017 at 12:04 PM, Wido den Hollander <w...@42on.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm looking at OCP [0] servers for Ceph and I'm not able to find yet what
>>> I'm looking for.
>>>
>>> First of all, the geek in me loves OCP and the design :-) Now I'm trying
>>> to
>>> match it with Ceph.
>>>
>>> Looking at wiwynn [1] they offer a few OCP servers:
>>>
>>> - 3 nodes in 2U with a single 3.5" disk [2]
>>> - 2U node with 30 disks and a Atom C2000 [3]
>>> - 2U JDOD with 12G SAS [4]
>>>
>>> For Ceph I would want:
>>>
>>> - 1U node / 12x 3.5" / Fast CPU
>>> - 1U node / 24x 2.5" / Fast CPU
>>>
>>> They don't seem to exist yet when looking for OCP server.
>>>
>>> Although 30 drives is fine, it would become a very large Ceph cluster
>>> when
>>> building with something like that.
>>>
>>> Has anybody build Ceph clusters yet using OCP hardaware? If so, which
>>> vendor
>>> and what are your experiences?
>>>
>>> Thanks!
>>>
>>> Wido
>>>
>>> [0]: http://www.opencompute.org/
>>> [1]: http://www.wiwynn.com/
>>> [2]: http://www.wiwynn.com/english/product/type/details/65?ptype=28
>>> [3]: http://www.wiwynn.com/english/product/type/details/33?ptype=28
>>> [4]: http://www.wiwynn.com/english/product/type/details/43?ptype=28
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Wes Dillingham
Similar to Dan's situation we utilize the --cluster name concept for our
operations. Primarily for "datamover" nodes which do incremental rbd
import/export between distinct clusters. This is entirely coordinated by
utilizing the --cluster option throughout.

The way we set it up is that all clusters are actually named "ceph" on the
mons and osds etc, but the clients themselves get /etc/ceph/clusterA.conf
and /etc/ceph/clusterB.conf so that we can differentiate. I would like to
see the functionality of clients being able to specify which conf file to
read preserved.

As a note though we went the route of naming all clusters "ceph" to
workaround difficulties in non-standard naming so this issue does need some
attention.

On Fri, Jun 9, 2017 at 8:19 AM, Alfredo Deza <ad...@redhat.com> wrote:

> On Thu, Jun 8, 2017 at 3:54 PM, Sage Weil <sw...@redhat.com> wrote:
> > On Thu, 8 Jun 2017, Bassam Tabbara wrote:
> >> Thanks Sage.
> >>
> >> > At CDM yesterday we talked about removing the ability to name your
> ceph
> >> > clusters.
> >>
> >> Just to be clear, it would still be possible to run multiple ceph
> >> clusters on the same nodes, right?
> >
> > Yes, but you'd need to either (1) use containers (so that different
> > daemons see a different /etc/ceph/ceph.conf) or (2) modify the systemd
> > unit files to do... something.
>
> In the container case, I need to clarify that ceph-docker deployed
> with ceph-ansible is not capable of doing this, since
> the ad-hoc systemd units use the hostname as part of the identifier
> for the daemon, e.g:
>
> systemctl enable ceph-mon@{{ ansible_hostname }}.service
>
>
> >
> > This is actually no different from Jewel. It's just that currently you
> can
> > run a single cluster on a host (without containers) but call it 'foo' and
> > knock yourself out by passing '--cluster foo' every time you invoke the
> > CLI.
> >
> > I'm guessing you're in the (1) case anyway and this doesn't affect you at
> > all :)
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upper limit of MONs and MDSs in a Cluster

2017-05-25 Thread Wes Dillingham
How much testing has there been / what are the implications of having a
large number of Monitor and Metadata daemons running in a cluster.

Thus far I  have deployed all of our Ceph clusters as a single service type
per physical machine but I am interested in a use case where we deploy
dozens/hundreds? of boxes each of which would be a mon,mds,mgr,osd,rgw all
in one and all a single cluster. I do realize it is somewhat trivial (with
config mgmt and all) to dedicate a couple of lean boxes as MDS's and MONs
and only expand at the OSD level but I'm still curious.

My use case in mind is for backup targets where pools span the entire
cluster and am looking to streamline the process for possible rack and
stack situations where boxes can just be added in place booted up and they
auto-join the cluster as a mon/mds/mgr/osd/rgw.

So does anyone run clusters with dozen's of MONs' and/or MDS or aware of
any testing with very high numbers of each? At the MDS level I would just
be looking for 1 Active, 1 Standby-replay and X standby until multiple
active MDSs are production ready. Thanks!

-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Available tools for deploying ceph cluster as a backend storage ?

2017-05-18 Thread Wes Dillingham
If you dont want a full fledged configuration management approach
ceph-deploy is your best bet. http://docs.ceph.com/
docs/master/rados/deployment/ceph-deploy-new/

On Thu, May 18, 2017 at 8:28 AM, Shambhu Rajak <sra...@sandvine.com> wrote:

> Hi ceph-users,
>
>
>
> I want to deploy ceph-cluster as a backend storage for openstack, so I am
> trying to find the best tool available for deploying ceph cluster.
>
> Few are in my mind:
>
> https://github.com/ceph/ceph-ansible
>
> https://github.com/01org/virtual-storage-manager/wiki/Gettin
> g-Started-with-VSM
>
>
>
> Is there anything else that are available that could be much easier to use
> and give production deployment.
>
>
>
> Thanks,
> Shambhu Rajak
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client's read affinity

2017-04-05 Thread Wes Dillingham
This is a big development for us. I have not heard of this option either. I
am excited to play with this feature and the implications it may have in
improving RBD reads in our multi-datacenter RBD pools.

Just to clarify the following options:
"rbd localize parent reads = true" and "crush location = foo=bar" are
configuration options for the client's ceph.conf and are not needed for OSD
hosts as their locations are already encoded in the CRUSH map.

It looks like this is a pretty old option ( http://narkive.com/ZkTahBVu:
5.455.67 )

so I am assuming it is relatively tried and true? but I have never heard of
it before... is anyone out there using this in a production RBD environment?




On Tue, Apr 4, 2017 at 7:36 PM, Jason Dillaman <jdill...@redhat.com> wrote:

> AFAIK, the OSDs should discover their location in the CRUSH map
> automatically -- therefore, this "crush location" config override
> would be used for librbd client configuration ("i.e. [client]
> section") to describe their location in the CRUSH map relative to
> racks, hosts, etc.
>
> On Tue, Apr 4, 2017 at 3:12 PM, Brian Andrus <brian.and...@dreamhost.com>
> wrote:
> > Jason, I haven't heard much about this feature.
> >
> > Will the localization have effect if the crush location configuration is
> set
> > in the [osd] section, or does it need to apply globally for clients as
> well?
> >
> > On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman <jdill...@redhat.com>
> wrote:
> >>
> >> Assuming you are asking about RBD-back VMs, it is not possible to
> >> localize the all reads to the VM image. You can, however, enable
> >> localization of the parent image since that is a read-only data set.
> >> To enable that feature, set "rbd localize parent reads = true" and
> >> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf.
> >>
> >> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario
> >> <alejan...@nubeliu.com> wrote:
> >> > any experiences ?
> >> >
> >> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
> >> > <alejan...@nubeliu.com> wrote:
> >> >> Guys hi.
> >> >> I have a Jewel Cluster divided into two racks which is configured on
> >> >> the crush map.
> >> >> I have clients (openstack compute nodes) that are closer from one
> rack
> >> >> than to another.
> >> >>
> >> >> I would love to (if is possible) to specify in some way the clients
> to
> >> >> read first from the nodes on a specific rack then try the other one
> if
> >> >> is not possible.
> >> >>
> >> >> Is that doable ? can somebody explain me how to do it ?
> >> >> best.
> >> >>
> >> >> --
> >> >> Alejandrito
> >> >
> >> >
> >> >
> >> > --
> >> > Alejandro Comisario
> >> > CTO | NUBELIU
> >> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> >> > _
> >> > www.nubeliu.com
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > Brian Andrus | Cloud Systems Engineer | DreamHost
> > brian.and...@dreamhost.com | www.dreamhost.com
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] INFO:ceph-create-keys:ceph-mon admin socket not ready yet.

2017-03-21 Thread Wes Dillingham
Generally this means the monitor daemon is not running. Is the monitor
daemon running? The monitor daemon creates the admin socket in
/var/run/ceph/$socket

Elaborate on how you are attempting to deploy ceph.

On Tue, Mar 21, 2017 at 9:01 AM, Vince <vi...@ihnetworks.com> wrote:

> Hi,
>
> I am getting the below error in messages after setting up ceph monitor.
>
> ===
> Mar 21 08:48:23 mon1 ceph-create-keys: admin_socket: exception getting
> command descriptions: [Errno 2] No such file or directory
> Mar 21 08:48:23 mon1 ceph-create-keys: INFO:ceph-create-keys:ceph-mon
> admin socket not ready yet.
> Mar 21 08:48:23 mon1 ceph-create-keys: admin_socket: exception getting
> command descriptions: [Errno 2] No such file or directory
> Mar 21 08:48:23 mon1 ceph-create-keys: INFO:ceph-create-keys:ceph-mon
> admin socket not ready yet.
> ===
>
> On checking the ceph-create-keys service status, getting the below error.
>
> ===
> [root@mon1 ~]# systemctl status ceph-create-keys@mon1.service
> ● ceph-create-keys@mon1.service - Ceph cluster key creator task
> Loaded: loaded (/usr/lib/systemd/system/ceph-create-keys@.service;
> static; vendor preset: disabled)
> Active: inactive (dead) since Thu 2017-02-16 10:47:14 PST; 1 months 2 days
> ago
> Condition: start condition failed at Tue 2017-03-21 05:47:42 PDT; 2s ago
> ConditionPathExists=!/var/lib/ceph/bootstrap-mds/ceph.keyring was not met
> Main PID: 2576 (code=exited, status=0/SUCCESS)
> ===
>
> Have anyone faced this error before ?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Wes Dillingham
If you had set min_size to 1 you would not have seen the writes pause. a
min_size of 1 is dangerous though because it means you are 1 hard disk
failure away from losing the objects within that placement group entirely.
a min_size of 2 is generally considered the minimum you want but many
people ignore that advice, some wish they hadn't.

On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carhe...@ucar.edu> wrote:

> Thanks everyone for the replies. Very informative. However, should I
> have expected writes to pause if I'd had min_size set to 1 instead of 2?
>
> And yes, I was under the false impression that my rdb devices was a
> single object. That explains what all those other things are on a test
> cluster where I only created a single object!
>
>
> --
> Adam Carheden
>
> On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > This is because of the min_size specification. I would bet you have it
> > set at 2 (which is good).
> >
> > ceph osd pool get rbd min_size
> >
> > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> > from each hosts) results in some of the objects only having 1 replica
> > min_size dictates that IO freezes for those objects until min_size is
> > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> >
> > I cant tell if your under the impression that your RBD device is a
> > single object. It is not. It is chunked up into many objects and spread
> > throughout the cluster, as Kjeti mentioned earlier.
> >
> > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com
> > <mailto:kje...@medallia.com>> wrote:
> >
> > Hi,
> >
> > rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> > will get you a "prefix", which then gets you on to
> > rbd_header., rbd_header.prefix contains block size,
> > striping, etc. The actual data bearing objects will be named
> > something like rbd_data.prefix.%-016x.
> >
> > Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >  of that image will be named rbd_data.
> > 86ce2ae8944a., the second  will be
> > 86ce2ae8944a.0001, and so on, chances are that one of these
> > objects are mapped to a pg which has both host3 and host4 among it's
> > replicas.
> >
> > An rbd image will end up scattered across most/all osds of the pool
> > it's in.
> >
> > Cheers,
> > -KJ
> >
> > On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu
> > <mailto:carhe...@ucar.edu>> wrote:
> >
> > I have a 4 node cluster shown by `ceph osd tree` below. Monitors
> are
> > running on hosts 1, 2 and 3. It has a single replicated pool of
> size
> > 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> > 5(host1) and 3(host2).
> >
> > I can 'fail' any one host by disabling the SAN network interface
> and
> > the VM keeps running with a simple slowdown in I/O performance
> > just as
> > expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> > the VM.
> > (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> > still
> > have quorum, so that shouldn't be an issue. The placement group
> > still
> > has 2 of its 3 replicas online.
> >
> > Why does I/O hang even though host4 isn't running a monitor and
> > doesn't have anything to do with my VM's hard drive.
> >
> >
> > Size?
> > # ceph osd pool get rbd size
> > size: 3
> >
> > Where's rbd_id.vm-100-disk-1?
> > # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> > --test-map-object
> > rbd_id.vm-100-disk-1 /tmp/map
> > got osdmap epoch 1043
> > osdmaptool: osdmap file '/tmp/map'
> >  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> >
> > # ceph osd tree
> > ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 8.06160 root default
> > -7 5.50308 room A
> > -3 1.88754 host host1
> >  4 0.40369 osd.4   up  1.0  1.0
> >  5 0.40369 osd.5   up  1.0  1.0
> >  6 0.54008 osd.6   up  1.0  1.0
> >  7 0.54008  

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Wes Dillingham
This is because of the min_size specification. I would bet you have it set
at 2 (which is good).

ceph osd pool get rbd min_size

With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 from
each hosts) results in some of the objects only having 1 replica
min_size dictates that IO freezes for those objects until min_size is
achieved.
http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

I cant tell if your under the impression that your RBD device is a single
object. It is not. It is chunked up into many objects and spread throughout
the cluster, as Kjeti mentioned earlier.

On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com>
wrote:

> Hi,
>
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
> you a "prefix", which then gets you on to rbd_header.,
> rbd_header.prefix contains block size, striping, etc. The actual data
> bearing objects will be named something like rbd_data.prefix.%-016x.
>
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first  size> of that image will be named rbd_data. 86ce2ae8944a., the
> second  will be 86ce2ae8944a.0001, and so on, chances
> are that one of these objects are mapped to a pg which has both host3 and
> host4 among it's replicas.
>
> An rbd image will end up scattered across most/all osds of the pool it's
> in.
>
> Cheers,
> -KJ
>
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu> wrote:
>
>> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
>> running on hosts 1, 2 and 3. It has a single replicated pool of size
>> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
>> 5(host1) and 3(host2).
>>
>> I can 'fail' any one host by disabling the SAN network interface and
>> the VM keeps running with a simple slowdown in I/O performance just as
>> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
>> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
>> have quorum, so that shouldn't be an issue. The placement group still
>> has 2 of its 3 replicas online.
>>
>> Why does I/O hang even though host4 isn't running a monitor and
>> doesn't have anything to do with my VM's hard drive.
>>
>>
>> Size?
>> # ceph osd pool get rbd size
>> size: 3
>>
>> Where's rbd_id.vm-100-disk-1?
>> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
>> rbd_id.vm-100-disk-1 /tmp/map
>> got osdmap epoch 1043
>> osdmaptool: osdmap file '/tmp/map'
>>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>>
>> # ceph osd tree
>> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 8.06160 root default
>> -7 5.50308 room A
>> -3 1.88754 host host1
>>  4 0.40369 osd.4   up  1.0  1.0
>>  5 0.40369 osd.5   up  1.0  1.0
>>  6 0.54008 osd.6   up  1.0  1.0
>>  7 0.54008 osd.7   up  1.0  1.0
>> -2 3.61554 host host2
>>  0 0.90388 osd.0   up  1.0  1.0
>>  1 0.90388 osd.1   up  1.0  1.0
>>  2 0.90388 osd.2   up  1.0  1.0
>>  3 0.90388 osd.3   up  1.0  1.0
>> -6 2.55852 room B
>> -4 1.75114 host host3
>>  8 0.40369 osd.8   up  1.0  1.0
>>  9 0.40369 osd.9   up  1.0  1.0
>> 10 0.40369 osd.10  up  1.0  1.0
>> 11 0.54008 osd.11  up  1.0  1.0
>> -5 0.80737 host host4
>> 12 0.40369 osd.12  up  1.0  1.0
>> 13 0.40369 osd.13  up  1.0  1.0
>>
>>
>> --
>> Adam Carheden
>> _______
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen <kje...@medallia.com>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon Sponsorships Open

2016-12-22 Thread Wes Dillingham
I / my group / our organization would be interested in discussing our
deployment of Ceph and how we are using it, deploying it, future plans etc.
This sounds like an exciting event. We look forward to hearing more
details.

On Thu, Dec 22, 2016 at 1:44 PM, Patrick McGarry <pmcga...@redhat.com>
wrote:

> Hey cephers,
>
> Just letting you know that we're opening the flood gates for
> sponsorship opportunities at Cephalocon next year (23-25 Aug 2017,
> Boston, MA). If you would be interested in sponsoring/exhibiting at
> our inaugural Ceph conference, please drop me a line. Thanks!
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Wes Dillingham
You might want to have a look at this:
https://github.com/camptocamp/ceph-rbd-backup/blob/master/ceph-rbd-backup.py

I have a bash implementation of this, but it basically boils down to
wrapping what peter said: an export-diff to stdout piped to an
import-diff on a different cluster. The "transfer" node is a client of
both clusters and simpy iterates over all rbd devices, snapshotting
them daily, and exporting the diff between todays snap and yesterdays
snap and layering that diff onto a sister rbd on the remote side.


On Tue, Nov 1, 2016 at 5:23 AM, Peter Maloney
<peter.malo...@brockmann-consult.de> wrote:
> On 11/01/16 10:22, Peter Maloney wrote:
>
> On 11/01/16 06:57, xxhdx1985126 wrote:
>
> Hi, everyone.
>
> I'm trying to write a program based on the librbd API that transfers
> snapshot diffs between ceph clusters without the need for a temporary
> storage which is required if I use the "rbd export-diff" and "rbd
> import-diff" pair.
>
>
> You don't need a temp file for this... eg.
>
>
> oops forgot the "-" in the commands corrected:
>
> ssh node1 rbd export-diff rbd/blah@snap1 - | rbd import-diff - rbd/blah
> ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 - | rbd
> import-diff - rbd/blah
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reliable monitor restarts

2016-10-24 Thread Wes Dillingham
What do the logs of the monitor service say? Increase their verbosity
and check the logs at the time of the crash. Are you doing any sort of
monitoring on the nodes such that you can forensically check what the
system was up to prior to the crash?

As others have said systemd can handle this via unit files, in fact
this is setup for you when installing ceph (at least in version 10.x /
jewel). Which version of Ceph are you running?

Also as others have stated, MON service is very reliable, and should
not be crashing, we have had zero crashes of mon service in 1.5 years
of running. Something is afoot.

Also configuration management platforms can ensure daemons remain
running as well, but this is bootstrap and suspenders with systemd.

On Sat, Oct 22, 2016 at 6:57 AM, Ruben Kerkhof <ru...@rubenkerkhof.com> wrote:
> On Fri, Oct 21, 2016 at 9:31 PM, Steffen Weißgerber
> <weissgerb...@ksnb.de> wrote:
>> Hello,
>>
>> we're running a 6 node ceph cluster with 3 mons on Ubuntu (14.04.4).
>>
>> Sometimes it happen's that the mon services die and have to restarted
>> manually.
>>
>> To have reliable service restarts I normally use D.J. Bernsteins deamontools
>> on other Linux distributions. Until now I never did this on Ubuntu.
>>
>> Is there a comparable way to configure such a watcher on services on Ubuntu
>> (i.e. under systemd)?
>
> Systemd handles this for you.
> The ceph-mon unit file has:
>
> Restart=on-failure
> StartLimitInterval=30min
> StartLimitBurst=3
>
> Note that systemd only restarts it 3 times in 30 minutes. If it fails
> more often, you'll have to reset the unit.
>
> You can finetune this with drop-ins, see systemd.service(5) for details.
>
>>
>> Regards and have a nice weekend.
>>
>> Steffen
>
> Kind regards,
>
> Ruben Kerkhof
> _______
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-21 Thread Wes Dillingham
What is the use case that requires you to have it in two datacenters?
In addition to RBD mirroring already mentioned by others, you can do
RBD snapshots and ship those snapshots to a remote location (separate
cluster or separate pool). Similar to RBD mirroring, in this situation
your client writes are not subject to that latency.

On Thu, Oct 20, 2016 at 1:51 PM, German Anders <gand...@despegar.com> wrote:
> Thanks, that's too far actually lol. And how things going with rbd
> mirroring?
>
> German
>
> 2016-10-20 14:49 GMT-03:00 yan cui <ccuiy...@gmail.com>:
>>
>> The two data centers are actually cross US.  One is in the west, and the
>> other in the east.
>> We try to sync rdb images using RDB mirroring.
>>
>> 2016-10-20 9:54 GMT-07:00 German Anders <gand...@despegar.com>:
>>>
>>> from curiosity I wanted to ask you what kind of network topology are you
>>> trying to use across the cluster? In this type of scenario you really need
>>> an ultra low latency network, how far from each other?
>>>
>>> Best,
>>>
>>> German
>>>
>>> 2016-10-18 16:22 GMT-03:00 Sean Redmond <sean.redmo...@gmail.com>:
>>>>
>>>> Maybe this would be an option for you:
>>>>
>>>> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
>>>>
>>>>
>>>> On Tue, Oct 18, 2016 at 8:18 PM, yan cui <ccuiy...@gmail.com> wrote:
>>>>>
>>>>> Hi Guys,
>>>>>
>>>>>Our company has a use case which needs the support of Ceph across
>>>>> two data centers (one data center is far away from the other). The
>>>>> experience of using one data center is good. We did some benchmarking on 
>>>>> two
>>>>> data centers, and the performance is bad because of the synchronization
>>>>> feature in Ceph and large latency between data centers. So, are there
>>>>> setting ups like data center aware features in Ceph, so that we have good
>>>>> locality? Usually, we use rbd to create volume and snapshot. But we want 
>>>>> the
>>>>> volume is high available with acceptable performance in case one data 
>>>>> center
>>>>> is down. Our current setting ups does not consider data center difference.
>>>>> Any ideas?
>>>>>
>>>>>
>>>>> Thanks, Yan
>>>>>
>>>>> --
>>>>> Think big; Dream impossible; Make it happen.
>>>>>
>>>>> _______
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>>
>>
>> --
>> Think big; Dream impossible; Make it happen.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding general information Openstack+kvm virtual machine block storage

2016-09-16 Thread Wes Dillingham
Erick,

You can use erasure coding but it has to be fronted by a replicated cache
tier, or so states the documentation, I have never set up this
configuration, and always opt to use RBD directly on replicated pools.

https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/paged/storage-strategies/chapter-31-erasure-coded-pools-and-cache-tiering

On Fri, Sep 16, 2016 at 1:15 AM, Josh Durgin <jdur...@redhat.com> wrote:

> On 09/16/2016 09:46 AM, Erick Perez - Quadrian Enterprises wrote:
>
>> Can someone point me to a thread or site that uses ceph+erasure coding
>> to serve block storage for Virtual Machines running with Openstack+KVM?
>> All references that I found are using erasure coding for cold data or
>> *not* VM block access.
>>
>
> Erasure coding is not supported by RBD currently, since EC pools only
> support append operations. There's work in progress to make it
> possible, by allowing overwrites for EC pools, but it won't be usable
> until at earliest Luminous [0].
>
> Josh
>
> [0] http://tracker.ceph.com/issues/14031
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com