[ceph-users] New Blue Jean's Meeting ID for Performance and Testing Meetings

2018-09-11 Thread Mike Perez
Hey all,

I have updated the testing and performance meetings with a new call number and
meeting ID to help with future recordings. The events should have the
information you need, but just in case:

US and Canada: 408-915-6466
See all numbers: https://www.redhat.com/en/conference-numbers

Enter Meeting ID: 908675367

Thanks!

-- 
Mike Perez (thingee)


pgp6Yc3gzRyO4.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Benjamin Cherian
Ok, that’s good to know. I was planning on using an EC pool. Maybe I'll
store some of the larger kv pairs in their own objects or move the metadata
into it's own replicated pool entirely. If the storage mechanism is the
same, is there a reason xattrs are supported and omap is not? (Or is there
some hidden cost to storing kv pairs in an EC pool I’m unaware of, e.g.,
does the kv data get replicated across all OSDs being used for a PG or
something?)

Thanks,
Ben

On Tue, Sep 11, 2018 at 1:46 PM Patrick Donnelly 
wrote:

> On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
>  wrote:
> > On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
> wrote:
> >>
> >> 
> >> In general, if the key-value storage is of unpredictable or non-trivial
> >> size, you should use omap.
> >>
> >> At the bottom layer where the data is actually stored, they're likely to
> >> be in the same places (if using BlueStore, they are the same — in
> FileStore,
> >> a rados xattr *might* be in the local FS xattrs, or it might not). It is
> >> somewhat more likely that something stored in an xattr will get pulled
> into
> >> memory at the same time as the object's internal metadata, but that only
> >> happens if it's quite small (think the xfs or ext4 xattr rules).
> >
> >
> > Based on this description, if I'm planning on using Bluestore, there is
> no
> > particular reason to ever prefer using xattrs over omap (outside of ease
> of
> > use in the API), correct?
>
> You may prefer xattrs on bluestore if the metadata is small and you
> may need to store the xattrs on an EC pool. omap is not supported on
> ecpools.
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding node efficient data move.

2018-09-11 Thread Anthony D'Atri
> When adding a node and I increment the crush weight like this. I have 
> the most efficient data transfer to the 4th node?
> 
> sudo -u ceph ceph osd crush reweight osd.23 1 
> sudo -u ceph ceph osd crush reweight osd.24 1 
> sudo -u ceph ceph osd crush reweight osd.25 1 
> sudo -u ceph ceph osd crush reweight osd.26 1 
> sudo -u ceph ceph osd crush reweight osd.27 1 
> sudo -u ceph ceph osd crush reweight osd.28 1 
> sudo -u ceph ceph osd crush reweight osd.29 1 
> 
> And then after recovery
> 
> sudo -u ceph ceph osd crush reweight osd.23 2

I'm not sure if you're asking for the most *efficient* way to add capacity, or 
the least *impactful*.

The most *efficient* would be to have the new OSDs start out at their full 
CRUSH weight.  This way data only has to move once.  However the overhead of 
that much movement can be quite significant, especially if I correctly read 
that you're expanding the size of the cluster by 33%.

What I prefer to do (on replicated clusters) is to use this script:

https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight

I set the CRUSH weights to 0 then run the script like

ceph-gentle-reweight -o  -b 10 -d 0.01 -t 3.48169 -i 10 -r | tee 
-a /var/tmp/upweight.log

Note that I disable measure_latency() out of paranoia.  This is less 
*efficient* in that some data ends up being moved more than once, and the 
elapsed time to complete is longer, but has the advantage of less impact.  It 
also allows one to quickly stop data movement if a drive/HBA/server/network 
issue causes difficulties.  Small steps means that each completes quickly.

I also set

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
osd_recovery_max_single_start = 1
osd_scrub_during_recovery = false

to additionally limit the impact of data movement on client operations.

YMMV. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Day at University of Santa Cruz - September 19

2018-09-11 Thread Mike Perez
Hey all,

Just a reminder that Ceph Day at UCSC Silicon Valley campus is coming this
September 19. This is a great opportunity to hear various use cases around
Ceph, and have discussions with various contributors in the community.

Potentially we'll be hearing a presentation from the university that helped
Sage with his research to start it all, and how Ceph enables Genomic
research at the campus today!

Registration is up and the schedule is posted:

https://ceph.com/cephdays/ceph-day-silicon-valley-university-santa-cruz-silicon-
+valley-campus/:

--
Mike Perez (thingee)


pgpo2CiOqgXaL.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Day in Berlin - November 12 - CFP now open

2018-09-11 Thread Mike Perez
Hey all,

The call for presentations is now open for Ceph Day in Berlin November 12!

https://ceph.formstack.com/forms/ceph_day_berlin_cfp

Deadline is October 5th 11:59 UTC

Registration and other information for the event:

https://ceph.com/cephdays/ceph-day-berlin/

We are happy to be working with the OpenStack Foundation in providing a Ceph
Day at the same venue that the OpenStack Summit will be taking place. You do
need to purchase a Ceph Day pass even if you have a Full access pass for the
OpenStack summit.

-- 
Mike Perez (thingee)


pgpVgCtKU6Hay.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore DB size and onode count

2018-09-11 Thread Igor Fedotov



On 9/10/2018 11:39 PM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: 10 September 2018 18:27
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Bluestore DB size and onode count

On 09/10/2018 12:22 PM, Igor Fedotov wrote:


Hi Nick.


On 9/10/2018 1:30 PM, Nick Fisk wrote:

If anybody has 5 minutes could they just clarify a couple of things
for me

1. onode count, should this be equal to the number of objects stored
on the OSD?
Through reading several posts, there seems to be a general indication
that this is the case, but looking at my OSD's the maths don't
work.

onode_count is the number of onodes in the cache, not the total number
of onodes at an OSD.
Hence the difference...

Ok, thanks, that makes sense. I assume there isn't actually a counter which 
gives you the total number of objects on an OSD then?
IIRC "bin/ceph daemon osd.1 calc_objectstore_db_histogram" might report 
what you need, see "num_onodes" field in the report..



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Patrick Donnelly
On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
 wrote:
> On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum  wrote:
>>
>> 
>> In general, if the key-value storage is of unpredictable or non-trivial
>> size, you should use omap.
>>
>> At the bottom layer where the data is actually stored, they're likely to
>> be in the same places (if using BlueStore, they are the same — in FileStore,
>> a rados xattr *might* be in the local FS xattrs, or it might not). It is
>> somewhat more likely that something stored in an xattr will get pulled into
>> memory at the same time as the object's internal metadata, but that only
>> happens if it's quite small (think the xfs or ext4 xattr rules).
>
>
> Based on this description, if I'm planning on using Bluestore, there is no
> particular reason to ever prefer using xattrs over omap (outside of ease of
> use in the API), correct?

You may prefer xattrs on bluestore if the metadata is small and you
may need to store the xattrs on an EC pool. omap is not supported on
ecpools.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Gregory Farnum
Yeah that’s basically the case. RADOS support for xattrs significantly
precedes the introduction of the omap concept and is perfectly acceptable
to use but also kind of vestigial at this point. :)
On Tue, Sep 11, 2018 at 12:43 PM Benjamin Cherian <
benjamin.cher...@gmail.com> wrote:

> On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
> wrote:
>
>> 
>> In general, if the key-value storage is of unpredictable or non-trivial
>> size, you should use omap.
>>
>> At the bottom layer where the data is actually stored, they're likely to
>> be in the same places (if using BlueStore, they are the same — in
>> FileStore, a rados xattr *might* be in the local FS xattrs, or it might
>> not). It is somewhat more likely that something stored in an xattr will get
>> pulled into memory at the same time as the object's internal metadata, but
>> that only happens if it's quite small (think the xfs or ext4 xattr rules).
>>
>
> Based on this description, if I'm planning on using Bluestore, there is no
> particular reason to ever prefer using xattrs over omap (outside of ease of
> use in the API), correct?
>
> Thanks,
> Ben
>
> On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
> wrote:
>
>> On Tue, Sep 11, 2018 at 7:48 AM Benjamin Cherian <
>> benjamin.cher...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm interested in writing a relatively simple application that would use
>>> librados for storage. Are there recommendations for when to use the omap as
>>> opposed to an xattr? In theory, you could use either a set of xattrs or an
>>> omap as a kv store associated with a specific object. Are there
>>> recommendations for what kind of data xattrs and omaps are intended to
>>> store?
>>>
>>
>> In general, if the key-value storage is of unpredictable or non-trivial
>> size, you should use omap.
>>
>> At the bottom layer where the data is actually stored, they're likely to
>> be in the same places (if using BlueStore, they are the same — in
>> FileStore, a rados xattr *might* be in the local FS xattrs, or it might
>> not). It is somewhat more likely that something stored in an xattr will get
>> pulled into memory at the same time as the object's internal metadata, but
>> that only happens if it's quite small (think the xfs or ext4 xattr rules).
>>
>> If you have 250KB of key-value data, omap is definitely the place for it.
>> -Greg
>>
>>
>>> Just for background, I have some metadata i'd like to associate with
>>> each object (total size of all kv pairs in object metadata is ~250k, some
>>> values a few bytes, while others are 10-20k.) The object will store actual
>>> data (a relatively large FP array) as a binary blob (~3-5 MB).
>>>
>>>
>>> Thanks,
>>> Ben
>>> --
>>> Regards,
>>>
>>> Benjamin Cherian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Benjamin Cherian
On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum  wrote:

> 
> In general, if the key-value storage is of unpredictable or non-trivial
> size, you should use omap.
>
> At the bottom layer where the data is actually stored, they're likely to
> be in the same places (if using BlueStore, they are the same — in
> FileStore, a rados xattr *might* be in the local FS xattrs, or it might
> not). It is somewhat more likely that something stored in an xattr will get
> pulled into memory at the same time as the object's internal metadata, but
> that only happens if it's quite small (think the xfs or ext4 xattr rules).
>

Based on this description, if I'm planning on using Bluestore, there is no
particular reason to ever prefer using xattrs over omap (outside of ease of
use in the API), correct?

Thanks,
Ben

On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum  wrote:

> On Tue, Sep 11, 2018 at 7:48 AM Benjamin Cherian <
> benjamin.cher...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm interested in writing a relatively simple application that would use
>> librados for storage. Are there recommendations for when to use the omap as
>> opposed to an xattr? In theory, you could use either a set of xattrs or an
>> omap as a kv store associated with a specific object. Are there
>> recommendations for what kind of data xattrs and omaps are intended to
>> store?
>>
>
> In general, if the key-value storage is of unpredictable or non-trivial
> size, you should use omap.
>
> At the bottom layer where the data is actually stored, they're likely to
> be in the same places (if using BlueStore, they are the same — in
> FileStore, a rados xattr *might* be in the local FS xattrs, or it might
> not). It is somewhat more likely that something stored in an xattr will get
> pulled into memory at the same time as the object's internal metadata, but
> that only happens if it's quite small (think the xfs or ext4 xattr rules).
>
> If you have 250KB of key-value data, omap is definitely the place for it.
> -Greg
>
>
>> Just for background, I have some metadata i'd like to associate with each
>> object (total size of all kv pairs in object metadata is ~250k, some values
>> a few bytes, while others are 10-20k.) The object will store actual data (a
>> relatively large FP array) as a binary blob (~3-5 MB).
>>
>>
>> Thanks,
>> Ben
>> --
>> Regards,
>>
>> Benjamin Cherian
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby

2018-09-11 Thread Gregory Farnum
The monitors can't replace an MDS if they're trying to agree amongst
themselves. Note the repeated monitor elections happening at the same time.

A monitor election is *often* transparent to clients, since they and the
OSDs only need the monitors when something happens in the cluster. But when
you collide losing an MDS and losing monitors at the same time, the clients
can't do their MDS requests but the monitors can't do a fast failover of
the MDS because the monitors are trying to establish a quorum.

(There's also the time sync warnings going on; those may or may not be
causing issues here but certainly aren't helping anything!)
-Greg

On Fri, Sep 7, 2018 at 7:24 PM Bryan Henderson 
wrote:

> > It's mds_beacon_grace.  Set that on the monitor to control the
> replacement of
> > laggy MDS daemons,
>
> Sounds like William's issue is something else.  William shuts down MDS 2
> and
> MON 4 simultaneously.  The log shows that some time later (we don't know
> how
> long), MON 3 detects that MDS 2 is gone ("MDS_ALL_DOWN"), but does nothing
> about it until 30 seconds later, which happens to be when MDS 2 and MON 4
> come
> back.  At that point, MON 3 reports that the rank has been reassigned to
> MDS
> 1.
>
> 'mds_beacon_grace' determines when a monitor declares MDS_ALL_DOWN, right?
>
> I think if things are working as designed, the log should show MON 3
> reassigning the rank to MDS 1 immediately after it reports MDS 2 is gone.
>
>
> From the original post:
>
> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
> (MDS_ALL_DOWN)
> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0
> 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
> 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs
> inactive, 115 pgs peering (PG_AVAILABILITY)
> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504
> objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg
> inactive, 69 pgs peering (PG_AVAILABILITY)
> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
> availability: 1 pg inactive, 69 pgs peering)
> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future,
> clocks not synchronized
> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
> 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0
> 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
> 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
> dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
> 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,
> quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
> 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max
> 0.05s
> 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 80 : cl

Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Gregory Farnum
On Tue, Sep 11, 2018 at 7:48 AM Benjamin Cherian 
wrote:

> Hi,
>
> I'm interested in writing a relatively simple application that would use
> librados for storage. Are there recommendations for when to use the omap as
> opposed to an xattr? In theory, you could use either a set of xattrs or an
> omap as a kv store associated with a specific object. Are there
> recommendations for what kind of data xattrs and omaps are intended to
> store?
>

In general, if the key-value storage is of unpredictable or non-trivial
size, you should use omap.

At the bottom layer where the data is actually stored, they're likely to be
in the same places (if using BlueStore, they are the same — in FileStore, a
rados xattr *might* be in the local FS xattrs, or it might not). It is
somewhat more likely that something stored in an xattr will get pulled into
memory at the same time as the object's internal metadata, but that only
happens if it's quite small (think the xfs or ext4 xattr rules).

If you have 250KB of key-value data, omap is definitely the place for it.
-Greg


> Just for background, I have some metadata i'd like to associate with each
> object (total size of all kv pairs in object metadata is ~250k, some values
> a few bytes, while others are 10-20k.) The object will store actual data (a
> relatively large FP array) as a binary blob (~3-5 MB).
>
>
> Thanks,
> Ben
> --
> Regards,
>
> Benjamin Cherian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph balancer "Error EAGAIN: compat weight-set not available"

2018-09-11 Thread David Turner
ceph balancer status
ceph config-key dump | grep balancer
ceph osd dump | grep min_compat_client
ceph osd crush dump | grep straw
ceph osd crush dump | grep profile
ceph features

You didn't mention it, but based on your error and my experiences over the
last week getting the balancer working, you're trying to use crush-compat.
Running all of those commands should give you the information you need to
fix everything up for the balancer to work.  With the first 2, you need to
make sure that you have your mode set properly as well as double check any
other settings you're going for with the balancer.  Everything else stems
off of a requirement of having your buckets being straw2 instead of straw
for the balancer to work.  I'm sure you'll notice that your cluster has
older compatibility requirements and crush profile than hammer and that
your buckets are using the straw algorithm instead of straw2.

Running [1] these commands will fix up your cluster so that you are now
using straw2 and have your minimum required clients and profile to hammer
which is the ceph release that introduced straw2.  Before running these
commands make sure that the output of `ceph features` does not show any
firefly clients connected to your cluster.  If you do have any, it is
likely due to outdated kernels or clients installed without the upstream
ceph repo and just using the version of ceph in the canonical repos or
similar for your distribution.  If you do happen to have any firefly, or
older, clients connected to your cluster, then you need to update those
clients before running the commands.

There will be some data movement, but I didn't see more than ~5% data
movement on any of the 8 clusters I ran them on.  That data movement will
be higher if you do not have a standard size of OSD drive in your clusters
like some 2TB disks and some 8TB disks across your cluster will probably
cause some more data movement then I saw, but it should still be within
reason.  This data movement is because straw2 can handle that situation
better than straw did and will allow your cluster to better balance itself
even without the balancer module.

If you don't even have any hammer clients, then go ahead and set the
min-compat-client to jewel as well as the crush tunables to jewel.  Setting
them to Jewel will cause a bit more data movement, but again for good
reasons.

The tl;dr of your error is that your cluster has been running since at
least hammer which started with older default settings than are required by
the balancer module.  As you've updated your cluster you didn't allow it to
utilize new features in the backend by leaving your crush tunables alone
during all of the upgrades to new versions.  To learn more about the
changes to the crush tunables you can check out the ceph wiki [2] here.

[1]
ceph osd set-require-min-compat-client hammer
ceph osd crush set-all-straw-buckets-to-straw2
ceph osd crush tunables hammer

[2] http://docs.ceph.com/docs/master/rados/operations/crush-map/

On Tue, Sep 11, 2018 at 6:24 AM Marc Roos  wrote:

>
> I am new, with using the balancer, I think this should generated a plan
> not? Do not get what this error is about.
>
>
> [@c01 ~]# ceph balancer optimize balancer-test.plan
> Error EAGAIN: compat weight-set not available
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] omap vs. xattr in librados

2018-09-11 Thread Benjamin Cherian
Hi,

I'm interested in writing a relatively simple application that would use
librados for storage. Are there recommendations for when to use the omap as
opposed to an xattr? In theory, you could use either a set of xattrs or an
omap as a kv store associated with a specific object. Are there
recommendations for what kind of data xattrs and omaps are intended to
store?

Just for background, I have some metadata i'd like to associate with each
object (total size of all kv pairs in object metadata is ~250k, some values
a few bytes, while others are 10-20k.) The object will store actual data (a
relatively large FP array) as a binary blob (~3-5 MB).


Thanks,
Ben
-- 
Regards,

Benjamin Cherian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds_cache_memory_limit

2018-09-11 Thread Dan van der Ster
We set it to 50% -- there seems to be some mystery inflation and
possibly a small leak (in luminous, at least).

-- dan

On Tue, Sep 11, 2018 at 4:04 PM marc-antoine desrochers
 wrote:
>
> Hi,
>
>
>
> Is there any recommendation for the mds_cache_memory_limit ? Like a % of the 
> total ram or something ?
>
>
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds_cache_memory_limit

2018-09-11 Thread marc-antoine desrochers
Hi,

 

Is there any recommendation for the mds_cache_memory_limit ? Like a % of the
total ram or something ?

 

Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Get supported features of all connected clients

2018-09-11 Thread Ilya Dryomov
On Tue, Sep 11, 2018 at 1:00 PM Tobias Florek  wrote:
>
> Hi!
>
> I have a cluster serving RBDs and CephFS that has a big number of
> clients I don't control.  I want to know what feature flags I can safely
> set without locking out clients.  Is there a command analogous to `ceph
> versions` that shows the connected clients and their feature support?

Yes, "ceph features".

https://ceph.com/community/new-luminous-upgrade-complete/

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Get supported features of all connected clients

2018-09-11 Thread Hervé Ballans

$ sudo ceph tell mds.0 client ls  ?

Le 11/09/2018 à 13:00, Tobias Florek a écrit :

Hi!

I have a cluster serving RBDs and CephFS that has a big number of
clients I don't control.  I want to know what feature flags I can safely
set without locking out clients.  Is there a command analogous to `ceph
versions` that shows the connected clients and their feature support?

If so, I would simply monitor this output for a few days and would have
a pretty good estimate on how Ceph is used.

If there is no such command, can I get that information from the Mon's, OSD's,
or MDS's logs?

Cheers,
  Tobias Florek


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Get supported features of all connected clients

2018-09-11 Thread Tobias Florek
Hi!

Answering to myself. There is `ceph features`, but I am missing
something.

I have (on my Mimic-only test cluster).

 # ceph features
 {
"mon": [
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 3
}
],
"mds": [
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 3
}
],
"osd": [
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 10
}
],
"client": [
{
"features": "0x7018fb86aa42ada",
"release": "jewel",
"num": 21
},
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 4
}
],
"mgr": [
{
"features": "0x3ffddff8ffa4fffb",
"release": "luminous",
"num": 3
}
]
 }

This does not make sense to me.  Is there no Mimic client connected even
though I have e.g. an active Mimic ceph-fuse mount?

Cheers,
 Tobias Florek


signature.asc
Description: signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Get supported features of all connected clients

2018-09-11 Thread Tobias Florek
Hi!

I have a cluster serving RBDs and CephFS that has a big number of
clients I don't control.  I want to know what feature flags I can safely
set without locking out clients.  Is there a command analogous to `ceph
versions` that shows the connected clients and their feature support?

If so, I would simply monitor this output for a few days and would have
a pretty good estimate on how Ceph is used.

If there is no such command, can I get that information from the Mon's, OSD's,
or MDS's logs?

Cheers,
 Tobias Florek


signature.asc
Description: signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph balancer "Error EAGAIN: compat weight-set not available"

2018-09-11 Thread Marc Roos


I am new, with using the balancer, I think this should generated a plan 
not? Do not get what this error is about.


[@c01 ~]# ceph balancer optimize balancer-test.plan
Error EAGAIN: compat weight-set not available
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.8 Luminous released

2018-09-11 Thread Dan van der Ster
This is a friendly reminder that multi-active MDS clusters must be
reduced to only 1 active during upgrades [1].

In the case of v12.2.8, the CEPH_MDS_PROTOCOL version has changed so
if you try to upgrade one MDS it will get stuck in the resolve state,
logging:

conn(0x55e3d9671000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
pgs=0 cs=0 l=0).handle_connect_reply connect protocol version
mismatch, my 31 != 30

Cheers, Dan

[1] http://docs.ceph.com/docs/luminous/cephfs/upgrading/

On Wed, Sep 5, 2018 at 4:20 PM Dan van der Ster  wrote:
>
> Thanks for the release!
>
> We've updated some test clusters (rbd, cephfs) and it looks good so far.
>
> -- dan
>
>
> On Tue, Sep 4, 2018 at 6:30 PM Abhishek Lekshmanan  wrote:
> >
> >
> > We're glad to announce the next point release in the Luminous v12.2.X
> > stable release series. This release contains a range of bugfixes and
> > stability improvements across all the components of ceph. For detailed
> > release notes with links to tracker issues and pull requests, refer to
> > the blog post at http://ceph.com/releases/v12-2-8-released/
> >
> > Upgrade Notes from previous luminous releases
> > -
> >
> > When upgrading from v12.2.5 or v12.2.6 please note that upgrade caveats from
> > 12.2.5 will apply to any _newer_ luminous version including 12.2.8. Please 
> > read
> > the notes at 
> > https://ceph.com/releases/12-2-7-luminous-released/#upgrading-from-v12-2-6
> >
> > For the cluster that installed the broken 12.2.6 release, 12.2.7 fixed the
> > regression and introduced a workaround option `osd distrust data digest = 
> > true`,
> > but 12.2.7 clusters still generated health warnings like ::
> >
> >   [ERR] 11.288 shard 207: soid
> >   11:1155c332:::rbd_data.207dce238e1f29.0527:head data_digest
> >   0xc8997a5b != data_digest 0x2ca15853
> >
> >
> > 12.2.8 improves the deep scrub code to automatically repair these
> > inconsistencies. Once the entire cluster has been upgraded and then fully 
> > deep
> > scrubbed, and all such inconsistencies are resolved; it will be safe to 
> > disable
> > the `osd distrust data digest = true` workaround option.
> >
> > Changelog
> > -
> > * bluestore: set correctly shard for existed Collection (issue#24761, 
> > pr#22860, Jianpeng Ma)
> > * build/ops: Boost system library is no longer required to compile and link 
> > example librados program (issue#25054, pr#23202, Nathan Cutler)
> > * build/ops: Bring back diff -y for non-FreeBSD (issue#24396, issue#21664, 
> > pr#22848, Sage Weil, David Zafman)
> > * build/ops: install-deps.sh fails on newest openSUSE Leap (issue#25064, 
> > pr#23179, Kyr Shatskyy)
> > * build/ops: Mimic build fails with -DWITH_RADOSGW=0 (issue#24437, 
> > pr#22864, Dan Mick)
> > * build/ops: order rbdmap.service before remote-fs-pre.target (issue#24713, 
> > pr#22844, Ilya Dryomov)
> > * build/ops: rpm: silence osd block chown (issue#25152, pr#23313, Dan van 
> > der Ster)
> > * cephfs-journal-tool: Fix purging when importing an zero-length journal 
> > (issue#24239, pr#22980, yupeng chen, zhongyan gu)
> > * cephfs: MDSMonitor: uncommitted state exposed to clients/mdss 
> > (issue#23768, pr#23013, Patrick Donnelly)
> > * ceph-fuse mount failed because no mds (issue#22205, pr#22895, liyan)
> > * ceph-volume add a __release__ string, to help version-conditional calls 
> > (issue#25170, pr#23331, Alfredo Deza)
> > * ceph-volume: adds test for `ceph-volume lvm list /dev/sda` (issue#24784, 
> > issue#24957, pr#23350, Andrew Schoen)
> > * ceph-volume: do not use stdin in luminous (issue#25173, issue#23260, 
> > pr#23367, Alfredo Deza)
> > * ceph-volume enable the ceph-osd during lvm activation (issue#24152, 
> > pr#23394, Dan van der Ster, Alfredo Deza)
> > * ceph-volume expand on the LVM API to create multiple LVs at different 
> > sizes (issue#24020, pr#23395, Alfredo Deza)
> > * ceph-volume lvm.activate conditional mon-config on prime-osd-dir 
> > (issue#25216, pr#23397, Alfredo Deza)
> > * ceph-volume lvm.batch remove non-existent sys_api property (issue#34310, 
> > pr#23811, Alfredo Deza)
> > * ceph-volume lvm.listing only include devices if they exist (issue#24952, 
> > pr#23150, Alfredo Deza)
> > * ceph-volume: process.call with stdin in Python 3 fix (issue#24993, 
> > pr#23238, Alfredo Deza)
> > * ceph-volume: PVolumes.get() should return one PV when using name or uuid 
> > (issue#24784, pr#23329, Andrew Schoen)
> > * ceph-volume: refuse to zap mapper devices (issue#24504, pr#23374, Andrew 
> > Schoen)
> > * ceph-volume: tests.functional inherit SSH_ARGS from ansible (issue#34311, 
> > pr#23813, Alfredo Deza)
> > * ceph-volume tests/functional run lvm list after OSD provisioning 
> > (issue#24961, pr#23147, Alfredo Deza)
> > * ceph-volume: unmount lvs correctly before zapping (issue#24796, pr#23128, 
> > Andrew Schoen)
> > * ceph-volume: update batch documentation to explain filestore strategies 
> > (issue#34309, pr#23825, Alfredo Deza)
> > * c