Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brett Chancellor
For me, it was the .rgw.meta pool that had very dense placement groups. The
OSDs would fail to start and would then commit suicide while trying to scan
the PGs. We had to remove all references of those placement groups just to
get the OSDs to start. It wasn't pretty.


On Mon, Aug 19, 2019, 2:09 AM Troy Ablan  wrote:

> Yes, it's possible that they do, but since all of the affected OSDs are
> still down and the monitors have been restarted since, all of those
> pools have pgs that are in unknown state and don't return anything in
> ceph pg ls.
>
> There weren't that many placement groups for the SSDs, but also I don't
> know that there were that many objects.  There were of course a ton of
> omap key/values.
>
> -Troy
>
> On 8/18/19 10:57 PM, Brett Chancellor wrote:
> > This sounds familiar. Do any of these pools on the SSD have fairly dense
> > placement group to object ratios? Like more than 500k objects per pg?
> > (ceph pg ls)
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brett Chancellor
This sounds familiar. Do any of these pools on the SSD have fairly dense
placement group to object ratios? Like more than 500k objects per pg? (ceph
pg ls)

On Sun, Aug 18, 2019, 10:12 PM Brad Hubbard  wrote:

> On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan  wrote:
> >
> > Paul,
> >
> > Thanks for the reply.  All of these seemed to fail except for pulling
> > the osdmap from the live cluster.
> >
> > -Troy
> >
> > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path
> > /var/lib/ceph/osd/ceph-45/ --file osdmap45
> > terminate called after throwing an instance of
> > 'ceph::buffer::malformed_input'
> >what():  buffer::malformed_input: unsupported bucket algorithm: -1
> > *** Caught signal (Aborted) **
> >   in thread 7f945ee04f00 thread_name:ceph-objectstor
> >   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> > (stable)
> >   1: (()+0xf5d0) [0x7f94531935d0]
> >   2: (gsignal()+0x37) [0x7f9451d80207]
> >   3: (abort()+0x148) [0x7f9451d818f8]
> >   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f945268f7d5]
> >   5: (()+0x5e746) [0x7f945268d746]
> >   6: (()+0x5e773) [0x7f945268d773]
> >   7: (__cxa_rethrow()+0x49) [0x7f945268d9e9]
> >   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)
> > [0x7f94553218d8]
> >   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad)
> [0x7f94550ff4ad]
> >   10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9455101db1]
> >   11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
> > ceph::buffer::list&)+0x1d0) [0x55de1f9a6e60]
> >   12: (main()+0x5340) [0x55de1f8c8870]
> >   13: (__libc_start_main()+0xf5) [0x7f9451d6c3d5]
> >   14: (()+0x3adc10) [0x55de1f9a1c10]
> > Aborted
> >
> > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path
> > /var/lib/ceph/osd/ceph-46/ --file osdmap46
> > terminate called after throwing an instance of
> > 'ceph::buffer::malformed_input'
> >what():  buffer::malformed_input: unsupported bucket algorithm: -1
> > *** Caught signal (Aborted) **
> >   in thread 7f9ce4135f00 thread_name:ceph-objectstor
> >   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> > (stable)
> >   1: (()+0xf5d0) [0x7f9cd84c45d0]
> >   2: (gsignal()+0x37) [0x7f9cd70b1207]
> >   3: (abort()+0x148) [0x7f9cd70b28f8]
> >   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f9cd79c07d5]
> >   5: (()+0x5e746) [0x7f9cd79be746]
> >   6: (()+0x5e773) [0x7f9cd79be773]
> >   7: (__cxa_rethrow()+0x49) [0x7f9cd79be9e9]
> >   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)
> > [0x7f9cda6528d8]
> >   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad)
> [0x7f9cda4304ad]
> >   10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9cda432db1]
> >   11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
> > ceph::buffer::list&)+0x1d0) [0x55cea26c8e60]
> >   12: (main()+0x5340) [0x55cea25ea870]
> >   13: (__libc_start_main()+0xf5) [0x7f9cd709d3d5]
> >   14: (()+0x3adc10) [0x55cea26c3c10]
> > Aborted
> >
> > -[~:#]- ceph osd getmap -o osdmap
> > got osdmap epoch 81298
> >
> > -[~:#]- ceph-objectstore-tool --op set-osdmap --data-path
> > /var/lib/ceph/osd/ceph-46/ --file osdmap
> > osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.
> >
> > -[~:#]- ceph-objectstore-tool --op set-osdmap --data-path
> > /var/lib/ceph/osd/ceph-45/ --file osdmap
> > osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.
>
> 819   auto ch = store->open_collection(coll_t::meta());
>  820   const ghobject_t full_oid = OSD::get_osdmap_pobject_name(e);
>  821   if (!store->exists(ch, full_oid)) {
>  822 cerr << "osdmap (" << full_oid << ") does not exist." <<
> std::endl;
>  823 if (!force) {
>  824   return -ENOENT;
>  825 }
>  826 cout << "Creating a new epoch." << std::endl;
>  827   }
>
> Adding "--force"should get you past that error.
>
> >
> >
> >
> > On 8/14/19 2:54 AM, Paul Emmerich wrote:
> > > Starting point to debug/fix this would be to extract the osdmap from
> > > one of the dead OSDs:
> > >
> > > ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...
> > >
> > > Then try to run osdmaptool on that osdmap to see if it also crashes,
> > > set some --debug options (don't know which one off the top of my
> > > head).
> > > Does it also crash? How does it differ from the map retrieved with
> > > "ceph osd getmap"?
> > >
> > > You can also set the osdmap with "--op set-osdmap", does it help to
> > > set the osdmap retrieved by "ceph osd getmap"?
> > >
> > > Paul
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is the admin burden avoidable? "1 pg inconsistent" every other day?

2019-08-04 Thread Brett Chancellor
If all you want to do is repair the pg when it finds an inconsistent pg,
you could set osd_scrub_auto_repair to true.

On Sun, Aug 4, 2019, 9:16 AM Harry G. Coin  wrote:

> Question:  If you have enough osds it seems an almost daily thing when
> you get to work in the morning there' s a "ceph health error"  "1 pg
> inconsistent"   arising from a 'scrub error'.   Or 2, etc.   Then like
> most such mornings you look to see there's two or more valid instances
> of the pg and one with an issue.  So, like putting on socks that just
> takes time every day: there's the 'ceph pg repair xx' (making note of
> the likely soon to fail osd) then hey presto on with the day.
>
> Am I missing some way to automate this and be notified only if one
> attempt at pg repair has failed and just a log entry for successful
> repairs?   Calls about dashboard "HEALTH ERR" warnings so often I don't
> need.
>
> Ideas welcome!
>
> Thanks
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-31 Thread Brett Chancellor
I was able to answer my own question. For future interested parties, I
initiated a deep scrub on the placement group, which cleared the error.

On Tue, Jul 30, 2019 at 1:48 PM Brett Chancellor 
wrote:

> I was able to remove the meta objects, but the cluster is still in WARN
> state
> HEALTH_WARN 1 large omap objects
> LARGE_OMAP_OBJECTS 1 large omap objects
> 1 large objects found in pool 'us-prd-1.rgw.log'
> Search the cluster log for 'Large omap object found' for more details.
>
> How do I go about clearing it out? I don't see any other references to
> large omap in any of the logs.  I've tried restarted the mgr's, the
> monitors, and even the osd that reported the issue.
>
> -Brett
>
> On Thu, Jul 25, 2019 at 2:55 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> 14.2.1
>> Thanks, I'll try that.
>>
>> On Thu, Jul 25, 2019 at 2:54 PM Casey Bodley  wrote:
>>
>>> What ceph version is this cluster running? Luminous or later should not
>>> be writing any new meta.log entries when it detects a single-zone
>>> configuration.
>>>
>>> I'd recommend editing your zonegroup configuration (via 'radosgw-admin
>>> zonegroup get' and 'put') to set both log_meta and log_data to false,
>>> then commit the change with 'radosgw-admin period update --commit'.
>>>
>>> You can then delete any meta.log.* and data_log.* objects from your log
>>> pool using the rados tool.
>>>
>>> On 7/25/19 2:30 PM, Brett Chancellor wrote:
>>> > Casey,
>>> >   These clusters were setup with the intention of one day doing multi
>>> > site replication. That has never happened. The cluster has a single
>>> > realm, which contains a single zonegroup, and that zonegroup contains
>>> > a single zone.
>>> >
>>> > -Brett
>>> >
>>> > On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley >> > <mailto:cbod...@redhat.com>> wrote:
>>> >
>>> > Hi Brett,
>>> >
>>> > These meta.log objects store the replication logs for metadata
>>> > sync in
>>> > multisite. Log entries are trimmed automatically once all other
>>> zones
>>> > have processed them. Can you verify that all zones in the multisite
>>> > configuration are reachable and syncing? Does 'radosgw-admin sync
>>> > status' on any zone show that it's stuck behind on metadata sync?
>>> > That
>>> > would prevent these logs from being trimmed and result in these
>>> large
>>> > omap warnings.
>>> >
>>> > On 7/25/19 1:59 PM, Brett Chancellor wrote:
>>> > > I'm having an issue similar to
>>> > >
>>> >
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html
>>>  .
>>> >
>>> > > I don't see where any solution was proposed.
>>> > >
>>> > > $ ceph health detail
>>> > > HEALTH_WARN 1 large omap objects
>>> > > LARGE_OMAP_OBJECTS 1 large omap objects
>>> > > 1 large objects found in pool 'us-prd-1.rgw.log'
>>> > > Search the cluster log for 'Large omap object found' for
>>> > more details.
>>> > >
>>> > > $ grep "Large omap object" /var/log/ceph/ceph.log
>>> > > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
>>> > Large omap
>>> > > object found. Object:
>>> > >
>>> 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
>>> > > Key count: 3382154 Size (bytes): 611384043
>>> > >
>>> > > $ rados -p us-prd-1.rgw.log listomapkeys
>>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
>>> > > 3382154
>>> > >
>>> > > $ rados -p us-prd-1.rgw.log listomapvals
>>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
>>> > > This returns entries from almost every bucket, across multiple
>>> > > tenants. Several of the entries are from buckets that no longer
>>> > exist
>>> > > on the system.
>>> > >
>>> > > $ ceph df |egrep 'OBJECTS|.rgw.log'
>>> > > POOLID  STORED  OBJECTS USED%USED MAX
>>> > > AVAIL
>>> > > us-prd-1.rgw.log 51 758 MiB 228   758 MiB
>>> > >   0   102 TiB
>>> > >
>>> > > Thanks,
>>> > >
>>> > > -Brett
>>> > >
>>> > > ___
>>> > > ceph-users mailing list
>>> > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-30 Thread Brett Chancellor
I was able to remove the meta objects, but the cluster is still in WARN
state
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'us-prd-1.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

How do I go about clearing it out? I don't see any other references to
large omap in any of the logs.  I've tried restarted the mgr's, the
monitors, and even the osd that reported the issue.

-Brett

On Thu, Jul 25, 2019 at 2:55 PM Brett Chancellor 
wrote:

> 14.2.1
> Thanks, I'll try that.
>
> On Thu, Jul 25, 2019 at 2:54 PM Casey Bodley  wrote:
>
>> What ceph version is this cluster running? Luminous or later should not
>> be writing any new meta.log entries when it detects a single-zone
>> configuration.
>>
>> I'd recommend editing your zonegroup configuration (via 'radosgw-admin
>> zonegroup get' and 'put') to set both log_meta and log_data to false,
>> then commit the change with 'radosgw-admin period update --commit'.
>>
>> You can then delete any meta.log.* and data_log.* objects from your log
>> pool using the rados tool.
>>
>> On 7/25/19 2:30 PM, Brett Chancellor wrote:
>> > Casey,
>> >   These clusters were setup with the intention of one day doing multi
>> > site replication. That has never happened. The cluster has a single
>> > realm, which contains a single zonegroup, and that zonegroup contains
>> > a single zone.
>> >
>> > -Brett
>> >
>> > On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley > > <mailto:cbod...@redhat.com>> wrote:
>> >
>> > Hi Brett,
>> >
>> > These meta.log objects store the replication logs for metadata
>> > sync in
>> > multisite. Log entries are trimmed automatically once all other
>> zones
>> > have processed them. Can you verify that all zones in the multisite
>> > configuration are reachable and syncing? Does 'radosgw-admin sync
>> > status' on any zone show that it's stuck behind on metadata sync?
>> > That
>> > would prevent these logs from being trimmed and result in these
>> large
>> > omap warnings.
>> >
>> > On 7/25/19 1:59 PM, Brett Chancellor wrote:
>> > > I'm having an issue similar to
>> > >
>> >
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html
>>  .
>> >
>> > > I don't see where any solution was proposed.
>> > >
>> > > $ ceph health detail
>> > > HEALTH_WARN 1 large omap objects
>> > > LARGE_OMAP_OBJECTS 1 large omap objects
>> > > 1 large objects found in pool 'us-prd-1.rgw.log'
>> > > Search the cluster log for 'Large omap object found' for
>> > more details.
>> > >
>> > > $ grep "Large omap object" /var/log/ceph/ceph.log
>> > > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
>> > Large omap
>> > > object found. Object:
>> > >
>> 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
>> > > Key count: 3382154 Size (bytes): 611384043
>> > >
>> > > $ rados -p us-prd-1.rgw.log listomapkeys
>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
>> > > 3382154
>> > >
>> > > $ rados -p us-prd-1.rgw.log listomapvals
>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
>> > > This returns entries from almost every bucket, across multiple
>> > > tenants. Several of the entries are from buckets that no longer
>> > exist
>> > > on the system.
>> > >
>> > > $ ceph df |egrep 'OBJECTS|.rgw.log'
>> > > POOLID  STORED  OBJECTS USED%USED MAX
>> > > AVAIL
>> > > us-prd-1.rgw.log 51 758 MiB 228   758 MiB
>> > >   0   102 TiB
>> > >
>> > > Thanks,
>> > >
>> > > -Brett
>> > >
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
14.2.1
Thanks, I'll try that.

On Thu, Jul 25, 2019 at 2:54 PM Casey Bodley  wrote:

> What ceph version is this cluster running? Luminous or later should not
> be writing any new meta.log entries when it detects a single-zone
> configuration.
>
> I'd recommend editing your zonegroup configuration (via 'radosgw-admin
> zonegroup get' and 'put') to set both log_meta and log_data to false,
> then commit the change with 'radosgw-admin period update --commit'.
>
> You can then delete any meta.log.* and data_log.* objects from your log
> pool using the rados tool.
>
> On 7/25/19 2:30 PM, Brett Chancellor wrote:
> > Casey,
> >   These clusters were setup with the intention of one day doing multi
> > site replication. That has never happened. The cluster has a single
> > realm, which contains a single zonegroup, and that zonegroup contains
> > a single zone.
> >
> > -Brett
> >
> > On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley  > <mailto:cbod...@redhat.com>> wrote:
> >
> > Hi Brett,
> >
> > These meta.log objects store the replication logs for metadata
> > sync in
> > multisite. Log entries are trimmed automatically once all other zones
> > have processed them. Can you verify that all zones in the multisite
> > configuration are reachable and syncing? Does 'radosgw-admin sync
> > status' on any zone show that it's stuck behind on metadata sync?
> > That
> > would prevent these logs from being trimmed and result in these large
> > omap warnings.
> >
> > On 7/25/19 1:59 PM, Brett Chancellor wrote:
> > > I'm having an issue similar to
> > >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html
>  .
> >
> > > I don't see where any solution was proposed.
> > >
> > > $ ceph health detail
> > > HEALTH_WARN 1 large omap objects
> > > LARGE_OMAP_OBJECTS 1 large omap objects
> > > 1 large objects found in pool 'us-prd-1.rgw.log'
> > > Search the cluster log for 'Large omap object found' for
> > more details.
> > >
> > > $ grep "Large omap object" /var/log/ceph/ceph.log
> > > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
> > Large omap
> > > object found. Object:
> > > 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
> > > Key count: 3382154 Size (bytes): 611384043
> > >
> > > $ rados -p us-prd-1.rgw.log listomapkeys
> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
> > > 3382154
> > >
> > > $ rados -p us-prd-1.rgw.log listomapvals
> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
> > > This returns entries from almost every bucket, across multiple
> > > tenants. Several of the entries are from buckets that no longer
> > exist
> > > on the system.
> > >
> > > $ ceph df |egrep 'OBJECTS|.rgw.log'
> > > POOLID  STORED  OBJECTS USED%USED MAX
> > > AVAIL
> > > us-prd-1.rgw.log 51 758 MiB 228   758 MiB
> > >   0   102 TiB
> > >
> > > Thanks,
> > >
> > > -Brett
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
Casey,
  These clusters were setup with the intention of one day doing multi site
replication. That has never happened. The cluster has a single realm, which
contains a single zonegroup, and that zonegroup contains a single zone.

-Brett

On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley  wrote:

> Hi Brett,
>
> These meta.log objects store the replication logs for metadata sync in
> multisite. Log entries are trimmed automatically once all other zones
> have processed them. Can you verify that all zones in the multisite
> configuration are reachable and syncing? Does 'radosgw-admin sync
> status' on any zone show that it's stuck behind on metadata sync? That
> would prevent these logs from being trimmed and result in these large
> omap warnings.
>
> On 7/25/19 1:59 PM, Brett Chancellor wrote:
> > I'm having an issue similar to
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html .
>
> > I don't see where any solution was proposed.
> >
> > $ ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool 'us-prd-1.rgw.log'
> > Search the cluster log for 'Large omap object found' for more
> details.
> >
> > $ grep "Large omap object" /var/log/ceph/ceph.log
> > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN] Large omap
> > object found. Object:
> > 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
> > Key count: 3382154 Size (bytes): 611384043
> >
> > $ rados -p us-prd-1.rgw.log listomapkeys
> > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
> > 3382154
> >
> > $ rados -p us-prd-1.rgw.log listomapvals
> > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
> > This returns entries from almost every bucket, across multiple
> > tenants. Several of the entries are from buckets that no longer exist
> > on the system.
> >
> > $ ceph df |egrep 'OBJECTS|.rgw.log'
> > POOLID  STORED  OBJECTS USED%USED MAX
> > AVAIL
> > us-prd-1.rgw.log 51 758 MiB 228 758 MiB
> >   0   102 TiB
> >
> > Thanks,
> >
> > -Brett
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-25 Thread Brett Chancellor
I'm having an issue similar to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html .
I don't see where any solution was proposed.

$ ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'us-prd-1.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

$ grep "Large omap object" /var/log/ceph/ceph.log
2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN] Large omap
object found. Object:
51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head Key
count: 3382154 Size (bytes): 611384043

$ rados -p us-prd-1.rgw.log listomapkeys
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
3382154

$ rados -p us-prd-1.rgw.log listomapvals
meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
This returns entries from almost every bucket, across multiple tenants.
Several of the entries are from buckets that no longer exist on the system.

$ ceph df |egrep 'OBJECTS|.rgw.log'
POOLID  STORED  OBJECTS USED
 %USED MAX AVAIL
us-prd-1.rgw.log 51 758 MiB 228 758 MiB
0   102 TiB

Thanks,

-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-11 Thread Brett Chancellor
I did try and run sudo ceph-bluestore-tool --out-dir /mnt/ceph
bluefs-export . but it died after writing out 93GB and filling up my root
partition.

On Thu, Jul 11, 2019 at 3:32 PM Brett Chancellor 
wrote:

> We moved the .rgw.meta data pool over to SSD to try and improve
> performance, during the backfill SSDs bgan dying in mass. Log attached to
> this case
> https://tracker.ceph.com/issues/40741
>
> Right now the SSD's wont come up with either allocator and the cluster is
> pretty much dead.
>
> What are the consequences of deleting the .rgw.meta pool? Can it be
> recreated?
>
> On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de  wrote:
>
>> You might want to try manual rocksdb compaction using ceph-kvstore-tool..
>>
>> Sent from my Huawei tablet
>>
>>
>>  Original Message 
>> Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
>> From: Brett Chancellor
>> To: Igor Fedotov
>> CC: Ceph Users
>>
>> Once backfilling finished, the cluster was super slow, most osd's were
>> filled with heartbeat_map errors.  When an OSD restarts it causes a cascade
>> of other osd's to follow suit and restart.. logs like..
>>   -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581
>> get_health_metrics reporting 21 slow ops, oldest is
>> osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded)
>> ack+ondisk+write+known_if_redirected e1348522)
>> -2> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90
>> -1> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150
>>  0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f3493f2b700 thread_name:tp_osd_tp
>>
>>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
>> (stable)
>>  1: (()+0xf5d0) [0x7f34b57c25d0]
>>  2: (pread64()+0x33) [0x7f34b57c1f63]
>>  3: (KernelDevice::read_random(unsigned long, unsigned long, char*,
>> bool)+0x238) [0x55bfdae5a448]
>>  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
>> long, char*)+0xca) [0x55bfdae1271a]
>>  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
>> rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440]
>>  6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long,
>> rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0]
>>  7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27]
>>  8: (()+0x11146a4) [0x55bfdb40d6a4]
>>  9:
>> (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
>> rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&,
>> rocksdb::BlockHandle const&, rocksdb::Slice,
>> rocksdb::BlockBasedTable::CachableEntry*, bool,
>> rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
>>  10: (rocksdb::DataBlockIter*
>> rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*,
>> rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
>> rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*,
>> rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
>>  11: (rocksdb::BlockBasedTableIterator> rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
>>  12: (rocksdb::BlockBasedTableIterator> rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
>>  13: (()+0x10adde9) [0x55bfdb3a6de9]
>>  14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
>>  15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762)
>> [0x55bfdb32a092]
>>  16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
>>  17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
>> [0x55bfdad9fa8d]
>>  18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t
>> const&, ghobject_t const&, int, std::vector> std::allocator >*, ghobject_t*)+0xdf6) [0x55bfdad12466]
>>  19:
>> (BlueStore::collection_list(boost::intrusive_ptr&,
>> ghobject_t const&, ghobject_t const&, int, std::vector> std::allocator >*, ghobject_t*)+0x9b) [0x55bfdad1393b]
>>  20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120]
>>  21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
>> [0x55bfda985598]
>>  22: (boost::statechart::simple_state> PG::RecoveryState::ToDelete, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> 

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-11 Thread Brett Chancellor
We moved the .rgw.meta data pool over to SSD to try and improve
performance, during the backfill SSDs bgan dying in mass. Log attached to
this case
https://tracker.ceph.com/issues/40741

Right now the SSD's wont come up with either allocator and the cluster is
pretty much dead.

What are the consequences of deleting the .rgw.meta pool? Can it be
recreated?

On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de  wrote:

> You might want to try manual rocksdb compaction using ceph-kvstore-tool..
>
> Sent from my Huawei tablet
>
>
>  Original Message 
> Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
> From: Brett Chancellor
> To: Igor Fedotov
> CC: Ceph Users
>
> Once backfilling finished, the cluster was super slow, most osd's were
> filled with heartbeat_map errors.  When an OSD restarts it causes a cascade
> of other osd's to follow suit and restart.. logs like..
>   -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581
> get_health_metrics reporting 21 slow ops, oldest is
> osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded)
> ack+ondisk+write+known_if_redirected e1348522)
> -2> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90
> -1> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150
>  0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f3493f2b700 thread_name:tp_osd_tp
>
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0xf5d0) [0x7f34b57c25d0]
>  2: (pread64()+0x33) [0x7f34b57c1f63]
>  3: (KernelDevice::read_random(unsigned long, unsigned long, char*,
> bool)+0x238) [0x55bfdae5a448]
>  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
> long, char*)+0xca) [0x55bfdae1271a]
>  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
> rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440]
>  6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long,
> rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0]
>  7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27]
>  8: (()+0x11146a4) [0x55bfdb40d6a4]
>  9:
> (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
> rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&,
> rocksdb::BlockHandle const&, rocksdb::Slice,
> rocksdb::BlockBasedTable::CachableEntry*, bool,
> rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
>  10: (rocksdb::DataBlockIter*
> rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*,
> rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
> rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*,
> rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
>  11: (rocksdb::BlockBasedTableIterator rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
>  12: (rocksdb::BlockBasedTableIterator rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
>  13: (()+0x10adde9) [0x55bfdb3a6de9]
>  14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
>  15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762)
> [0x55bfdb32a092]
>  16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
>  17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
> [0x55bfdad9fa8d]
>  18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t
> const&, ghobject_t const&, int, std::vector std::allocator >*, ghobject_t*)+0xdf6) [0x55bfdad12466]
>  19:
> (BlueStore::collection_list(boost::intrusive_ptr&,
> ghobject_t const&, ghobject_t const&, int, std::vector std::allocator >*, ghobject_t*)+0x9b) [0x55bfdad1393b]
>  20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120]
>  21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
> [0x55bfda985598]
>  22: (boost::statechart::simple_state PG::RecoveryState::ToDelete, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x16a) [0x55bfda9c45ca]
>  23: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x5a) [0x55bfda9a20ca]
>  24: (PG::do_peering_event(std::shared_ptr,
> PG::RecoveryCtx*)+0x119) [0x55bfda991389]
&

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-10 Thread Brett Chancellor
ompressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.69.log
--- end dump of recent events ---

On Tue, Jul 9, 2019 at 1:38 PM Igor Fedotov  wrote:

> This will cap single bluefs space allocation. Currently it attempts to
> allocate 70Gb which seems to overflow some 32-bit length fields. With the
> adjustment such allocation should be capped at ~700MB.
>
> I doubt there is any relation between this specific failure and the pool.
> At least at the moment.
>
> In short the history is: starting OSD tries to flush bluefs data to disk,
> detects lack of space and asks for more from main device - allocations
> succeeds but returned extent has length field set to 0.
> On 7/9/2019 8:33 PM, Brett Chancellor wrote:
>
> What does bluestore_bluefs_gift_ratio do?  I can't find any documentation
> on it.  Also do you think this could be related to the .rgw.meta pool
> having too many objects per PG? The disks that die always seem to be
> backfilling a pg from that pool, and they have ~550k objects per PG.
>
> -Brett
>
> On Tue, Jul 9, 2019 at 1:03 PM Igor Fedotov  wrote:
>
>> Please try to set bluestore_bluefs_gift_ratio to 0.0002
>>
>>
>> On 7/9/2019 7:39 PM, Brett Chancellor wrote:
>>
>> Too large for pastebin.. The problem is continually crashing new OSDs.
>> Here is the latest one.
>>
>> On Tue, Jul 9, 2019 at 11:46 AM Igor Fedotov  wrote:
>>
>>> could you please set debug bluestore to 20 and collect startup log for
>>> this specific OSD once again?
>>>
>>>
>>> On 7/9/2019 6:29 PM, Brett Chancellor wrote:
>>>
>>> I restarted most of the OSDs with the stupid allocator (6 of them
>>> wouldn't start unless bitmap allocator was set), but I'm still seeing
>>> issues with OSDs crashing.  Interestingly it seems that the dying OSDs are
>>> always working on a pg from the .rgw.meta pool when they crash.
>>>
>>> Log : https://pastebin.com/yuJKcPvX
>>>
>>> On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov  wrote:
>>>
>>>> Hi Brett,
>>>>
>>>> in Nautilus you can do that via
>>>>
>>>> ceph config set osd.N bluestore_allocator stupid
>>>>
>>>> ceph config set osd.N bluefs_allocator stupid
>>>>
>>>> See
>>>> https://ceph.com/community/new-mimic-centralized-configuration-management/
>>>> for more details on a new way of configuration options setting.
>>>>
>>>>
>>>> A known issue with Stupid allocator is gradual write request latency
>>>> increase (occurred within several days after OSD restart). Seldom observed
>>>> though. There were some posts about that behavior in the mail list  this
>>>> year.
>>>>
>>>> Thanks,
>>>>
>>>> Igor.
>>>>
>>>>
>>>> On 7/8/2019 8:33 PM, Brett Chancellor wrote:
>>>>
>>>>
>>>> I'll give that a try.  Is it something like...
>>>> ceph tell 'osd.*' bluestore_allocator stupid
>>>> ceph tell 'osd.*' bluefs_allocator stupid
>>>>
>>>> And should I expect any issues doing this?
>>>>
>>>>
>>>> On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov  wrote:
>>>>
>>>>> I should read call stack more carefully... It's not about lacking free
>>>>> space - this is rather the bug from this ticket:
>>>>>
>>>>> http://tracker.ceph.com/issues/40080
>>>>>
>>>>>
>>>>> You should upgrade to v14.2.2 (once it's available) or temporarily
>>>>> switch to stupid allocator as a workaround.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>>
>>>>> On 7/8/2019 8:00 PM, Igor Fedotov wrote:
>>>>>
>>>>> Hi Brett,
>>>>>
>>>>> looks like BlueStore is unable to allocate additional space for BlueFS
>>>>> at main device. It's either lacking free space or it's too fragmented...
>>>>>
>>>>> Would you share osd log, please?
>>>>>
>>>>> Also please run "

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Brett Chancellor
What does bluestore_bluefs_gift_ratio do?  I can't find any documentation
on it.  Also do you think this could be related to the .rgw.meta pool
having too many objects per PG? The disks that die always seem to be
backfilling a pg from that pool, and they have ~550k objects per PG.

-Brett

On Tue, Jul 9, 2019 at 1:03 PM Igor Fedotov  wrote:

> Please try to set bluestore_bluefs_gift_ratio to 0.0002
>
>
> On 7/9/2019 7:39 PM, Brett Chancellor wrote:
>
> Too large for pastebin.. The problem is continually crashing new OSDs.
> Here is the latest one.
>
> On Tue, Jul 9, 2019 at 11:46 AM Igor Fedotov  wrote:
>
>> could you please set debug bluestore to 20 and collect startup log for
>> this specific OSD once again?
>>
>>
>> On 7/9/2019 6:29 PM, Brett Chancellor wrote:
>>
>> I restarted most of the OSDs with the stupid allocator (6 of them
>> wouldn't start unless bitmap allocator was set), but I'm still seeing
>> issues with OSDs crashing.  Interestingly it seems that the dying OSDs are
>> always working on a pg from the .rgw.meta pool when they crash.
>>
>> Log : https://pastebin.com/yuJKcPvX
>>
>> On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov  wrote:
>>
>>> Hi Brett,
>>>
>>> in Nautilus you can do that via
>>>
>>> ceph config set osd.N bluestore_allocator stupid
>>>
>>> ceph config set osd.N bluefs_allocator stupid
>>>
>>> See
>>> https://ceph.com/community/new-mimic-centralized-configuration-management/
>>> for more details on a new way of configuration options setting.
>>>
>>>
>>> A known issue with Stupid allocator is gradual write request latency
>>> increase (occurred within several days after OSD restart). Seldom observed
>>> though. There were some posts about that behavior in the mail list  this
>>> year.
>>>
>>> Thanks,
>>>
>>> Igor.
>>>
>>>
>>> On 7/8/2019 8:33 PM, Brett Chancellor wrote:
>>>
>>>
>>> I'll give that a try.  Is it something like...
>>> ceph tell 'osd.*' bluestore_allocator stupid
>>> ceph tell 'osd.*' bluefs_allocator stupid
>>>
>>> And should I expect any issues doing this?
>>>
>>>
>>> On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov  wrote:
>>>
>>>> I should read call stack more carefully... It's not about lacking free
>>>> space - this is rather the bug from this ticket:
>>>>
>>>> http://tracker.ceph.com/issues/40080
>>>>
>>>>
>>>> You should upgrade to v14.2.2 (once it's available) or temporarily
>>>> switch to stupid allocator as a workaround.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>
>>>> On 7/8/2019 8:00 PM, Igor Fedotov wrote:
>>>>
>>>> Hi Brett,
>>>>
>>>> looks like BlueStore is unable to allocate additional space for BlueFS
>>>> at main device. It's either lacking free space or it's too fragmented...
>>>>
>>>> Would you share osd log, please?
>>>>
>>>> Also please run "ceph-bluestore-tool --path >>> path-to-osd!!!> bluefs-bdev-sizes" and share the output.
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>> On 7/3/2019 9:59 PM, Brett Chancellor wrote:
>>>>
>>>> Hi All! Today I've had 3 OSDs stop themselves and are unable to
>>>> restart, all with the same error. These OSDs are all on different hosts.
>>>> All are running 14.2.1
>>>>
>>>> I did try the following two commands
>>>> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>>>>   ## This failed with the same error below
>>>> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>>>>  ## After a couple of hours returned...
>>>> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
>>>> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
>>>> found, suggest to run store repair to get consistent statistic reports
>>>> fsck success
>>>>
>>>>
>>>> ## Error when trying to start one of the OSDs
>>>>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
>>>> (Aborted) **
>>>>  in thread 7f5e42366700 thread_name:rocksdb:low0
>>>>
>>>>  ceph vers

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Brett Chancellor
I restarted most of the OSDs with the stupid allocator (6 of them wouldn't
start unless bitmap allocator was set), but I'm still seeing issues with
OSDs crashing.  Interestingly it seems that the dying OSDs are always
working on a pg from the .rgw.meta pool when they crash.

Log : https://pastebin.com/yuJKcPvX

On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov  wrote:

> Hi Brett,
>
> in Nautilus you can do that via
>
> ceph config set osd.N bluestore_allocator stupid
>
> ceph config set osd.N bluefs_allocator stupid
>
> See
> https://ceph.com/community/new-mimic-centralized-configuration-management/
> for more details on a new way of configuration options setting.
>
>
> A known issue with Stupid allocator is gradual write request latency
> increase (occurred within several days after OSD restart). Seldom observed
> though. There were some posts about that behavior in the mail list  this
> year.
>
> Thanks,
>
> Igor.
>
>
> On 7/8/2019 8:33 PM, Brett Chancellor wrote:
>
>
> I'll give that a try.  Is it something like...
> ceph tell 'osd.*' bluestore_allocator stupid
> ceph tell 'osd.*' bluefs_allocator stupid
>
> And should I expect any issues doing this?
>
>
> On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov  wrote:
>
>> I should read call stack more carefully... It's not about lacking free
>> space - this is rather the bug from this ticket:
>>
>> http://tracker.ceph.com/issues/40080
>>
>>
>> You should upgrade to v14.2.2 (once it's available) or temporarily switch
>> to stupid allocator as a workaround.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>> On 7/8/2019 8:00 PM, Igor Fedotov wrote:
>>
>> Hi Brett,
>>
>> looks like BlueStore is unable to allocate additional space for BlueFS at
>> main device. It's either lacking free space or it's too fragmented...
>>
>> Would you share osd log, please?
>>
>> Also please run "ceph-bluestore-tool --path > path-to-osd!!!> bluefs-bdev-sizes" and share the output.
>>
>> Thanks,
>>
>> Igor
>> On 7/3/2019 9:59 PM, Brett Chancellor wrote:
>>
>> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
>> all with the same error. These OSDs are all on different hosts. All are
>> running 14.2.1
>>
>> I did try the following two commands
>> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>>   ## This failed with the same error below
>> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>>  ## After a couple of hours returned...
>> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
>> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
>> found, suggest to run store repair to get consistent statistic reports
>> fsck success
>>
>>
>> ## Error when trying to start one of the OSDs
>>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f5e42366700 thread_name:rocksdb:low0
>>
>>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
>> (stable)
>>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x199) [0x55a7aaee96ab]
>>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
>> const*, char const*, ...)+0) [0x55a7aaee982a]
>>  6: (interval_set> std::less, std::allocator> unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
>> unsigned long*)+0x3c6) [0x55a7ab212a66]
>>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >*)+0x74e) [0x55a7ab48253e]
>>  8: (BlueFS::_expand_slow_device(unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >&)+0x111) [0x55a7ab59e921]
>>  9: (BlueFS::_allocate(unsigned char, unsigned long,
>> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>> long)+0xe5) [0x55a7ab59fce5]
>>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>>  15: (roc

Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Brett Chancellor
I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov  wrote:

> I should read call stack more carefully... It's not about lacking free
> space - this is rather the bug from this ticket:
>
> http://tracker.ceph.com/issues/40080
>
>
> You should upgrade to v14.2.2 (once it's available) or temporarily switch
> to stupid allocator as a workaround.
>
>
> Thanks,
>
> Igor
>
>
>
> On 7/8/2019 8:00 PM, Igor Fedotov wrote:
>
> Hi Brett,
>
> looks like BlueStore is unable to allocate additional space for BlueFS at
> main device. It's either lacking free space or it's too fragmented...
>
> Would you share osd log, please?
>
> Also please run "ceph-bluestore-tool --path  path-to-osd!!!> bluefs-bdev-sizes" and share the output.
>
> Thanks,
>
> Igor
> On 7/3/2019 9:59 PM, Brett Chancellor wrote:
>
> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
> all with the same error. These OSDs are all on different hosts. All are
> running 14.2.1
>
> I did try the following two commands
> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>   ## This failed with the same error below
> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>  ## After a couple of hours returned...
> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
> found, suggest to run store repair to get consistent statistic reports
> fsck success
>
>
> ## Error when trying to start one of the OSDs
>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f5e42366700 thread_name:rocksdb:low0
>
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x199) [0x55a7aaee96ab]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
> char const*, ...)+0) [0x55a7aaee982a]
>  6: (interval_set std::less, std::allocator unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
> unsigned long*)+0x3c6) [0x55a7ab212a66]
>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >*)+0x74e) [0x55a7ab48253e]
>  8: (BlueFS::_expand_slow_device(unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >&)+0x111) [0x55a7ab59e921]
>  9: (BlueFS::_allocate(unsigned char, unsigned long,
> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
> long)+0xe5) [0x55a7ab59fce5]
>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>  15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
> const&, rocksdb::CompactionJob::SubcompactionState*,
> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
> const*)+0xbaa) [0x55a7abc3b73a]
>  16:
> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
> [0x55a7abc3f150]
>  17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
>  18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
> [0x55a7aba7fb67]
>  19:
> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
>  20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
>  21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
> [0x55a7abc8d9c4]
>  22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
> [0x55a7abc8db4f]
>  23: (()+0x129dfff) [0x55a7abd1afff]
>  24: (()+0x7dd5) [0x7f5e50bcfdd5]
>  25: (clone()+0x6d) [0x7f5e4fa95ead]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Brett Chancellor
Just seeing if anybody has seen this? About 15 more OSDs have failed since
then. The cluster can't backfill fast enough, and I fear data loss may be
imminent.   I did notice one of the latest ones to fail, has  lines
similar to this one right before the crash

2019-07-08 15:18:56.170 7fc732475700  5
bluestore(/var/lib/ceph/osd/ceph-59) allocate_bluefs_freespace gifting
0x4d18d0~40 to bluefs

Any thoughts?

On Sat, Jul 6, 2019 at 3:06 PM Brett Chancellor 
wrote:

> Has anybody else run into this? It seems to be slowly spreading to other
> OSDs, maybe it gets to a bad pg in the backfill process and kills off
> another OSD (just guessing since the failures are hours apart).  It's kind
> of a pain because I have ton continually rebuild these OSDs before the
> cluster runs out of space.
>
> On Wed, Jul 3, 2019 at 2:59 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
>> all with the same error. These OSDs are all on different hosts. All are
>> running 14.2.1
>>
>> I did try the following two commands
>> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>>   ## This failed with the same error below
>> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>>  ## After a couple of hours returned...
>> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
>> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
>> found, suggest to run store repair to get consistent statistic reports
>> fsck success
>>
>>
>> ## Error when trying to start one of the OSDs
>>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f5e42366700 thread_name:rocksdb:low0
>>
>>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
>> (stable)
>>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x199) [0x55a7aaee96ab]
>>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
>> const*, char const*, ...)+0) [0x55a7aaee982a]
>>  6: (interval_set> std::less, std::allocator> unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
>> unsigned long*)+0x3c6) [0x55a7ab212a66]
>>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >*)+0x74e) [0x55a7ab48253e]
>>  8: (BlueFS::_expand_slow_device(unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >&)+0x111) [0x55a7ab59e921]
>>  9: (BlueFS::_allocate(unsigned char, unsigned long,
>> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>> long)+0xe5) [0x55a7ab59fce5]
>>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>>  15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
>> const&, rocksdb::CompactionJob::SubcompactionState*,
>> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
>> const*)+0xbaa) [0x55a7abc3b73a]
>>  16:
>> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
>> [0x55a7abc3f150]
>>  17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
>>  18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
>> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
>> [0x55a7aba7fb67]
>>  19:
>> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
>> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
>>  20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
>>  21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
>> [0x55a7abc8d9c4]
>>  22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
>> [0x55a7abc8db4f]
>>  23: (()+0x129dfff) [0x55a7abd1afff]
>>  24: (()+0x7dd5) [0x7f5e50bcfdd5]
>>  25: (clone()+0x6d) [0x7f5e4fa95ead]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-06 Thread Brett Chancellor
Has anybody else run into this? It seems to be slowly spreading to other
OSDs, maybe it gets to a bad pg in the backfill process and kills off
another OSD (just guessing since the failures are hours apart).  It's kind
of a pain because I have ton continually rebuild these OSDs before the
cluster runs out of space.

On Wed, Jul 3, 2019 at 2:59 PM Brett Chancellor 
wrote:

> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
> all with the same error. These OSDs are all on different hosts. All are
> running 14.2.1
>
> I did try the following two commands
> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>   ## This failed with the same error below
> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>  ## After a couple of hours returned...
> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
> found, suggest to run store repair to get consistent statistic reports
> fsck success
>
>
> ## Error when trying to start one of the OSDs
>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f5e42366700 thread_name:rocksdb:low0
>
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x199) [0x55a7aaee96ab]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
> char const*, ...)+0) [0x55a7aaee982a]
>  6: (interval_set std::less, std::allocator unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
> unsigned long*)+0x3c6) [0x55a7ab212a66]
>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >*)+0x74e) [0x55a7ab48253e]
>  8: (BlueFS::_expand_slow_device(unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >&)+0x111) [0x55a7ab59e921]
>  9: (BlueFS::_allocate(unsigned char, unsigned long,
> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
> long)+0xe5) [0x55a7ab59fce5]
>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>  15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
> const&, rocksdb::CompactionJob::SubcompactionState*,
> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
> const*)+0xbaa) [0x55a7abc3b73a]
>  16:
> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
> [0x55a7abc3f150]
>  17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
>  18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
> [0x55a7aba7fb67]
>  19:
> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
>  20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
>  21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
> [0x55a7abc8d9c4]
>  22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
> [0x55a7abc8db4f]
>  23: (()+0x129dfff) [0x55a7abd1afff]
>  24: (()+0x7dd5) [0x7f5e50bcfdd5]
>  25: (clone()+0x6d) [0x7f5e4fa95ead]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 3 OSDs stopped and unable to restart

2019-07-03 Thread Brett Chancellor
Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
all with the same error. These OSDs are all on different hosts. All are
running 14.2.1

I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
found, suggest to run store repair to get consistent statistic reports
fsck success


## Error when trying to start one of the OSDs
   -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal (Aborted)
**
 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)
 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_set, std::allocator > > >::insert(unsigned long, unsigned long, unsigned long*,
unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
std::vector
>*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long,
std::vector
>&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long,
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
long)+0xe5) [0x55a7ab59fce5]
 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
const&, rocksdb::CompactionJob::SubcompactionState*,
rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
const*)+0xbaa) [0x55a7abc3b73a]
 16:
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
[0x55a7abc3f150]
 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
[0x55a7aba7fb67]
 19:
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
[0x55a7abc8d9c4]
 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
[0x55a7abc8db4f]
 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] increase pg_num error

2019-07-01 Thread Brett Chancellor
In Nautilus just pg_num is sufficient for both increases and decreases.

On Mon, Jul 1, 2019 at 10:55 AM Robert LeBlanc  wrote:

> I believe he needs to increase the pgp_num first, then pg_num.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Jul 1, 2019 at 7:21 AM Nathan Fish  wrote:
>
>> I ran into this recently. Try running "ceph osd require-osd-release
>> nautilus". This drops backwards compat with pre-nautilus and allows
>> changing settings.
>>
>> On Mon, Jul 1, 2019 at 4:24 AM Sylvain PORTIER  wrote:
>> >
>> > Hi all,
>> >
>> > I am using ceph 14.2.1 (Nautilus)
>> >
>> > I am unable to increase the pg_num of a pool.
>> >
>> > I have a pool named Backup, the current pg_num is 64 : ceph osd pool get
>> > Backup pg_num => result pg_num: 64
>> >
>> > And when I try to increase it using the command
>> >
>> > ceph osd pool set Backup pg_num 512 => result "set pool 6 pg_num to 512"
>> >
>> > And then I check with the command : ceph osd pool get Backup pg_num =>
>> > result pg_num: 64
>> >
>> > I don't how to increase the pg_num of a pool, I also tried the autoscale
>> > module, but it doesn't work (unable to activate the autoscale, always
>> > warn mode).
>> >
>> > Thank you for your help,
>> >
>> >
>> > Cabeur.
>> >
>> >
>> > ---
>> > L'absence de virus dans ce courrier électronique a été vérifiée par le
>> logiciel antivirus Avast.
>> > https://www.avast.com/antivirus
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-07-01 Thread Brett Chancellor
Ceph already does this by default. For each replicated pool, you can set
the 'size' which is the number of copies you want Ceph to maintain. The
accepted norm for replicas is 3, but you can set it higher if you want to
incur the performance penalty.

On Mon, Jul 1, 2019, 6:01 AM nokia ceph  wrote:

> Hi Brad,
>
> Thank you for your response , and we will check this video as well.
> Our requirement is while writing an object into the cluster , if we can
> provide number of copies to be made , the network consumption between
> client and cluster will be only for one object write. However , the cluster
> will clone/copy multiple objects and stores inside the cluster.
>
> Thanks,
> Muthu
>
> On Fri, Jun 28, 2019 at 9:23 AM Brad Hubbard  wrote:
>
>> On Thu, Jun 27, 2019 at 8:58 PM nokia ceph 
>> wrote:
>> >
>> > Hi Team,
>> >
>> > We have a requirement to create multiple copies of an object and
>> currently we are handling it in client side to write as separate objects
>> and this causes huge network traffic between client and cluster.
>> > Is there possibility of cloning an object to multiple copies using
>> librados api?
>> > Please share the document details if it is feasible.
>>
>> It may be possible to use an object class to accomplish what you want
>> to achieve but the more we understand what you are trying to do, the
>> better the advice we can offer (at the moment your description sounds
>> like replication which is already part of RADOS as you know).
>>
>> More on object classes from Cephalocon Barcelona in May this year:
>> https://www.youtube.com/watch?v=EVrP9MXiiuU
>>
>> >
>> > Thanks,
>> > Muthu
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph Ansible to Add Nodes to Cluster at Weight 0

2019-06-24 Thread Brett Chancellor
I have used the gentle reweight script many times in the past. But more
recently, I expanded one cluster from 334 to 1114 OSDs, by just changing
the crush weight 100 OSDs at a time. Once all pgs from those 100 were
stable and backfilling, add another hundred. I stopped at 500 and let the
backfill finish. I repeated the process for the last 500 drives and it was
finished in a weekend without any complaints.
Don't forget to adjust your PG count for the new OSDs once rebalancing is
done.

-Brett

On Sun, Jun 23, 2019, 2:51 PM  wrote:

> Hello,
>
> I would advice to use this Script from dan:
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight
>
> I have Used it many Times and it works Great - also if you want to drain
> the OSDs.
>
> Hth
> Mehmet
>
> Am 30. Mai 2019 22:59:05 MESZ schrieb Michel Raabe :
>>
>> Hi Mike,
>>
>> On 30.05.19 02:00, Mike Cave wrote:
>>
>>> I’d like a s little friction for the cluster as possible as it is in
>>> heavy use right now.
>>>
>>> I’m running mimic (13.2.5) on CentOS.
>>>
>>> Any suggestions on best practices for this?
>>>
>>
>> You can limit the recovery for example
>>
>> * max backfills
>> * recovery max active
>> * recovery sleep
>>
>> It will slow down the rebalance but will not hurt the users too much.
>>
>>
>> Michel.
>> --
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Invalid metric type, prometheus module with rbd mirroring

2019-06-20 Thread Brett Chancellor
Has anybody else encountered this issue? Prometheus is failing to scrape
the prometheus module, returning invalid metric type
"cef431ab_b67a_43f9_9b87_ebe992dec94e_replay_bytes counter"

Ceph version: 14.2.1
Prometheus version: 2.10.0-rc.0

This started happening when I setup one way rbd mirroring to this cluster.
Here is the match from the scrape. I don't see any other metrics that have
'/' in them, perhaps prometheus doesn't like it?

# HELP 
ceph_rbd_mirror_vir401_volumes/cef431ab_b67a_43f9_9b87_ebe992dec94e_replay_bytes
Replayed data
# TYPE 
ceph_rbd_mirror_vir401_volumes/cef431ab_b67a_43f9_9b87_ebe992dec94e_replay_bytes
counter
ceph_rbd_mirror_vir401_volumes/cef431ab_b67a_43f9_9b87_ebe992dec94e_replay_bytes{ceph_daemon="rbd-mirror.595769852"}
0.0


-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to move RBD volumes between pools?

2019-06-19 Thread Brett Chancellor
Both pools are in the same Ceph cluster. Do you have any documentation on
the live migration process? I'm running 14.2.1

On Wed, Jun 19, 2019, 8:35 PM Jason Dillaman  wrote:

> On Wed, Jun 19, 2019 at 6:25 PM Brett Chancellor
>  wrote:
> >
> > Background: We have a few ceph clusters, each serves multiple Openstack
> cluster. Each cluster has it's own set of pools.
> >
> > I'd like to move ~50TB of volumes from an old cluster (we'll call the
> pool cluster01-volumes) to an existing pool (cluster02-volumes) to later be
> imported by a different Openstack cluster. I could run something like
> this...
> > rbd export cluster01-volumes/volume-12345 | rbd import
> cluster02-volumes/volume-12345 .
>
> I'm getting a little confused by the dual use of "cluster" for both
> Ceph and OpenStack. Are both pools in the same Ceph cluster? If so,
> could you just clone the image to the new pool? The Nautilus release
> also includes a simple image live migration tool where it creates a
> clone, copies the data and all snapshots to the clone, and then
> deletes the original image.
>
> > But that would be slow and duplicate the data which I'd rather not do.
> Are there any better ways to it?
> >
> > Thanks,
> >
> > -Brett
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible to move RBD volumes between pools?

2019-06-19 Thread Brett Chancellor
Background: We have a few ceph clusters, each serves multiple Openstack
cluster. Each cluster has it's own set of pools.

I'd like to move ~50TB of volumes from an old cluster (we'll call the pool
cluster01-volumes) to an existing pool (cluster02-volumes) to later be
imported by a different Openstack cluster. I could run something like
this...
rbd export cluster01-volumes/volume-12345 | rbd import
cluster02-volumes/volume-12345 .

But that would be slow and duplicate the data which I'd rather not do. Are
there any better ways to it?

Thanks,

-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueFS spillover detected - 14.2.1

2019-06-18 Thread Brett Chancellor
Thanks Igor. I'm fine turning the warnings off, but it's curious that only
this cluster is showing the alerts.  Is there any value in rebuilding the
with smaller SSD meta data volumes? Say 60GB or 30GB?

-Brett

On Tue, Jun 18, 2019 at 1:55 PM Igor Fedotov  wrote:

> Hi Brett,
>
> this issue has been with you long before upgrade to 14.2.1. This upgrade
> just brought corresponding alert visible.
>
> You can turn the alert off by setting
> bluestore_warn_on_bluefs_spillover=false.
>
> But generally this warning shows DB data layout inefficiency - some data
> is kept at slow device - which might has some negative performance impact.
>
> Unfortunately that's a know issue with current RocksDB/BlueStore
> interaction - spillovers to slow device might take place even when there is
> plenty of free space at fast one.
>
>
> Thanks,
>
> Igor
>
>
>
> On 6/18/2019 8:46 PM, Brett Chancellor wrote:
>
> Does anybody have a fix for BlueFS spillover detected? This started
> happening 2 days after an upgrade to 14.2.1 and has increased from 3 OSDs
> to 118 in the last 4 days.  I read you could fix it by rebuilding the OSDs,
> but rebuilding the 264 OSDs on this cluster will take months of
> rebalancing.
>
> $ sudo ceph health detail
> HEALTH_WARN BlueFS spillover detected on 118 OSD(s)
> BLUEFS_SPILLOVER BlueFS spillover detected on 118 OSD(s)
>  osd.0 spilled over 22 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.1 spilled over 103 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.5 spilled over 21 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.6 spilled over 64 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.11 spilled over 22 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.13 spilled over 23 GiB metadata from 'db' device (29 GiB used of
> 148 GiB) to slow device
>  osd.21 spilled over 102 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.22 spilled over 103 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.23 spilled over 24 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.24 spilled over 25 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.25 spilled over 24 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.26 spilled over 64 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.27 spilled over 21 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.30 spilled over 65 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.32 spilled over 21 GiB metadata from 'db' device (29 GiB used of
> 148 GiB) to slow device
>  osd.34 spilled over 24 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.42 spilled over 25 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.45 spilled over 103 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.46 spilled over 24 GiB metadata from 'db' device (29 GiB used of
> 148 GiB) to slow device
>  osd.47 spilled over 63 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.48 spilled over 63 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.49 spilled over 62 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.50 spilled over 24 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.52 spilled over 140 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.53 spilled over 22 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.54 spilled over 59 GiB metadata from 'db' device (29 GiB used of
> 148 GiB) to slow device
>  osd.55 spilled over 134 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.56 spilled over 19 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.57 spilled over 61 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slow device
>  osd.58 spilled over 66 GiB metadata from 'db' device (28 GiB used of
> 148 GiB) to slo

[ceph-users] BlueFS spillover detected - 14.2.1

2019-06-18 Thread Brett Chancellor
Does anybody have a fix for BlueFS spillover detected? This started
happening 2 days after an upgrade to 14.2.1 and has increased from 3 OSDs
to 118 in the last 4 days.  I read you could fix it by rebuilding the OSDs,
but rebuilding the 264 OSDs on this cluster will take months of rebalancing.

$ sudo ceph health detail
HEALTH_WARN BlueFS spillover detected on 118 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 118 OSD(s)
 osd.0 spilled over 22 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.1 spilled over 103 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.5 spilled over 21 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.6 spilled over 64 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.11 spilled over 22 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.13 spilled over 23 GiB metadata from 'db' device (29 GiB used of
148 GiB) to slow device
 osd.21 spilled over 102 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.22 spilled over 103 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.23 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.24 spilled over 25 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.25 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.26 spilled over 64 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.27 spilled over 21 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.30 spilled over 65 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.32 spilled over 21 GiB metadata from 'db' device (29 GiB used of
148 GiB) to slow device
 osd.34 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.42 spilled over 25 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.45 spilled over 103 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.46 spilled over 24 GiB metadata from 'db' device (29 GiB used of
148 GiB) to slow device
 osd.47 spilled over 63 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.48 spilled over 63 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.49 spilled over 62 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.50 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.52 spilled over 140 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.53 spilled over 22 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.54 spilled over 59 GiB metadata from 'db' device (29 GiB used of
148 GiB) to slow device
 osd.55 spilled over 134 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.56 spilled over 19 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.57 spilled over 61 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.58 spilled over 66 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.59 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.61 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.62 spilled over 59 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.65 spilled over 19 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.67 spilled over 62 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.69 spilled over 20 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.71 spilled over 21 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.73 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.74 spilled over 17 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.75 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.76 spilled over 22 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.78 spilled over 64 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.80 spilled over 100 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.81 spilled over 63 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.82 spilled over 24 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.83 spilled over 60 GiB metadata from 'db' device (28 GiB used of
148 GiB) to slow device
 osd.84 spilled over 60 GiB metadata from 'db' device

Re: [ceph-users] Nautilus HEALTH_WARN for msgr2 protocol

2019-06-14 Thread Brett Chancellor
If you don't figure out how to enable it on your monitor, you can always
disable it to squash the warnings
*ceph config set mon.node01 ms_bind_msgr2 false*

On Fri, Jun 14, 2019 at 12:11 PM Bob Farrell  wrote:

> Hi. Firstly thanks to all involved in this great mailing list, I learn
> lots from it every day.
>
> We are running Ceph with a huge amount of success to store website
> themes/templates across a large collection of websites. We are very pleased
> with the solution in every way.
>
> The only issue we have, which we have had since day 1, is we always see
> HEALTH_WARN:
>
> health: HEALTH_WARN
> 1 monitors have not enabled msgr2
>
> And this is reflected in the monmap:
>
> monmaptool: monmap file /tmp/monmap
> epoch 7
> fsid 7273720d-04d7-480f-a77c-f0207ae35852
> last_changed 2019-04-02 17:21:56.935381
> created 2019-04-02 17:21:09.925941
> min_mon_release 14 (nautilus)
> 0: v1:172.30.0.144:6789/0 mon.node01.homeflow.co.uk
> 1: [v2:172.30.0.146:3300/0,v1:172.30.0.146:6789/0]
> mon.node03.homeflow.co.uk
> 2: [v2:172.30.0.147:3300/0,v1:172.30.0.147:6789/0]
> mon.node04.homeflow.co.uk
> 3: [v2:172.30.0.148:3300/0,v1:172.30.0.148:6789/0]
> mon.node05.homeflow.co.uk
> 4: [v2:172.30.0.145:3300/0,v1:172.30.0.145:6789/0]
> mon.node02.homeflow.co.uk
> 5: [v2:172.30.0.149:3300/0,v1:172.30.0.149:6789/0]
> mon.node06.homeflow.co.uk
> 6: [v2:172.30.0.150:3300/0,v1:172.30.0.150:6789/0]
> mon.node07.homeflow.co.uk
>
> I never figured out the correct syntax to set up the first monitor to use
> both 6789 and 3300. The other monitors that join the cluster set this
> config automatically but I couldn't work out how to apply it to the first
> monitor node.
>
> The cluster has been operating in production for at least a month now with
> no issues at all, so it would be nice to remove this warning as, at the
> moment, it's not really very useful as a monitoring metric.
>
> Could somebody advise me on the safest/most sensible way to update the
> monmap so that node01 listens on v2 and v1 ?
>
> Thanks for any help !
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw dying

2019-06-09 Thread Brett Chancellor
radosgw will try and create all if the default pools if they are missing.
The number of pools changes depending on the version, but it's somewhere
around 5.

On Sun, Jun 9, 2019, 1:00 PM  wrote:

> Huan;
>
> I get that, but the pool already exists, why is radosgw trying to create
> one?
>
> Dominic Hilsbos
>
> Get Outlook for Android 
>
>
>
>
> On Sat, Jun 8, 2019 at 2:55 AM -0700, "huang jun" 
> wrote:
>
> From the error message, i'm decline to that 'mon_max_pg_per_osd' was exceed,
>> you can check the value of it, and its default value is 250, so you
>> can at most have 1500pgs(250*6osds),
>> and for replicated pools with size=3, you can have 500pgs for all pools,
>> you already have 448pgs, so the next pool can create at most 500-448=52pgs.
>>  于2019年6月8日周六 下午2:41写道:
>> >
>> > All;
>> >
>> > I have a test and demonstration cluster running (3 hosts, MON, MGR, 2x OSD 
>> > per host), and I'm trying to add a 4th host for gateway purposes.
>> >
>> > The radosgw process keeps dying with:
>> > 2019-06-07 15:59:50.700 7fc4ef273780  0 ceph version 14.2.1 
>> > (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable), process 
>> > radosgw, pid 17588
>> > 2019-06-07 15:59:51.358 7fc4ef273780  0 rgw_init_ioctx ERROR: 
>> > librados::Rados::pool_create returned (34) Numerical result out of range 
>> > (this can be due to a pool or placement group misconfiguration, e.g. 
>> > pg_num < pgp_num or mon_max_pg_per_osd exceeded)
>> > 2019-06-07 15:59:51.396 7fc4ef273780 -1 Couldn't init storage provider 
>> > (RADOS)
>> >
>> > The .rgw.root pool already exists.
>> >
>> > ceph status returns:
>> >   cluster:
>> > id: 1a8a1693-fa54-4cb3-89d2-7951d4cee6a3
>> > health: HEALTH_OK
>> >
>> >   services:
>> > mon: 3 daemons, quorum S700028,S700029,S700030 (age 30m)
>> > mgr: S700028(active, since 47h), standbys: S700030, S700029
>> > osd: 6 osds: 6 up (since 2d), 6 in (since 3d)
>> >
>> >   data:
>> > pools:   5 pools, 448 pgs
>> > objects: 12 objects, 1.2 KiB
>> > usage:   722 GiB used, 65 TiB / 66 TiB avail
>> > pgs: 448 active+clean
>> >
>> > and ceph osd tree returns:
>> > ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
>> > -1   66.17697 root default
>> > -5   22.05899 host S700029
>> >  2   hdd 11.02950 osd.2up  1.0 1.0
>> >  3   hdd 11.02950 osd.3up  1.0 1.0
>> > -7   22.05899 host S700030
>> >  4   hdd 11.02950 osd.4up  1.0 1.0
>> >  5   hdd 11.02950 osd.5up  1.0 1.0
>> > -3   22.05899 host s700028
>> >  0   hdd 11.02950 osd.0up  1.0 1.0
>> >  1   hdd 11.02950 osd.1up  1.0 1.0
>> >
>> > Any thoughts on what I'm missing?
>> >
>> > Thank you,
>> >
>> > Dominic L. Hilsbos, MBA
>> > Director - Information Technology
>> > Perform Air International Inc.
>> > dhils...@performair.com
>> > www.PerformAir.com
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Thank you!
>> HuangJun
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw in container

2019-06-05 Thread Brett Chancellor
It works okay. You need a ceph.conf and a generic radosgw cephx key. That's
it.

On Wed, Jun 5, 2019, 5:37 AM Marc Roos  wrote:

>
>
> Has anyone put the radosgw in a container? What files do I need to put
> in the sandbox directory? Are there other things I should consider?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing a HEALTH_ERR situation

2019-05-18 Thread Brett Chancellor
It won't kick off right away if other deep scrubs are going to those OSDs.
You can set nodeep-scrub on the cluster, wait till your other deep scrubs
have finished, then turn deep scrubs back on and immediately run the
repair. You should see that pg do a deep scrubs then repair.

On Sat, May 18, 2019, 6:41 PM Jorge Garcia  wrote:

> I have tried ceph pg repair several times. It claims "instructing pg
> 2.798s0 on osd.41 to repair" but then nothing happens as far as I can tell.
> Any way of knowing if it's doing more?
>
> On Sat, May 18, 2019 at 3:33 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> I would try the ceph pg repair. If you see the pg go into deep scrubbing,
>> then back to inconsistent you probably have a bad drive. Find which of the
>> drives in the pg are bad (pg query or go to the host and look through
>> dmesg). Take that osd offline and mark it out. Once backfill is complete,
>> it should clear up.
>>
>> On Sat, May 18, 2019, 6:05 PM Jorge Garcia  wrote:
>>
>>> We are testing a ceph cluster mostly using cephfs. We are using an
>>> erasure-code pool, and have been loading it up with data. Recently, we got
>>> a HEALTH_ERR response when we were querying the ceph status. We stopped all
>>> activity to the filesystem, and waited to see if the error would go away.
>>> It didn't. Then we tried a couple of suggestions from the internet (ceph pg
>>> repair, ceph pg scrub, ceph pg deep-scrub) to no avail. I'm not sure how to
>>> find out more information about what the problem is, and how to repair the
>>> filesystem to bring it back to normal health. Any suggestions?
>>>
>>> Current status:
>>>
>>> # ceph -s
>>>
>>>   cluster:
>>>
>>> id: 28ef32f1-4350-491b-9003-b19b9c3a2076
>>>
>>> health: HEALTH_ERR
>>>
>>> 5 scrub errors
>>>
>>> Possible data damage: 1 pg inconsistent
>>>
>>>
>>>
>>>   services:
>>>
>>> mon: 3 daemons, quorum gi-cba-01,gi-cba-02,gi-cba-03
>>>
>>> mgr: gi-cba-01(active), standbys: gi-cba-02, gi-cba-03
>>>
>>> mds: backups-1/1/1 up  {0=gi-cbmd=up:active}
>>>
>>> osd: 87 osds: 87 up, 87 in
>>>
>>>
>>>
>>>   data:
>>>
>>> pools:   2 pools, 4096 pgs
>>>
>>> objects: 90.98 M objects, 134 TiB
>>>
>>> usage:   210 TiB used, 845 TiB / 1.0 PiB avail
>>>
>>> pgs: 4088 active+clean
>>>
>>>  5active+clean+scrubbing+deep
>>>
>>>  2active+clean+scrubbing
>>>
>>>  1active+clean+inconsistent
>>>
>>> # ceph health detail
>>>
>>> HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent
>>>
>>> OSD_SCRUB_ERRORS 5 scrub errors
>>>
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>
>>> pg 2.798 is active+clean+inconsistent, acting [41,50,17,2,86,70,61]
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing a HEALTH_ERR situation

2019-05-18 Thread Brett Chancellor
I would try the ceph pg repair. If you see the pg go into deep scrubbing,
then back to inconsistent you probably have a bad drive. Find which of the
drives in the pg are bad (pg query or go to the host and look through
dmesg). Take that osd offline and mark it out. Once backfill is complete,
it should clear up.

On Sat, May 18, 2019, 6:05 PM Jorge Garcia  wrote:

> We are testing a ceph cluster mostly using cephfs. We are using an
> erasure-code pool, and have been loading it up with data. Recently, we got
> a HEALTH_ERR response when we were querying the ceph status. We stopped all
> activity to the filesystem, and waited to see if the error would go away.
> It didn't. Then we tried a couple of suggestions from the internet (ceph pg
> repair, ceph pg scrub, ceph pg deep-scrub) to no avail. I'm not sure how to
> find out more information about what the problem is, and how to repair the
> filesystem to bring it back to normal health. Any suggestions?
>
> Current status:
>
> # ceph -s
>
>   cluster:
>
> id: 28ef32f1-4350-491b-9003-b19b9c3a2076
>
> health: HEALTH_ERR
>
> 5 scrub errors
>
> Possible data damage: 1 pg inconsistent
>
>
>
>   services:
>
> mon: 3 daemons, quorum gi-cba-01,gi-cba-02,gi-cba-03
>
> mgr: gi-cba-01(active), standbys: gi-cba-02, gi-cba-03
>
> mds: backups-1/1/1 up  {0=gi-cbmd=up:active}
>
> osd: 87 osds: 87 up, 87 in
>
>
>
>   data:
>
> pools:   2 pools, 4096 pgs
>
> objects: 90.98 M objects, 134 TiB
>
> usage:   210 TiB used, 845 TiB / 1.0 PiB avail
>
> pgs: 4088 active+clean
>
>  5active+clean+scrubbing+deep
>
>  2active+clean+scrubbing
>
>  1active+clean+inconsistent
>
> # ceph health detail
>
> HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent
>
> OSD_SCRUB_ERRORS 5 scrub errors
>
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>
> pg 2.798 is active+clean+inconsistent, acting [41,50,17,2,86,70,61]
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG scrub stamps reset to 0.000000 in 14.2.1

2019-05-17 Thread Brett Chancellor
Not sure if it's related, but this only happens to PG's who's primary OSD
is one where osd_numa_node has been set.

On Wed, May 15, 2019 at 7:13 PM Brett Chancellor 
wrote:

> After upgrading from 14.2.0 to 14.2.1, I've noticed PGs are frequently
> resetting their scrub and deep scrub time stamps to 0.00.  It's extra
> strange because the peers show timestamps for deep scrubs.
>
> ## First entry from a pg list at 7pm
> $ grep 11.2f2 ~/pgs-active.7pm
> 11.2f2 6910 0   0 2897477632   0
> 0 2091 active+clean3h  7378'12291   8048:36261[1,6,37]p1
> [1,6,37]p1 2019-05-14 21:01:29.172460 2019-05-14 21:01:29.172460
>
> ## Next Entry 3 minutes later
> $ ceph pg ls active |grep 11.2f2
> 11.2f2 6950 0   0 2914713600   0
> 0 2091 active+clean6s  7378'12291   8049:36330[1,6,37]p1
> [1,6,37]p1   0.00   0.00
>
> ## PG Query
> {
> "state": "active+clean",
> "snap_trimq": "[]",
> "snap_trimq_len": 0,
> "epoch": 8049,
> "up": [
> 1,
> 6,
> 37
> ],
> "acting": [
> 1,
> 6,
> 37
> ],
> "acting_recovery_backfill": [
> "1",
> "6",
> "37"
> ],
> "info": {
> "pgid": "11.2f2",
> "last_update": "7378'12291",
> "last_complete": "7378'12291",
> "log_tail": "1087'10200",
> "last_user_version": 12291,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 1,
> "purged_snaps": [],
> "history": {
> "epoch_created": 1549,
> "epoch_pool_created": 216,
> "last_epoch_started": 6148,
> "last_interval_started": 6147,
> "last_epoch_clean": 6148,
> "last_interval_clean": 6147,
> "last_epoch_split": 6147,
> "last_epoch_marked_full": 0,
> "same_up_since": 6126,
> "same_interval_since": 6147,
> "same_primary_since": 6126,
> "last_scrub": "7378'12291",
> "last_scrub_stamp": "0.00",
> "last_deep_scrub": "6103'12186",
> "last_deep_scrub_stamp": "0.00",
> "last_clean_scrub_stamp": "2019-05-15 23:08:17.014575"
> },
> "stats": {
> "version": "7378'12291",
> "reported_seq": "36700",
> "reported_epoch": "8049",
> "state": "active+clean",
> "last_fresh": "2019-05-15 23:08:17.014609",
> "last_change": "2019-05-15 23:08:17.014609",
> "last_active": "2019-05-15 23:08:17.014609",
> "last_peered": "2019-05-15 23:08:17.014609",
> "last_clean": "2019-05-15 23:08:17.014609",
> "last_became_active": "2019-05-15 19:25:01.484322",
> "last_became_peered": "2019-05-15 19:25:01.484322",
> "last_unstale": "2019-05-15 23:08:17.014609",
> "last_undegraded": "2019-05-15 23:08:17.014609",
> "last_fullsized": "2019-05-15 23:08:17.014609",
> "mapping_epoch": 6126,
> "log_start": "1087'10200",
> "ondisk_log_start": "1087'10200",
> "created": 1549,
> "last_epoch_clean": 6148,
> "parent": "0.0",
> "parent_split_bits": 10,
> "last_scrub": "7378'12291",
> "last_scrub_stamp": "0.00",
> "last_deep_scrub": "6103'12186",
> "last_deep_scrub_stamp": "0.00",
> "last_clean_scrub_stamp": "2019-05-15 23:08:17.014575",
> "log_size": 209

[ceph-users] PG scrub stamps reset to 0.000000 in 14.2.1

2019-05-15 Thread Brett Chancellor
After upgrading from 14.2.0 to 14.2.1, I've noticed PGs are frequently
resetting their scrub and deep scrub time stamps to 0.00.  It's extra
strange because the peers show timestamps for deep scrubs.

## First entry from a pg list at 7pm
$ grep 11.2f2 ~/pgs-active.7pm
11.2f2 6910 0   0 2897477632   0  0
2091 active+clean3h  7378'12291   8048:36261[1,6,37]p1
[1,6,37]p1 2019-05-14 21:01:29.172460 2019-05-14 21:01:29.172460

## Next Entry 3 minutes later
$ ceph pg ls active |grep 11.2f2
11.2f2 6950 0   0 2914713600   0  0
2091 active+clean6s  7378'12291   8049:36330[1,6,37]p1
[1,6,37]p1   0.00   0.00

## PG Query
{
"state": "active+clean",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 8049,
"up": [
1,
6,
37
],
"acting": [
1,
6,
37
],
"acting_recovery_backfill": [
"1",
"6",
"37"
],
"info": {
"pgid": "11.2f2",
"last_update": "7378'12291",
"last_complete": "7378'12291",
"log_tail": "1087'10200",
"last_user_version": 12291,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 1549,
"epoch_pool_created": 216,
"last_epoch_started": 6148,
"last_interval_started": 6147,
"last_epoch_clean": 6148,
"last_interval_clean": 6147,
"last_epoch_split": 6147,
"last_epoch_marked_full": 0,
"same_up_since": 6126,
"same_interval_since": 6147,
"same_primary_since": 6126,
"last_scrub": "7378'12291",
"last_scrub_stamp": "0.00",
"last_deep_scrub": "6103'12186",
"last_deep_scrub_stamp": "0.00",
"last_clean_scrub_stamp": "2019-05-15 23:08:17.014575"
},
"stats": {
"version": "7378'12291",
"reported_seq": "36700",
"reported_epoch": "8049",
"state": "active+clean",
"last_fresh": "2019-05-15 23:08:17.014609",
"last_change": "2019-05-15 23:08:17.014609",
"last_active": "2019-05-15 23:08:17.014609",
"last_peered": "2019-05-15 23:08:17.014609",
"last_clean": "2019-05-15 23:08:17.014609",
"last_became_active": "2019-05-15 19:25:01.484322",
"last_became_peered": "2019-05-15 19:25:01.484322",
"last_unstale": "2019-05-15 23:08:17.014609",
"last_undegraded": "2019-05-15 23:08:17.014609",
"last_fullsized": "2019-05-15 23:08:17.014609",
"mapping_epoch": 6126,
"log_start": "1087'10200",
"ondisk_log_start": "1087'10200",
"created": 1549,
"last_epoch_clean": 6148,
"parent": "0.0",
"parent_split_bits": 10,
"last_scrub": "7378'12291",
"last_scrub_stamp": "0.00",
"last_deep_scrub": "6103'12186",
"last_deep_scrub_stamp": "0.00",
"last_clean_scrub_stamp": "2019-05-15 23:08:17.014575",
"log_size": 2091,
"ondisk_log_size": 2091,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": true,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 2914713600,
"num_objects": 695,
"num_object_clones": 0,
"num_object_copies": 2085,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 695,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
  

Re: [ceph-users] ceph nautilus deep-scrub health error

2019-05-14 Thread Brett Chancellor
You can increase your scrub intervals.
osd deep scrub interval
osd scrub max interval

On Tue, May 14, 2019 at 7:00 AM EDH - Manuel Rios Fernandez <
mrios...@easydatahost.com> wrote:

> Hi Muthu
>
>
>
> We found the same issue near 2000 pgs not deep-scrubbed in time.
>
>
>
> We’re manually force scrubbing with :
>
>
>
> ceph health detail | grep -i not | awk '{print $2}' | while read i; do
> ceph pg deep-scrub ${i}; done
>
>
>
> It launch near 20-30 pgs to be deep-scrubbed. I think you can improve
>  with a sleep of 120 secs between scrub to prevent overload your osd.
>
>
>
> For disable deep-scrub you can use “ceph osd set nodeep-scrub” , Also you
> can setup deep-scrub with threshold .
>
> #Start Scrub 22:00
>
> osd scrub begin hour = 22
>
> #Stop Scrub 8
>
> osd scrub end hour = 8
>
> #Scrub Load 0.5
>
> osd scrub load threshold = 0.5
>
>
>
> Regards,
>
>
>
> Manuel
>
>
>
>
>
>
>
>
>
> *De:* ceph-users  *En nombre de *nokia
> ceph
> *Enviado el:* martes, 14 de mayo de 2019 11:44
> *Para:* Ceph Users 
> *Asunto:* [ceph-users] ceph nautilus deep-scrub health error
>
>
>
> Hi Team,
>
>
>
> After upgrading from Luminous to Nautilus , we see 654 pgs not
> deep-scrubbed in time error in ceph status . How can we disable this flag?
> . In our setup we disable deep-scrubbing for performance issues.
>
>
>
> Thanks,
>
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread Brett Chancellor
This seems right. You are doing a single benchmark from a single client.
Your limiting factor will be the network latency. For most networks this is
between 0.2 and 0.3ms.  if you're trying to test the potential of your
cluster, you'll need multiple workers and clients.

On Thu, Feb 7, 2019, 2:17 AM  Hi List
>
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
>
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
>
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
>
> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
>
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
>
>
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16  5857  5841   22.8155   22.8164  0.00238437
> 0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559
> 0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024
> 0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716
> 0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.00220186
> 0.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315
> 0.00280552
> 7  16 39616 39600   22.0962   21.0938  0.00290661
> 0.00282718
> 8  16 45510 45494   22.2118   23.0234   0.0033541
> 0.00281253
> 9  16 50995 50979   22.1243   21.4258  0.00267282
> 0.00282371
>10  16 56745 56729   22.1577   22.4609  0.00252583
>  0.0028193
> Total time run: 10.002668
> Total writes made:  56745
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 22.1601
> Stddev Bandwidth:   0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:   5672
> Stddev IOPS:182
> Max IOPS:   5912
> Min IOPS:   5400
> Average Latency(s): 0.00281953
> Stddev Latency(s):  0.00190771
> Max latency(s): 0.0834767
> Min latency(s): 0.00120945
>
> Min latency is fine -- but Max latency of 83ms ?
> Average IOPS @ 5672 ?
>
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  15 23329 23314   91.0537   91.0703 0.000349856
> 0.000679074
> 2  16 48555 48539   94.7884   98.5352 0.000499159
> 0.000652067
> 3  16 76193 76177   99.1747   107.961 0.000443877
> 0.000622775
> 4  15103923103908   101.459   108.324 0.000678589
> 0.000609182
> 5  15132720132705   103.663   112.488 0.000741734
> 0.000595998
> 6  15161811161796   105.323   113.637 0.000333166
> 0.000586323
> 7  15190196190181   106.115   110.879 0.000612227
> 0.000582014
> 8  15221155221140   107.966   120.934 0.000471219
> 0.000571944
> 9  16251143251127   108.984   117.137 0.000267528
> 0.000566659
> Total time run:   10.000640
> Total reads made: 282097
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   110.187
> Average IOPS: 28207
> Stddev IOPS:  2357
> Max IOPS: 30959
> Min IOPS: 23314
> Average Latency(s):   0.000560402
> Max latency(s):   0.109804
> Min latency(s):   0.000212671
>
> This is also quite far from expected. I have 12GB of memory on the OSD
> daemon for caching on each host - close to idle cluster - thus 50GB+ for
> caching with a working set of < 6GB .. this should - in this case
> not really be bound by the underlying SSD. But if it were:
>
> IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?
>
> No measureable service time in iostat when running tests, thus I have
> come to the conclusion that it has to be either client side,

Re: [ceph-users] Inconsistent PG, repair doesn't work

2018-10-11 Thread Brett Chancellor
This seems like a bug. If I'm kicking off a repair manually it should take
place immediately, and ignore flags such as max scrubs, or minimum scrub
window.

-Brett

On Thu, Oct 11, 2018 at 1:11 PM David Turner  wrote:

> As a part of a repair is queuing a deep scrub. As soon as the repair part
> is over the deep scrub continues until it is done.
>
> On Thu, Oct 11, 2018, 12:26 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> Does the "repair" function use the same rules as a deep scrub? I couldn't
>> get one to kick off, until I temporarily increased the max_scrubs and
>> lowered the scrub_min_interval on all 3 OSDs for that placement group. This
>> ended up fixing the issue, so I'll leave this here in case somebody else
>> runs into it.
>>
>> sudo ceph tell 'osd.208' injectargs '--osd_max_scrubs 3'
>> sudo ceph tell 'osd.120' injectargs '--osd_max_scrubs 3'
>> sudo ceph tell 'osd.235' injectargs '--osd_max_scrubs 3'
>> sudo ceph tell 'osd.208' injectargs '--osd_scrub_min_interval 1.0'
>> sudo ceph tell 'osd.120' injectargs '--osd_scrub_min_interval 1.0'
>> sudo ceph tell 'osd.235' injectargs '--osd_scrub_min_interval 1.0'
>> sudo ceph pg repair 75.302
>>
>> -Brett
>>
>>
>> On Thu, Oct 11, 2018 at 8:42 AM Maks Kowalik 
>> wrote:
>>
>>> Imho moving was not the best idea (a copying attempt would have told if
>>> the read error was the case here).
>>> Scrubs might don't want to start if there are many other scrubs ongoing.
>>>
>>> czw., 11 paź 2018 o 14:27 Brett Chancellor 
>>> napisał(a):
>>>
>>>> I moved the file. But the cluster won't actually start any scrub/repair
>>>> I manually initiate.
>>>>
>>>> On Thu, Oct 11, 2018, 7:51 AM Maks Kowalik 
>>>> wrote:
>>>>
>>>>> Based on the log output it looks like you're having a damaged file on
>>>>> OSD 235 where the shard is stored.
>>>>> To ensure if that's the case you should find the file (using
>>>>> 81d5654895863d as a part of its name) and try to copy it to another
>>>>> directory.
>>>>> If you get the I/O error while copying, the next steps would be to
>>>>> delete the file, run the scrub on 75.302 and take a deep look at the
>>>>> OSD.235 for any other errors.
>>>>>
>>>>> Kind regards,
>>>>> Maks
>>>>>
>>>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG, repair doesn't work

2018-10-11 Thread Brett Chancellor
Does the "repair" function use the same rules as a deep scrub? I couldn't
get one to kick off, until I temporarily increased the max_scrubs and
lowered the scrub_min_interval on all 3 OSDs for that placement group. This
ended up fixing the issue, so I'll leave this here in case somebody else
runs into it.

sudo ceph tell 'osd.208' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.120' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.235' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.208' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph tell 'osd.120' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph tell 'osd.235' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph pg repair 75.302

-Brett


On Thu, Oct 11, 2018 at 8:42 AM Maks Kowalik  wrote:

> Imho moving was not the best idea (a copying attempt would have told if
> the read error was the case here).
> Scrubs might don't want to start if there are many other scrubs ongoing.
>
> czw., 11 paź 2018 o 14:27 Brett Chancellor 
> napisał(a):
>
>> I moved the file. But the cluster won't actually start any scrub/repair I
>> manually initiate.
>>
>> On Thu, Oct 11, 2018, 7:51 AM Maks Kowalik 
>> wrote:
>>
>>> Based on the log output it looks like you're having a damaged file on
>>> OSD 235 where the shard is stored.
>>> To ensure if that's the case you should find the file (using
>>> 81d5654895863d as a part of its name) and try to copy it to another
>>> directory.
>>> If you get the I/O error while copying, the next steps would be to
>>> delete the file, run the scrub on 75.302 and take a deep look at the
>>> OSD.235 for any other errors.
>>>
>>> Kind regards,
>>> Maks
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PG, repair doesn't work

2018-10-10 Thread Brett Chancellor
Hi all,
  I have an inconsistent PG. I've tried running a repair and manual deep
scrub, but neither operation seems to actually do anything.  I've also
tried stopping the primary OSD, removing the object, and restarting the
OSD. The system copies the object back, but the inconsistent PG ERR remains.

## Ceph Health
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 75.302 is active+clean+inconsistent, acting [208,120,235]

## OSD log
2018-10-10 13:43:08.734034 7feb3bf96700  0 log_channel(cluster) log [DBG] :
75.302 deep-scrub starts
2018-10-10 13:43:35.355037 7feb3bf96700 -1 log_channel(cluster) log [ERR] :
75.302 shard 235: soid
75:40d6b566:::rbd_data.81d5654895863d.1900:head candidate had a
read error
2018-10-10 13:44:06.476651 7feb3bf96700 -1 log_channel(cluster) log [ERR] :
75.302 deep-scrub 0 missing, 1 inconsistent objects
2018-10-10 13:44:06.476659 7feb3bf96700 -1 log_channel(cluster) log [ERR] :
75.302 deep-scrub 1 errors

## list-inconsistent-obj fails to report anything
$ sudo rados list-inconsistent-pg vir400-volumes
["75.302"]
$ sudo rados list-inconsistent-obj 75.302
No scrub information available for pg 75.302
error 2: (2) No such file or directory

## PG Query Information
https://pastebin.com/5wa3mWDC


Thanks,
-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hardware heterogeneous in same pool

2018-10-04 Thread Brett Chancellor
You could also set *osd_crush_initial_weight = 0 . *New OSDs will
automatically come up with a 0 weight and you won't have to race the clock.

-Brett

On Thu, Oct 4, 2018 at 3:50 AM Janne Johansson  wrote:

>
>
> Den tors 4 okt. 2018 kl 00:09 skrev Bruno Carvalho :
>
>> Hi Cephers, I would like to know how you are growing the cluster.
>> Using dissimilar hardware in the same pool or creating a pool for each
>> different hardware group.
>> What problem would I have many problems using different hardware (CPU,
>> memory, disk) in the same pool?
>
>
> I don't think CPU and RAM (and other hw related things like HBA controller
> card brand) matters
> a lot, more is always nicer, but as long as you don't add worse machines
> like Jonathan wrote you
> should not see any degradation.
>
> What you might want to look out for is if the new disks are very uneven
> compared to the old
> setup, so if you used to have servers with 10x2TB drives and suddenly add
> one with 2x10TB,
> things might become very unbalanced, since those differences will not be
> handled seamlessly
> by the crush map.
>
> Apart from that, the only issues for us is "add drives, quickly set crush
> reweight to 0.0 before
> all existing OSD hosts shoot massive amounts of I/O on them, then script a
> slower raise of
> crush weight upto what they should end up at", to lessen the impact for
> our 24/7 operations.
>
> If you have weekends where noone accesses the cluster or night-time low-IO
> usage patterns,
> just upping the weight at the right hour might suffice.
>
> Lastly, for ssd/nvme setups with good networking, this is almost moot,
> they converge so fast
> its almost unfair. A real joy working with expanding flash-only
> pools/clusters.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Brett Chancellor
That turned out to be exactly the issue (And boy was it fun clearing pgs
out on 71 OSDs). I think it's caused by a combination of two factors.
1. This cluster has way to many placement groups per OSD (just north of
800). It was fine when we first created all the pools, but upgrades (most
recently to luminous 12.2.4) have cemented the fact that high PG:OSD ratio
is a bad thing.
2. We had a host in a failed state for an extended period of time. That
host finally coming online is what triggered the event. The system dug
itself into a hole it couldn't get out of.

-Brett

On Wed, Oct 3, 2018 at 11:49 AM Gregory Farnum  wrote:

> Yeah, don't run these commands blind. They are changing the local metadata
> of the PG in ways that may make it inconsistent with the overall cluster
> and result in lost data.
>
> Brett, it seems this issue has come up several times in the field but we
> haven't been able to reproduce it locally or get enough info to debug
> what's going on: https://tracker.ceph.com/issues/21142
> Maybe run through that ticket and see if you can contribute new logs or
> add detail about possible sources?
> -Greg
>
> On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim 
> wrote:
>
>> Hi,
>>
>> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>>
>> I’ve corrected mine OSDs with the following commands. My OSD logs
>> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
>> number besides and before crash dump.
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> trim-pg-log --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> fix-lost --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
>> --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> mark-complete --pgid $2
>> systemctl restart ceph-osd@$1
>>
>> I dont know if it works for you but it may be no harm to try for an OSD.
>>
>> There is such less information about this tools. So it might be risky. I
>> hope someone much experienced could help more.
>>
>>
>> > On 2 Oct 2018, at 23:23, Brett Chancellor 
>> wrote:
>> >
>> > Help. I have a 60 node cluster and most of the OSDs decided to crash
>> themselves at the same time. They wont restart, the messages look like...
>> >
>> > --- begin dump of recent events ---
>> >  0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
>> (Aborted) **
>> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
>> >
>> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>> >  1: (()+0xa3c611) [0x556d618bb611]
>> >  2: (()+0xf6d0) [0x7f57a885e6d0]
>> >  3: (gsignal()+0x37) [0x7f57a787f277]
>> >  4: (abort()+0x148) [0x7f57a7880968]
>> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x284) [0x556d618fa6e4]
>> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x556d615c74a2]
>> >  7: (PastIntervals::check_new_interval(int, int, std::vector> std::allocator > const&, std::vector >
>> const&, int, int, std::vector > const&,
>> std::vector > const&, unsigned int, unsigned int,
>> std::shared_ptr, std::shared_ptr, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x556d615ae6c0]
>> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
>> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
>> >  10: (OSD::init()+0x2169) [0x556d613919d9]
>> >  11: (main()+0x2d07) [0x556d61295dd7]
>> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
>> >  13: (()+0x4b53e3) [0x556d613343e3]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>> >
>> >
>> > Some hosts have no working OSDs, others seem to have 1 working, and 2
>> dead.  It's spread all across the cluster, across several different racks.
>> Any idea on where to look next? The cluster is dead in the water right now.
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Brett Chancellor
Help. I have a 60 node cluster and most of the OSDs decided to crash
themselves at the same time. They wont restart, the messages look like...

--- begin dump of recent events ---
 0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
(Aborted) **
 in thread 7f57ab5b7d80 thread_name:ceph-osd

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
(stable)
 1: (()+0xa3c611) [0x556d618bb611]
 2: (()+0xf6d0) [0x7f57a885e6d0]
 3: (gsignal()+0x37) [0x7f57a787f277]
 4: (abort()+0x148) [0x7f57a7880968]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x284) [0x556d618fa6e4]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
const&)+0x3b2) [0x556d615c74a2]
 7: (PastIntervals::check_new_interval(int, int, std::vector > const&, std::vector >
const&, int, int, std::vector > const&,
std::vector > const&, unsigned int, unsigned int,
std::shared_ptr, std::shared_ptr, pg_t,
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
[0x556d615ae6c0]
 8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
 9: (OSD::load_pgs()+0x545) [0x556d61373095]
 10: (OSD::init()+0x2169) [0x556d613919d9]
 11: (main()+0x2d07) [0x556d61295dd7]
 12: (__libc_start_main()+0xf5) [0x7f57a786b445]
 13: (()+0x4b53e3) [0x556d613343e3]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


Some hosts have no working OSDs, others seem to have 1 working, and 2
dead.  It's spread all across the cluster, across several different racks.
Any idea on where to look next? The cluster is dead in the water right now.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore DB showing as ssd

2018-09-21 Thread Brett Chancellor
Hi all. Quick question about osd metadata information. I have several OSDs
setup with the data dir on HDD and the db going to a partition on ssd. But
when I look at the metadata for all the OSDs, it's showing the db as "hdd".
Does this effect anything? And is there anyway to change it?

$ sudo ceph osd metadata 1
{
"id": 1,
"arch": "x86_64",
"back_addr": ":6805/2053608",
"back_iface": "eth0",
"bluefs": "1",
"bluefs_db_access_mode": "blk",
"bluefs_db_block_size": "4096",
"bluefs_db_dev": "8:80",
"bluefs_db_dev_node": "sdf",
"bluefs_db_driver": "KernelDevice",
"bluefs_db_model": "PERC H730 Mini  ",
"bluefs_db_partition_path": "/dev/sdf2",
"bluefs_db_rotational": "1",
"bluefs_db_size": "266287972352",
*"bluefs_db_type": "hdd",*
"bluefs_single_shared_device": "0",
"bluefs_slow_access_mode": "blk",
"bluefs_slow_block_size": "4096",
"bluefs_slow_dev": "253:1",
"bluefs_slow_dev_node": "dm-1",
"bluefs_slow_driver": "KernelDevice",
"bluefs_slow_model": "",
"bluefs_slow_partition_path": "/dev/dm-1",
"bluefs_slow_rotational": "1",
"bluefs_slow_size": "6000601989120",
"bluefs_slow_type": "hdd",
"bluestore_bdev_access_mode": "blk",
"bluestore_bdev_block_size": "4096",
"bluestore_bdev_dev": "253:1",
"bluestore_bdev_dev_node": "dm-1",
"bluestore_bdev_driver": "KernelDevice",
"bluestore_bdev_model": "",
"bluestore_bdev_partition_path": "/dev/dm-1",
"bluestore_bdev_rotational": "1",
"bluestore_bdev_size": "6000601989120",
"bluestore_bdev_type": "hdd",
"ceph_version": "ceph version 12.2.4
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)",
"cpu": "Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz",
"default_device_class": "hdd",
"distro": "centos",
"distro_description": "CentOS Linux 7 (Core)",
"distro_version": "7",
"front_addr": ":6804/2053608",
"front_iface": "eth0",
"hb_back_addr": ".78:6806/2053608",
"hb_front_addr": ".78:6807/2053608",
"hostname": "ceph0rdi-osd2-1-xrd.eng.sfdc.net",
"journal_rotational": "1",
"kernel_description": "#1 SMP Tue Jun 26 16:32:21 UTC 2018",
"kernel_version": "3.10.0-862.6.3.el7.x86_64",
"mem_swap_kb": "0",
"mem_total_kb": "131743604",
"os": "Linux",
"osd_data": "/var/lib/ceph/osd/ceph-1",
"osd_objectstore": "bluestore",
"rotational": "1"
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Brett Chancellor
I saw above the recommended size for the db partition was 5% of data, but
yet the recommendation is 40GB partitions for 4TB drives. Isn't that closer
to 1%?

On Fri, Sep 7, 2018 at 10:06 AM, Muhammad Junaid 
wrote:

> Thanks very much. It is clear very much now. Because we are just in
> planning stage right now, would you tell me if we use 7200rpm SAS 3-4TB for
> OSD's, write speed will be fine with this new scenario? Because it will
> apparently writing to slower disks before actual confirmation. (I
> understand there must be advantages of bluestore using direct partitions).
> Regards.
>
> Muhammad Junaid
>
> On Fri, Sep 7, 2018 at 6:39 PM Richard Hesketh <
> richard.hesk...@rd.bbc.co.uk> wrote:
>
>> It can get confusing.
>>
>> There will always be a WAL, and there will always be a metadata DB, for
>> a bluestore OSD. However, if a separate device is not specified for the
>> WAL, it is kept in the same device/partition as the DB; in the same way,
>> if a separate device is not specified for the DB, it is kept on the same
>> device as the actual data (an "all-in-one" OSD). Unless you have a
>> separate, even faster device for the WAL to go on, you shouldn't specify
>> it separately from the DB; just make one partition on your SSD per OSD,
>> and make them as large as will fit together on the SSD.
>>
>> Also, just to be clear, the WAL is not exactly a journal in the same way
>> that Filestore required a journal. Because Bluestore can provide write
>> atomicity without requiring a separate journal, data is *usually*
>> written directly to the longterm storage; writes are only journalled in
>> the WAL to be flushed/synced later if they're below a certain size (IIRC
>> 32kb by default), to avoid latency by excessive seeking on HDDs.
>>
>> Rich
>>
>> On 07/09/18 14:23, Muhammad Junaid wrote:
>> > Thanks again, but sorry again too. I couldn't understand the following.
>> >
>> > 1. As per docs, blocks.db is used only for bluestore (file system meta
>> > data info etc). It has nothing to do with actual data (for journaling)
>> > which will ultimately written to slower disks.
>> > 2. How will actual journaling will work if there is no WAL (As you
>> > suggested)?
>> >
>> > Regards.
>> >
>> > On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza > > > wrote:
>> >
>> > On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid
>> > mailto:junaid.fsd...@gmail.com>> wrote:
>> > > Thanks Alfredo. Just to clear that My configuration has 5 OSD's
>> > (7200 rpm
>> > > SAS HDDS) which are slower than the 200G SSD. Thats why I asked
>> > for a 10G
>> > > WAL partition for each OSD on the SSD.
>> > >
>> > > Are you asking us to do 40GB  * 5 partitions on SSD just for
>> block.db?
>> >
>> > Yes.
>> >
>> > You don't need a separate WAL defined. It only makes sense when you
>> > have something *faster* than where block.db will live.
>> >
>> > In your case 'data' will go in the slower spinning devices,
>> 'block.db'
>> > will go in the SSD, and there is no need for WAL. You would only
>> > benefit
>> > from WAL if you had another device, like an NVMe, where 2GB
>> partitions
>> > (or LVs) could be created for block.wal
>> >
>> >
>> > >
>> > > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza > > > wrote:
>> > >>
>> > >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid
>> > mailto:junaid.fsd...@gmail.com>>
>> > >> wrote:
>> > >> > Hi there
>> > >> >
>> > >> > Asking the questions as a newbie. May be asked a number of
>> > times before
>> > >> > by
>> > >> > many but sorry, it is not clear yet to me.
>> > >> >
>> > >> > 1. The WAL device is just like journaling device used before
>> > bluestore.
>> > >> > And
>> > >> > CEPH confirms Write to client after writing to it (Before
>> > actual write
>> > >> > to
>> > >> > primary device)?
>> > >> >
>> > >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD.
>> > Should we
>> > >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition
>> Size
>> > >> > against
>> > >> > each OSD as 10GB? Or what min/max we should set for WAL
>> > Partition? And
>> > >> > can
>> > >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for
>> > all OSD's?
>> > >>
>> > >> A WAL partition would only help if you have a device faster than
>> the
>> > >> SSD where the block.db would go.
>> > >>
>> > >> We recently updated our sizing recommendations for block.db at
>> least
>> > >> 4% of the size of block (also referenced as the data device):
>> > >>
>> > >>
>> > >>
>> > http://docs.ceph.com/docs/master/rados/configuration/
>> bluestore-config-ref/#sizing
>> > >>
>> > >> In your case, what you want is to create 5 logical volumes from
>> your
>> > >> 200GB at 40GB each, without a need for a WAL device.
>> > >>
>> > >>
>

Re: [ceph-users] Slow requests from bluestore osds

2018-09-05 Thread Brett Chancellor
Mine is currently at 1000 due to the high number of pgs we had coming from
Jewel. I do find it odd that only the bluestore OSDs have this issue.
Filestore OSDs seem to be unaffected.

On Wed, Sep 5, 2018, 3:43 PM Samuel Taylor Liston 
wrote:

> Just a thought - have you looked at increasing your "—mon_max_pg_per_osd”
> both on the mons and osds?  I was having a similar issue while trying to
> add more OSDs to my cluster (12.2.27, CentOS7.5,
> 3.10.0-862.9.1.el7.x86_64).   I increased mine to 300 temporarily while
> adding OSDs and stopped having blocked requests.
> --
> Sam Liston (sam.lis...@utah.edu)
> 
> Center for High Performance Computing
> 155 S. 1452 E. Rm 405
> Salt Lake City, Utah 84112 (801)232-6932
> 
>
>
>
>
> On Sep 5, 2018, at 12:46 PM, Daniel Pryor  wrote:
>
> I've experienced the same thing during scrubbing and/or any kind of
> expansion activity.
>
> *Daniel Pryor*
>
> On Mon, Sep 3, 2018 at 2:13 AM Marc Schöchlin  wrote:
>
>> Hi,
>>
>> we are also experiencing this type of behavior for some weeks on our not
>> so performance critical hdd pools.
>> We haven't spent so much time on this problem, because there are
>> currently more important tasks - but here are a few details:
>>
>> Running the following loop results in the following output:
>>
>> while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health
>> detail); sleep 2; done
>>
>> Sun Sep  2 20:59:47 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>> 4 ops are blocked > 32.768 sec
>> osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:50 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>> 4 ops are blocked > 32.768 sec
>> osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:52 CEST 2018
>> HEALTH_OK
>> Sun Sep  2 21:00:28 CEST 2018
>> HEALTH_WARN 1 slow requests are blocked > 32 sec
>> REQUEST_SLOW 1 slow requests are blocked > 32 sec
>> 1 ops are blocked > 32.768 sec
>> osd.41 has blocked requests > 32.768 sec
>> Sun Sep  2 21:00:31 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>> 7 ops are blocked > 32.768 sec
>> osds 35,41 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:33 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>> 7 ops are blocked > 32.768 sec
>> osds 35,51 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:35 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>> 7 ops are blocked > 32.768 sec
>> osds 35,51 have blocked requests > 32.768 sec
>>
>> Our details:
>>
>>   * system details:
>> * Ubuntu 16.04
>>  * Kernel 4.13.0-39
>>  * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
>>  * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
>>* Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>>* 2*10GBITS SFP+ Network Adapters
>>* 192GB RAM
>>  * Pools are using replication factor 3, 2MB object size,
>>85% write load, 1700 write IOPS/sec
>>(ops mainly between 4k and 16k size), 300 read IOPS/sec
>>   * we have the impression that this appears on deepscrub/scrub activity.
>>   * Ceph 12.2.5, we alread played with the osd settings OSD Settings
>> (our assumtion was that the problem is related to rocksdb compaction)
>> bluestore cache kv max = 2147483648
>> bluestore cache kv ratio = 0.9
>> bluestore cache meta ratio = 0.1
>> bluestore cache size hdd = 10737418240
>>   * this type problem only appears on hdd/bluestore osds, ssd/bluestore
>> osds did never experienced that problem
>>   * the system is healthy, no swapping, no high load, no errors in dmesg
>>
>> I attached a log excerpt of osd.35 - probably this is useful for
>> investigating the problem is someone owns deeper bluestore knowledge.
>> (slow requests appeared on Sun Sep  2 21:00:35)
>>
>> Regards
>> Marc
>>
>>
>> Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
>> > The warnings look like this.
>> >
>> > 6 ops are blocked > 32.768 sec on osd.219
>> > 1 osds h

Re: [ceph-users] Slow requests from bluestore osds

2018-09-05 Thread Brett Chancellor
I'm running Centos 7.5. If I turn off spectre/meltdown protection then a
security sweep will disconnect it from the network.

-Brett

On Wed, Sep 5, 2018 at 2:24 PM, Uwe Sauter  wrote:

> I'm also experiencing slow requests though I cannot point it to scrubbing.
>
> Which kernel do you run? Would you be able to test against the same kernel
> with Spectre/Meltdown mitigations disabled ("noibrs noibpb nopti
> nospectre_v2" as boot option)?
>
> Uwe
>
> Am 05.09.18 um 19:30 schrieb Brett Chancellor:
>
>> Marc,
>>As with you, this problem manifests itself only when the bluestore OSD
>> is involved in some form of deep scrub.  Anybody have any insight on what
>> might be causing this?
>>
>> -Brett
>>
>> On Mon, Sep 3, 2018 at 4:13 AM, Marc Schöchlin > m...@256bit.org>> wrote:
>>
>> Hi,
>>
>> we are also experiencing this type of behavior for some weeks on our
>> not
>> so performance critical hdd pools.
>> We haven't spent so much time on this problem, because there are
>> currently more important tasks - but here are a few details:
>>
>> Running the following loop results in the following output:
>>
>> while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health
>> detail); sleep 2; done
>>
>> Sun Sep  2 20:59:47 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>>  4 ops are blocked > 32.768 sec
>>  osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:50 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>>  4 ops are blocked > 32.768 sec
>>  osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:52 CEST 2018
>> HEALTH_OK
>> Sun Sep  2 21:00:28 CEST 2018
>> HEALTH_WARN 1 slow requests are blocked > 32 sec
>> REQUEST_SLOW 1 slow requests are blocked > 32 sec
>>  1 ops are blocked > 32.768 sec
>>  osd.41 has blocked requests > 32.768 sec
>> Sun Sep  2 21:00:31 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>  7 ops are blocked > 32.768 sec
>>  osds 35,41 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:33 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>  7 ops are blocked > 32.768 sec
>>  osds 35,51 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:35 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>  7 ops are blocked > 32.768 sec
>>  osds 35,51 have blocked requests > 32.768 sec
>>
>> Our details:
>>
>>* system details:
>>  * Ubuntu 16.04
>>   * Kernel 4.13.0-39
>>   * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
>>   * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
>> * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>> * 2*10GBITS SFP+ Network Adapters
>> * 192GB RAM
>>   * Pools are using replication factor 3, 2MB object size,
>> 85% write load, 1700 write IOPS/sec
>> (ops mainly between 4k and 16k size), 300 read IOPS/sec
>>* we have the impression that this appears on deepscrub/scrub
>> activity.
>>* Ceph 12.2.5, we alread played with the osd settings OSD Settings
>>  (our assumtion was that the problem is related to rocksdb
>> compaction)
>>  bluestore cache kv max = 2147483648
>>  bluestore cache kv ratio = 0.9
>>      bluestore cache meta ratio = 0.1
>>  bluestore cache size hdd = 10737418240
>>* this type problem only appears on hdd/bluestore osds,
>> ssd/bluestore
>>  osds did never experienced that problem
>>* the system is healthy, no swapping, no high load, no errors in
>> dmesg
>>
>>     I attached a log excerpt of osd.35 - probably this is useful for
>> investigating the problem is someone owns deeper bluestore knowledge.
>> (slow requests appeared on Sun Sep  2 21:00:35)
>>
>> Regards
>> Marc
>>
>>
>> Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
&

Re: [ceph-users] Slow requests from bluestore osds

2018-09-05 Thread Brett Chancellor
Marc,
  As with you, this problem manifests itself only when the bluestore OSD is
involved in some form of deep scrub.  Anybody have any insight on what
might be causing this?

-Brett

On Mon, Sep 3, 2018 at 4:13 AM, Marc Schöchlin  wrote:

> Hi,
>
> we are also experiencing this type of behavior for some weeks on our not
> so performance critical hdd pools.
> We haven't spent so much time on this problem, because there are
> currently more important tasks - but here are a few details:
>
> Running the following loop results in the following output:
>
> while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health
> detail); sleep 2; done
>
> Sun Sep  2 20:59:47 CEST 2018
> HEALTH_WARN 4 slow requests are blocked > 32 sec
> REQUEST_SLOW 4 slow requests are blocked > 32 sec
> 4 ops are blocked > 32.768 sec
> osd.43 has blocked requests > 32.768 sec
> Sun Sep  2 20:59:50 CEST 2018
> HEALTH_WARN 4 slow requests are blocked > 32 sec
> REQUEST_SLOW 4 slow requests are blocked > 32 sec
> 4 ops are blocked > 32.768 sec
> osd.43 has blocked requests > 32.768 sec
> Sun Sep  2 20:59:52 CEST 2018
> HEALTH_OK
> Sun Sep  2 21:00:28 CEST 2018
> HEALTH_WARN 1 slow requests are blocked > 32 sec
> REQUEST_SLOW 1 slow requests are blocked > 32 sec
> 1 ops are blocked > 32.768 sec
> osd.41 has blocked requests > 32.768 sec
> Sun Sep  2 21:00:31 CEST 2018
> HEALTH_WARN 7 slow requests are blocked > 32 sec
> REQUEST_SLOW 7 slow requests are blocked > 32 sec
> 7 ops are blocked > 32.768 sec
> osds 35,41 have blocked requests > 32.768 sec
> Sun Sep  2 21:00:33 CEST 2018
> HEALTH_WARN 7 slow requests are blocked > 32 sec
> REQUEST_SLOW 7 slow requests are blocked > 32 sec
> 7 ops are blocked > 32.768 sec
> osds 35,51 have blocked requests > 32.768 sec
> Sun Sep  2 21:00:35 CEST 2018
> HEALTH_WARN 7 slow requests are blocked > 32 sec
> REQUEST_SLOW 7 slow requests are blocked > 32 sec
> 7 ops are blocked > 32.768 sec
> osds 35,51 have blocked requests > 32.768 sec
>
> Our details:
>
>   * system details:
> * Ubuntu 16.04
>  * Kernel 4.13.0-39
>  * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
>  * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
>* Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>* 2*10GBITS SFP+ Network Adapters
>* 192GB RAM
>  * Pools are using replication factor 3, 2MB object size,
>85% write load, 1700 write IOPS/sec
>(ops mainly between 4k and 16k size), 300 read IOPS/sec
>   * we have the impression that this appears on deepscrub/scrub activity.
>   * Ceph 12.2.5, we alread played with the osd settings OSD Settings
> (our assumtion was that the problem is related to rocksdb compaction)
> bluestore cache kv max = 2147483648
> bluestore cache kv ratio = 0.9
> bluestore cache meta ratio = 0.1
> bluestore cache size hdd = 10737418240
>   * this type problem only appears on hdd/bluestore osds, ssd/bluestore
> osds did never experienced that problem
>   * the system is healthy, no swapping, no high load, no errors in dmesg
>
> I attached a log excerpt of osd.35 - probably this is useful for
> investigating the problem is someone owns deeper bluestore knowledge.
> (slow requests appeared on Sun Sep  2 21:00:35)
>
> Regards
> Marc
>
>
> Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
> > The warnings look like this.
> >
> > 6 ops are blocked > 32.768 sec on osd.219
> > 1 osds have slow requests
> >
> > On Sun, Sep 2, 2018, 8:45 AM Alfredo Deza  > <mailto:ad...@redhat.com>> wrote:
> >
> > On Sat, Sep 1, 2018 at 12:45 PM, Brett Chancellor
> > mailto:bchancel...@salesforce.com>>
> > wrote:
> > > Hi Cephers,
> > >   I am in the process of upgrading a cluster from Filestore to
> > bluestore,
> > > but I'm concerned about frequent warnings popping up against the
> new
> > > bluestore devices. I'm frequently seeing messages like this,
> > although the
> > > specific osd changes, it's always one of the few hosts I've
> > converted to
> > > bluestore.
> > >
> > > 6 ops are blocked > 32.768 sec on osd.219
> > > 1 osds have slow requests
> > >
> > > I'm running 12.2.4, have any of you seen similar issues? It
> > seems as though
> > > these messages pop up more frequently when one of the bluestore
> > pgs is
> > > involved in a scrub.  I'll include my blues

Re: [ceph-users] Slow requests from bluestore osds

2018-09-02 Thread Brett Chancellor
The warnings look like this.

6 ops are blocked > 32.768 sec on osd.219
1 osds have slow requests

On Sun, Sep 2, 2018, 8:45 AM Alfredo Deza  wrote:

> On Sat, Sep 1, 2018 at 12:45 PM, Brett Chancellor
>  wrote:
> > Hi Cephers,
> >   I am in the process of upgrading a cluster from Filestore to bluestore,
> > but I'm concerned about frequent warnings popping up against the new
> > bluestore devices. I'm frequently seeing messages like this, although the
> > specific osd changes, it's always one of the few hosts I've converted to
> > bluestore.
> >
> > 6 ops are blocked > 32.768 sec on osd.219
> > 1 osds have slow requests
> >
> > I'm running 12.2.4, have any of you seen similar issues? It seems as
> though
> > these messages pop up more frequently when one of the bluestore pgs is
> > involved in a scrub.  I'll include my bluestore creation process below,
> in
> > case that might cause an issue. (sdb, sdc, sdd are SATA, sde and sdf are
> > SSD)
>
> Would be useful to include what those warnings say. The ceph-volume
> commands look OK to me
>
> >
> >
> > ## Process used to create osds
> > sudo ceph-disk zap /dev/sdb /dev/sdc /dev/sdd /dev/sdd /dev/sde /dev/sdf
> > sudo ceph-volume lvm zap /dev/sdb
> > sudo ceph-volume lvm zap /dev/sdc
> > sudo ceph-volume lvm zap /dev/sdd
> > sudo ceph-volume lvm zap /dev/sde
> > sudo ceph-volume lvm zap /dev/sdf
> > sudo sgdisk -n 0:2048:+133GiB -t 0: -c 1:"ceph block.db sdb" /dev/sdf
> > sudo sgdisk -n 0:0:+133GiB -t 0: -c 2:"ceph block.db sdc" /dev/sdf
> > sudo sgdisk -n 0:0:+133GiB -t 0: -c 3:"ceph block.db sdd" /dev/sdf
> > sudo sgdisk -n 0:0:+133GiB -t 0: -c 4:"ceph block.db sde" /dev/sdf
> > sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
> > /dev/sdb --block.db /dev/sdf1
> > sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
> > /dev/sdc --block.db /dev/sdf2
> > sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
> > /dev/sdd --block.db /dev/sdf3
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requests from bluestore osds

2018-09-01 Thread Brett Chancellor
Hi Cephers,
  I am in the process of upgrading a cluster from Filestore to bluestore,
but I'm concerned about frequent warnings popping up against the new
bluestore devices. I'm frequently seeing messages like this, although the
specific osd changes, it's always one of the few hosts I've converted to
bluestore.

6 ops are blocked > 32.768 sec on osd.219
1 osds have slow requests

I'm running 12.2.4, have any of you seen similar issues? It seems as though
these messages pop up more frequently when one of the bluestore pgs is
involved in a scrub.  I'll include my bluestore creation process below, in
case that might cause an issue. (sdb, sdc, sdd are SATA, sde and sdf are
SSD)


## Process used to create osds
sudo ceph-disk zap /dev/sdb /dev/sdc /dev/sdd /dev/sdd /dev/sde /dev/sdf
sudo ceph-volume lvm zap /dev/sdb
sudo ceph-volume lvm zap /dev/sdc
sudo ceph-volume lvm zap /dev/sdd
sudo ceph-volume lvm zap /dev/sde
sudo ceph-volume lvm zap /dev/sdf
sudo sgdisk -n 0:2048:+133GiB -t 0: -c 1:"ceph block.db sdb" /dev/sdf
sudo sgdisk -n 0:0:+133GiB -t 0: -c 2:"ceph block.db sdc" /dev/sdf
sudo sgdisk -n 0:0:+133GiB -t 0: -c 3:"ceph block.db sdd" /dev/sdf
sudo sgdisk -n 0:0:+133GiB -t 0: -c 4:"ceph block.db sde" /dev/sdf
sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
/dev/sdb --block.db /dev/sdf1
sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
/dev/sdc --block.db /dev/sdf2
sudo ceph-volume lvm create --bluestore --crush-device-class hdd --data
/dev/sdd --block.db /dev/sdf3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster "hung" after node failure

2018-08-29 Thread Brett Chancellor
Hi All. I have a ceph cluster that's partially upgraded to Luminous. Last
night a host died and since then the cluster is failing to recover. It
finished backfilling, but was left with thousands of requests degraded,
inactive, or stale.  In order to move past the issue, I put the cluster in
noout,noscrub,nodeep-scrub and restarted all services one by one.

Here is the current state of the cluster, any idea how to get past the
stale and stuck pgs? Any help would be very appreciated. Thanks.

-Brett


## ceph -s output
###
$ sudo ceph -s
  cluster:
id:
health: HEALTH_ERR
165 pgs are stuck inactive for more than 60 seconds
243 pgs backfill_wait
144 pgs backfilling
332 pgs degraded
5 pgs peering
1 pgs recovery_wait
22 pgs stale
332 pgs stuck degraded
143 pgs stuck inactive
22 pgs stuck stale
531 pgs stuck unclean
330 pgs stuck undersized
330 pgs undersized
671 requests are blocked > 32 sec
603 requests are blocked > 4096 sec
recovery 3524906/412016682 objects degraded (0.856%)
recovery 2462252/412016682 objects misplaced (0.598%)
noout,noscrub,nodeep-scrub flag(s) set
mon.ceph0rdi-mon1-1-prd store is getting too big! 17612 MB >=
15360 MB
mon.ceph0rdi-mon2-1-prd store is getting too big! 17669 MB >=
15360 MB
mon.ceph0rdi-mon3-1-prd store is getting too big! 17586 MB >=
15360 MB

  services:
mon: 3 daemons, quorum
ceph0rdi-mon1-1-prd,ceph0rdi-mon2-1-prd,ceph0rdi-mon3-1-prd
mgr: ceph0rdi-mon3-1-prd(active), standbys: ceph0rdi-mon2-1-prd,
ceph0rdi-mon1-1-prd
osd: 222 osds: 218 up, 218 in; 428 remapped pgs
 flags noout,noscrub,nodeep-scrub

  data:
pools:   35 pools, 38144 pgs
objects: 130M objects, 172 TB
usage:   538 TB used, 337 TB / 875 TB avail
pgs: 0.375% pgs not active
 3524906/412016682 objects degraded (0.856%)
 2462252/412016682 objects misplaced (0.598%)
 37599 active+clean
 173   active+undersized+degraded+remapped+backfill_wait
 133   active+undersized+degraded+remapped+backfilling
 93activating
 68active+remapped+backfill_wait
 22activating+undersized+degraded+remapped
 13stale+active+clean
 11active+remapped+backfilling
 9 activating+remapped
 5 remapped
 5 stale+activating+remapped
 3 remapped+peering
 2 stale+remapped
 2 stale+remapped+peering
 1 activating+degraded+remapped
 1 active+clean+remapped
 1 active+degraded+remapped+backfill_wait
 1 active+undersized+remapped+backfill_wait
 1 activating+degraded
 1 active+recovery_wait+undersized+degraded+remapped

  io:
client:   187 kB/s rd, 2595 kB/s wr, 99 op/s rd, 343 op/s wr
recovery: 1509 MB/s, 1541 objects/s

## ceph pg dump_stuck stale (this number doesn't seem to decrease)

$ sudo ceph pg dump_stuck stale
ok
PG_STAT STATE UPUP_PRIMARY ACTING
ACTING_PRIMARY
17.6d7 stale+remapped[5,223,96]  5  [223,96,148]
223
2.5c5  stale+active+clean  [224,48,179]224  [224,48,179]
224
17.64e stale+active+clean  [224,84,109]224  [224,84,109]
224
19.5b4  stale+activating+remapped  [124,130,20]124   [124,20,11]
124
17.4c6 stale+active+clean  [224,216,95]224  [224,216,95]
224
73.413  stale+activating+remapped [117,130,189]117 [117,189,137]
117
2.431  stale+remapped+peering   [5,180,142]  5  [180,142,40]
180
69.1dc stale+active+clean[62,36,54] 62[62,36,54]
 62
14.790 stale+active+clean   [81,114,19] 81   [81,114,19]
 81
2.78e  stale+active+clean [224,143,124]224 [224,143,124]
224
73.37a stale+active+clean   [224,84,38]224   [224,84,38]
224
17.42d  stale+activating+remapped  [220,130,25]220  [220,25,137]
220
72.263 stale+active+clean [224,148,117]224 [224,148,117]
224
67.40  stale+active+clean   [62,170,71] 62   [62,170,71]
 62
67.16d stale+remapped+peering[3,147,22]  3   [147,22,29]
147
20.3de stale+active+clean [224,103,126]224 [224,103,126]
224
19.721 stale+remapped[3,34,179]  3  [34,179,128]
 34
19.2f1  stale+activating+remapped [126,130,178]126  [126,178,72]
126
74.28b stale+active+clean   [224,95,56]224 

[ceph-users] Trouble Creating OSD after rolling back from from Luminous to Jewel

2018-06-12 Thread Brett Chancellor
Hi all!
  I'm having trouble creating OSDs on some boxes that once held Bluestore
OSDs.  I have rolled the ceph software back from 12.2.4 -> 10.2.9 on the
boxes, but I'm running into this error when creating osds.
2018-06-12 22:32:42.78 7fcaf39e2800  0 ceph version 10.2.9
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 414399
2018-06-12 22:32:42.700326 7fcaf39e2800 -1 bluestore(/dev/sdf1)
_read_bdev_label unable to decode label at offset 102:
buffer::malformed_input: void
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
end of struct encoding
2018-06-12 22:32:42.703062 7fcaf39e2800  1 journal _open /dev/sdf1 fd 4:
10485760 bytes, block size 4096 bytes, directio = 0, aio = 0
2018-06-12 22:32:42.703151 7fcaf39e2800  1 journal close /dev/sdf1
2018-06-12 22:32:42.703203 7fcaf39e2800  0 probe_block_device_fsid
/dev/sdf1 is filestore, ----

I've tried:
Wiping all drives with dd if=/dev/zero of=/dev/sdX bs=1M count=1000
running ceph-disk zap /dev/sdb /dev/sdf
ceph-disk prepare /dev/sdb /dev/sdf

The monitors are and will remain running 12.2.4, since I can't find a safe
way to roll those back just yet.

Any help would be appreciated. Thanks.

-Brett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool has many more objects per pg than average

2018-06-07 Thread Brett Chancellor
The error will go away once you start storing data in the other pools. Or,
you could simply silence the message with mon_pg_warn_max_object_skew = 0


On Thu, Jun 7, 2018 at 10:48 AM, Torin Woltjer 
wrote:

> I have a ceph cluster and status shows this error: pool libvirt-pool has
> many more objects per pg than average (too few pgs?) This pool has the most
> stored in it currently, by a large margin. The other pools are
> underutilized currently, but are purposed to take a role much greater than
> libvirt-pool. Once the other pools begin storing more objects, will this
> error go away, or am I misunderstanding the message?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW unable to start gateway for 2nd realm

2018-05-31 Thread Brett Chancellor
be3a6a0893",
"val": 4096339366
},
{
"key": "8501d6d3-4a70-452b-b546-a9b9e95c7cdf",
"val": 2637200738
}
]
},
"master_zonegroup": "73bf78f0-d5a3-4981-8d22-211795e86dc2",
"master_zone": "66a7bbfb-2ecb-40ab-8252-8fbe3a6a0893",
"period_config": {
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
},
"realm_id": "ef91fd09-32a9-4da6-9707-85cb9922a034",
"realm_name": "realmtest",
"realm_epoch": 2
}


## List zonegroups, looking for 'default', don't see it
$ sudo radosgw-admin zonegroup list
{
"default_info": "f0417b11-ebe9-4eb9-bbbe-9396c0061190",
"zonegroups": [
"us",
"maintest"
]
}


On Thu, May 31, 2018 at 11:08 AM, David Turner 
wrote:

> This is the documentation that has always worked for me to set up multiple
> realms.  http://docs.ceph.com/docs/luminous/radosgw/multisite/  It's for
> the multisite configuration, but if you aren't using multi-site, just stop
> after setting up the first site.  I didn't read through all of your steps,
> but I've never had a problem following these steps.  Make sure to replace
> 'luminous' with your ceph version if it isn't luminous.  One thing I
> noticed in Luminous is that you only need like 5 metadata pools for RGW now
> due to namespaces in the pools.
>
> On Wed, May 30, 2018 at 11:38 AM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> Hi All,
>>   I'm having issues trying to get a 2nd Rados GW realm/zone up and
>> running.  The configuration seemed to go well, but I'm unable to start the
>> gateway.
>>
>> 2018-05-29 21:21:27.119192 7fd26cfdd9c0  0 ERROR: failed to decode obj
>> from .rgw.root:zone_info.fe2e0680-d7e8-415f-bf91-501dda96d075
>> 2018-05-29 21:21:27.119198 7fd26cfdd9c0  0 replace_region_with_zonegroup:
>> error initializing default zone params: (5) Input/output error
>> 2018-05-29 21:21:27.119207 7fd26cfdd9c0 -1 failed converting region to
>> zonegroup : ret -5 (5) Input/output error
>> 2018-05-29 21:21:27.120479 7fd26cfdd9c0  1 -- 10.252.174.9:0/3447328109 
>> mark_down
>> 0x55dddc157a30 -- 0x55dddc153630
>> 2018-05-29 21:21:27.11 7fd26cfdd9c0  1 -- 10.252.174.9:0/3447328109
>> mark_down_all
>> 2018-05-29 21:21:27.122393 7fd26cfdd9c0  1 -- 10.252.174.9:0/3447328109 
>> shutdown
>> complete.
>> 2018-05-29 21:21:27.122800 7fd26cfdd9c0 -1 Couldn't init storage provider
>> (RADOS)
>>
>> Existing RadosGW .. this one works fine
>> Realm : realm01 (default)
>> zonegroup: us (master, default)
>> zone: us-prd-1 (master, default)
>>
>> The problem comes when I'm attempting to add a new realm.
>> New Realm: realmtest
>> new ZG: maintest (master)
>> new zone: lumtest (master)
>>
>> Steps taken:
>> =
>> * Created new rgw pools lumtest.rgw.* (14 pools) on dedicated root
>> *  radosgw-admin realm create --rgw-realm=realmtest
>> *  radosgw-admin zonegroup create --rgw-zonegroup=maintest
>> --rgw-realm=realmtest --master
>> * radosgw-admin zone create --rgw-realm=realmtest
>> --rgw-zonegroup=maintest --rgw-zone=lumtest --master
>> * radosgw-admin user create --rgw-realm realmtest --rgw-zonegroup
>> maintest --rgw-zone lumtest --uid="REMOVED" --display-name="System User"
>> --system
>> * radosgw-admin zone modify -rgw-realm realmtest --rgw-zonegroup maintest
>> --rgw-zone lumtest  [added the access key and secret of system user]
>> *  radosgw-admin user create --rgw-realm realmtest --rgw-zonegroup
>> maintest --rgw-zone lumtest --uid="test" --display-name="test User"
>> * radosgw-admin period update --rgw-realm realmtest
>>
>> ceph.conf
>> 
>> [client.radosgw.rgw-test]
>> host = rgw-test
>> keyring = /etc/ceph/ceph.client.radosgw.rgw-test
>> log file = /var/log/ceph/radosgw.rgw-test
>> rgw frontends = civetweb port=80
>> rgw realm=realmtest
>> rgw zonegroup=maintest
>> rgw zone=lumtest
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW unable to start gateway for 2nd realm

2018-05-30 Thread Brett Chancellor
Hi All,
  I'm having issues trying to get a 2nd Rados GW realm/zone up and
running.  The configuration seemed to go well, but I'm unable to start the
gateway.

2018-05-29 21:21:27.119192 7fd26cfdd9c0  0 ERROR: failed to decode obj from
.rgw.root:zone_info.fe2e0680-d7e8-415f-bf91-501dda96d075
2018-05-29 21:21:27.119198 7fd26cfdd9c0  0 replace_region_with_zonegroup:
error initializing default zone params: (5) Input/output error
2018-05-29 21:21:27.119207 7fd26cfdd9c0 -1 failed converting region to
zonegroup : ret -5 (5) Input/output error
2018-05-29 21:21:27.120479 7fd26cfdd9c0  1 --
10.252.174.9:0/3447328109 mark_down
0x55dddc157a30 -- 0x55dddc153630
2018-05-29 21:21:27.11 7fd26cfdd9c0  1 -- 10.252.174.9:0/3447328109
 mark_down_all
2018-05-29 21:21:27.122393 7fd26cfdd9c0  1 --
10.252.174.9:0/3447328109 shutdown
complete.
2018-05-29 21:21:27.122800 7fd26cfdd9c0 -1 Couldn't init storage provider
(RADOS)

Existing RadosGW .. this one works fine
Realm : realm01 (default)
zonegroup: us (master, default)
zone: us-prd-1 (master, default)

The problem comes when I'm attempting to add a new realm.
New Realm: realmtest
new ZG: maintest (master)
new zone: lumtest (master)

Steps taken:
=
* Created new rgw pools lumtest.rgw.* (14 pools) on dedicated root
*  radosgw-admin realm create --rgw-realm=realmtest
*  radosgw-admin zonegroup create --rgw-zonegroup=maintest
--rgw-realm=realmtest --master
* radosgw-admin zone create --rgw-realm=realmtest --rgw-zonegroup=maintest
--rgw-zone=lumtest --master
* radosgw-admin user create --rgw-realm realmtest --rgw-zonegroup maintest
--rgw-zone lumtest --uid="REMOVED" --display-name="System User" --system
* radosgw-admin zone modify -rgw-realm realmtest --rgw-zonegroup maintest
--rgw-zone lumtest  [added the access key and secret of system user]
*  radosgw-admin user create --rgw-realm realmtest --rgw-zonegroup maintest
--rgw-zone lumtest --uid="test" --display-name="test User"
* radosgw-admin period update --rgw-realm realmtest

ceph.conf

[client.radosgw.rgw-test]
host = rgw-test
keyring = /etc/ceph/ceph.client.radosgw.rgw-test
log file = /var/log/ceph/radosgw.rgw-test
rgw frontends = civetweb port=80
rgw realm=realmtest
rgw zonegroup=maintest
rgw zone=lumtest
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com