Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-06 Thread David Turner
Marking osds down is not without risks. You are taking away one of the
copies of data for every PG on that osd. Also you are causing every PG on
that osd to peer. If that osd comes back up, every PG on it again needs to
peer and then they need to recover.

That is a lot of load and risks to automate into the system. Now let's take
into consideration other causes of slow requests like having more IO load
than your spindle can handle, backfilling settings set to aggressively
(related to the first option), or networking problems. If the mon is
detecting slow requests on OSDs and marking them down, you could end up
marking half of your cluster down or causing corrupt data by flapping OSDs.

The mon will mark osds down if those settings I mentioned are met. If the
osd isn't unresponsive enough to not respond to other OSDs or the mons,
then there really isn't much that ceph can do to automate this safely.
There are just so many variables. If ceph was a closed system on specific
hardware, it could certainly be monitoring that hardware closely for early
warning signs... But people are running Ceph on everything they can compile
it for including raspberry pis. The cluster admin, however, should be able
to add their own early detection for failures.

You can monitor a lot about disks including things such as average await in
a host to see if the disks are taking longer than normal to respond. That
particular check led us to find that we had several storage nodes with bad
cache batteries on the controllers. Finding that explained some slowness we
had noticed in the cluster. It also led us to a better method to catch that
scenario sooner.

On Tue, Mar 6, 2018, 11:22 PM shadow_lin  wrote:

> Hi Turner,
> Thanks for your insight.
> I am wondering if the mon can detect slow/blocked request from certain osd
> why can't mon mark a osd with blocked request down if the request is
> blocked for a certain time.
>
> 2018-03-07
> --
> shadow_lin
> --
>
> *发件人:*David Turner 
> *发送时间:*2018-03-06 23:56
> *主题:*Re: [ceph-users] Why one crippled osd can slow down or block all
> request to the whole ceph cluster?
> *收件人:*"shadow_lin"
> *抄送:*"ceph-users"
>
>
> There are multiple settings that affect this.  osd_heartbeat_grace is
> probably the most apt.  If an OSD is not getting a response from another
> OSD for more than the heartbeat_grace period, then it will tell the mons
> that the OSD is down.  Once mon_osd_min_down_reporters have told the mons
> that an OSD is down, then the OSD will be marked down by the cluster.  If
> the OSD does not then talk to the mons directly to say that it is up, it
> will be marked out after mon_osd_down_out_interval is reached.  If it does
> talk to the mons to say that it is up, then it should be responding again
> and be fine.
>
> In your case where the OSD is half up, half down... I believe all you can
> really do is monitor your cluster and troubleshoot OSDs causing problems
> like this.  Basically every storage solution is vulnerable to this.
> Sometimes an OSD just needs to be restarted due to being in a bad state
> somehow, or simply removed from the cluster because the disk is going bad.
>
> On Sun, Mar 4, 2018 at 2:28 AM shadow_lin  wrote:
>
>> Hi list,
>> During my test of ceph,I find sometime the whole ceph cluster are blocked
>> and the reason was one unfunctional osd.Ceph can heal itself if some osd is
>> down, but it seems if some osd is half dead (have heart beat but can't
>> handle request) then all the request which are directed to that osd would
>> be blocked. If all osds are in one pool and the whole cluster would be
>> blocked due to that one hanged osd.
>> I think this is because ceph will try to distribute the request to all
>> osds and if one of the osd wont confirm the request is done then everything
>> is blocked.
>> Is there a way to let ceph to mark the the crippled osd down if the
>> requests direct to that osd are blocked more than certain time to avoid the
>> whole cluster is blocked?
>>
>> 2018-03-04
>> --
>> shadow_lin
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-06 Thread shadow_lin
Hi Turner,
Thanks for your insight.
I am wondering if the mon can detect slow/blocked request from certain osd why 
can't mon mark a osd with blocked request down if the request is blocked for a 
certain time.

2018-03-07 

shadow_lin 



发件人:David Turner 
发送时间:2018-03-06 23:56
主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to 
the whole ceph cluster?
收件人:"shadow_lin"
抄送:"ceph-users"

There are multiple settings that affect this.  osd_heartbeat_grace is probably 
the most apt.  If an OSD is not getting a response from another OSD for more 
than the heartbeat_grace period, then it will tell the mons that the OSD is 
down.  Once mon_osd_min_down_reporters have told the mons that an OSD is down, 
then the OSD will be marked down by the cluster.  If the OSD does not then talk 
to the mons directly to say that it is up, it will be marked out after 
mon_osd_down_out_interval is reached.  If it does talk to the mons to say that 
it is up, then it should be responding again and be fine.


In your case where the OSD is half up, half down... I believe all you can 
really do is monitor your cluster and troubleshoot OSDs causing problems like 
this.  Basically every storage solution is vulnerable to this.  Sometimes an 
OSD just needs to be restarted due to being in a bad state somehow, or simply 
removed from the cluster because the disk is going bad.


On Sun, Mar 4, 2018 at 2:28 AM shadow_lin  wrote:

Hi list,
During my test of ceph,I find sometime the whole ceph cluster are blocked and 
the reason was one unfunctional osd.Ceph can heal itself if some osd is down, 
but it seems if some osd is half dead (have heart beat but can't handle 
request) then all the request which are directed to that osd would be blocked. 
If all osds are in one pool and the whole cluster would be blocked due to that 
one hanged osd.
I think this is because ceph will try to distribute the request to all osds and 
if one of the osd wont confirm the request is done then everything is blocked.
Is there a way to let ceph to mark the the crippled osd down if the requests 
direct to that osd are blocked more than certain time to avoid the whole 
cluster is blocked?

2018-03-04


shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-06 Thread Alexandre DERUMIER
Hi,

I'm also seeing slow memory increase over time with my bluestore nvme osds 
(3,2tb each) , with default ceph.conf settings. (ceph 12.2.2)

each osd start around 5G memory, and go up to 8GB.

Currently I'm restarting them around each month to free memory.


here a dump of osd.0 after 1week running

ceph 2894538  3.9  9.9 7358564 6553080 ? Ssl  mars01 303:03 
/usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph


root@ceph4-1:~#  ceph daemon osd.0 dump_mempools 
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 84070208,
"bytes": 84070208
},
"bluestore_cache_data": {
"items": 168,
"bytes": 2908160
},
"bluestore_cache_onode": {
"items": 947820,
"bytes": 636935040
},
"bluestore_cache_other": {
"items": 101250372,
"bytes": 2043476720
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 8,
"bytes": 5760
},
"bluestore_writing_deferred": {
"items": 85,
"bytes": 1203200
},
"bluestore_writing": {
"items": 7,
"bytes": 569584
},
"bluefs": {
"items": 1774,
"bytes": 106360
},
"buffer_anon": {
"items": 68307,
"bytes": 17188636
},
"buffer_meta": {
"items": 284,
"bytes": 24992
},
"osd": {
"items": 333,
"bytes": 4017312
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 1195884,
"bytes": 298139520
},
"osdmap": {
"items": 4542,
"bytes": 384464
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 187539792,
"bytes": 3089029956
}
}



another osd after 1 month:


USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
ceph 1718009  2.5 11.7 8542012 7725992 ? Ssl   2017 2463:28 
/usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph

root@ceph4-1:~# ceph daemon osd.5 dump_mempools 
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 98449088,
"bytes": 98449088
},
"bluestore_cache_data": {
"items": 759,
"bytes": 17276928
},
"bluestore_cache_onode": {
"items": 884140,
"bytes": 594142080
},
"bluestore_cache_other": {
"items": 116375567,
"bytes": 2072801299
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 6,
"bytes": 4320
},
"bluestore_writing_deferred": {
"items": 99,
"bytes": 1190045
},
"bluestore_writing": {
"items": 11,
"bytes": 4510159
},
"bluefs": {
"items": 1202,
"bytes": 64136
},
"buffer_anon": {
"items": 76863,
"bytes": 21327234
},
"buffer_meta": {
"items": 910,
"bytes": 80080
},
"osd": {
"items": 328,
"bytes": 3956992
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 1118050,
"bytes": 286277600
},
"osdmap": {
"items": 6073,
"bytes": 551872
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 216913096,
"bytes": 3100631833
}
}

- Mail original -
De: "Kjetil Joergensen" 
À: "ceph-users" 
Envoyé: Mercredi 7 Mars 2018 01:07:06
Objet: Re: [ceph-users] Memory leak in Ceph OSD?

Hi, 
addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). 

The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). 

-KJ 

On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen < [ 
mailto:kje...@medallia.com | kje...@medallia.com ] > wrote: 



Hi, 
so.. +1 

We don't run compression as far as I know, so that wouldn't be it. We do 
actually run a mix of bluestore & filestore - due to the rest of the cluster 
predating a stable bluestore by some amount. 

The interesting part is - the behavior seems to be specific to our bluestore 
nodes. 

Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue 
line - dump_mempools total bytes for all the OSDs running 

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-06 Thread Kjetil Joergensen
Hi,

addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b).

The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw).

-KJ

On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen 
wrote:

> Hi,
>
> so.. +1
>
> We don't run compression as far as I know, so that wouldn't be it. We do
> actually run a mix of bluestore & filestore - due to the rest of the
> cluster predating a stable bluestore by some amount.
>
> The interesting part is - the behavior seems to be specific to our
> bluestore nodes.
>
> Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs.
> Blue line - dump_mempools total bytes for all the OSDs running on the
> yellow line. The big dips - forced restarts after having suffered through
> after effects of letting linux deal with it by OOM->SIGKILL previously.
>
>
> ​
> A gross extrapolation - "right now" the "memory used" seems to be close
> enough to "sum of RSS of ceph-osd processes" running on the machines.
>
> -KJ
>
> On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev 
> wrote:
>
>> On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra
>>  wrote:
>> > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives
>> > filled to around 90%. One thing that does increase memory usage is the
>> > number of clients simultaneously sending write requests to a particular
>> > primary OSD if the write sizes are large.
>>
>> We have not seen a memory increase in Ubuntu 16.04, but I also
>> observed repeatedly the following phenomenon:
>>
>> When doing a VMotion in ESXi of a large 3TB file (this generates a log
>> of IO requests of small size) to a Ceph pool with compression set to
>> force, after some time the Ceph cluster shows a large number of
>> blocked requests and eventually timeouts become very large (to the
>> point where ESXi aborts the IO due to timeouts).  After abort, the
>> blocked/slow requests messages disappear.  There are no OSD errors.  I
>> have OSD logs if anyone is interested.
>>
>> This does not occur when compression is unset.
>>
>> --
>> Alex Gorbachev
>> Storcium
>>
>> >
>> > Subhachandra
>> >
>> > On Thu, Mar 1, 2018 at 6:18 AM, David Turner 
>> wrote:
>> >>
>> >> With default memory settings, the general rule is 1GB ram/1TB OSD.  If
>> you
>> >> have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
>> >> recommendation for filestore OSDs, but it was a bit much memory for the
>> >> OSDs.  From what I've seen, this rule is a little more appropriate with
>> >> bluestore now and should still be observed.
>> >>
>> >> Please note that memory usage in a HEALTH_OK cluster is not the same
>> >> amount of memory that the daemons will use during recovery.  I have
>> seen
>> >> deployments with 4x memory usage during recovery.
>> >>
>> >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:
>> >>>
>> >>> Quoting Caspar Smit (caspars...@supernas.eu):
>> >>> > Stefan,
>> >>> >
>> >>> > How many OSD's and how much RAM are in each server?
>> >>>
>> >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
>> >>> cores (at least one core per OSD).
>> >>>
>> >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
>> >>> > right?
>> >>>
>> >>> Apparently. Sure they will use more RAM than just cache to function
>> >>> correctly. I figured 3 GB per OSD would be enough ...
>> >>>
>> >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of
>> >>> > total
>> >>> > RAM. The cache is a part of the memory usage by bluestore OSD's.
>> >>>
>> >>> A factor 4 is quite high, isn't it? Where is all this RAM used for
>> >>> besides cache? RocksDB?
>> >>>
>> >>> So how should I size the amount of RAM in a OSD server for 10
>> bluestore
>> >>> SSDs in a
>> >>> replicated setup?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Stefan
>> >>>
>> >>> --
>> >>> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
>> >>> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>> >>> ___
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen 
> SRE, Medallia Inc
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
___

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-06 Thread Kjetil Joergensen
Hi,

so.. +1

We don't run compression as far as I know, so that wouldn't be it. We do
actually run a mix of bluestore & filestore - due to the rest of the
cluster predating a stable bluestore by some amount.

The interesting part is - the behavior seems to be specific to our
bluestore nodes.

Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs.
Blue line - dump_mempools total bytes for all the OSDs running on the
yellow line. The big dips - forced restarts after having suffered through
after effects of letting linux deal with it by OOM->SIGKILL previously.


​
A gross extrapolation - "right now" the "memory used" seems to be close
enough to "sum of RSS of ceph-osd processes" running on the machines.

-KJ

On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev 
wrote:

> On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra
>  wrote:
> > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives
> > filled to around 90%. One thing that does increase memory usage is the
> > number of clients simultaneously sending write requests to a particular
> > primary OSD if the write sizes are large.
>
> We have not seen a memory increase in Ubuntu 16.04, but I also
> observed repeatedly the following phenomenon:
>
> When doing a VMotion in ESXi of a large 3TB file (this generates a log
> of IO requests of small size) to a Ceph pool with compression set to
> force, after some time the Ceph cluster shows a large number of
> blocked requests and eventually timeouts become very large (to the
> point where ESXi aborts the IO due to timeouts).  After abort, the
> blocked/slow requests messages disappear.  There are no OSD errors.  I
> have OSD logs if anyone is interested.
>
> This does not occur when compression is unset.
>
> --
> Alex Gorbachev
> Storcium
>
> >
> > Subhachandra
> >
> > On Thu, Mar 1, 2018 at 6:18 AM, David Turner 
> wrote:
> >>
> >> With default memory settings, the general rule is 1GB ram/1TB OSD.  If
> you
> >> have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
> >> recommendation for filestore OSDs, but it was a bit much memory for the
> >> OSDs.  From what I've seen, this rule is a little more appropriate with
> >> bluestore now and should still be observed.
> >>
> >> Please note that memory usage in a HEALTH_OK cluster is not the same
> >> amount of memory that the daemons will use during recovery.  I have seen
> >> deployments with 4x memory usage during recovery.
> >>
> >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:
> >>>
> >>> Quoting Caspar Smit (caspars...@supernas.eu):
> >>> > Stefan,
> >>> >
> >>> > How many OSD's and how much RAM are in each server?
> >>>
> >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
> >>> cores (at least one core per OSD).
> >>>
> >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
> >>> > right?
> >>>
> >>> Apparently. Sure they will use more RAM than just cache to function
> >>> correctly. I figured 3 GB per OSD would be enough ...
> >>>
> >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of
> >>> > total
> >>> > RAM. The cache is a part of the memory usage by bluestore OSD's.
> >>>
> >>> A factor 4 is quite high, isn't it? Where is all this RAM used for
> >>> besides cache? RocksDB?
> >>>
> >>> So how should I size the amount of RAM in a OSD server for 10 bluestore
> >>> SSDs in a
> >>> replicated setup?
> >>>
> >>> Thanks,
> >>>
> >>> Stefan
> >>>
> >>> --
> >>> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> >>> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change radosgw object owner

2018-03-06 Thread Yehuda Sadeh-Weinraub
On Tue, Mar 6, 2018 at 11:40 AM, Ryan Leimenstoll
 wrote:
> Hi all,
>
> We are trying to move a bucket in radosgw from one user to another in an 
> effort both change ownership and attribute the storage usage of the data to 
> the receiving user’s quota.
>
> I have unlinked the bucket and linked it to the new user using:
>
> radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
> radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID 
> —uid=$NEWUSER
>
> However, perhaps as expected, the owner of all the objects in the bucket 
> remain as $USER. I don’t believe changing the owner is a supported operation 
> from the S3 protocol, however it would be very helpful to have the ability to 
> do this on the radosgw backend. This is especially useful for large 
> buckets/datasets where copying the objects out and into radosgw could be time 
> consuming.
>
>  Is this something that is currently possible within radosgw? We are running 
> Ceph 12.2.2.

Maybe try to copy objects into themselves with the new owner (as long
as it can read it, if not then you first need to change the objects'
acls to allow read)? Note that you need to do a copy that would retain
the old meta attributes of the old object.

Yehuda

>
> Thanks,
> Ryan Leimenstoll
> rleim...@umiacs.umd.edu
> University of Maryland Institute for Advanced Computer Studies
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

2018-03-06 Thread Gregory Farnum
On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic 
wrote:

> Hi all,
>
> I have few problems on my cluster, that are maybe linked together and
> now caused OSD down during pg repair.
>
> First few notes about my cluster:
>
> 4 nodes, 15 OSDs installed on Luminous (no upgrade).
> Replicated pools with 1 pool (pool 6) cached by ssd disks.
> I don't detect any hardware failures (disk IO errors, restarts,
> corrupted data etc).
> I'm running RBDs using libvirt on debian wheezy and jessie (stable and
> oldstable).
> I'm snapshotting RBD's using Luminous client on Debian Jessie only.
>

When you say "cached by", do you mean there's a cache pool? Or are you
using bcache or something underneath?


>
> Now problems, from light to severe:
>
> 1)
> Almost every day I notice health some problems after deep scrub
> 1-2 inconsistent PG's with "read_error" on some osd's.
> When I don't repair it, it disappears after few days (? another deep
> scrub). There is no read_error on disks (disk check ok, no errors logged
> in syslog).
>

> 2)
> I noticed on my pool 6 (cached pool), that scrub reports some objects,
> that shouldn't be there:
>
> 2018-02-27 23:43:06.490152 7f4b3820e700 -1 osd.1 pg_epoch: 8712 pg[6.20(
> v 8712'771984 (8712'770478,8712'771984] local-lis/les=8710/8711 n=14299
> ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 8710/8710/8710)
> [1,10,14] r=0 lpr=8710 crt=8712'771984 lcod 8712'771983 mlcod
> 8712'771983 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps
> no head for 6:07ffbc7b:::rbd_data.967992ae8944a.00061cb8:c2
> (have MIN)
>
> I think, that means orphaned snap object without his head replica. Maybe
> snaptrim left it there? Why? Maybe error during snaptrim? Or
> fstrim/discard removed "head" object (this is I hope nonsense)?
>
> 3)
> I ended with one object (probably snap object), that has only 1 replica
> (out from size 3) and when I try to repair it, my OSD crash with
>
> /build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
> recovery_info.ss.clone_snaps.end())
> I guess, that it detected orphaned snap object I noticed at 2) and don't
> repair it, just assterts and stop OSD. Am I right?
>
> I noticed comment "// hmm, should we warn?" on ceph source at that
> assert code. So should someone remove that assert?
>

There's a ticket https://tracker.ceph.com/issues/23030, which links to a
much longer discussion on this mailing list between Sage and Stefan which
discusses this particular assert. I'm not entirely clear from the rest of
your story (and the lng history in that thread) if there are other
potential causes, or if your story might help diagnose it. But I'd start
there since AFAIK it's still a mystery that looks serious but has only a
very small number of incidences. :/
-Greg


>
> And my questions are
>
> How can I fix issue with crashing OSD?
> How can I safely remove that objects with missing head? Is there any
> tool or force-snaptrim on non-existent snapshots? It is prod cluster so
> I want to be careful. I have no problems now with data availability.
> My last idea is to move RBD's to another pool, but have not enough space
> to do that (as I know RBD can only copy not move) so I'm looking for
> another clean solution.
> And last question - how can I find, what is causing that read_erros and
> snap object leftovers?
>
> Should I paste my whole log? It is bigger than allowed post size.
> Pasting most important events:
>
> -23> 2018-02-27 23:43:07.903368 7f4b3820e700  2 osd.1 pg_epoch: 8712
> pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
> n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
> 8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
> mlcod 8712'771985 active+clean+scrubbing+deep+inconsistent+repair] 6.20
> repair 1 missing, 0 inconsistent objects
> -22> 2018-02-27 23:43:07.903410 7f4b3820e700 -1 log_channel(cluster)
> log [ERR] : 6.20 repair 1 missing, 0 inconsistent objects
> -21> 2018-02-27 23:43:07.903446 7f4b3820e700 -1 log_channel(cluster)
> log [ERR] : 6.20 repair 3 errors, 2 fixed
> -20> 2018-02-27 23:43:07.903480 7f4b3820e700  5
> write_log_and_missing with: dirty_to: 0'0, dirty_from:
> 4294967295'18446744073709551615, writeout_from:
> 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
> -19> 2018-02-27 23:43:07.903604 7f4b3820e700  1 --
> [2a01:430:22a::cef:c011]:6805/514544 -->
> [2a01:430:22a::cef:c021]:6803/3001666 -- MOSDScrubReserve(6.20 RELEASE
> e8712) v1 -- 0x55a4c5459c00 con 0
> -18> 2018-02-27 23:43:07.903651 7f4b3820e700  1 --
> [2a01:430:22a::cef:c011]:6805/514544 -->
> [2a01:430:22a::cef:c041]:6802/3012729 -- MOSDScrubReserve(6.20 RELEASE
> e8712) v1 -- 0x55a4cb6dee00 con 0
> -17> 2018-02-27 23:43:07.903679 7f4b3820e700  1 --
> [2a01:430:22a::cef:c011]:6805/514544 -->
> [2a01:430:22a::cef:c021]:6803/3001666 -- pg_info((query:8712 sent:8712
> 6.20( v 8712'771986 

[ceph-users] Civetweb log format

2018-03-06 Thread Aaron Bassett
Hey all,
I'm trying to get something of an audit log out of radosgw. To that end I was 
wondering if theres a mechanism to customize the log format of civetweb. It's 
already writing IP, HTTP Verb, path, response and time, but I'm hoping to get 
it to print the Authorization header of the request, which containers the 
access key id which we can tie back into the systems we use to issue 
credentials. Any thoughts?

Thanks,
Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] change radosgw object owner

2018-03-06 Thread Robin H. Johnson
On Tue, Mar 06, 2018 at 02:40:11PM -0500, Ryan Leimenstoll wrote:
> Hi all, 
> 
> We are trying to move a bucket in radosgw from one user to another in an 
> effort both change ownership and attribute the storage usage of the data to 
> the receiving user’s quota. 
> 
> I have unlinked the bucket and linked it to the new user using: 
> 
> radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
> radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID 
> —uid=$NEWUSER
> 
> However, perhaps as expected, the owner of all the objects in the
> bucket remain as $USER. I don’t believe changing the owner is a
> supported operation from the S3 protocol, however it would be very
> helpful to have the ability to do this on the radosgw backend. This is
> especially useful for large buckets/datasets where copying the objects
> out and into radosgw could be time consuming.
At the raw radosgw-admin level, you should be able to do it with
bi-list/bi-get/bi-put. The downside here is that I don't think the BI ops are
exposed in the HTTP Admin API, so it's going to be really expensive to chown
lots of objects.

Using a quick example:
# radosgw-admin \
  --uid UID-CENSORED \
  --bucket BUCKET-CENSORED \
  bi get \
  --object=OBJECTNAME-CENSORED
{
"type": "plain",
"idx": "OBJECTNAME-CENSORED",
"entry": {
"name": "OBJECTNAME-CENSORED",
"instance": "",
"ver": {
"pool": 5,
"epoch": 266028
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 1066,
"mtime": "2016-11-17 17:01:29.668746Z",
"etag": "e7a75c39df3d123c716d5351059ad2d9",
"owner": "UID-CENSORED",
"owner_display_name": "UID-CENSORED",
"content_type": "image/png",
"accounted_size": 1066,
"user_data": ""
},
"tag": "default.293024600.1188196",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
}

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-06 Thread Mike Christie
On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
> Hi,
> 
> I want to do load balanced multipathing (multiple iSCSI gateway/exporter
> nodes) of iSCSI backed with RBD images. Should I disable exclusive lock
> feature? What if I don't disable that feature? I'm using TGT (manual
> way) since I get so many CPU stuck error messages when I was using LIO.
> 

You are using LIO/TGT with krbd right?

You cannot or shouldn't do active/active multipathing. If you have the
lock enabled then it bounces between paths for each IO and will be slow.
If you do not have it enabled then you can end up with stale IO
overwriting current data.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] change radosgw object owner

2018-03-06 Thread Ryan Leimenstoll
Hi all, 

We are trying to move a bucket in radosgw from one user to another in an effort 
both change ownership and attribute the storage usage of the data to the 
receiving user’s quota. 

I have unlinked the bucket and linked it to the new user using: 

radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER
radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID —uid=$NEWUSER

However, perhaps as expected, the owner of all the objects in the bucket remain 
as $USER. I don’t believe changing the owner is a supported operation from the 
S3 protocol, however it would be very helpful to have the ability to do this on 
the radosgw backend. This is especially useful for large buckets/datasets where 
copying the objects out and into radosgw could be time consuming.

 Is this something that is currently possible within radosgw? We are running 
Ceph 12.2.2. 

Thanks,
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-06 Thread Lazuardi Nasution
Hi,

I want to do load balanced multipathing (multiple iSCSI gateway/exporter
nodes) of iSCSI backed with RBD images. Should I disable exclusive lock
feature? What if I don't disable that feature? I'm using TGT (manual way)
since I get so many CPU stuck error messages when I was using LIO.

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When all Mons are down, does existing RBD volume continue to work

2018-03-06 Thread Gregory Farnum
I think things would keep running, but I'm really not sure. This is just
not a realistic concern as there are lots of little housekeeping things
that can be deferred for a little while but eventually will stop forward
progress if you can't talk to the monitors to persist cluster state updates.

On Tue, Mar 6, 2018 at 9:50 AM Mayank Kumar  wrote:

> Thanks Gregory. This is basically just trying to understand the behavior
> of the system in a failure scenario . Ideally we would track and fix mons
> going down promptly .
>
> In an ideal world where nothing else fails and there cephx is not in use
> but mons are down , what happens if the osd pings to mons time-out ? Would
> that start resulting in I/O failures ?
>
>
> On Mon, Mar 5, 2018 at 9:44 PM Gregory Farnum  wrote:
>
>> On Sun, Mar 4, 2018 at 12:02 AM Mayank Kumar  wrote:
>>
>>> Ceph Users,
>>>
>>> My question is if all mons are down(i know its a terrible situation to
>>> be), does an existing rbd volume which is mapped to a host and being
>>> used(read/written to) continues to work?
>>>
>>> I understand that it wont get notifications about osdmap, etc, but
>>> assuming nothing fails, does the read/write ios on the exsiting rbd volume
>>> continue to work or that would start failing ?
>>>
>>
>> Clients will continue to function if there are transient monitor issues,
>> but you can't rely on them continuing in a long-term failure scenario.
>> Eventually *something* will hit a timeout, whether that's an OSD on its
>> pings, or some kind of key rotation for cephx, or
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When all Mons are down, does existing RBD volume continue to work

2018-03-06 Thread Mayank Kumar
Thanks Gregory. This is basically just trying to understand the behavior of
the system in a failure scenario . Ideally we would track and fix mons
going down promptly .

In an ideal world where nothing else fails and there cephx is not in use
but mons are down , what happens if the osd pings to mons time-out ? Would
that start resulting in I/O failures ?


On Mon, Mar 5, 2018 at 9:44 PM Gregory Farnum  wrote:

> On Sun, Mar 4, 2018 at 12:02 AM Mayank Kumar  wrote:
>
>> Ceph Users,
>>
>> My question is if all mons are down(i know its a terrible situation to
>> be), does an existing rbd volume which is mapped to a host and being
>> used(read/written to) continues to work?
>>
>> I understand that it wont get notifications about osdmap, etc, but
>> assuming nothing fails, does the read/write ios on the exsiting rbd volume
>> continue to work or that would start failing ?
>>
>
> Clients will continue to function if there are transient monitor issues,
> but you can't rely on them continuing in a long-term failure scenario.
> Eventually *something* will hit a timeout, whether that's an OSD on its
> pings, or some kind of key rotation for cephx, or
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-06 Thread David Turner
There are multiple settings that affect this.  osd_heartbeat_grace is
probably the most apt.  If an OSD is not getting a response from another
OSD for more than the heartbeat_grace period, then it will tell the mons
that the OSD is down.  Once mon_osd_min_down_reporters have told the mons
that an OSD is down, then the OSD will be marked down by the cluster.  If
the OSD does not then talk to the mons directly to say that it is up, it
will be marked out after mon_osd_down_out_interval is reached.  If it does
talk to the mons to say that it is up, then it should be responding again
and be fine.

In your case where the OSD is half up, half down... I believe all you can
really do is monitor your cluster and troubleshoot OSDs causing problems
like this.  Basically every storage solution is vulnerable to this.
Sometimes an OSD just needs to be restarted due to being in a bad state
somehow, or simply removed from the cluster because the disk is going bad.

On Sun, Mar 4, 2018 at 2:28 AM shadow_lin  wrote:

> Hi list,
> During my test of ceph,I find sometime the whole ceph cluster are blocked
> and the reason was one unfunctional osd.Ceph can heal itself if some osd is
> down, but it seems if some osd is half dead (have heart beat but can't
> handle request) then all the request which are directed to that osd would
> be blocked. If all osds are in one pool and the whole cluster would be
> blocked due to that one hanged osd.
> I think this is because ceph will try to distribute the request to all
> osds and if one of the osd wont confirm the request is done then everything
> is blocked.
> Is there a way to let ceph to mark the the crippled osd down if the
> requests direct to that osd are blocked more than certain time to avoid the
> whole cluster is blocked?
>
> 2018-03-04
> --
> shadow_lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-06 Thread Martin Emrich

Hi!

Am 02.03.18 um 13:27 schrieb Federico Lucifredi:


We do speak to the Xen team every once in a while, but while there is 
interest in adding Ceph support on their side, I think we are somewhat 
down the list of their priorities.



Maybe things change with XCP-ng (https://xcp-ng.github.io). Now as 
Citrix is removing features from 7.3 and cutting off users of the free 
version, this project looks very interesting (Trying to be what CentOS 
is/was to RHEL).


And they have Ceph RBD support on their ideas list already.

Cheers,

Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep Scrub distribution

2018-03-06 Thread Jonathan Proulx
On Tue, Mar 06, 2018 at 03:48:30PM +, David Turner wrote:
:I'm pretty sure I put up one of those scripts in the past.  Basically what
:we did was we set our scrub cycle to something like 40 days, we then sort
:all PGs by the last time they were deep scrubbed.  We grab the oldest 1/30
:of those PGs and tell them to deep-scrub manually, the next day we do it
:again.  After a month or so, your PGs should be fairly evenly spaced out
:over 30 days.  With those numbers you could disable the cron to run the
:deep-scrubs for maintenance up to 10 days every 40 days and still scrub all
:of your PGs during that time.

I think I had that script :)

But in Jewel (I believe it was jewel) ceph got smarter about spacing things out 
and
we ditched the cron job (though probably still have a copy of the
script).

Now we're on luminous things bunched up again.  The main problem being
they are bunched into 4days or so so there wouldn't be space for the
cron solution to work.

I have a theory on my potential mistake.  I had dropped zero from the
config briefly so thing were scheduled for 4.2 days rather than 42,
but "corrected" that and restarted all OSDs but the 'mgr' processses
still showed 4.2d config.

Which process actually decides to start scrubs?  osd, mgr, mon?

In any case I've just ensured all instance of all three are showing
the same value for osd_deep_scrub_interval.

I guess if we go from everything scrubbing to nothing scrubbing I'll
dust off the cron script so we even out rather than just have the same
pileup less frequently.

Thanks,
-Jon


:On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnum  wrote:
:
:> On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx 
:> wrote:
:>
:>> Hi All,
:>>
:>> I've recently noticed my deep scrubs are EXTREAMLY poorly
:>> distributed.  They are stating with in the 18->06 local time start
:>> stop time but are not distrubuted over enough days or well distributed
:>> over the range of days they have.
:>>
:>> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print
:>> $20}'`; do date +%D -d $date; done | sort | uniq -c
:>> dumped all
:>>   1 03/01/18
:>>   6 03/03/18
:>>8358 03/04/18
:>>1875 03/05/18
:>>
:>> So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
:>> been kicking this around for a while since I noticed poor distribution
:>> over a 7 day range when I was really pretty sure I'd changed that from
:>> the 7d default to 28d.
:>>
:>> Tried kicking it out to 42 days about a week ago with:
:>>
:>> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'
:>>
:>>
:>> There were many error suggesting it could nto reread the change and I'd
:>> need to restart the OSDs but 'ceph daemon osd.0 config show |grep
:>> osd_deep_scrub_interval' showed the right value so I let it roll for a
:>> week but the scrubs did not spread out.
:>>
:>> So Friday I set that value in ceph.conf and did rolling restarts of
:>> all OSDs.  Then doubled checked running value on all daemons.
:>> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB
:>> voodoo above) show near enough 1/42nd of PGs had been scrubbed
:>> Saturday night that I thought this was working.
:>>
:>> This morning I checked again and got the results above.
:>>
:>> I would expect after changing to a 42d scrub cycle I'd see approx 1/42
:>> of the PGs deep scrub each night untill there was a roughly even
:>> distribution over the past 42 days.
:>>
:>> So which thing is broken my config or my expectations?
:>>
:>
:> Sadly, changing the interval settings does not directly change the
:> scheduling of deep scrubs. Instead, it merely influences whether a PG will
:> get queued for scrub when it is examined as a candidate, based on how
:> out-of-date its scrub is. (That is, nothing holistically goes "I need to
:> scrub 1/n of these PGs every night"; there's a simple task that says "is
:> this PG's last scrub more than n days old?")
:>
:> Users have shared various scripts on the list for setting up a more even
:> scrub distribution by fiddling with the settings and poking at specific PGs
:> to try and smear them out over the whole time period; I'd check archives or
:> google for those. :)
:> -Greg
:> ___
:> ceph-users mailing list
:> ceph-users@lists.ceph.com
:> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
:>

-- 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Max Cuttins


Il 06/03/2018 16:15, David Turner ha scritto:
I've never deleted a bucket, pool, etc at the request of a user that 
they then wanted back because I force them to go through a process to 
have their data deleted. They have to prove to me, and I have to 
agree, that they don't need it before I'll delete it.


Of course I cannot keep in touch with the customer of my reseller (which 
I don't know)
.. or I've to say with the end customer [of the customer] [of the 
customer] [of the customer] of my resellers
...in order to obsessively ask to please PROVE ME that your data are not 
usefull anymore.


And even if I could I neither want to call all the end customers making 
me wasting time to let me confirm _*I can go on *_and do my job.


It just sounds like you need to either learn to be a storage admin, 
hire someone that is, or buy a solution that doesn't care if you are.


Uh! That's bad.
It is so sad when somebody cannot take a proposal as constructive 
criticism but need instead to mark other as incompetent.
Everybody has different admin experience and different point-of-view and 
that's all folks.

You don't have sub-sub-sub customer which you don't know? I do.
You are the one that make everybody obey to "the process"? I can't. I 
need to solve the requests of my customers not yell when they are so 
dumb to delete important data.
I just wrote to throw a proposal in order to improve the admin's life 
not of course to be offended.

Thanks!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep Scrub distribution

2018-03-06 Thread David Turner
I'm pretty sure I put up one of those scripts in the past.  Basically what
we did was we set our scrub cycle to something like 40 days, we then sort
all PGs by the last time they were deep scrubbed.  We grab the oldest 1/30
of those PGs and tell them to deep-scrub manually, the next day we do it
again.  After a month or so, your PGs should be fairly evenly spaced out
over 30 days.  With those numbers you could disable the cron to run the
deep-scrubs for maintenance up to 10 days every 40 days and still scrub all
of your PGs during that time.

On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnum  wrote:

> On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx 
> wrote:
>
>> Hi All,
>>
>> I've recently noticed my deep scrubs are EXTREAMLY poorly
>> distributed.  They are stating with in the 18->06 local time start
>> stop time but are not distrubuted over enough days or well distributed
>> over the range of days they have.
>>
>> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print
>> $20}'`; do date +%D -d $date; done | sort | uniq -c
>> dumped all
>>   1 03/01/18
>>   6 03/03/18
>>8358 03/04/18
>>1875 03/05/18
>>
>> So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
>> been kicking this around for a while since I noticed poor distribution
>> over a 7 day range when I was really pretty sure I'd changed that from
>> the 7d default to 28d.
>>
>> Tried kicking it out to 42 days about a week ago with:
>>
>> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'
>>
>>
>> There were many error suggesting it could nto reread the change and I'd
>> need to restart the OSDs but 'ceph daemon osd.0 config show |grep
>> osd_deep_scrub_interval' showed the right value so I let it roll for a
>> week but the scrubs did not spread out.
>>
>> So Friday I set that value in ceph.conf and did rolling restarts of
>> all OSDs.  Then doubled checked running value on all daemons.
>> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB
>> voodoo above) show near enough 1/42nd of PGs had been scrubbed
>> Saturday night that I thought this was working.
>>
>> This morning I checked again and got the results above.
>>
>> I would expect after changing to a 42d scrub cycle I'd see approx 1/42
>> of the PGs deep scrub each night untill there was a roughly even
>> distribution over the past 42 days.
>>
>> So which thing is broken my config or my expectations?
>>
>
> Sadly, changing the interval settings does not directly change the
> scheduling of deep scrubs. Instead, it merely influences whether a PG will
> get queued for scrub when it is examined as a candidate, based on how
> out-of-date its scrub is. (That is, nothing holistically goes "I need to
> scrub 1/n of these PGs every night"; there's a simple task that says "is
> this PG's last scrub more than n days old?")
>
> Users have shared various scripts on the list for setting up a more even
> scrub distribution by fiddling with the settings and poking at specific PGs
> to try and smear them out over the whole time period; I'd check archives or
> google for those. :)
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Max Cuttins




Il 06/03/2018 11:13, Ronny Aasen ha scritto:

On 06. mars 2018 10:26, Max Cuttins wrote:

Il 05/03/2018 20:17, Gregory Farnum ha scritto:


You're not wrong, and indeed that's why I pushed back on the latest 
attempt to make deleting pools even more cumbersome.


But having a "trash" concept is also pretty weird. If admins can 
override it to just immediately delete the data (if they need the 
space), how is that different from just being another hoop to jump 
through? If we want to give the data owners a chance to undo, how do 
we identify and notify *them* rather than the admin running the 
command? But if admins can't override the trash and delete 
immediately, what do we do for things like testing and proofs of 
concept where large-scale data creates and deletes are to be expected?

-Greg


I'm talking about my experience:

  * Data Owner are a little bit in their LA LA LAND, and think that they
    can safely delete some of their data without losses.
  * Data Owner should think that their pool have been really deleted
  * Data Owner should not been akwnoledge about the existance of the
    "/trash/"
  * So Data Owner ask to restore from backup (but instead we'll use
    easily the trash).

Said so, we also have to think that:

  * Administrator is always GOD, so he need to be in the possibility to
    override if needed whenever he needs.
  * However Administrator should just put in status delete without
    override this behaviour if there is not need to do so.
  * Override should be allowed only with many cumbersome telling you
    that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE

I don't like that the software can limit administrators to do his 
job... in the end Administrator'll always find its way to do what he 
want (it's the root).
Of course I like the feature to push the Admin to follow the right 
behaviour.



some sort of active/inactive toggle both on RBD images, pools, buckets 
and filesystems trees is nice to allow admins to perform scream tests.


"data owner requests deletion - admin disables pool(kicks all clients) 
- data owner screams - admin reactivates"


sounds much better then the last step beeing admin checking if the 
backups are good.,..


i try to do something similar by renaming pools to be deleted but that 
is not allways the same as inactive.




EXACTLY! :)
I like the name "scream test"... it really look like that! :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-06 Thread Konstantin Shalygin

Dear all,
   I wonder how we could support VM systems with ceph storage (block
device)? my colleagues are waiting for my answer for vmware (vSphere 5) and
I myself use oVirt (RHEV). the default protocol is iSCSI.
   I know that openstack/cinder work well with ceph and proxmox (just heard)
too. But currently we are using vmware and ovirt.


Your wise suggestion is appreciated

Cheers
Joshua



oVirt works with Ceph natively via librbd.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Ronny Aasen

On 06. mars 2018 10:26, Max Cuttins wrote:

Il 05/03/2018 20:17, Gregory Farnum ha scritto:


You're not wrong, and indeed that's why I pushed back on the latest 
attempt to make deleting pools even more cumbersome.


But having a "trash" concept is also pretty weird. If admins can 
override it to just immediately delete the data (if they need the 
space), how is that different from just being another hoop to jump 
through? If we want to give the data owners a chance to undo, how do 
we identify and notify *them* rather than the admin running the 
command? But if admins can't override the trash and delete 
immediately, what do we do for things like testing and proofs of 
concept where large-scale data creates and deletes are to be expected?

-Greg


I'm talking about my experience:

  * Data Owner are a little bit in their LA LA LAND, and think that they
can safely delete some of their data without losses.
  * Data Owner should think that their pool have been really deleted
  * Data Owner should not been akwnoledge about the existance of the
"/trash/"
  * So Data Owner ask to restore from backup (but instead we'll use
easily the trash).

Said so, we also have to think that:

  * Administrator is always GOD, so he need to be in the possibility to
override if needed whenever he needs.
  * However Administrator should just put in status delete without
override this behaviour if there is not need to do so.
  * Override should be allowed only with many cumbersome telling you
that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE

I don't like that the software can limit administrators to do his job... 
in the end Administrator'll always find its way to do what he want (it's 
the root).
Of course I like the feature to push the Admin to follow the right 
behaviour.



some sort of active/inactive toggle both on RBD images, pools, buckets 
and filesystems trees is nice to allow admins to perform scream tests.


"data owner requests deletion - admin disables pool(kicks all clients) - 
data owner screams - admin reactivates"


sounds much better then the last step beeing admin checking if the 
backups are good.,..


i try to do something similar by renaming pools to be deleted but that 
is not allways the same as inactive.



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Packages for Debian 8 "Jessie" missing from download.ceph.com APT repository

2018-03-06 Thread Simon Fredsted
Hi,

I'm trying to install "ceph-common" on Debian 8 "Jessie", but it seems the 
packages aren't available for it. Searching for “jessie” on 
https://download.ceph.com/debian-luminous/pool/main/c/ceph/ yields no results.

I've tried to install it like it is documented here: 
http://docs.ceph.com/docs/master/install/get-packages/#debian-packages

However, after adding the repository, only version 10.2 and 0.80.7 from the 
official Debian repositories show up in "apt-cache policy ceph-common”

So far, my solution is using the “trusty” packages from Ubuntu seems to work on 
my Debian box for anybody else that’s seeking to resolve this issue.

Thanks,

Best regards,
Simon Fredsted

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Max Cuttins

Il 05/03/2018 20:17, Gregory Farnum ha scritto:


You're not wrong, and indeed that's why I pushed back on the latest 
attempt to make deleting pools even more cumbersome.


But having a "trash" concept is also pretty weird. If admins can 
override it to just immediately delete the data (if they need the 
space), how is that different from just being another hoop to jump 
through? If we want to give the data owners a chance to undo, how do 
we identify and notify *them* rather than the admin running the 
command? But if admins can't override the trash and delete 
immediately, what do we do for things like testing and proofs of 
concept where large-scale data creates and deletes are to be expected?

-Greg


I'm talking about my experience:

 * Data Owner are a little bit in their LA LA LAND, and think that they
   can safely delete some of their data without losses.
 * Data Owner should think that their pool have been really deleted
 * Data Owner should not been akwnoledge about the existance of the
   "/trash/"
 * So Data Owner ask to restore from backup (but instead we'll use
   easily the trash).

Said so, we also have to think that:

 * Administrator is always GOD, so he need to be in the possibility to
   override if needed whenever he needs.
 * However Administrator should just put in status delete without
   override this behaviour if there is not need to do so.
 * Override should be allowed only with many cumbersome telling you
   that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE

I don't like that the software can limit administrators to do his job... 
in the end Administrator'll always find its way to do what he want (it's 
the root).
Of course I like the feature to push the Admin to follow the right 
behaviour.








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Max Cuttins



What about using the at command:

ceph osd pool rm   --yes-i-really-really-mean-it | at now + 30 days

Regards,
Alex


How do you know that this command is scheduled?
How do you delete the scheduled command if is casted?
This is weird. We need something within CEPH that make you see the 
"status" of the pool as "pending delete".



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-06 Thread Brad Hubbard
debug_osd that is... :)

On Tue, Mar 6, 2018 at 7:10 PM, Brad Hubbard  wrote:

>
>
> On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata <
> mbald...@hsamiata.it> wrote:
>
>> Hi
>>
>> I monitor dmesg in each of the 3 nodes, no hardware issue reported. And
>> the problem happens with various different OSDs in different nodes, for me
>> it is clear it's not an hardware problem.
>>
>
> If you have osd_debug set to 25 or greater when you run the deep scrub you
> should get more information about the nature of the read error in the
> ReplicatedBackend::be_deep_scrub() function (assuming this is a
> replicated pool).
>
> This may create large logs so watch they don't exhaust storage.
>
>> Thanks for reply
>>
>>
>>
>> Il 05/03/2018 21:45, Vladimir Prokofev ha scritto:
>>
>> > always solved by ceph pg repair 
>> That doesn't necessarily means that there's no hardware issue. In my case
>> repair also worked fine and returned cluster to OK state every time, but in
>> time faulty disk fail another scrub operation, and this repeated multiple
>> times before we replaced that disk.
>> One last thing to look into is dmesg at your OSD nodes. If there's a
>> hardware read error it will be logged in dmesg.
>>
>> 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata <
>> mbald...@hsamiata.it>:
>>
>>> Hi and thanks for reply
>>>
>>> The OSDs are all healthy, in fact after a ceph pg repair  the ceph
>>> health is back to OK and in the OSD log I see   repair ok, 0 fixed
>>>
>>> The SMART data of the 3 OSDs seems fine
>>>
>>> *OSD.5*
>>>
>>> # ceph-disk list | grep osd.5
>>>  /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2
>>>
>>> # smartctl -a /dev/sdd
>>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
>>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>>>
>>> === START OF INFORMATION SECTION ===
>>> Model Family: Seagate Barracuda 7200.14 (AF)
>>> Device Model: ST1000DM003-1SB10C
>>> Serial Number:Z9A1MA1V
>>> LU WWN Device Id: 5 000c50 090c7028b
>>> Firmware Version: CC43
>>> User Capacity:1,000,204,886,016 bytes [1.00 TB]
>>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>>> Rotation Rate:7200 rpm
>>> Form Factor:  3.5 inches
>>> Device is:In smartctl database [for details use: -P show]
>>> ATA Version is:   ATA8-ACS T13/1699-D revision 4
>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>> Local Time is:Mon Mar  5 16:17:22 2018 CET
>>> SMART support is: Available - device has SMART capability.
>>> SMART support is: Enabled
>>>
>>> === START OF READ SMART DATA SECTION ===
>>> SMART overall-health self-assessment test result: PASSED
>>>
>>> General SMART Values:
>>> Offline data collection status:  (0x82) Offline data collection activity
>>> was completed without error.
>>> Auto Offline Data Collection: Enabled.
>>> Self-test execution status:  (   0) The previous self-test routine 
>>> completed
>>> without error or no self-test has ever
>>> been run.
>>> Total time to complete Offline
>>> data collection:(0) seconds.
>>> Offline data collection
>>> capabilities:(0x7b) SMART execute Offline immediate.
>>> Auto Offline data collection on/off 
>>> support.
>>> Suspend Offline collection upon new
>>> command.
>>> Offline surface scan supported.
>>> Self-test supported.
>>> Conveyance Self-test supported.
>>> Selective Self-test supported.
>>> SMART capabilities:(0x0003) Saves SMART data before entering
>>> power-saving mode.
>>> Supports SMART auto save timer.
>>> Error logging capability:(0x01) Error logging supported.
>>> General Purpose Logging supported.
>>> Short self-test routine
>>> recommended polling time:(   1) minutes.
>>> Extended self-test routine
>>> recommended polling time:( 109) minutes.
>>> Conveyance self-test routine
>>> recommended polling time:(   2) minutes.
>>> SCT capabilities:  (0x1085) SCT Status supported.
>>>
>>> SMART Attributes Data Structure revision number: 10
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
>>> WHEN_FAILED RAW_VALUE
>>>   1 Raw_Read_Error_Rate 0x000f   082   063   006Pre-fail  Always
>>>-   193297722
>>>   3 Spin_Up_Time0x0003   097   097   000Pre-fail  Always
>>>-   0
>>>   4 Start_Stop_Count   

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-06 Thread Brad Hubbard
On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata <
mbald...@hsamiata.it> wrote:

> Hi
>
> I monitor dmesg in each of the 3 nodes, no hardware issue reported. And
> the problem happens with various different OSDs in different nodes, for me
> it is clear it's not an hardware problem.
>

If you have osd_debug set to 25 or greater when you run the deep scrub you
should get more information about the nature of the read error in the
ReplicatedBackend::be_deep_scrub() function (assuming this is a replicated
pool).

This may create large logs so watch they don't exhaust storage.

> Thanks for reply
>
>
>
> Il 05/03/2018 21:45, Vladimir Prokofev ha scritto:
>
> > always solved by ceph pg repair 
> That doesn't necessarily means that there's no hardware issue. In my case
> repair also worked fine and returned cluster to OK state every time, but in
> time faulty disk fail another scrub operation, and this repeated multiple
> times before we replaced that disk.
> One last thing to look into is dmesg at your OSD nodes. If there's a
> hardware read error it will be logged in dmesg.
>
> 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata <
> mbald...@hsamiata.it>:
>
>> Hi and thanks for reply
>>
>> The OSDs are all healthy, in fact after a ceph pg repair  the ceph
>> health is back to OK and in the OSD log I see   repair ok, 0 fixed
>>
>> The SMART data of the 3 OSDs seems fine
>>
>> *OSD.5*
>>
>> # ceph-disk list | grep osd.5
>>  /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2
>>
>> # smartctl -a /dev/sdd
>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: Seagate Barracuda 7200.14 (AF)
>> Device Model: ST1000DM003-1SB10C
>> Serial Number:Z9A1MA1V
>> LU WWN Device Id: 5 000c50 090c7028b
>> Firmware Version: CC43
>> User Capacity:1,000,204,886,016 bytes [1.00 TB]
>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>> Rotation Rate:7200 rpm
>> Form Factor:  3.5 inches
>> Device is:In smartctl database [for details use: -P show]
>> ATA Version is:   ATA8-ACS T13/1699-D revision 4
>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is:Mon Mar  5 16:17:22 2018 CET
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status:  (0x82)  Offline data collection activity
>>  was completed without error.
>>  Auto Offline Data Collection: Enabled.
>> Self-test execution status:  (   0)  The previous self-test routine 
>> completed
>>  without error or no self-test has ever
>>  been run.
>> Total time to complete Offline
>> data collection: (0) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>>  Auto Offline data collection on/off 
>> support.
>>  Suspend Offline collection upon new
>>  command.
>>  Offline surface scan supported.
>>  Self-test supported.
>>  Conveyance Self-test supported.
>>  Selective Self-test supported.
>> SMART capabilities:(0x0003)  Saves SMART data before entering
>>  power-saving mode.
>>  Supports SMART auto save timer.
>> Error logging capability:(0x01)  Error logging supported.
>>  General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: (   1) minutes.
>> Extended self-test routine
>> recommended polling time: ( 109) minutes.
>> Conveyance self-test routine
>> recommended polling time: (   2) minutes.
>> SCT capabilities:   (0x1085) SCT Status supported.
>>
>> SMART Attributes Data Structure revision number: 10
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
>> WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate 0x000f   082   063   006Pre-fail  Always 
>>   -   193297722
>>   3 Spin_Up_Time0x0003   097   097   000Pre-fail  Always 
>>   -   0
>>   4 Start_Stop_Count0x0032   100   100   020Old_age   Always 
>>   -   60
>>   5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always 
>>   -   0
>>   7 Seek_Error_Rate 

Re: [ceph-users] Cache tier

2018-03-06 Thread Захаров Алексей
Hi,
We use write-around cache tier with libradosstriper-based clients. We faced 
with bug which causes performance degradation: 
http://tracker.ceph.com/issues/22528 . Especially if it is a lot of small 
objects - sizeof(1 striper chunk). Such objects will promote on every 
read/write lock:).
And it is very hard to benchmark cache tier.

Also, we have a little testing pool with rbd disks for vm's. It works better 
with cache tier on ssd's. But, there's no heavy i/o load.

It's better to benchmark cache tier for your special case and choose cache mode 
based on benchmark results.

06.03.2018, 02:28, "Budai Laszlo" :
> Dear all,
>
> I have some questions about cache tier in ceph:
>
> 1. Can someone share experiences with cache tiering? What are the sensitive 
> things to pay attention regarding the cache tier? Can one use the same ssd 
> for both cache and
> 2. Is cache tiering supported with bluestore? Any advices for using cache 
> tier with Bluestore?
>
> Kind regards,
> Laszlo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Regards,
Aleksei Zakharov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with UID starting with underscores

2018-03-06 Thread Arvydas Opulskis
Hi all,

because one our script misbehaved, new user with bad UID was created via
API, and now we can't remove, view or modify it. I believe, it's because it
has three underscores at the beginning:

[root@rgw001 /]# radosgw-admin metadata list user | grep "___pro_"
"___pro_",

[root@rgw001 /]# radosgw-admin user info --uid="___pro_"
could not fetch user info: no user info saved

Do you have any ideas how to workaround this problem? If it's not supported
naming, maybe API shouldn't allow to create it?

We are using Jewel 10.2.10 version on Centos 7.4.

Thanks for any ideas,
Arvydas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com