Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
Marking osds down is not without risks. You are taking away one of the copies of data for every PG on that osd. Also you are causing every PG on that osd to peer. If that osd comes back up, every PG on it again needs to peer and then they need to recover. That is a lot of load and risks to automate into the system. Now let's take into consideration other causes of slow requests like having more IO load than your spindle can handle, backfilling settings set to aggressively (related to the first option), or networking problems. If the mon is detecting slow requests on OSDs and marking them down, you could end up marking half of your cluster down or causing corrupt data by flapping OSDs. The mon will mark osds down if those settings I mentioned are met. If the osd isn't unresponsive enough to not respond to other OSDs or the mons, then there really isn't much that ceph can do to automate this safely. There are just so many variables. If ceph was a closed system on specific hardware, it could certainly be monitoring that hardware closely for early warning signs... But people are running Ceph on everything they can compile it for including raspberry pis. The cluster admin, however, should be able to add their own early detection for failures. You can monitor a lot about disks including things such as average await in a host to see if the disks are taking longer than normal to respond. That particular check led us to find that we had several storage nodes with bad cache batteries on the controllers. Finding that explained some slowness we had noticed in the cluster. It also led us to a better method to catch that scenario sooner. On Tue, Mar 6, 2018, 11:22 PM shadow_linwrote: > Hi Turner, > Thanks for your insight. > I am wondering if the mon can detect slow/blocked request from certain osd > why can't mon mark a osd with blocked request down if the request is > blocked for a certain time. > > 2018-03-07 > -- > shadow_lin > -- > > *发件人:*David Turner > *发送时间:*2018-03-06 23:56 > *主题:*Re: [ceph-users] Why one crippled osd can slow down or block all > request to the whole ceph cluster? > *收件人:*"shadow_lin" > *抄送:*"ceph-users" > > > There are multiple settings that affect this. osd_heartbeat_grace is > probably the most apt. If an OSD is not getting a response from another > OSD for more than the heartbeat_grace period, then it will tell the mons > that the OSD is down. Once mon_osd_min_down_reporters have told the mons > that an OSD is down, then the OSD will be marked down by the cluster. If > the OSD does not then talk to the mons directly to say that it is up, it > will be marked out after mon_osd_down_out_interval is reached. If it does > talk to the mons to say that it is up, then it should be responding again > and be fine. > > In your case where the OSD is half up, half down... I believe all you can > really do is monitor your cluster and troubleshoot OSDs causing problems > like this. Basically every storage solution is vulnerable to this. > Sometimes an OSD just needs to be restarted due to being in a bad state > somehow, or simply removed from the cluster because the disk is going bad. > > On Sun, Mar 4, 2018 at 2:28 AM shadow_lin wrote: > >> Hi list, >> During my test of ceph,I find sometime the whole ceph cluster are blocked >> and the reason was one unfunctional osd.Ceph can heal itself if some osd is >> down, but it seems if some osd is half dead (have heart beat but can't >> handle request) then all the request which are directed to that osd would >> be blocked. If all osds are in one pool and the whole cluster would be >> blocked due to that one hanged osd. >> I think this is because ceph will try to distribute the request to all >> osds and if one of the osd wont confirm the request is done then everything >> is blocked. >> Is there a way to let ceph to mark the the crippled osd down if the >> requests direct to that osd are blocked more than certain time to avoid the >> whole cluster is blocked? >> >> 2018-03-04 >> -- >> shadow_lin >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
Hi Turner, Thanks for your insight. I am wondering if the mon can detect slow/blocked request from certain osd why can't mon mark a osd with blocked request down if the request is blocked for a certain time. 2018-03-07 shadow_lin 发件人:David Turner发送时间:2018-03-06 23:56 主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster? 收件人:"shadow_lin" 抄送:"ceph-users" There are multiple settings that affect this. osd_heartbeat_grace is probably the most apt. If an OSD is not getting a response from another OSD for more than the heartbeat_grace period, then it will tell the mons that the OSD is down. Once mon_osd_min_down_reporters have told the mons that an OSD is down, then the OSD will be marked down by the cluster. If the OSD does not then talk to the mons directly to say that it is up, it will be marked out after mon_osd_down_out_interval is reached. If it does talk to the mons to say that it is up, then it should be responding again and be fine. In your case where the OSD is half up, half down... I believe all you can really do is monitor your cluster and troubleshoot OSDs causing problems like this. Basically every storage solution is vulnerable to this. Sometimes an OSD just needs to be restarted due to being in a bad state somehow, or simply removed from the cluster because the disk is going bad. On Sun, Mar 4, 2018 at 2:28 AM shadow_lin wrote: Hi list, During my test of ceph,I find sometime the whole ceph cluster are blocked and the reason was one unfunctional osd.Ceph can heal itself if some osd is down, but it seems if some osd is half dead (have heart beat but can't handle request) then all the request which are directed to that osd would be blocked. If all osds are in one pool and the whole cluster would be blocked due to that one hanged osd. I think this is because ceph will try to distribute the request to all osds and if one of the osd wont confirm the request is done then everything is blocked. Is there a way to let ceph to mark the the crippled osd down if the requests direct to that osd are blocked more than certain time to avoid the whole cluster is blocked? 2018-03-04 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory leak in Ceph OSD?
Hi, I'm also seeing slow memory increase over time with my bluestore nvme osds (3,2tb each) , with default ceph.conf settings. (ceph 12.2.2) each osd start around 5G memory, and go up to 8GB. Currently I'm restarting them around each month to free memory. here a dump of osd.0 after 1week running ceph 2894538 3.9 9.9 7358564 6553080 ? Ssl mars01 303:03 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph root@ceph4-1:~# ceph daemon osd.0 dump_mempools { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 84070208, "bytes": 84070208 }, "bluestore_cache_data": { "items": 168, "bytes": 2908160 }, "bluestore_cache_onode": { "items": 947820, "bytes": 636935040 }, "bluestore_cache_other": { "items": 101250372, "bytes": 2043476720 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 8, "bytes": 5760 }, "bluestore_writing_deferred": { "items": 85, "bytes": 1203200 }, "bluestore_writing": { "items": 7, "bytes": 569584 }, "bluefs": { "items": 1774, "bytes": 106360 }, "buffer_anon": { "items": 68307, "bytes": 17188636 }, "buffer_meta": { "items": 284, "bytes": 24992 }, "osd": { "items": 333, "bytes": 4017312 }, "osd_mapbl": { "items": 0, "bytes": 0 }, "osd_pglog": { "items": 1195884, "bytes": 298139520 }, "osdmap": { "items": 4542, "bytes": 384464 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 }, "total": { "items": 187539792, "bytes": 3089029956 } } another osd after 1 month: USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND ceph 1718009 2.5 11.7 8542012 7725992 ? Ssl 2017 2463:28 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph root@ceph4-1:~# ceph daemon osd.5 dump_mempools { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 98449088, "bytes": 98449088 }, "bluestore_cache_data": { "items": 759, "bytes": 17276928 }, "bluestore_cache_onode": { "items": 884140, "bytes": 594142080 }, "bluestore_cache_other": { "items": 116375567, "bytes": 2072801299 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 6, "bytes": 4320 }, "bluestore_writing_deferred": { "items": 99, "bytes": 1190045 }, "bluestore_writing": { "items": 11, "bytes": 4510159 }, "bluefs": { "items": 1202, "bytes": 64136 }, "buffer_anon": { "items": 76863, "bytes": 21327234 }, "buffer_meta": { "items": 910, "bytes": 80080 }, "osd": { "items": 328, "bytes": 3956992 }, "osd_mapbl": { "items": 0, "bytes": 0 }, "osd_pglog": { "items": 1118050, "bytes": 286277600 }, "osdmap": { "items": 6073, "bytes": 551872 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 }, "total": { "items": 216913096, "bytes": 3100631833 } } - Mail original - De: "Kjetil Joergensen"À: "ceph-users" Envoyé: Mercredi 7 Mars 2018 01:07:06 Objet: Re: [ceph-users] Memory leak in Ceph OSD? Hi, addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). -KJ On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen < [ mailto:kje...@medallia.com | kje...@medallia.com ] > wrote: Hi, so.. +1 We don't run compression as far as I know, so that wouldn't be it. We do actually run a mix of bluestore & filestore - due to the rest of the cluster predating a stable bluestore by some amount. The interesting part is - the behavior seems to be specific to our bluestore nodes. Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue line - dump_mempools total bytes for all the OSDs running
Re: [ceph-users] Memory leak in Ceph OSD?
Hi, addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). -KJ On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensenwrote: > Hi, > > so.. +1 > > We don't run compression as far as I know, so that wouldn't be it. We do > actually run a mix of bluestore & filestore - due to the rest of the > cluster predating a stable bluestore by some amount. > > The interesting part is - the behavior seems to be specific to our > bluestore nodes. > > Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. > Blue line - dump_mempools total bytes for all the OSDs running on the > yellow line. The big dips - forced restarts after having suffered through > after effects of letting linux deal with it by OOM->SIGKILL previously. > > > > A gross extrapolation - "right now" the "memory used" seems to be close > enough to "sum of RSS of ceph-osd processes" running on the machines. > > -KJ > > On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev > wrote: > >> On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra >> wrote: >> > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives >> > filled to around 90%. One thing that does increase memory usage is the >> > number of clients simultaneously sending write requests to a particular >> > primary OSD if the write sizes are large. >> >> We have not seen a memory increase in Ubuntu 16.04, but I also >> observed repeatedly the following phenomenon: >> >> When doing a VMotion in ESXi of a large 3TB file (this generates a log >> of IO requests of small size) to a Ceph pool with compression set to >> force, after some time the Ceph cluster shows a large number of >> blocked requests and eventually timeouts become very large (to the >> point where ESXi aborts the IO due to timeouts). After abort, the >> blocked/slow requests messages disappear. There are no OSD errors. I >> have OSD logs if anyone is interested. >> >> This does not occur when compression is unset. >> >> -- >> Alex Gorbachev >> Storcium >> >> > >> > Subhachandra >> > >> > On Thu, Mar 1, 2018 at 6:18 AM, David Turner >> wrote: >> >> >> >> With default memory settings, the general rule is 1GB ram/1TB OSD. If >> you >> >> have a 4TB OSD, you should plan to have at least 4GB ram. This was the >> >> recommendation for filestore OSDs, but it was a bit much memory for the >> >> OSDs. From what I've seen, this rule is a little more appropriate with >> >> bluestore now and should still be observed. >> >> >> >> Please note that memory usage in a HEALTH_OK cluster is not the same >> >> amount of memory that the daemons will use during recovery. I have >> seen >> >> deployments with 4x memory usage during recovery. >> >> >> >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman wrote: >> >>> >> >>> Quoting Caspar Smit (caspars...@supernas.eu): >> >>> > Stefan, >> >>> > >> >>> > How many OSD's and how much RAM are in each server? >> >>> >> >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12 >> >>> cores (at least one core per OSD). >> >>> >> >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM >> >>> > right? >> >>> >> >>> Apparently. Sure they will use more RAM than just cache to function >> >>> correctly. I figured 3 GB per OSD would be enough ... >> >>> >> >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of >> >>> > total >> >>> > RAM. The cache is a part of the memory usage by bluestore OSD's. >> >>> >> >>> A factor 4 is quite high, isn't it? Where is all this RAM used for >> >>> besides cache? RocksDB? >> >>> >> >>> So how should I size the amount of RAM in a OSD server for 10 >> bluestore >> >>> SSDs in a >> >>> replicated setup? >> >>> >> >>> Thanks, >> >>> >> >>> Stefan >> >>> >> >>> -- >> >>> | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 >> >>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl >> >>> ___ >> >>> ceph-users mailing list >> >>> ceph-users@lists.ceph.com >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Kjetil Joergensen > SRE, Medallia Inc > -- Kjetil Joergensen SRE, Medallia Inc ___
Re: [ceph-users] Memory leak in Ceph OSD?
Hi, so.. +1 We don't run compression as far as I know, so that wouldn't be it. We do actually run a mix of bluestore & filestore - due to the rest of the cluster predating a stable bluestore by some amount. The interesting part is - the behavior seems to be specific to our bluestore nodes. Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue line - dump_mempools total bytes for all the OSDs running on the yellow line. The big dips - forced restarts after having suffered through after effects of letting linux deal with it by OOM->SIGKILL previously. A gross extrapolation - "right now" the "memory used" seems to be close enough to "sum of RSS of ceph-osd processes" running on the machines. -KJ On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachevwrote: > On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra > wrote: > > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives > > filled to around 90%. One thing that does increase memory usage is the > > number of clients simultaneously sending write requests to a particular > > primary OSD if the write sizes are large. > > We have not seen a memory increase in Ubuntu 16.04, but I also > observed repeatedly the following phenomenon: > > When doing a VMotion in ESXi of a large 3TB file (this generates a log > of IO requests of small size) to a Ceph pool with compression set to > force, after some time the Ceph cluster shows a large number of > blocked requests and eventually timeouts become very large (to the > point where ESXi aborts the IO due to timeouts). After abort, the > blocked/slow requests messages disappear. There are no OSD errors. I > have OSD logs if anyone is interested. > > This does not occur when compression is unset. > > -- > Alex Gorbachev > Storcium > > > > > Subhachandra > > > > On Thu, Mar 1, 2018 at 6:18 AM, David Turner > wrote: > >> > >> With default memory settings, the general rule is 1GB ram/1TB OSD. If > you > >> have a 4TB OSD, you should plan to have at least 4GB ram. This was the > >> recommendation for filestore OSDs, but it was a bit much memory for the > >> OSDs. From what I've seen, this rule is a little more appropriate with > >> bluestore now and should still be observed. > >> > >> Please note that memory usage in a HEALTH_OK cluster is not the same > >> amount of memory that the daemons will use during recovery. I have seen > >> deployments with 4x memory usage during recovery. > >> > >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman wrote: > >>> > >>> Quoting Caspar Smit (caspars...@supernas.eu): > >>> > Stefan, > >>> > > >>> > How many OSD's and how much RAM are in each server? > >>> > >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12 > >>> cores (at least one core per OSD). > >>> > >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM > >>> > right? > >>> > >>> Apparently. Sure they will use more RAM than just cache to function > >>> correctly. I figured 3 GB per OSD would be enough ... > >>> > >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of > >>> > total > >>> > RAM. The cache is a part of the memory usage by bluestore OSD's. > >>> > >>> A factor 4 is quite high, isn't it? Where is all this RAM used for > >>> besides cache? RocksDB? > >>> > >>> So how should I size the amount of RAM in a OSD server for 10 bluestore > >>> SSDs in a > >>> replicated setup? > >>> > >>> Thanks, > >>> > >>> Stefan > >>> > >>> -- > >>> | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > >>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] change radosgw object owner
On Tue, Mar 6, 2018 at 11:40 AM, Ryan Leimenstollwrote: > Hi all, > > We are trying to move a bucket in radosgw from one user to another in an > effort both change ownership and attribute the storage usage of the data to > the receiving user’s quota. > > I have unlinked the bucket and linked it to the new user using: > > radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER > radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID > —uid=$NEWUSER > > However, perhaps as expected, the owner of all the objects in the bucket > remain as $USER. I don’t believe changing the owner is a supported operation > from the S3 protocol, however it would be very helpful to have the ability to > do this on the radosgw backend. This is especially useful for large > buckets/datasets where copying the objects out and into radosgw could be time > consuming. > > Is this something that is currently possible within radosgw? We are running > Ceph 12.2.2. Maybe try to copy objects into themselves with the new owner (as long as it can read it, if not then you first need to change the objects' acls to allow read)? Note that you need to do a copy that would retain the old meta attributes of the old object. Yehuda > > Thanks, > Ryan Leimenstoll > rleim...@umiacs.umd.edu > University of Maryland Institute for Advanced Computer Studies > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems
On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imaticwrote: > Hi all, > > I have few problems on my cluster, that are maybe linked together and > now caused OSD down during pg repair. > > First few notes about my cluster: > > 4 nodes, 15 OSDs installed on Luminous (no upgrade). > Replicated pools with 1 pool (pool 6) cached by ssd disks. > I don't detect any hardware failures (disk IO errors, restarts, > corrupted data etc). > I'm running RBDs using libvirt on debian wheezy and jessie (stable and > oldstable). > I'm snapshotting RBD's using Luminous client on Debian Jessie only. > When you say "cached by", do you mean there's a cache pool? Or are you using bcache or something underneath? > > Now problems, from light to severe: > > 1) > Almost every day I notice health some problems after deep scrub > 1-2 inconsistent PG's with "read_error" on some osd's. > When I don't repair it, it disappears after few days (? another deep > scrub). There is no read_error on disks (disk check ok, no errors logged > in syslog). > > 2) > I noticed on my pool 6 (cached pool), that scrub reports some objects, > that shouldn't be there: > > 2018-02-27 23:43:06.490152 7f4b3820e700 -1 osd.1 pg_epoch: 8712 pg[6.20( > v 8712'771984 (8712'770478,8712'771984] local-lis/les=8710/8711 n=14299 > ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 8710/8710/8710) > [1,10,14] r=0 lpr=8710 crt=8712'771984 lcod 8712'771983 mlcod > 8712'771983 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps > no head for 6:07ffbc7b:::rbd_data.967992ae8944a.00061cb8:c2 > (have MIN) > > I think, that means orphaned snap object without his head replica. Maybe > snaptrim left it there? Why? Maybe error during snaptrim? Or > fstrim/discard removed "head" object (this is I hope nonsense)? > > 3) > I ended with one object (probably snap object), that has only 1 replica > (out from size 3) and when I try to repair it, my OSD crash with > > /build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != > recovery_info.ss.clone_snaps.end()) > I guess, that it detected orphaned snap object I noticed at 2) and don't > repair it, just assterts and stop OSD. Am I right? > > I noticed comment "// hmm, should we warn?" on ceph source at that > assert code. So should someone remove that assert? > There's a ticket https://tracker.ceph.com/issues/23030, which links to a much longer discussion on this mailing list between Sage and Stefan which discusses this particular assert. I'm not entirely clear from the rest of your story (and the lng history in that thread) if there are other potential causes, or if your story might help diagnose it. But I'd start there since AFAIK it's still a mystery that looks serious but has only a very small number of incidences. :/ -Greg > > And my questions are > > How can I fix issue with crashing OSD? > How can I safely remove that objects with missing head? Is there any > tool or force-snaptrim on non-existent snapshots? It is prod cluster so > I want to be careful. I have no problems now with data availability. > My last idea is to move RBD's to another pool, but have not enough space > to do that (as I know RBD can only copy not move) so I'm looking for > another clean solution. > And last question - how can I find, what is causing that read_erros and > snap object leftovers? > > Should I paste my whole log? It is bigger than allowed post size. > Pasting most important events: > > -23> 2018-02-27 23:43:07.903368 7f4b3820e700 2 osd.1 pg_epoch: 8712 > pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 > n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 > 8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 > mlcod 8712'771985 active+clean+scrubbing+deep+inconsistent+repair] 6.20 > repair 1 missing, 0 inconsistent objects > -22> 2018-02-27 23:43:07.903410 7f4b3820e700 -1 log_channel(cluster) > log [ERR] : 6.20 repair 1 missing, 0 inconsistent objects > -21> 2018-02-27 23:43:07.903446 7f4b3820e700 -1 log_channel(cluster) > log [ERR] : 6.20 repair 3 errors, 2 fixed > -20> 2018-02-27 23:43:07.903480 7f4b3820e700 5 > write_log_and_missing with: dirty_to: 0'0, dirty_from: > 4294967295'18446744073709551615, writeout_from: > 4294967295'18446744073709551615, trimmed: , trimmed_dups: , > clear_divergent_priors: 0 > -19> 2018-02-27 23:43:07.903604 7f4b3820e700 1 -- > [2a01:430:22a::cef:c011]:6805/514544 --> > [2a01:430:22a::cef:c021]:6803/3001666 -- MOSDScrubReserve(6.20 RELEASE > e8712) v1 -- 0x55a4c5459c00 con 0 > -18> 2018-02-27 23:43:07.903651 7f4b3820e700 1 -- > [2a01:430:22a::cef:c011]:6805/514544 --> > [2a01:430:22a::cef:c041]:6802/3012729 -- MOSDScrubReserve(6.20 RELEASE > e8712) v1 -- 0x55a4cb6dee00 con 0 > -17> 2018-02-27 23:43:07.903679 7f4b3820e700 1 -- > [2a01:430:22a::cef:c011]:6805/514544 --> > [2a01:430:22a::cef:c021]:6803/3001666 -- pg_info((query:8712 sent:8712 > 6.20( v 8712'771986
[ceph-users] Civetweb log format
Hey all, I'm trying to get something of an audit log out of radosgw. To that end I was wondering if theres a mechanism to customize the log format of civetweb. It's already writing IP, HTTP Verb, path, response and time, but I'm hoping to get it to print the Authorization header of the request, which containers the access key id which we can tie back into the systems we use to issue credentials. Any thoughts? Thanks, Aaron CONFIDENTIALITY NOTICE This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] change radosgw object owner
On Tue, Mar 06, 2018 at 02:40:11PM -0500, Ryan Leimenstoll wrote: > Hi all, > > We are trying to move a bucket in radosgw from one user to another in an > effort both change ownership and attribute the storage usage of the data to > the receiving user’s quota. > > I have unlinked the bucket and linked it to the new user using: > > radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER > radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID > —uid=$NEWUSER > > However, perhaps as expected, the owner of all the objects in the > bucket remain as $USER. I don’t believe changing the owner is a > supported operation from the S3 protocol, however it would be very > helpful to have the ability to do this on the radosgw backend. This is > especially useful for large buckets/datasets where copying the objects > out and into radosgw could be time consuming. At the raw radosgw-admin level, you should be able to do it with bi-list/bi-get/bi-put. The downside here is that I don't think the BI ops are exposed in the HTTP Admin API, so it's going to be really expensive to chown lots of objects. Using a quick example: # radosgw-admin \ --uid UID-CENSORED \ --bucket BUCKET-CENSORED \ bi get \ --object=OBJECTNAME-CENSORED { "type": "plain", "idx": "OBJECTNAME-CENSORED", "entry": { "name": "OBJECTNAME-CENSORED", "instance": "", "ver": { "pool": 5, "epoch": 266028 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 1066, "mtime": "2016-11-17 17:01:29.668746Z", "etag": "e7a75c39df3d123c716d5351059ad2d9", "owner": "UID-CENSORED", "owner_display_name": "UID-CENSORED", "content_type": "image/png", "accounted_size": 1066, "user_data": "" }, "tag": "default.293024600.1188196", "flags": 0, "pending_map": [], "versioned_epoch": 0 } } -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: > Hi, > > I want to do load balanced multipathing (multiple iSCSI gateway/exporter > nodes) of iSCSI backed with RBD images. Should I disable exclusive lock > feature? What if I don't disable that feature? I'm using TGT (manual > way) since I get so many CPU stuck error messages when I was using LIO. > You are using LIO/TGT with krbd right? You cannot or shouldn't do active/active multipathing. If you have the lock enabled then it bounces between paths for each IO and will be slow. If you do not have it enabled then you can end up with stale IO overwriting current data. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] change radosgw object owner
Hi all, We are trying to move a bucket in radosgw from one user to another in an effort both change ownership and attribute the storage usage of the data to the receiving user’s quota. I have unlinked the bucket and linked it to the new user using: radosgw-admin bucket unlink —bucket=$MYBUCKET —uid=$USER radosgw-admin bucket link —bucket=$MYBUCKET —bucket-id=$BUCKET_ID —uid=$NEWUSER However, perhaps as expected, the owner of all the objects in the bucket remain as $USER. I don’t believe changing the owner is a supported operation from the S3 protocol, however it would be very helpful to have the ability to do this on the radosgw backend. This is especially useful for large buckets/datasets where copying the objects out and into radosgw could be time consuming. Is this something that is currently possible within radosgw? We are running Ceph 12.2.2. Thanks, Ryan Leimenstoll rleim...@umiacs.umd.edu University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi, I want to do load balanced multipathing (multiple iSCSI gateway/exporter nodes) of iSCSI backed with RBD images. Should I disable exclusive lock feature? What if I don't disable that feature? I'm using TGT (manual way) since I get so many CPU stuck error messages when I was using LIO. Best regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When all Mons are down, does existing RBD volume continue to work
I think things would keep running, but I'm really not sure. This is just not a realistic concern as there are lots of little housekeeping things that can be deferred for a little while but eventually will stop forward progress if you can't talk to the monitors to persist cluster state updates. On Tue, Mar 6, 2018 at 9:50 AM Mayank Kumarwrote: > Thanks Gregory. This is basically just trying to understand the behavior > of the system in a failure scenario . Ideally we would track and fix mons > going down promptly . > > In an ideal world where nothing else fails and there cephx is not in use > but mons are down , what happens if the osd pings to mons time-out ? Would > that start resulting in I/O failures ? > > > On Mon, Mar 5, 2018 at 9:44 PM Gregory Farnum wrote: > >> On Sun, Mar 4, 2018 at 12:02 AM Mayank Kumar wrote: >> >>> Ceph Users, >>> >>> My question is if all mons are down(i know its a terrible situation to >>> be), does an existing rbd volume which is mapped to a host and being >>> used(read/written to) continues to work? >>> >>> I understand that it wont get notifications about osdmap, etc, but >>> assuming nothing fails, does the read/write ios on the exsiting rbd volume >>> continue to work or that would start failing ? >>> >> >> Clients will continue to function if there are transient monitor issues, >> but you can't rely on them continuing in a long-term failure scenario. >> Eventually *something* will hit a timeout, whether that's an OSD on its >> pings, or some kind of key rotation for cephx, or >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When all Mons are down, does existing RBD volume continue to work
Thanks Gregory. This is basically just trying to understand the behavior of the system in a failure scenario . Ideally we would track and fix mons going down promptly . In an ideal world where nothing else fails and there cephx is not in use but mons are down , what happens if the osd pings to mons time-out ? Would that start resulting in I/O failures ? On Mon, Mar 5, 2018 at 9:44 PM Gregory Farnumwrote: > On Sun, Mar 4, 2018 at 12:02 AM Mayank Kumar wrote: > >> Ceph Users, >> >> My question is if all mons are down(i know its a terrible situation to >> be), does an existing rbd volume which is mapped to a host and being >> used(read/written to) continues to work? >> >> I understand that it wont get notifications about osdmap, etc, but >> assuming nothing fails, does the read/write ios on the exsiting rbd volume >> continue to work or that would start failing ? >> > > Clients will continue to function if there are transient monitor issues, > but you can't rely on them continuing in a long-term failure scenario. > Eventually *something* will hit a timeout, whether that's an OSD on its > pings, or some kind of key rotation for cephx, or > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
There are multiple settings that affect this. osd_heartbeat_grace is probably the most apt. If an OSD is not getting a response from another OSD for more than the heartbeat_grace period, then it will tell the mons that the OSD is down. Once mon_osd_min_down_reporters have told the mons that an OSD is down, then the OSD will be marked down by the cluster. If the OSD does not then talk to the mons directly to say that it is up, it will be marked out after mon_osd_down_out_interval is reached. If it does talk to the mons to say that it is up, then it should be responding again and be fine. In your case where the OSD is half up, half down... I believe all you can really do is monitor your cluster and troubleshoot OSDs causing problems like this. Basically every storage solution is vulnerable to this. Sometimes an OSD just needs to be restarted due to being in a bad state somehow, or simply removed from the cluster because the disk is going bad. On Sun, Mar 4, 2018 at 2:28 AM shadow_linwrote: > Hi list, > During my test of ceph,I find sometime the whole ceph cluster are blocked > and the reason was one unfunctional osd.Ceph can heal itself if some osd is > down, but it seems if some osd is half dead (have heart beat but can't > handle request) then all the request which are directed to that osd would > be blocked. If all osds are in one pool and the whole cluster would be > blocked due to that one hanged osd. > I think this is because ceph will try to distribute the request to all > osds and if one of the osd wont confirm the request is done then everything > is blocked. > Is there a way to let ceph to mark the the crippled osd down if the > requests direct to that osd are blocked more than certain time to avoid the > whole cluster is blocked? > > 2018-03-04 > -- > shadow_lin > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph iSCSI is a prank?
Hi! Am 02.03.18 um 13:27 schrieb Federico Lucifredi: We do speak to the Xen team every once in a while, but while there is interest in adding Ceph support on their side, I think we are somewhat down the list of their priorities. Maybe things change with XCP-ng (https://xcp-ng.github.io). Now as Citrix is removing features from 7.3 and cutting off users of the free version, this project looks very interesting (Trying to be what CentOS is/was to RHEL). And they have Ceph RBD support on their ideas list already. Cheers, Martin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep Scrub distribution
On Tue, Mar 06, 2018 at 03:48:30PM +, David Turner wrote: :I'm pretty sure I put up one of those scripts in the past. Basically what :we did was we set our scrub cycle to something like 40 days, we then sort :all PGs by the last time they were deep scrubbed. We grab the oldest 1/30 :of those PGs and tell them to deep-scrub manually, the next day we do it :again. After a month or so, your PGs should be fairly evenly spaced out :over 30 days. With those numbers you could disable the cron to run the :deep-scrubs for maintenance up to 10 days every 40 days and still scrub all :of your PGs during that time. I think I had that script :) But in Jewel (I believe it was jewel) ceph got smarter about spacing things out and we ditched the cron job (though probably still have a copy of the script). Now we're on luminous things bunched up again. The main problem being they are bunched into 4days or so so there wouldn't be space for the cron solution to work. I have a theory on my potential mistake. I had dropped zero from the config briefly so thing were scheduled for 4.2 days rather than 42, but "corrected" that and restarted all OSDs but the 'mgr' processses still showed 4.2d config. Which process actually decides to start scrubs? osd, mgr, mon? In any case I've just ensured all instance of all three are showing the same value for osd_deep_scrub_interval. I guess if we go from everything scrubbing to nothing scrubbing I'll dust off the cron script so we even out rather than just have the same pileup less frequently. Thanks, -Jon :On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnumwrote: : :> On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx :> wrote: :> :>> Hi All, :>> :>> I've recently noticed my deep scrubs are EXTREAMLY poorly :>> distributed. They are stating with in the 18->06 local time start :>> stop time but are not distrubuted over enough days or well distributed :>> over the range of days they have. :>> :>> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print :>> $20}'`; do date +%D -d $date; done | sort | uniq -c :>> dumped all :>> 1 03/01/18 :>> 6 03/03/18 :>>8358 03/04/18 :>>1875 03/05/18 :>> :>> So very nearly all 10240 pgs scrubbed lastnight/this morning. I've :>> been kicking this around for a while since I noticed poor distribution :>> over a 7 day range when I was really pretty sure I'd changed that from :>> the 7d default to 28d. :>> :>> Tried kicking it out to 42 days about a week ago with: :>> :>> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800' :>> :>> :>> There were many error suggesting it could nto reread the change and I'd :>> need to restart the OSDs but 'ceph daemon osd.0 config show |grep :>> osd_deep_scrub_interval' showed the right value so I let it roll for a :>> week but the scrubs did not spread out. :>> :>> So Friday I set that value in ceph.conf and did rolling restarts of :>> all OSDs. Then doubled checked running value on all daemons. :>> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB :>> voodoo above) show near enough 1/42nd of PGs had been scrubbed :>> Saturday night that I thought this was working. :>> :>> This morning I checked again and got the results above. :>> :>> I would expect after changing to a 42d scrub cycle I'd see approx 1/42 :>> of the PGs deep scrub each night untill there was a roughly even :>> distribution over the past 42 days. :>> :>> So which thing is broken my config or my expectations? :>> :> :> Sadly, changing the interval settings does not directly change the :> scheduling of deep scrubs. Instead, it merely influences whether a PG will :> get queued for scrub when it is examined as a candidate, based on how :> out-of-date its scrub is. (That is, nothing holistically goes "I need to :> scrub 1/n of these PGs every night"; there's a simple task that says "is :> this PG's last scrub more than n days old?") :> :> Users have shared various scripts on the list for setting up a more even :> scrub distribution by fiddling with the settings and poking at specific PGs :> to try and smear them out over the whole time period; I'd check archives or :> google for those. :) :> -Greg :> ___ :> ceph-users mailing list :> ceph-users@lists.ceph.com :> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com :> -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
Il 06/03/2018 16:15, David Turner ha scritto: I've never deleted a bucket, pool, etc at the request of a user that they then wanted back because I force them to go through a process to have their data deleted. They have to prove to me, and I have to agree, that they don't need it before I'll delete it. Of course I cannot keep in touch with the customer of my reseller (which I don't know) .. or I've to say with the end customer [of the customer] [of the customer] [of the customer] of my resellers ...in order to obsessively ask to please PROVE ME that your data are not usefull anymore. And even if I could I neither want to call all the end customers making me wasting time to let me confirm _*I can go on *_and do my job. It just sounds like you need to either learn to be a storage admin, hire someone that is, or buy a solution that doesn't care if you are. Uh! That's bad. It is so sad when somebody cannot take a proposal as constructive criticism but need instead to mark other as incompetent. Everybody has different admin experience and different point-of-view and that's all folks. You don't have sub-sub-sub customer which you don't know? I do. You are the one that make everybody obey to "the process"? I can't. I need to solve the requests of my customers not yell when they are so dumb to delete important data. I just wrote to throw a proposal in order to improve the admin's life not of course to be offended. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep Scrub distribution
I'm pretty sure I put up one of those scripts in the past. Basically what we did was we set our scrub cycle to something like 40 days, we then sort all PGs by the last time they were deep scrubbed. We grab the oldest 1/30 of those PGs and tell them to deep-scrub manually, the next day we do it again. After a month or so, your PGs should be fairly evenly spaced out over 30 days. With those numbers you could disable the cron to run the deep-scrubs for maintenance up to 10 days every 40 days and still scrub all of your PGs during that time. On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnumwrote: > On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx > wrote: > >> Hi All, >> >> I've recently noticed my deep scrubs are EXTREAMLY poorly >> distributed. They are stating with in the 18->06 local time start >> stop time but are not distrubuted over enough days or well distributed >> over the range of days they have. >> >> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print >> $20}'`; do date +%D -d $date; done | sort | uniq -c >> dumped all >> 1 03/01/18 >> 6 03/03/18 >>8358 03/04/18 >>1875 03/05/18 >> >> So very nearly all 10240 pgs scrubbed lastnight/this morning. I've >> been kicking this around for a while since I noticed poor distribution >> over a 7 day range when I was really pretty sure I'd changed that from >> the 7d default to 28d. >> >> Tried kicking it out to 42 days about a week ago with: >> >> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800' >> >> >> There were many error suggesting it could nto reread the change and I'd >> need to restart the OSDs but 'ceph daemon osd.0 config show |grep >> osd_deep_scrub_interval' showed the right value so I let it roll for a >> week but the scrubs did not spread out. >> >> So Friday I set that value in ceph.conf and did rolling restarts of >> all OSDs. Then doubled checked running value on all daemons. >> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB >> voodoo above) show near enough 1/42nd of PGs had been scrubbed >> Saturday night that I thought this was working. >> >> This morning I checked again and got the results above. >> >> I would expect after changing to a 42d scrub cycle I'd see approx 1/42 >> of the PGs deep scrub each night untill there was a roughly even >> distribution over the past 42 days. >> >> So which thing is broken my config or my expectations? >> > > Sadly, changing the interval settings does not directly change the > scheduling of deep scrubs. Instead, it merely influences whether a PG will > get queued for scrub when it is examined as a candidate, based on how > out-of-date its scrub is. (That is, nothing holistically goes "I need to > scrub 1/n of these PGs every night"; there's a simple task that says "is > this PG's last scrub more than n days old?") > > Users have shared various scripts on the list for setting up a more even > scrub distribution by fiddling with the settings and poking at specific PGs > to try and smear them out over the whole time period; I'd check archives or > google for those. :) > -Greg > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
Il 06/03/2018 11:13, Ronny Aasen ha scritto: On 06. mars 2018 10:26, Max Cuttins wrote: Il 05/03/2018 20:17, Gregory Farnum ha scritto: You're not wrong, and indeed that's why I pushed back on the latest attempt to make deleting pools even more cumbersome. But having a "trash" concept is also pretty weird. If admins can override it to just immediately delete the data (if they need the space), how is that different from just being another hoop to jump through? If we want to give the data owners a chance to undo, how do we identify and notify *them* rather than the admin running the command? But if admins can't override the trash and delete immediately, what do we do for things like testing and proofs of concept where large-scale data creates and deletes are to be expected? -Greg I'm talking about my experience: * Data Owner are a little bit in their LA LA LAND, and think that they can safely delete some of their data without losses. * Data Owner should think that their pool have been really deleted * Data Owner should not been akwnoledge about the existance of the "/trash/" * So Data Owner ask to restore from backup (but instead we'll use easily the trash). Said so, we also have to think that: * Administrator is always GOD, so he need to be in the possibility to override if needed whenever he needs. * However Administrator should just put in status delete without override this behaviour if there is not need to do so. * Override should be allowed only with many cumbersome telling you that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE I don't like that the software can limit administrators to do his job... in the end Administrator'll always find its way to do what he want (it's the root). Of course I like the feature to push the Admin to follow the right behaviour. some sort of active/inactive toggle both on RBD images, pools, buckets and filesystems trees is nice to allow admins to perform scream tests. "data owner requests deletion - admin disables pool(kicks all clients) - data owner screams - admin reactivates" sounds much better then the last step beeing admin checking if the backups are good.,.. i try to do something similar by renaming pools to be deleted but that is not allways the same as inactive. EXACTLY! :) I like the name "scream test"... it really look like that! :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph iSCSI is a prank?
Dear all, I wonder how we could support VM systems with ceph storage (block device)? my colleagues are waiting for my answer for vmware (vSphere 5) and I myself use oVirt (RHEV). the default protocol is iSCSI. I know that openstack/cinder work well with ceph and proxmox (just heard) too. But currently we are using vmware and ovirt. Your wise suggestion is appreciated Cheers Joshua oVirt works with Ceph natively via librbd. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
On 06. mars 2018 10:26, Max Cuttins wrote: Il 05/03/2018 20:17, Gregory Farnum ha scritto: You're not wrong, and indeed that's why I pushed back on the latest attempt to make deleting pools even more cumbersome. But having a "trash" concept is also pretty weird. If admins can override it to just immediately delete the data (if they need the space), how is that different from just being another hoop to jump through? If we want to give the data owners a chance to undo, how do we identify and notify *them* rather than the admin running the command? But if admins can't override the trash and delete immediately, what do we do for things like testing and proofs of concept where large-scale data creates and deletes are to be expected? -Greg I'm talking about my experience: * Data Owner are a little bit in their LA LA LAND, and think that they can safely delete some of their data without losses. * Data Owner should think that their pool have been really deleted * Data Owner should not been akwnoledge about the existance of the "/trash/" * So Data Owner ask to restore from backup (but instead we'll use easily the trash). Said so, we also have to think that: * Administrator is always GOD, so he need to be in the possibility to override if needed whenever he needs. * However Administrator should just put in status delete without override this behaviour if there is not need to do so. * Override should be allowed only with many cumbersome telling you that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE I don't like that the software can limit administrators to do his job... in the end Administrator'll always find its way to do what he want (it's the root). Of course I like the feature to push the Admin to follow the right behaviour. some sort of active/inactive toggle both on RBD images, pools, buckets and filesystems trees is nice to allow admins to perform scream tests. "data owner requests deletion - admin disables pool(kicks all clients) - data owner screams - admin reactivates" sounds much better then the last step beeing admin checking if the backups are good.,.. i try to do something similar by renaming pools to be deleted but that is not allways the same as inactive. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Packages for Debian 8 "Jessie" missing from download.ceph.com APT repository
Hi, I'm trying to install "ceph-common" on Debian 8 "Jessie", but it seems the packages aren't available for it. Searching for “jessie” on https://download.ceph.com/debian-luminous/pool/main/c/ceph/ yields no results. I've tried to install it like it is documented here: http://docs.ceph.com/docs/master/install/get-packages/#debian-packages However, after adding the repository, only version 10.2 and 0.80.7 from the official Debian repositories show up in "apt-cache policy ceph-common” So far, my solution is using the “trusty” packages from Ubuntu seems to work on my Debian box for anybody else that’s seeking to resolve this issue. Thanks, Best regards, Simon Fredsted ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
Il 05/03/2018 20:17, Gregory Farnum ha scritto: You're not wrong, and indeed that's why I pushed back on the latest attempt to make deleting pools even more cumbersome. But having a "trash" concept is also pretty weird. If admins can override it to just immediately delete the data (if they need the space), how is that different from just being another hoop to jump through? If we want to give the data owners a chance to undo, how do we identify and notify *them* rather than the admin running the command? But if admins can't override the trash and delete immediately, what do we do for things like testing and proofs of concept where large-scale data creates and deletes are to be expected? -Greg I'm talking about my experience: * Data Owner are a little bit in their LA LA LAND, and think that they can safely delete some of their data without losses. * Data Owner should think that their pool have been really deleted * Data Owner should not been akwnoledge about the existance of the "/trash/" * So Data Owner ask to restore from backup (but instead we'll use easily the trash). Said so, we also have to think that: * Administrator is always GOD, so he need to be in the possibility to override if needed whenever he needs. * However Administrator should just put in status delete without override this behaviour if there is not need to do so. * Override should be allowed only with many cumbersome telling you that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE I don't like that the software can limit administrators to do his job... in the end Administrator'll always find its way to do what he want (it's the root). Of course I like the feature to push the Admin to follow the right behaviour. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
What about using the at command: ceph osd pool rm --yes-i-really-really-mean-it | at now + 30 days Regards, Alex How do you know that this command is scheduled? How do you delete the scheduled command if is casted? This is weird. We need something within CEPH that make you see the "status" of the pool as "pending delete". ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK
debug_osd that is... :) On Tue, Mar 6, 2018 at 7:10 PM, Brad Hubbardwrote: > > > On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < > mbald...@hsamiata.it> wrote: > >> Hi >> >> I monitor dmesg in each of the 3 nodes, no hardware issue reported. And >> the problem happens with various different OSDs in different nodes, for me >> it is clear it's not an hardware problem. >> > > If you have osd_debug set to 25 or greater when you run the deep scrub you > should get more information about the nature of the read error in the > ReplicatedBackend::be_deep_scrub() function (assuming this is a > replicated pool). > > This may create large logs so watch they don't exhaust storage. > >> Thanks for reply >> >> >> >> Il 05/03/2018 21:45, Vladimir Prokofev ha scritto: >> >> > always solved by ceph pg repair >> That doesn't necessarily means that there's no hardware issue. In my case >> repair also worked fine and returned cluster to OK state every time, but in >> time faulty disk fail another scrub operation, and this repeated multiple >> times before we replaced that disk. >> One last thing to look into is dmesg at your OSD nodes. If there's a >> hardware read error it will be logged in dmesg. >> >> 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata < >> mbald...@hsamiata.it>: >> >>> Hi and thanks for reply >>> >>> The OSDs are all healthy, in fact after a ceph pg repair the ceph >>> health is back to OK and in the OSD log I see repair ok, 0 fixed >>> >>> The SMART data of the 3 OSDs seems fine >>> >>> *OSD.5* >>> >>> # ceph-disk list | grep osd.5 >>> /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2 >>> >>> # smartctl -a /dev/sdd >>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >>> >>> === START OF INFORMATION SECTION === >>> Model Family: Seagate Barracuda 7200.14 (AF) >>> Device Model: ST1000DM003-1SB10C >>> Serial Number:Z9A1MA1V >>> LU WWN Device Id: 5 000c50 090c7028b >>> Firmware Version: CC43 >>> User Capacity:1,000,204,886,016 bytes [1.00 TB] >>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>> Rotation Rate:7200 rpm >>> Form Factor: 3.5 inches >>> Device is:In smartctl database [for details use: -P show] >>> ATA Version is: ATA8-ACS T13/1699-D revision 4 >>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>> Local Time is:Mon Mar 5 16:17:22 2018 CET >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> === START OF READ SMART DATA SECTION === >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x82) Offline data collection activity >>> was completed without error. >>> Auto Offline Data Collection: Enabled. >>> Self-test execution status: ( 0) The previous self-test routine >>> completed >>> without error or no self-test has ever >>> been run. >>> Total time to complete Offline >>> data collection:(0) seconds. >>> Offline data collection >>> capabilities:(0x7b) SMART execute Offline immediate. >>> Auto Offline data collection on/off >>> support. >>> Suspend Offline collection upon new >>> command. >>> Offline surface scan supported. >>> Self-test supported. >>> Conveyance Self-test supported. >>> Selective Self-test supported. >>> SMART capabilities:(0x0003) Saves SMART data before entering >>> power-saving mode. >>> Supports SMART auto save timer. >>> Error logging capability:(0x01) Error logging supported. >>> General Purpose Logging supported. >>> Short self-test routine >>> recommended polling time:( 1) minutes. >>> Extended self-test routine >>> recommended polling time:( 109) minutes. >>> Conveyance self-test routine >>> recommended polling time:( 2) minutes. >>> SCT capabilities: (0x1085) SCT Status supported. >>> >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >>> WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 082 063 006Pre-fail Always >>>- 193297722 >>> 3 Spin_Up_Time0x0003 097 097 000Pre-fail Always >>>- 0 >>> 4 Start_Stop_Count
Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK
On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < mbald...@hsamiata.it> wrote: > Hi > > I monitor dmesg in each of the 3 nodes, no hardware issue reported. And > the problem happens with various different OSDs in different nodes, for me > it is clear it's not an hardware problem. > If you have osd_debug set to 25 or greater when you run the deep scrub you should get more information about the nature of the read error in the ReplicatedBackend::be_deep_scrub() function (assuming this is a replicated pool). This may create large logs so watch they don't exhaust storage. > Thanks for reply > > > > Il 05/03/2018 21:45, Vladimir Prokofev ha scritto: > > > always solved by ceph pg repair > That doesn't necessarily means that there's no hardware issue. In my case > repair also worked fine and returned cluster to OK state every time, but in > time faulty disk fail another scrub operation, and this repeated multiple > times before we replaced that disk. > One last thing to look into is dmesg at your OSD nodes. If there's a > hardware read error it will be logged in dmesg. > > 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata < > mbald...@hsamiata.it>: > >> Hi and thanks for reply >> >> The OSDs are all healthy, in fact after a ceph pg repair the ceph >> health is back to OK and in the OSD log I see repair ok, 0 fixed >> >> The SMART data of the 3 OSDs seems fine >> >> *OSD.5* >> >> # ceph-disk list | grep osd.5 >> /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2 >> >> # smartctl -a /dev/sdd >> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate Barracuda 7200.14 (AF) >> Device Model: ST1000DM003-1SB10C >> Serial Number:Z9A1MA1V >> LU WWN Device Id: 5 000c50 090c7028b >> Firmware Version: CC43 >> User Capacity:1,000,204,886,016 bytes [1.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate:7200 rpm >> Form Factor: 3.5 inches >> Device is:In smartctl database [for details use: -P show] >> ATA Version is: ATA8-ACS T13/1699-D revision 4 >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is:Mon Mar 5 16:17:22 2018 CET >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine >> completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: (0) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off >> support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities:(0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability:(0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 109) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x1085) SCT Status supported. >> >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 082 063 006Pre-fail Always >> - 193297722 >> 3 Spin_Up_Time0x0003 097 097 000Pre-fail Always >> - 0 >> 4 Start_Stop_Count0x0032 100 100 020Old_age Always >> - 60 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010Pre-fail Always >> - 0 >> 7 Seek_Error_Rate
Re: [ceph-users] Cache tier
Hi, We use write-around cache tier with libradosstriper-based clients. We faced with bug which causes performance degradation: http://tracker.ceph.com/issues/22528 . Especially if it is a lot of small objects - sizeof(1 striper chunk). Such objects will promote on every read/write lock:). And it is very hard to benchmark cache tier. Also, we have a little testing pool with rbd disks for vm's. It works better with cache tier on ssd's. But, there's no heavy i/o load. It's better to benchmark cache tier for your special case and choose cache mode based on benchmark results. 06.03.2018, 02:28, "Budai Laszlo": > Dear all, > > I have some questions about cache tier in ceph: > > 1. Can someone share experiences with cache tiering? What are the sensitive > things to pay attention regarding the cache tier? Can one use the same ssd > for both cache and > 2. Is cache tiering supported with bluestore? Any advices for using cache > tier with Bluestore? > > Kind regards, > Laszlo > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, Aleksei Zakharov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problem with UID starting with underscores
Hi all, because one our script misbehaved, new user with bad UID was created via API, and now we can't remove, view or modify it. I believe, it's because it has three underscores at the beginning: [root@rgw001 /]# radosgw-admin metadata list user | grep "___pro_" "___pro_", [root@rgw001 /]# radosgw-admin user info --uid="___pro_" could not fetch user info: no user info saved Do you have any ideas how to workaround this problem? If it's not supported naming, maybe API shouldn't allow to create it? We are using Jewel 10.2.10 version on Centos 7.4. Thanks for any ideas, Arvydas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com