[ceph-users] cephbot - a Slack bot for Ceph has been added to the github.com/ceph project

2023-07-26 Thread David Turner
cephbot [1] is a project that I've been working on and using for years now
and it has been added to the github.com/ceph project to increase visibility
for other people that would like to implement slack-ops for their Ceph
clusters.

The instructions show how to set it up so that read-only operations can be
performed from Slack for security purposes, but there are settings that
could make it possible to lock down who can communicate with cephbot which
could make it relatively secure to run administrative tasks as well.

Ask here or in the Ceph Slack instance if you have any questions about its
uses, implementation, or would like to contribute. I hope you find it as
useful as I have.

David Turner
Sony Interactive Entertainment


[1] https://github.com/ceph/cephbot-slack
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS health warnings after deleting millions of files

2022-04-19 Thread David Turner
A rogue process wrote 38M files into a single CephFS directory that took
about a month to delete. We had to increase MDS cache sizes to handle the
increased file volume, but we've been able to reduce all of our settings
back to default.

Ceph cluster is 15.2.11. Cephfs clients are ceph-fuse either
version 14.2.16 or 15.2.11 depending if they've been upgraded yet. Nothing
has changed in the last ~6 months in regards to client versions or cluster
version.

We are currently dealing with 2 issues now that things seem to be cleaned
up.

1. MDSs report slow requests. [1] Dumping the blocked requests has the same
output for all of them. They seemingly get stuck AFTER the event succeeds
to acquire locks. I can't find any information about what's happening after
this or why things are getting stuck here.

2. Clients failing to advance oldest client/flush tid. There are 2 clients
that are the worst offenders for this, but a few other clients are having
this same issue. All of the clients having this issue are on 14.2.16, but
we also have a hundred clients on the same version that don't have this
issue at all. [2] The logs make it look like the clients just have a bad
integer/pointer somehow. We can clean up the error by remounting the
filesystem or rebooting the server, but these 2 clients in particular keep
ending up in this state. No other repeat offenders yet, but we've had 4
other servers in this state over the last couple weeks.

Are there any ideas what the next steps might be for diagnosing either of
these issues? Thank you.

-David Turner



[1] $ sudo ceph daemon mds.mon1 dump_blocked_ops
{
"ops": [
{
"description": "client_request(client.17709580:39254 open
#0x10001c99cd4 2022-02-22T16:25:40.231547+ caller_uid=0,
caller_gid=0{})",
"initiated_at": "2022-04-19T19:07:10.663552+",
"age": 90.920778446,
"duration": 90.92080624405,
"type_data": {
"flag_point": "acquired locks",
"reqid": "client.17709580:39254",
"op_type": "client_request",
"client_info": {
"client": "client.17709580",
"tid": 39254
},
"events": [
{
"time": "2022-04-19T19:07:10.663552+",
"event": "initiated"
},
{
"time": "2022-04-19T19:07:10.663549+",
"event": "throttled"
},
{
"time": "2022-04-19T19:07:10.663552+",
"event": "header_read"
},
{
"time": "2022-04-19T19:07:10.663555+",
"event": "all_read"
},
{
"time": "2022-04-19T19:07:10.665744+",
"event": "dispatched"
},
{
"time": "2022-04-19T19:07:10.773894+",
"event": "failed to xlock, waiting"
},
{
"time": "2022-04-19T19:07:10.807249+",
"event": "acquired locks"
}
]
}
},


[2] 2022-04-19 06:15:36.108 7fb28b7fe700  0 client.30095002
handle_cap_flush_ack mds.1 got unexpected flush ack tid 338611 expected is 0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph-mds using a lot of buffer_anon memory

2021-01-26 Thread David Turner
We just upgraded a cephfs cluster from 12.2.12 to 14.2.11. Our next step is
to upgrade to 14.2.16 to troubleshoot this issue, but I thought I'd reach
out here first if anyone had any ideas. The clients are still running an
older version of ceph-fuse 12.2.4 and it's very difficult to remount all of
them. Would probably take a team of us a couple days to restart all of
them. I've looked around online and release notes and all of the known
memory leaks I've been able to find have been fixed prior to version
14.2.11 so this would be an unknown memory leak.

All of the memory is in use in [1] buffer_anon. If left unchecked it will
use up over 700GB of memory within 24 hours. On an identical cluster with
an equivalent workload still running 12.2.12 [2] buffer_anon information is
much healthier.

Without any other options or ideas our plan is to upgrade the cluster to
14.2.16 first and then upgrade the clients. Has anyone else come across
high buffer_anon usage?


[1]
"buffer_anon": {
"items": 33756758,
"bytes": 135025912897
 },

[2]
"buffer_anon": {
"items": 636,
"bytes": 273118
},
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread David Turner
Do you have access to another Ceph cluster with enough available space to
create rbds that you dd these failing disks into? That's what I'm doing
right now with some failing disks. I've recovered 2 out of 6 osds that
failed in this way. I would recommend against using the same cluster for
this, but a stage cluster or something would be great.

On Tue, May 12, 2020, 7:36 PM Kári Bertilsson  wrote:

> Hi Paul
>
> I was able to mount both OSD's i need data from successfully using
> "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse
> --mountpoint /osd92/"
>
> I see the PG slices that are missing in the mounted folder
> "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the
> mounted folder and that works fine.
>
> But when i try to export it fails. I get the same error when trying to
> list.
>
> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list
> --debug
> Output @ https://pastebin.com/nXScEL6L
>
> Any ideas ?
>
> On Tue, May 12, 2020 at 12:17 PM Paul Emmerich 
> wrote:
>
> > First thing I'd try is to use objectstore-tool to scrape the
> > inactive/broken PGs from the dead OSDs using it's PG export feature.
> > Then import these PGs into any other OSD which will automatically recover
> > it.
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> >
> > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
> > wrote:
> >
> >> Yes
> >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
> >>
> >> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
> >>
> >> > Can you share your osd tree and the current ceph status?
> >> >
> >> >
> >> > Zitat von Kári Bertilsson :
> >> >
> >> > > Hello
> >> > >
> >> > > I had an incidence where 3 OSD's crashed at once completely and
> won't
> >> > power
> >> > > up. And during recovery 3 OSD's in another host have somehow become
> >> > > corrupted. I am running erasure coding with 8+2 setup using crush
> map
> >> > which
> >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few
> >> PG's
> >> > > down. Unfortunately these PG's seem to overlap almost all data on
> the
> >> > pool,
> >> > > so i believe the entire pool is mostly lost after only these 2% of
> >> PG's
> >> > > down.
> >> > >
> >> > > I am running ceph 14.2.9.
> >> > >
> >> > > OSD 92 log https://pastebin.com/5aq8SyCW
> >> > > OSD 97 log https://pastebin.com/uJELZxwr
> >> > >
> >> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's
> >> > still
> >> > > fail with the log above.
> >> > >
> >> > > Log from trying ceph-bluestore-tool repair --deep which is still
> >> running,
> >> > > not sure if it will actually fix anything and log looks pretty bad.
> >> > > https://pastebin.com/gkqTZpY3
> >> > >
> >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
> >> --op
> >> > > list" gave me input/output error. But everything in SMART looks OK,
> >> and i
> >> > > see no indication of hardware read error in any logs. Same for both
> >> OSD.
> >> > >
> >> > > The OSD's with corruption have absolutely no bad sectors and likely
> >> have
> >> > > only a minor corruption but at important locations.
> >> > >
> >> > > Any ideas on how to recover this kind of scenario ? Any tips would
> be
> >> > > highly appreciated.
> >> > >
> >> > > Best regards,
> >> > > Kári Bertilsson
> >> > > ___
> >> > > ceph-users mailing list -- ceph-users@ceph.io
> >> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 回复: Re: OSDs continuously restarting under load

2020-05-01 Thread David Turner
badblocks has found over 50 bad sectors so far and still running.
xfs_repair stopped running twice with a message "Killed" likely indicating
that it hit a similar bus error that ceph-osd is running into. This seems
like a fairly simple case of failing disks. I just hope I can get through
it without data loss.

On Thu, Apr 30, 2020 at 10:14 PM David Turner  wrote:

> I have 2 filestore OSDs in a cluster facing "Caught signal (Bus error)" as
> well and can't find anything about it. Ceph 12.2.12. The disks are less
> than 50% full and basic writes have been successful. Both disks are on
> different nodes. The other 14 disks on each node are unaffected.
>
> Restarting the node doesn't change the behavior. The affected OSD still
> crashes and the other 14 start fine (which likely rules out the controller
> and other shared components along those lines).
>
> I've attempted [1] these commands on the OSDs to see how much of the disk
> I could access cleanly. The first is just to flush the journal to disk and
> it crashed out with the same error. The second command is to compact the DB
> which also crashed with the same error. On one of the OSDs I was able to
> make it a fair bit into compacting the DB before it crashed the first time
> and now it crashes instantly.
>
> That leads me to think that it might have gotten to a specific part of the
> disk and/or filesystem that is having problems. I'm currently running [2]
> xfs_repair on one of the disks to see if it might be the filesystem. On the
> other disk I'm running [3] badblocks to check for problems with underlying
> sectors.
>
> I'm assuming that if it's a bad block on the disk that is preventing the
> disk from starting that there's really nothing that I can do to recover the
> OSD and I'll just need to export any PGs on the disks that aren't active.
> Here's hoping I make it through this without data loss. Since I started
> this data migration I've already lost a couple disks (completely unreadable
> by the OS so I can't get copies of the PGs off of them). Luckily these ones
> seem like I might be able to access that part of the data at least. As
> well, I only have some unfound objects at the moment, but all of my PGs are
> active, which is an improvement.
>
>
> [1] sudo -u ceph ceph-osd -i 285 --flush-journal
> sudo -u ceph ceph-kvstore-tool leveldb
> /var/lib/ceph/osd/ceph-285/current/omap compact
>
> [2] xfs_repair -n /dev/sdi1
> [3] badblocks -b 4096 -v /dev/sdn
>
> On Thu, Mar 19, 2020 at 9:04 AM huxia...@horebdata.cn <
> huxia...@horebdata.cn> wrote:
>
>> Hi, Igor,
>>
>> thanks for the tip. Dmesg does not say any suspicious information.
>>
>> I will investigate whether hardware has any problem or not.
>>
>> best regards,
>>
>> samuel
>>
>>
>>
>>
>>
>> huxia...@horebdata.cn
>>
>> 发件人: Igor Fedotov
>> 发送时间: 2020-03-19 12:07
>> 收件人: huxia...@horebdata.cn; ceph-users; ceph-users
>> 主题: Re: [ceph-users] OSDs continuously restarting under load
>> Hi, Samuel,
>>
>> I've never seen that sort of signal in the real life:
>>
>> 2020-03-18 18:39:26.426584 201e35fdb40 -1 *** Caught signal (Bus error) **
>>
>>
>> I suppose this has some hardware roots. Have you checked dmesg output?
>>
>>
>> Just in case, here is some info on "Bus Error" signal, may be it will
>> provide some insight: https://en.wikipedia.org/wiki/Bus_error
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 3/18/2020 5:06 PM, huxia...@horebdata.cn wrote:
>> > Hello, folks,
>> >
>> > I am trying to add a ceph node into an existing ceph cluster. Once the
>> reweight of newly-added OSD on the new node exceed 0.4 somewhere, the osd
>> becomes unresponsive and restarting, eventually go down.
>> >
>> > What could be the problem?  Any suggestion would be highly appreciated.
>> >
>> > best regards,
>> >
>> > samuel
>> >
>> > 
>> > root@node81:/var/log/ceph#
>> > root@node81:/var/log/ceph#
>> > root@node81:/var/log/ceph#
>> > root@node81:/var/log/ceph# ceph osd df
>> > ID CLASS  WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
>> > 12 hybrid 1.0  1.0 3.81TiB 38.3GiB 3.77TiB 0.98 1.32 316
>> > 13 hybrid 1.0  1.0 3.81TiB 37.6GiB 3.77TiB 0.96 1.29 308
>> > 14 hybrid 1.0  1.0 3.81TiB 36.9GiB 3.77TiB 0.95 1.27 301
>> > 15 hybrid 1.0  1.0 3.81TiB 37.1GiB 3.77TiB 0.95 1.28 297
>

[ceph-users] Re: 回复: Re: OSDs continuously restarting under load

2020-04-30 Thread David Turner
I have 2 filestore OSDs in a cluster facing "Caught signal (Bus error)" as
well and can't find anything about it. Ceph 12.2.12. The disks are less
than 50% full and basic writes have been successful. Both disks are on
different nodes. The other 14 disks on each node are unaffected.

Restarting the node doesn't change the behavior. The affected OSD still
crashes and the other 14 start fine (which likely rules out the controller
and other shared components along those lines).

I've attempted [1] these commands on the OSDs to see how much of the disk I
could access cleanly. The first is just to flush the journal to disk and it
crashed out with the same error. The second command is to compact the DB
which also crashed with the same error. On one of the OSDs I was able to
make it a fair bit into compacting the DB before it crashed the first time
and now it crashes instantly.

That leads me to think that it might have gotten to a specific part of the
disk and/or filesystem that is having problems. I'm currently running [2]
xfs_repair on one of the disks to see if it might be the filesystem. On the
other disk I'm running [3] badblocks to check for problems with underlying
sectors.

I'm assuming that if it's a bad block on the disk that is preventing the
disk from starting that there's really nothing that I can do to recover the
OSD and I'll just need to export any PGs on the disks that aren't active.
Here's hoping I make it through this without data loss. Since I started
this data migration I've already lost a couple disks (completely unreadable
by the OS so I can't get copies of the PGs off of them). Luckily these ones
seem like I might be able to access that part of the data at least. As
well, I only have some unfound objects at the moment, but all of my PGs are
active, which is an improvement.


[1] sudo -u ceph ceph-osd -i 285 --flush-journal
sudo -u ceph ceph-kvstore-tool leveldb
/var/lib/ceph/osd/ceph-285/current/omap compact

[2] xfs_repair -n /dev/sdi1
[3] badblocks -b 4096 -v /dev/sdn

On Thu, Mar 19, 2020 at 9:04 AM huxia...@horebdata.cn 
wrote:

> Hi, Igor,
>
> thanks for the tip. Dmesg does not say any suspicious information.
>
> I will investigate whether hardware has any problem or not.
>
> best regards,
>
> samuel
>
>
>
>
>
> huxia...@horebdata.cn
>
> 发件人: Igor Fedotov
> 发送时间: 2020-03-19 12:07
> 收件人: huxia...@horebdata.cn; ceph-users; ceph-users
> 主题: Re: [ceph-users] OSDs continuously restarting under load
> Hi, Samuel,
>
> I've never seen that sort of signal in the real life:
>
> 2020-03-18 18:39:26.426584 201e35fdb40 -1 *** Caught signal (Bus error) **
>
>
> I suppose this has some hardware roots. Have you checked dmesg output?
>
>
> Just in case, here is some info on "Bus Error" signal, may be it will
> provide some insight: https://en.wikipedia.org/wiki/Bus_error
>
>
> Thanks,
>
> Igor
>
>
> On 3/18/2020 5:06 PM, huxia...@horebdata.cn wrote:
> > Hello, folks,
> >
> > I am trying to add a ceph node into an existing ceph cluster. Once the
> reweight of newly-added OSD on the new node exceed 0.4 somewhere, the osd
> becomes unresponsive and restarting, eventually go down.
> >
> > What could be the problem?  Any suggestion would be highly appreciated.
> >
> > best regards,
> >
> > samuel
> >
> > 
> > root@node81:/var/log/ceph#
> > root@node81:/var/log/ceph#
> > root@node81:/var/log/ceph#
> > root@node81:/var/log/ceph# ceph osd df
> > ID CLASS  WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
> > 12 hybrid 1.0  1.0 3.81TiB 38.3GiB 3.77TiB 0.98 1.32 316
> > 13 hybrid 1.0  1.0 3.81TiB 37.6GiB 3.77TiB 0.96 1.29 308
> > 14 hybrid 1.0  1.0 3.81TiB 36.9GiB 3.77TiB 0.95 1.27 301
> > 15 hybrid 1.0  1.0 3.81TiB 37.1GiB 3.77TiB 0.95 1.28 297
> >   0 hybrid 1.0  1.0 3.81TiB 37.6GiB 3.77TiB 0.96 1.29 305
> >   1 hybrid 1.0  1.0 3.81TiB 38.2GiB 3.77TiB 0.98 1.31 309
> >   2 hybrid 1.0  1.0 3.81TiB 37.4GiB 3.77TiB 0.96 1.29 296
> >   3 hybrid 1.0  1.0 3.81TiB 37.9GiB 3.77TiB 0.97 1.30 303
> >   4hdd 0.2  1.0 3.42TiB 10.5GiB 3.41TiB 0.30 0.40   0
> >   5hdd 0.2  1.0 3.42TiB 9.63GiB 3.41TiB 0.28 0.37  87
> >   6hdd 0.2  1.0 3.42TiB 1.91GiB 3.42TiB 0.05 0.07   0
> >   7hdd 0.2  1.0 3.42TiB 11.3GiB 3.41TiB 0.32 0.43  83
> > 16hdd 0.3  1.0 1.79TiB 16.3GiB 1.78TiB 0.89 1.19 142
> >   TOTAL 45.9TiB  351GiB 45.6TiB 0.75
> >
> >
> 
> 日志
> >
> > root@node81:/var/log/ceph# cat ceph-osd.6.log | grep load_pgs
> > 2020-03-18 18:33:57.808747 2000b556000  0 osd.6 0 load_pgs
> > 2020-03-18 18:33:57.808763 2000b556000  0 osd.6 0 load_pgs opened 0 pgs
> >   -1324> 2020-03-18 18:33:57.808747 2000b556000  0 osd.6 0 load_pgs
> >   -1323> 2020-03-18 18:33:57.808763 2000b556000  0 osd.6 0 load_pgs
> opened 0 pgs
> > 2020-03-18 18:35:04.363341 2000327 

[ceph-users] Re: increasing PG count - limiting disruption

2019-11-14 Thread David Turner
There are a few factors to consider. I've gone from 16k pgs to 32k pgs
before and learned some lessons.

The first and most imminent is the peering that happens when you increase
the PG count. I like to increase the pg_num and pgp_num values slowly to
mitigate this. Something like [1] this should do the trick to increase your
pg count slowly and waiting for all peering and such to finish before
continuing. It will also wait for a few other statuses that you shouldn't
be doing maintenance like this during.

The second is that mons do not compact their databases while a pg is in a
non-"clean" state. That means that while your cluster is creating these new
PGs and moving data around, that your mon stores will grow with new maps
until everything is healthy again. This is desired behavior to keep
everything healthy in Ceph in the face of failures, BUT it means that you
need to be aware of how much space you have on your mons for the mon store
to grow. When I was increasing from 16k to 32k PGs, that means we could
only create 4k PGs at a time. In that cluster that would take about 2 weeks
to finish. When we tried to do more than that, our mons ran out of space
and we had to add disks to the mons to move the mon stores to so that the
mons could continue to run.

Finally know that this is just going to take a while (depending on how much
data is in your cluster and how full it is). Be patient. Either you
increase max_backfills, lower backfill sleep, and such to make the
backfilling go faster (at the cost of IOPS used here that the clients
can't) or you keep these throttled to not impact clients as much. Keep a
good balance though as putting off finishing the recovery for too long
leaves your cluster in a riskier position for that much longer.

Good luck.



[1] *Note that I typed this in gmail and not copied from a script. Please
test before using.
ceph osd set nobackfill
ceph osd set norebalance
function healthy_wait() {
  while ceph health | grep -q
'peering\|inactive\|activating\|creating\|down\|inconsistent\|stale'; do
echo waiting for ceph to be healthier
sleep 10
  done
}
for count in {2048..4096..256}; do
  healthy_wait
  ceph osd pool set $pool pg_num $count
  healthy_wait
  ceph osd pool set $pool pgp_num $count
done
healthy_wait
ceph osd unset nobackfill
ceph osd unset norebalance

On Thu, Nov 14, 2019 at 11:19 AM Frank R  wrote:

> Hi all,
>
> When increasing the number of placement groups for a pool by a large
> amount (say 2048 to 4096) is it better to go in small steps or all at once?
>
> This is a filestore cluster.
>
> Thanks,
> Frank
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mix ceph-disk and ceph-volume

2019-10-22 Thread David Turner
Yes, there is nothing wrong with this and had been a common scenario for
people during their migration from filestore to bluestore.

On Tue, Oct 22, 2019, 9:46 PM Frank R  wrote:

> Is it ok to create a new OSD using ceph-volume on a server where the other
> OSDs were created with ceph-disk?
>
> thx
> Frank
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: minimum osd size?

2019-10-22 Thread David Turner
I did a set of 30GB OSDs before with extra disk space on my SSDs for the
metadata pool on cephfs and my entire cluster locked up about 3 weeks
later. Some metadata operation was happening, filled some of the 30GB disks
to 100%, and all IO was blocked in the cluster. I did some trickery of
deleting 1 copy of a few PGs on each OSD, such that I still had at least 2
copies of each PG, and was able to backfill the pool back onto my HDDs and
restore cluster functionality. I would say that trying to use that space is
definitely not worth it.

In one of my production clusters I occasionally get a warning state that an
omap object is too large in my buckets.index pool. I could very easily
imagine that stalling the entire cluster if my index pool were on such
small OSDs.

On Tue, Oct 22, 2019, 6:55 PM Frank R  wrote:

> Hi all,
>
> I have 40 nvme drives with about 20G free space each.
>
> Would creating a 10GB partition/lvm on each of the nvmes for an rgw index
> pool be a bad idea?
>
> RGW has about about 5 million objects
>
> I don't think space will be an issue but I am worried about the 10G size,
> is it just too small for a bluestore OSD?
>
> thx
> Frank
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: download.ceph.com repository changes

2019-09-24 Thread David Turner
IRT a testing/cutting edge repo, the non-LTS versions of Ceph have been
removed because very few people ever used them and tested them. The
majority of people that would be using the testing repo would be people
needing a bug fix ASAP. Very few people would actually use this regularly
and its effectiveness would be almost zero in preventing problems slipping
through.

At work I haven't had a problem with which version of Ceph is being
installed because we always have local mirrors of the repo that we only
update with the upstream repos when we're ready to test a new version in
our QA environments long before we promote the version for production use.
That said, I've been bit by this multiple times in my home environment
where I've accidentally updated a server or reinstalled a server and needed
to upgrade my Ceph cluster before I could finish because it installed a
newer version of Ceph. I have had to download the entire copy of a version
from online, put it into a folder on disk, and set up a repo feeding from
that local folder to install a specific version. This would be very handy
to just use the ability in apt or yum to just specify a different version
of a package in the repo.

Problem releases have become more problematic than needed because the
packages were left the default packages after a bug was known because there
was no way to remove them from the repo. People continue to see the upgrade
and grabbing it not realizing it's a busted release. I've only seen that
happen on the ML here, but I personally will not touch a new release for at
least 2 weeks after it's been released even in my testing clusters.

On Tue, Sep 24, 2019 at 4:06 PM Ken Dreyer  wrote:

> On Tue, Sep 17, 2019 at 8:03 AM Sasha Litvak
>  wrote:
> >
> > * I am bothered with a quality of the releases of a very complex system
> that
> > can bring down a whole house and keep it down for a while.  While I wish
> the
> > QA would be perfect, I wonder if it would be practical to release new
> > packages to a testing repo before moving it to a main one.  There is a
> > chance then someone will detect a problem before it becomes a production
> > issue.  Let it seat for a couple days or weeks in testing.  People who
> need
> > new update right away or just want to test will install it and report the
> > problems.  Others will not be affected.
>
> I think it would be a good step forward to have a separate "testing"
> repository. This repository would be a little more cutting-edge, and we'd
> copy
> all the binaries over to the "main" repository location after 48 hours or
> something.
>
> This would let us all publicly test the candidate GPG-signed packages, for
> example.
>
> - Ken
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io