This has been completed and I've put ftp-osl back into rotation! Thanks for your patience.
On Fri, Jun 22, 2018 at 12:44 PM, Lance Albertson <[email protected]> wrote: > The sync has been completed and I will be switching this over to the local > drives at 1:30PM PDT (2030 UTC) today. I'm going to also reboot the machine > so that it's running back on the normal CentOS kernel instead of our custom > mainline kernel we needed for Ceph. This outage should only last for about > 10 minutes while the machine reboots. > > This does not affect anything pointed at ftp.osuosl.org, only ftp-osl > (which is out of rotation). > > Thanks- > > On Tue, Jun 19, 2018 at 8:57 AM, Lance Albertson <[email protected]> wrote: > >> It's taking longer than I expected to sync the data back to the local >> disks. This is due to the fact that the system is also rebuilding two RAID6 >> arrays which I forgot to account for. This is also making the system more >> slower than I expected. At this rate it might take a few days to copy all >> of the data back. Hopefully once the RAID6 arrays have finished rebuilding, >> the I/O rate will speed up the syncing. Both arrays are currently at 55% >> and 47% and we've transferred over 993G of 8.8T of data to the local disks. >> >> I will send another update once I'm ready switch the system back over. >> >> Thanks- >> >> On Mon, Jun 18, 2018 at 3:49 PM, Lance Albertson <[email protected]> >> wrote: >> >>> I just wanted to send you all an update on where we're at in the process. >>> >>> As of right now, ftp-osl is back online and serving it's content from >>> the the Ceph volume. I've gone ahead and kicked off a few manual syncs to >>> catch everything up however if you're using us as a master I recommend you >>> kick off an update job right now. I'm also currently copying the content to >>> the local disks which I expect to run through tomorrow sometime. >>> >>> The rebuild took a little bit longer than originally planned due to some >>> issues I ran into building the new RAID array. My original plan didn't work >>> so I had to go with plan B which took a little longer. Plan B resulted in >>> creating two separate RAID6 arrays which means I lost about 2T in capacity >>> from my original plan. >>> >>> I'm keeping ftp-osl out of the public rotation for now since it's I/O >>> throughput isn't likely as good as before since it's serving the content >>> via Ceph. >>> >>> I'll send another update tomorrow when I'm ready to switch back over to >>> local storage. Please let me know if you notice any issues. >>> >>> Thanks- >>> >>> On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <[email protected]> >>> wrote: >>> >>>> I had a few questions regarding this outages that I wanted to clarify >>>> for everyone. >>>> >>>> 1. There should be no outage during the 5.5 hour outage window for >>>> anything pointed to ftp.osuosl.org (unless your DNS is directly >>>> pointing at ftp-osl.osuosl.org) >>>> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should >>>> have normal read/write operations. There might be a little bit of I/O >>>> performance hit during that window but it's hard to tell. There will be a >>>> short (likely 5 min) outage to read/writes on ftp-osl when I do the final >>>> switch back to local storage however. >>>> >>>> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <[email protected]> >>>> wrote: >>>> >>>>> Service(s) affected: ftp.osuosl.org >>>>> >>>>> During the outage, the master syncing node for our FTP cluster >>>>> (ftp-osl) will be offline which means any updates to our software mirrors >>>>> will be delayed. >>>>> >>>>> Outage Window: >>>>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC) >>>>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC) >>>>> >>>>> Reason for outage: >>>>> >>>>> Our FTP cluster is starting to run low on disk space and we will be >>>>> adding additional hard drives to the system. Our system currently has >>>>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this >>>>> takes into account the RAID6 configuration) >>>>> >>>>> Unfortunately, due to the nature of the how the disk arrays are >>>>> configured, we will not be able to grow the RAID array without a complete >>>>> rebuild. This means we're going to have to re-copy all 8.8TB of data off >>>>> of >>>>> the machine and back onto it. Since this task is rather large and time >>>>> consuming we've come up with a better alternative so that we don't have >>>>> our >>>>> master FTP server offline for very long. >>>>> >>>>> We have just recently built a new Ceph cluster for some new storage >>>>> needs at the OSL and we are going to temporarily use this cluster to serve >>>>> the ftp-osl content. I've already copied the content onto a new volume and >>>>> have tested it enough to feel it can handle the load. This should make the >>>>> transition plan much easier and quicker than initially.This server is >>>>> already out of DNS rotation and we are planning on keeping it out of >>>>> rotation until this process is complete to reduce the I/O load. >>>>> >>>>> So here's the plan thus far starting on Monday: >>>>> >>>>> 1. Stopping all services on the system and doing one final rsync to >>>>> the Ceph volume >>>>> 2. Rebooting machine and destroying the current RAID and creating a >>>>> new one with the new disks >>>>> 3. Reinstall the OS >>>>> 4. Bootstrap machine without FTP components initially, setup ceph >>>>> volume >>>>> 5. Deploy FTP components after Ceph volume is setup and ready to go >>>>> 6. Ensure inter FTP node syncing is working using the Ceph volume >>>>> 7. Sync data from Ceph volume back over to local disks (I'm guessing >>>>> this will take 18-24 hours) >>>>> 8. Once sync is complete, shutdown all services and switch the mount >>>>> point over to the local disks >>>>> 9. Profit! >>>>> >>>>> I would like to thank IBM for donating the hard drives needed for this >>>>> upgrade. >>>>> >>>>> We will plan on doing the storage upgrades on our two other nodes >>>>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for >>>>> this since they are remote. The current plan is to take one machine out >>>>> for >>>>> several days and sync the data back between the nodes. I will send another >>>>> outage announcement for those two nodes once we're ready for that. We >>>>> still >>>>> need to ship the drives to the locations and work with the local data >>>>> centers to get them installed. >>>>> >>>>> Projects affected: Any project using our FTP cluster as a master >>>>> syncing point >>>>> >>>> >> -- >> Lance Albertson >> Director >> Oregon State University | Open Source Lab >> > > > > -- > Lance Albertson > Director > Oregon State University | Open Source Lab > -- Lance Albertson Director Oregon State University | Open Source Lab
_______________________________________________ Hosting mailing list [email protected] https://lists.osuosl.org/mailman/listinfo/hosting
