Re: [ceph-users] question on reusing OSD

John-Paul Robinson Wed, 16 Sep 2015 05:22:24 -0700

The move  journal, partition resize, grow file system approach would
work nicely if the spare capacity were at the end of the disk.

Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
journal placement to be at the 800GB mark, leaving the largest remaining
partition at the end of the disk.   I'm assuming the gdisk bug was
caused by overflowing a 32bit int during the -1000M offset from end of
disk calculation.  When it computed the end of disk for the journal
placement on disks >2TB it dropped the 2TB part of the size and was left
only with the 800GB value, putting the journal there.  After gdisk
created the journal at the 800GB mark (splitting the disk),
ceph-disk-prepare told gdisk to take the largest remaining partition for
data, using the 2TB partition at the end.

Here's an example of the buggy partitioning:

    crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
    GPT fdisk (gdisk) version 0.8.8

    Partition table scan:
      MBR: protective
      BSD: not present
      APM: not present
      GPT: present

    Found valid GPT with protective MBR; using GPT.
    Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
    Logical sector size: 512 bytes
    Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
    Partition table holds up to 128 entries
    First usable sector is 34, last usable sector is 5859442654
    Partitions will be aligned on 2048-sector boundaries
    Total free space is 1562425343 sectors (745.0 GiB)

    Number  Start (sector)    End (sector)  Size       Code  Name
       1      1564475392      5859442654   2.0 TiB     FFFF  ceph data
       2      1562425344      1564475358   1001.0 MiB  FFFF  ceph journal

I assume I could still follow a disk-level relocation of data using dd
and shift all my content forward in the disk and then grow the file
system to the end, but this would take a significant amount of time,
more than a quick restart of the OSD. 

This leaves me the option of setting noout and hoping for the best (no
other failures) during my somewhat lengthy dd data movement or taking my
osd down and letting the cluster begin repairing the redundancy.

If I follow the second option of normal osd loss repair, my disk
repartition step would be fast and I could bring the OSD back up rather
quickly.  Does taking an OSD out of service, erasing it and bringing the
same OSD back into service present any undue stress to the cluster?  

I'd prefer to use the second option if I can because I'm likely to
repeat this in the near future in order to add encryption to these disks.

~jpr

On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> Le 16/09/2015 01:21, John-Paul Robinson a écrit :
>> Hi,
>>
>> I'm working to correct a partitioning error from when our cluster was
>> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
>> partitions for our OSDs, instead of the 2.8TB actually available on
>> disk, a 29% space hit.  (The error was due to a gdisk bug that
>> mis-computed the end of the disk during the ceph-disk-prepare and placed
>> the journal at the 2TB mark instead of the true end of the disk at
>> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>>
>> I'd like to fix this problem by taking my existing 2TB OSDs offline one
>> at a time, repartitioning them and then bringing them back into the
>> cluster.  Unfortunately I can't just grow the partitions, so the
>> repartition will be destructive.
> Hum, why should it be? If the journal is at the 2TB mark, you should be
> able to:
> - stop the OSD,
> - flush the journal, (ceph-osd -i <osdid> --flush-journal),
> - unmount the data filesystem (might be superfluous but the kernel seems
> to cache the partition layout when a partition is active),
> - remove the journal partition,
> - extend the data partition,
> - place the journal partition at the end of the drive (in fact you
> probably want to write a precomputed partition layout in one go).
> - mount the data filesystem, resize it online,
> - ceph-osd -i <osdid> --mkjournal (assuming your setup can find the
> partition again automatically without reconfiguration)
> - start the OSD
>
> If you script this you should not have to use noout: the OSD should come
> back in a matter of seconds and the impact on the storage network minimal.
>
> Note that the start of the disk is where you get the best sequential
> reads/writes. Given that most data accesses are random and all journal
> accesses are sequential I put the journal at the start of the disk when
> data and journal are sharing the same platters.
>
> Best regards,
>
> Lionel

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question on reusing OSD

Reply via email to