Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
The move  journal, partition resize, grow file system approach would
work nicely if the spare capacity were at the end of the disk.

Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
journal placement to be at the 800GB mark, leaving the largest remaining
partition at the end of the disk.   I'm assuming the gdisk bug was
caused by overflowing a 32bit int during the -1000M offset from end of
disk calculation.  When it computed the end of disk for the journal
placement on disks >2TB it dropped the 2TB part of the size and was left
only with the 800GB value, putting the journal there.  After gdisk
created the journal at the 800GB mark (splitting the disk),
ceph-disk-prepare told gdisk to take the largest remaining partition for
data, using the 2TB partition at the end.

Here's an example of the buggy partitioning:

crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5859442654
Partitions will be aligned on 2048-sector boundaries
Total free space is 1562425343 sectors (745.0 GiB)

Number  Start (sector)End (sector)  Size   Code  Name
   1  1564475392  5859442654   2.0 TiB   ceph data
   2  1562425344  1564475358   1001.0 MiB    ceph journal



I assume I could still follow a disk-level relocation of data using dd
and shift all my content forward in the disk and then grow the file
system to the end, but this would take a significant amount of time,
more than a quick restart of the OSD. 

This leaves me the option of setting noout and hoping for the best (no
other failures) during my somewhat lengthy dd data movement or taking my
osd down and letting the cluster begin repairing the redundancy.

If I follow the second option of normal osd loss repair, my disk
repartition step would be fast and I could bring the OSD back up rather
quickly.  Does taking an OSD out of service, erasing it and bringing the
same OSD back into service present any undue stress to the cluster?  

I'd prefer to use the second option if I can because I'm likely to
repeat this in the near future in order to add encryption to these disks.

~jpr

On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> Le 16/09/2015 01:21, John-Paul Robinson a écrit :
>> Hi,
>>
>> I'm working to correct a partitioning error from when our cluster was
>> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
>> partitions for our OSDs, instead of the 2.8TB actually available on
>> disk, a 29% space hit.  (The error was due to a gdisk bug that
>> mis-computed the end of the disk during the ceph-disk-prepare and placed
>> the journal at the 2TB mark instead of the true end of the disk at
>> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>>
>> I'd like to fix this problem by taking my existing 2TB OSDs offline one
>> at a time, repartitioning them and then bringing them back into the
>> cluster.  Unfortunately I can't just grow the partitions, so the
>> repartition will be destructive.
> Hum, why should it be? If the journal is at the 2TB mark, you should be
> able to:
> - stop the OSD,
> - flush the journal, (ceph-osd -i  --flush-journal),
> - unmount the data filesystem (might be superfluous but the kernel seems
> to cache the partition layout when a partition is active),
> - remove the journal partition,
> - extend the data partition,
> - place the journal partition at the end of the drive (in fact you
> probably want to write a precomputed partition layout in one go).
> - mount the data filesystem, resize it online,
> - ceph-osd -i  --mkjournal (assuming your setup can find the
> partition again automatically without reconfiguration)
> - start the OSD
>
> If you script this you should not have to use noout: the OSD should come
> back in a matter of seconds and the impact on the storage network minimal.
>
> Note that the start of the disk is where you get the best sequential
> reads/writes. Given that most data accesses are random and all journal
> accesses are sequential I put the journal at the start of the disk when
> data and journal are sharing the same platters.
>
> Best regards,
>
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson (Campus)
So I just realized I had described the partition error incorrectly in my 
initial post. The journal was placed at the 800GB mark leaving the 2TB data 
partition at the end of the disk. (See my follow-up to Lionel for details.) 

I'm working to correct that so I have a single large partition the size of the 
disk, save for the journal.

Sorry for any confusion. 

~jpr



> On Sep 15, 2015, at 6:21 PM, John-Paul Robinson  wrote:
> 
> I'm working to correct a partitioning error from when our cluster was
> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> partitions for our OSDs, instead of the 2.8TB actually available on
> disk, a 29% space hit.  (The error was due to a gdisk bug that
> mis-computed the end of the disk during the ceph-disk-prepare and placed
> the journal at the 2TB mark instead of the true end of the disk at
> 2.8TB. I've updated gdisk to a newer release that works correctly.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread Christian Balzer

Hello,

On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:

> The move  journal, partition resize, grow file system approach would
> work nicely if the spare capacity were at the end of the disk.
>
That shouldn't matter, you can "safely" loose your journal in controlled
circumstances.

This would also be an ideal time to put your journals on SSDs. ^o^

Roughly (you do have a test cluster, do you? Or at least try this with
just one OSD):

1. set noout just to be sure.
2. stop the OSD
3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
--help)
4. clobber your partitions in a way that leaves you with an intact data
partition, grow that and the FS in it as desired.
5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
6. start the OSD and rejoice. 
 
More below.

> Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
> journal placement to be at the 800GB mark, leaving the largest remaining
> partition at the end of the disk.   I'm assuming the gdisk bug was
> caused by overflowing a 32bit int during the -1000M offset from end of
> disk calculation.  When it computed the end of disk for the journal
> placement on disks >2TB it dropped the 2TB part of the size and was left
> only with the 800GB value, putting the journal there.  After gdisk
> created the journal at the 800GB mark (splitting the disk),
> ceph-disk-prepare told gdisk to take the largest remaining partition for
> data, using the 2TB partition at the end.
> 
> Here's an example of the buggy partitioning:
> 
> crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
> GPT fdisk (gdisk) version 0.8.8
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 5859442654
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 1562425343 sectors (745.0 GiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
>1  1564475392  5859442654   2.0 TiB   ceph data
>2  1562425344  1564475358   1001.0 MiB    ceph journal
> 
> 
> 
> I assume I could still follow a disk-level relocation of data using dd
> and shift all my content forward in the disk and then grow the file
> system to the end, but this would take a significant amount of time,
> more than a quick restart of the OSD. 
> 
> This leaves me the option of setting noout and hoping for the best (no
> other failures) during my somewhat lengthy dd data movement or taking my
> osd down and letting the cluster begin repairing the redundancy.
> 
> If I follow the second option of normal osd loss repair, my disk
> repartition step would be fast and I could bring the OSD back up rather
> quickly.  Does taking an OSD out of service, erasing it and bringing the
> same OSD back into service present any undue stress to the cluster?  
> 
Undue is such a nicely ambiguous word.
Recovering/Backfilling an OSD will stress your cluster, especially
considering that you're not using SSDs and a positively ancient version of
Ceph. 

Make sure to set all appropriate recovery/backfill options to their
minimum.

OTOH your cluster should be able to handle losses of OSDs w/o melting down
and given the presumed age of your cluster you must have had OSD failures
before. 
How did it fare then?

I have one cluster where loosing an OSD would be just background noise,
while another one would be seriously impacted by such a loss (working on
correcting that).

Regards,

Christian
> I'd prefer to use the second option if I can because I'm likely to
> repeat this in the near future in order to add encryption to these disks.
> 
> ~jpr
> 
> On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> > Le 16/09/2015 01:21, John-Paul Robinson a écrit :
> >> Hi,
> >>
> >> I'm working to correct a partitioning error from when our cluster was
> >> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> >> partitions for our OSDs, instead of the 2.8TB actually available on
> >> disk, a 29% space hit.  (The error was due to a gdisk bug that
> >> mis-computed the end of the disk during the ceph-disk-prepare and
> >> placed the journal at the 2TB mark instead of the true end of the
> >> disk at 2.8TB. I've updated gdisk to a newer release that works
> >> correctly.)
> >>
> >> I'd like to fix this problem by taking my existing 2TB OSDs offline
> >> one at a time, repartitioning them and then bringing them back into
> >> the cluster.  Unfortunately I can't just grow the partitions, so the
> >> repartition will be destructive.
> > Hum, why should it be? If the journal is at the 2TB mark, you should be
> > able to:
> > - stop the OSD,
> 

Re: [ceph-users] question on reusing OSD

2015-09-16 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

My understanding of growing file systems is the same as yours, it can
only grow at the end not the beginning. In addition to that, having
partition 2 before partition 1 just cries to me to have it fixed, but
that is just aesthetic.

Because the weights of the drives will be different, there will be
some additional data movement (probably minimized if you are using
straw2). Setting noout will prevent Ceph from shuffling data around
while you are making the changes. When you bring the OSD back in, it
should receive only the PGs that were on it before minimizing the data
movement in the cluster. But because you are adding 800 GB, it will
want to take a few more PGs and so some shuffling in the cluster is
inevitable.

I don't know how well it would work, but you could bring in all the
reformatted OSDs in at the same weight as the current weight and then
when you have them all re-done, edit the crush map to set the weights
right, ideally the ratio would be the same so no (or very little) data
movement would occur. Due to an error in the straw algorithm, there is
still the potential of large amounts of data movement with small
weight changes.

As to your question about adding the disk before the rebalance is
completed, it will be fine to do so. Ceph will complete the PGs that
are currently being relocated, but compute new locations based on the
new disk. This may result in a PG that just finished moving to be
relocated again. The cluster will still perform and not lose data.

About saving OSD IDs; I only know that if you don't have gaps in your
OSDs (some were retired and not replaced) then if you remove an OSD
and recreate it, it will get the same number as the lowest available
number is the same as the OSD being replaced. I don't know about
saving off the files before wiping the OSD if it will keep the
identity.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV+aE4CRDmVDuy+mK58QAA7hEQAIluPpdYtvhpkIJiWabb
jWBkjOk3W6Am9aosQm88IF3biOMVGBQN2Xs9PgDW2lMz4aU1Vh6rpACCRFt0
Xn46pLanS4lPF/nYClUhu34z5LzNOZv84YEhwbc9KOUHIUs0Ijv7AlkyOn3S
bn1fbx7YUVbliqj6171jvEZKYndYdVe/nLeGVQu+DAkFyycSe+cj4fSnXtgr
xkRd6EDLiXBf8YuqX1sLjwDrtVYoNiPh4R7q1XA1zOkemuMlqwCwxCCJAxuq
5mKMg3DbJfPelSeOV6GXrMJt7GGTj8qUDzBGhvfhPBu1/XtfgRQar6VTi3gG
tdE0S+i8u5Ir9ze8aGvcl7ocmJXtcDa4LIyKmspz1vhPHCgG451W/vCu4mPV
lhym50/+arLSePxoZiQLwazfCx2T3XxcGBOK2KJ13rMVnt4HXsnfnG1x4T9U
0yIolZhPJDY30kyNXAEkivXnShfT9iOsIEFgb3LwhMJNR3uVVgOzQOL5CGlj
NDj5ZebzqsowfflwRxhQIWTo+F2zLXMt5gv5Xqq8UeLuEsx81I9wJh0+DwYM
ISHOHtE/COhlaRiyEk1q3ZzZe56baW5W3KnjNuYmF13jpMfS2ctoAEAUvGxS
d4frVCFJYXZ+5d8b7dYTU5mbqKe59yEPq3yjAOIZPL9PWn1jHfgjylvOMyMw
hihd
=GGct
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 16, 2015 at 10:12 AM, John-Paul Robinson  wrote:
> Christian,
>
> Thanks for the feedback.
>
> I guess I'm wondering about step 4 "clobber partition, leaving data in
> tact and grow partition and the file system as needed".
>
> My understanding of xfs_growfs is that the free space must be at the end
> of the existing file system.  In this case the existing partition starts
> around the 800GB mark on the disk and and extends to the end of the
> disk.  My goal is to add the first 800GB on the disk to that partition
> so it can become a single data partition.
>
> Note that my volumes are not LVM based so I can't extend the volume by
> incorporating the free space at the start of the disk.
>
> Am I misunderstanding something about file system grow commands?
>
> Regarding your comments, on impact to the cluster of a downed OSD.  I
> have lost OSDs and the impact is minimal (acceptable).
>
> My concern is around taking an OSD down, having the cluster initiate
> recovery and then bringing that same OSD back into the cluster in an
> empty state.  Are the placement groups that originally had data on this
> OSD already remapped by this point (even if they aren't fully recovered)
> so that bring the empty, replacement OSD on-line simply causes a
> different set of placement groups to be mapped onto it to achieve the
> rebalance?
>
> Thanks,
>
> ~jpr
>
> On 09/16/2015 08:37 AM, Christian Balzer wrote:
>> Hello,
>>
>> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>>
>>> > The move  journal, partition resize, grow file system approach would
>>> > work nicely if the spare capacity were at the end of the disk.
>>> >
>> That shouldn't matter, you can "safely" loose your journal in controlled
>> circumstances.
>>
>> This would also be an ideal time to put your journals on SSDs. ^o^
>>
>> Roughly (you do have a test cluster, do you? Or at least try this with
>> just one OSD):
>>
>> 1. set noout just to be sure.
>> 2. stop the OSD
>> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
>> --help)
>> 4. clobber your partitions in a way that leaves you with an intact 

Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
Christian,

Thanks for the feedback.

I guess I'm wondering about step 4 "clobber partition, leaving data in
tact and grow partition and the file system as needed".

My understanding of xfs_growfs is that the free space must be at the end
of the existing file system.  In this case the existing partition starts
around the 800GB mark on the disk and and extends to the end of the
disk.  My goal is to add the first 800GB on the disk to that partition
so it can become a single data partition.

Note that my volumes are not LVM based so I can't extend the volume by
incorporating the free space at the start of the disk.

Am I misunderstanding something about file system grow commands?

Regarding your comments, on impact to the cluster of a downed OSD.  I
have lost OSDs and the impact is minimal (acceptable).

My concern is around taking an OSD down, having the cluster initiate
recovery and then bringing that same OSD back into the cluster in an
empty state.  Are the placement groups that originally had data on this
OSD already remapped by this point (even if they aren't fully recovered)
so that bring the empty, replacement OSD on-line simply causes a
different set of placement groups to be mapped onto it to achieve the
rebalance?

Thanks,

~jpr

On 09/16/2015 08:37 AM, Christian Balzer wrote:
> Hello,
>
> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>
>> > The move  journal, partition resize, grow file system approach would
>> > work nicely if the spare capacity were at the end of the disk.
>> >
> That shouldn't matter, you can "safely" loose your journal in controlled
> circumstances.
>
> This would also be an ideal time to put your journals on SSDs. ^o^
>
> Roughly (you do have a test cluster, do you? Or at least try this with
> just one OSD):
>
> 1. set noout just to be sure.
> 2. stop the OSD
> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
> --help)
> 4. clobber your partitions in a way that leaves you with an intact data
> partition, grow that and the FS in it as desired.
> 5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
> 6. start the OSD and rejoice. 
>  
> More below.
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-15 Thread Lionel Bouton
Le 16/09/2015 01:21, John-Paul Robinson a écrit :
> Hi,
>
> I'm working to correct a partitioning error from when our cluster was
> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> partitions for our OSDs, instead of the 2.8TB actually available on
> disk, a 29% space hit.  (The error was due to a gdisk bug that
> mis-computed the end of the disk during the ceph-disk-prepare and placed
> the journal at the 2TB mark instead of the true end of the disk at
> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>
> I'd like to fix this problem by taking my existing 2TB OSDs offline one
> at a time, repartitioning them and then bringing them back into the
> cluster.  Unfortunately I can't just grow the partitions, so the
> repartition will be destructive.

Hum, why should it be? If the journal is at the 2TB mark, you should be
able to:
- stop the OSD,
- flush the journal, (ceph-osd -i  --flush-journal),
- unmount the data filesystem (might be superfluous but the kernel seems
to cache the partition layout when a partition is active),
- remove the journal partition,
- extend the data partition,
- place the journal partition at the end of the drive (in fact you
probably want to write a precomputed partition layout in one go).
- mount the data filesystem, resize it online,
- ceph-osd -i  --mkjournal (assuming your setup can find the
partition again automatically without reconfiguration)
- start the OSD

If you script this you should not have to use noout: the OSD should come
back in a matter of seconds and the impact on the storage network minimal.

Note that the start of the disk is where you get the best sequential
reads/writes. Given that most data accesses are random and all journal
accesses are sequential I put the journal at the start of the disk when
data and journal are sharing the same platters.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question on reusing OSD

2015-09-15 Thread John-Paul Robinson
Hi,

I'm working to correct a partitioning error from when our cluster was
first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
partitions for our OSDs, instead of the 2.8TB actually available on
disk, a 29% space hit.  (The error was due to a gdisk bug that
mis-computed the end of the disk during the ceph-disk-prepare and placed
the journal at the 2TB mark instead of the true end of the disk at
2.8TB. I've updated gdisk to a newer release that works correctly.)

I'd like to fix this problem by taking my existing 2TB OSDs offline one
at a time, repartitioning them and then bringing them back into the
cluster.  Unfortunately I can't just grow the partitions, so the
repartition will be destructive.

I would like for the reformatted OSD to come back into the cluster
looking just like the original OSD, except that it now has 2.8TB for
it's data.  That is, I'd like the OSD number to stay the same and for
the cluster to think of it like the original disk (save for not having
any data on it).

Ordinarily, I would add an OSD by bringing a system into the cluster
triggering these events:

ceph-disk-prepare /dev/sdb /dev/sdb  # partitions disk, note older
cluster with journal on same disk
ceph-disk-activate /dev/sdb # registers osd with cluster

The ceph-disk-prepare is focused on partitioning and doesn't interact
with the cluster.  The ceph-disk-activate takes care of making the OSD
look like an OSD and adding it into the cluster.

Inside of the ceph-disk-activate the code looks for some special files
at the top of the /dev/sdb1 file system, including magic, ceph_fsid, and
whoami (which is where the osd number is stored).

My first question is, can I preserve these special files and put them
back on the repartitioned/formatted drive causing ceph-disk-activate to
just bring the OSD back into the cluster using it's original identity or
is there a better way to do what I want?

My second question is, if I take an OSD out of the cluster, should I
wait for the subsequent rebalance to complete before bringing the
reformatted OSD back in the cluster?  That is, will it cause problems to
drop an OSD out of the cluster and then bring the same OSD back into the
cluster except without any of the data.   I'm assuming this is similar
to what would happen in a standard disk replacement scenario.

I reviewed the thread from Sept 2014
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13394.html)
discussing a similar scenario.  This was more focused on re-using a
journal slot on an SSD.  In my case the journal is on the same disk as
the data.  Also, I don't have a recent release of the ceph so likely
won't benefit from the associated fix.

Thanks for any suggestions.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com