Very unbalanced storage

2012-08-31 Thread Xiaopong Tran

Hi,

Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.

I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.

Here is the current situation on node s11:

Filesystem  Size  Used Avail 
Use% Mounted on
/dev/sdb1   932G  4.3G  927G 
  1% /disk1
/dev/sdc1   932G  4.3G  927G 
  1% /disk2
/dev/sdd1   932G  4.3G  927G 
  1% /disk3
/dev/sde1   932G  4.3G  927G 
  1% /disk4
/dev/sdf1   932G  4.3G  927G 
  1% /disk5
/dev/sdg1   932G  4.3G  927G 
  1% /disk6
/dev/sdh1   932G  4.3G  927G 
  1% /disk7
/dev/sdi1   932G  4.3G  927G 
  1% /disk8
/dev/sdj1   932G  4.3G  927G 
  1% /disk9
/dev/sdk1   932G  445G  487G 
 48% /disk10


Here, we can see that all data seem to go to one osd only, while others
are almost empty.

And here's the situation on node s21:

Filesystem  Size  Used Avail 
Use% Mounted on
/dev/sdb1   932G  443G  489G 
 48% /disk1
/dev/sdc1   932G  4.3G  927G 
  1% /disk2
/dev/sdd1   932G  4.3G  927G 
  1% /disk3
/dev/sde1   932G  4.3G  927G 
  1% /disk4
/dev/sdf1   932G  4.3G  927G 
  1% /disk5
/dev/sdg1   932G  4.3G  927G 
  1% /disk6
/dev/sdh1   932G  4.3G  927G 
  1% /disk7
/dev/sdi1   932G  4.3G  927G 
  1% /disk8
/dev/sdj1   932G  449G  483G 
 49% /disk9
/dev/sdk1   932G  4.3G  927G 
  1% /disk10


The situation is a bit better, but not much, the data are stored on two
disks mainly.

Here is a better situation, on node s12:

Filesystem  Size  Used Avail 
Use% Mounted on
/dev/sdb1   1.9T  453G  1.4T 
 25% /disk1
/dev/sdc1   1.9T  4.3G  1.9T 
  1% /disk2
/dev/sdd1   1.9T  4.4G  1.9T 
  1% /disk3
/dev/sde1   1.9T  4.3G  1.9T 
  1% /disk4
/dev/sdf1   1.9T  457G  1.4T 
 25% /disk5
/dev/sdg1   1.9T  443G  1.4T 
 24% /disk6
/dev/sdh1   1.9T  4.4G  1.9T 
  1% /disk7
/dev/sdi1   1.9T  4.4G  1.9T 
  1% /disk8
/dev/sdj1   1.9T  427G  1.5T 
 23% /disk9
/dev/sdk1   1.9T  4.4G  1.9T 
  1% /disk10


It's better than the other two, but still not what I expected. I
expected the data to be spread out according to the weight of each
osd, as defined in the crush map. Or at least, as close to that
as possible. It might be just some obviously stupid config error,
but I don't know. This can't be normal, can it?

Thanks for any hint.

Xiaopong
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70

Re: Very unbalanced storage

2012-08-31 Thread Andrew Thompson

On 8/31/2012 7:11 AM, Xiaopong Tran wrote:

Hi,

Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.

I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.


Have you been reweight-ing osds? I went round and round with my cluster 
a few days ago reloading different crush maps only to find that it 
re-injecting a crush map didn't seem to overwrite reweights.


Take a look at `ceph osd tree` to see if the reweight column matches the 
weight column.


Note: I'm new at this. This is my experience only, with 0.48.1, and may 
not be entirely correct.


--
Andrew Thompson
http://aktzero.com/

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Sage Weil
On Fri, 31 Aug 2012, Xiaopong Tran wrote:
> Hi,
> 
> Ceph storage on each disk in the cluster is very unbalanced. On each
> node, the data seems to go to one or two disks, while other disks
> are almost empty.
> 
> I can't find anything wrong from the crush map, it's just the
> default for now. Attached is the crush map.

This is usually a problem with the pg_num for the pool you are using.  Can 
you include the output from 'ceph osd dump | grep ^pool'?  By default, 
pools get 8 pgs, which will distribute poorly.

sage


> 
> Here is the current situation on node s11:
> 
> Filesystem  Size  Used Avail Use%
> Mounted on
> /dev/sdb1   932G  4.3G  927G   1%
> /disk1
> /dev/sdc1   932G  4.3G  927G   1%
> /disk2
> /dev/sdd1   932G  4.3G  927G   1%
> /disk3
> /dev/sde1   932G  4.3G  927G   1%
> /disk4
> /dev/sdf1   932G  4.3G  927G   1%
> /disk5
> /dev/sdg1   932G  4.3G  927G   1%
> /disk6
> /dev/sdh1   932G  4.3G  927G   1%
> /disk7
> /dev/sdi1   932G  4.3G  927G   1%
> /disk8
> /dev/sdj1   932G  4.3G  927G   1%
> /disk9
> /dev/sdk1   932G  445G  487G  48%
> /disk10
> 
> Here, we can see that all data seem to go to one osd only, while others
> are almost empty.
> 
> And here's the situation on node s21:
> 
> Filesystem  Size  Used Avail Use%
> Mounted on
> /dev/sdb1   932G  443G  489G  48%
> /disk1
> /dev/sdc1   932G  4.3G  927G   1%
> /disk2
> /dev/sdd1   932G  4.3G  927G   1%
> /disk3
> /dev/sde1   932G  4.3G  927G   1%
> /disk4
> /dev/sdf1   932G  4.3G  927G   1%
> /disk5
> /dev/sdg1   932G  4.3G  927G   1%
> /disk6
> /dev/sdh1   932G  4.3G  927G   1%
> /disk7
> /dev/sdi1   932G  4.3G  927G   1%
> /disk8
> /dev/sdj1   932G  449G  483G  49%
> /disk9
> /dev/sdk1   932G  4.3G  927G   1%
> /disk10
> 
> The situation is a bit better, but not much, the data are stored on two
> disks mainly.
> 
> Here is a better situation, on node s12:
> 
> Filesystem  Size  Used Avail Use%
> Mounted on
> /dev/sdb1   1.9T  453G  1.4T  25%
> /disk1
> /dev/sdc1   1.9T  4.3G  1.9T   1%
> /disk2
> /dev/sdd1   1.9T  4.4G  1.9T   1%
> /disk3
> /dev/sde1   1.9T  4.3G  1.9T   1%
> /disk4
> /dev/sdf1   1.9T  457G  1.4T  25%
> /disk5
> /dev/sdg1   1.9T  443G  1.4T  24%
> /disk6
> /dev/sdh1   1.9T  4.4G  1.9T   1%
> /disk7
> /dev/sdi1   1.9T  4.4G  1.9T   1%
> /disk8
> /dev/sdj1   1.9T  427G  1.5T  23%
> /disk9
> /dev/sdk1   1.9T  4.4G  1.9T   1%
> /disk10
> 
> It's better than the other two, but still not what I expected. I
> expected the data to be spread out according to the weight of each
> osd, as defined in the crush map. Or at least, as close to that
> as possible. It might be just some obviously stupid config error,
> but I don't know. This can't be normal, can it?
> 
> Thanks for any hint.
> 
> Xiaopong
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Sage Weil
On Fri, 31 Aug 2012, Andrew Thompson wrote:
> On 8/31/2012 7:11 AM, Xiaopong Tran wrote:
> > Hi,
> > 
> > Ceph storage on each disk in the cluster is very unbalanced. On each
> > node, the data seems to go to one or two disks, while other disks
> > are almost empty.
> > 
> > I can't find anything wrong from the crush map, it's just the
> > default for now. Attached is the crush map.
> 
> Have you been reweight-ing osds? I went round and round with my cluster a few
> days ago reloading different crush maps only to find that it re-injecting a
> crush map didn't seem to overwrite reweights.
> 
> Take a look at `ceph osd tree` to see if the reweight column matches the
> weight column.

Note that the ideal situation is for reweight to be 1, regardless of what 
the crush weight is.  If you find the utilizations are skewed, I would 
look for other causes before resorting to reweight-by-utilization; it is 
meant to adjust the normal statistical variation you expect from a 
(pseudo)random placement, but if the variance is high there is likely 
another cause.

sage

> 
> Note: I'm new at this. This is my experience only, with 0.48.1, and may not be
> entirely correct.
> 
> -- 
> Andrew Thompson
> http://aktzero.com/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Andrew Thompson

On 8/31/2012 12:10 PM, Sage Weil wrote:

On Fri, 31 Aug 2012, Andrew Thompson wrote:
Have you been reweight-ing osds? I went round and round with my 
cluster a few days ago reloading different crush maps only to find 
that it re-injecting a crush map didn't seem to overwrite reweights. 
Take a look at `ceph osd tree` to see if the reweight column matches 
the weight column. 

Note that the ideal situation is for reweight to be 1, regardless of what
the crush weight is.  If you find the utilizations are skewed, I would
look for other causes before resorting to reweight-by-utilization; it is
meant to adjust the normal statistical variation you expect from a
(pseudo)random placement, but if the variance is high there is likely
another cause.


So if someone(me, guilty) had been messing with reweight, will setting 
them all to 1 return it to a normal un-reweight-ed state?


--
Andrew Thompson
http://aktzero.com/

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Gregory Farnum
On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson  wrote:
> On 8/31/2012 12:10 PM, Sage Weil wrote:
>>
>> On Fri, 31 Aug 2012, Andrew Thompson wrote:
>>>
>>> Have you been reweight-ing osds? I went round and round with my cluster a
>>> few days ago reloading different crush maps only to find that it
>>> re-injecting a crush map didn't seem to overwrite reweights. Take a look at
>>> `ceph osd tree` to see if the reweight column matches the weight column.
>>
>> Note that the ideal situation is for reweight to be 1, regardless of what
>> the crush weight is.  If you find the utilizations are skewed, I would
>> look for other causes before resorting to reweight-by-utilization; it is
>> meant to adjust the normal statistical variation you expect from a
>> (pseudo)random placement, but if the variance is high there is likely
>> another cause.
>
>
> So if someone(me, guilty) had been messing with reweight, will setting them
> all to 1 return it to a normal un-reweight-ed state?

Yep!
If you have OSDs with different sizes you'll want to adjust the CRUSH
weights, not the reweight values:
http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Sage Weil
On Fri, 31 Aug 2012, Andrew Thompson wrote:
> On 8/31/2012 12:10 PM, Sage Weil wrote:
> > On Fri, 31 Aug 2012, Andrew Thompson wrote:
> > > Have you been reweight-ing osds? I went round and round with my cluster a
> > > few days ago reloading different crush maps only to find that it
> > > re-injecting a crush map didn't seem to overwrite reweights. Take a look
> > > at `ceph osd tree` to see if the reweight column matches the weight
> > > column. 
> > Note that the ideal situation is for reweight to be 1, regardless of what
> > the crush weight is.  If you find the utilizations are skewed, I would
> > look for other causes before resorting to reweight-by-utilization; it is
> > meant to adjust the normal statistical variation you expect from a
> > (pseudo)random placement, but if the variance is high there is likely
> > another cause.
> 
> So if someone(me, guilty) had been messing with reweight, will setting them
> all to 1 return it to a normal un-reweight-ed state?

Yep!  :)

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Xiaopong Tran

On 09/01/2012 12:39 AM, Gregory Farnum wrote:

On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson  wrote:

On 8/31/2012 12:10 PM, Sage Weil wrote:


On Fri, 31 Aug 2012, Andrew Thompson wrote:


Have you been reweight-ing osds? I went round and round with my cluster a
few days ago reloading different crush maps only to find that it
re-injecting a crush map didn't seem to overwrite reweights. Take a look at
`ceph osd tree` to see if the reweight column matches the weight column.


Note that the ideal situation is for reweight to be 1, regardless of what
the crush weight is.  If you find the utilizations are skewed, I would
look for other causes before resorting to reweight-by-utilization; it is
meant to adjust the normal statistical variation you expect from a
(pseudo)random placement, but if the variance is high there is likely
another cause.



So if someone(me, guilty) had been messing with reweight, will setting them
all to 1 return it to a normal un-reweight-ed state?


Yep!
If you have OSDs with different sizes you'll want to adjust the CRUSH
weights, not the reweight values:
http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight


Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD,
so using 1TB as the base line, with weight being 1.0, then I'd like that
the 2TB HD store 2x amount of data, so that the disks always have
roughly same relative amount of data.

Originally, every osd has weight of 1.0, and I did:

ceph osd crush reweight osd.30 2.0

and all the 2TB disks.

And that's probably what caused the skew afterward. The crush map
attached in my last message was fetched from the cluster, and

ceph osd tree

does show that the weight of the 2TB disks as 2, but reweight is 1.

Now I'm getting confused by the meaning of crush weight :)

Best,

Xiaopong





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Xiaopong Tran

On 09/01/2012 12:05 AM, Sage Weil wrote:

On Fri, 31 Aug 2012, Xiaopong Tran wrote:

Hi,

Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.

I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.


This is usually a problem with the pg_num for the pool you are using.  Can
you include the output from 'ceph osd dump | grep ^pool'?  By default,
pools get 8 pgs, which will distribute poorly.

sage



Here is the pool I'm interested in:

pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 
pgp_num 8 last_change 216 owner 0


So, ok, by default, the pg_num is really small. That's a very dumb
mistake I made. Is there any easy way to change this?

I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
then would I be able to tune the pg_num value?

Thanks for any help, this is quite a serious issue.

Xiaopong
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Sage Weil
On Sat, 1 Sep 2012, Xiaopong Tran wrote:
> On 09/01/2012 12:05 AM, Sage Weil wrote:
> > On Fri, 31 Aug 2012, Xiaopong Tran wrote:
> > > Hi,
> > > 
> > > Ceph storage on each disk in the cluster is very unbalanced. On each
> > > node, the data seems to go to one or two disks, while other disks
> > > are almost empty.
> > > 
> > > I can't find anything wrong from the crush map, it's just the
> > > default for now. Attached is the crush map.
> > 
> > This is usually a problem with the pg_num for the pool you are using.  Can
> > you include the output from 'ceph osd dump | grep ^pool'?  By default,
> > pools get 8 pgs, which will distribute poorly.
> > 
> > sage
> > 
> > 
> Here is the pool I'm interested in:
> 
> pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
> pgp_num 8 last_change 216 owner 0
> 
> So, ok, by default, the pg_num is really small. That's a very dumb
> mistake I made. Is there any easy way to change this?

I think me choosing 8 as the default was the dumb thing :)
 
> I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
> then would I be able to tune the pg_num value?

Sadly you can't yet adjust pg_num for an active pool.  You can create a 
new pool,

ceph osd pool create  

I would aim for 20 * num_osd, or thereabouts.. see 

http://ceph.com/docs/master/ops/manage/grow/placement-groups/

Then you can copy the data from the old pool to the new one with

rados cppool yunio2 yunio3

This won't be particularly fast, but it will work.  You can also do

ceph osd pool rename  
ceph osd pool delete 

I hope this solves your problem!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Sage Weil
On Sat, 1 Sep 2012, Xiaopong Tran wrote:
> On 09/01/2012 12:39 AM, Gregory Farnum wrote:
> > On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson 
> > wrote:
> > > On 8/31/2012 12:10 PM, Sage Weil wrote:
> > > > 
> > > > On Fri, 31 Aug 2012, Andrew Thompson wrote:
> > > > > 
> > > > > Have you been reweight-ing osds? I went round and round with my
> > > > > cluster a
> > > > > few days ago reloading different crush maps only to find that it
> > > > > re-injecting a crush map didn't seem to overwrite reweights. Take a
> > > > > look at
> > > > > `ceph osd tree` to see if the reweight column matches the weight
> > > > > column.
> > > > 
> > > > Note that the ideal situation is for reweight to be 1, regardless of
> > > > what
> > > > the crush weight is.  If you find the utilizations are skewed, I would
> > > > look for other causes before resorting to reweight-by-utilization; it is
> > > > meant to adjust the normal statistical variation you expect from a
> > > > (pseudo)random placement, but if the variance is high there is likely
> > > > another cause.
> > > 
> > > 
> > > So if someone(me, guilty) had been messing with reweight, will setting
> > > them
> > > all to 1 return it to a normal un-reweight-ed state?
> > 
> > Yep!
> > If you have OSDs with different sizes you'll want to adjust the CRUSH
> > weights, not the reweight values:
> > http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight
> 
> Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD,
> so using 1TB as the base line, with weight being 1.0, then I'd like that
> the 2TB HD store 2x amount of data, so that the disks always have
> roughly same relative amount of data.
> 
> Originally, every osd has weight of 1.0, and I did:
> 
> ceph osd crush reweight osd.30 2.0
> 
> and all the 2TB disks.
> 
> And that's probably what caused the skew afterward. The crush map
> attached in my last message was fetched from the cluster, and
> 
> ceph osd tree
> 
> does show that the weight of the 2TB disks as 2, but reweight is 1.
> 
> Now I'm getting confused by the meaning of crush weight :)

Yes, sorry.  The left one (in osd tree) is the 'crush weight', and the 
right one is the 'reweight', which you can think of as a non-binary state 
of failure.  0 == failed (and everything remapped elsewhere), 1 == normal, 
and anything in between meaning that some fraction of the content is 
remapped elsewhere.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-08-31 Thread Xiaopong Tran

On 09/01/2012 11:05 AM, Sage Weil wrote:

On Sat, 1 Sep 2012, Xiaopong Tran wrote:

On 09/01/2012 12:05 AM, Sage Weil wrote:

On Fri, 31 Aug 2012, Xiaopong Tran wrote:

Hi,

Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.

I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.


This is usually a problem with the pg_num for the pool you are using.  Can
you include the output from 'ceph osd dump | grep ^pool'?  By default,
pools get 8 pgs, which will distribute poorly.

sage



Here is the pool I'm interested in:

pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 216 owner 0

So, ok, by default, the pg_num is really small. That's a very dumb
mistake I made. Is there any easy way to change this?


I think me choosing 8 as the default was the dumb thing :)


I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
then would I be able to tune the pg_num value?


Sadly you can't yet adjust pg_num for an active pool.  You can create a
new pool,

ceph osd pool create  

I would aim for 20 * num_osd, or thereabouts.. see

http://ceph.com/docs/master/ops/manage/grow/placement-groups/

Then you can copy the data from the old pool to the new one with

rados cppool yunio2 yunio3

This won't be particularly fast, but it will work.  You can also do

ceph osd pool rename  
ceph osd pool delete 

I hope this solves your problem!
sage



Ok, this is going to be painful. But do I have to stop using
the current pool completely while I do

rados cppool yunio2 yunio3

? This is not something I can do now :)

But this wiki describes a nice way to increase the number of PGs:

http://ceph.com/wiki/Changing_the_number_of_PGs

Even if I upgrade to v0.48.1, this command can only change the PG size
when the pool is empty?

Thanks

Xiaopong
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-09-01 Thread Andrew Thompson

On 8/31/2012 11:05 PM, Sage Weil wrote:

Sadly you can't yet adjust pg_num for an active pool.  You can create a
new pool,

ceph osd pool create  

I would aim for 20 * num_osd, or thereabouts.. see

http://ceph.com/docs/master/ops/manage/grow/placement-groups/

Then you can copy the data from the old pool to the new one with

rados cppool yunio2 yunio3

This won't be particularly fast, but it will work.  You can also do

ceph osd pool rename  
ceph osd pool delete 

I hope this solves your problem!


Looking at old archives, I found this thread which shows that to mount a 
pool as cephfs, it needs to be added to mds:


http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685

I started a `rados cppool data tempstore` a couple hours ago. When it 
finishes, will I need to remove the current pool from mds somehow(other 
than just deleting the pool)?


Is `ceph mds add_data_pool ` still required? (It's not listed 
in `ceph --help`.)


Thanks.

--
Andrew Thompson
http://aktzero.com/

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-09-04 Thread Tommi Virtanen
On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson  wrote:
> Looking at old archives, I found this thread which shows that to mount a
> pool as cephfs, it needs to be added to mds:
>
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685
>
> I started a `rados cppool data tempstore` a couple hours ago. When it
> finishes, will I need to remove the current pool from mds somehow(other than
> just deleting the pool)?
>
> Is `ceph mds add_data_pool ` still required? (It's not listed in
> `ceph --help`.)

If the pool you are trying to grow pg_num for really is a CephFS data
pool, I fear a "rados cppool" is nowhere near enough to perform a
migration. My understanding is that each of the inodes stored in
cephfs/on ceph-mds'es knows what pool the file data resides in; you
shoveling the objects into another pool with "rados cppool" doesn't
change these pointers, removing the old pool will just break the
filesystem.

Before we go too far down this road: is your problem pool *really*
being use as a cephfs data pool? Based on how it's not named "data"
and you're just now asking about "ceph mds add_data_pool", it seems
that's not likely..
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-09-04 Thread Andrew Thompson

On 9/4/2012 11:59 AM, Tommi Virtanen wrote:

On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson  wrote:

Looking at old archives, I found this thread which shows that to mount a
pool as cephfs, it needs to be added to mds:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685

I started a `rados cppool data tempstore` a couple hours ago. When it
finishes, will I need to remove the current pool from mds somehow(other than
just deleting the pool)?

Is `ceph mds add_data_pool ` still required? (It's not listed in
`ceph --help`.)

If the pool you are trying to grow pg_num for really is a CephFS data
pool, I fear a "rados cppool" is nowhere near enough to perform a
migration. My understanding is that each of the inodes stored in
cephfs/on ceph-mds'es knows what pool the file data resides in; you
shoveling the objects into another pool with "rados cppool" doesn't
change these pointers, removing the old pool will just break the
filesystem.

Before we go too far down this road: is your problem pool *really*
being use as a cephfs data pool? Based on how it's not named "data"
and you're just now asking about "ceph mds add_data_pool", it seems
that's not likely..


Well, I guess it's time to wipe this cluster and start over.

Yes, it was my `data` pool I was trying to grow. After renaming and 
removing the original data pool, I can `ls` my folders/files, but not 
access them.


I attempted a tar backup beforehand, so unless it flaked out, I should 
be able to recover data.


I was concerned the small number of PGs created by default by mkcephfs 
would be an issue, so I was trying to up it a bit. I'm not going to have 
100+ OSDs or petabytes of data. I just want a relatively safe place to 
store my files that I can easily extend as needed.


So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another.

--
Andrew Thompson
http://aktzero.com/

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-09-04 Thread Sage Weil
On Tue, 4 Sep 2012, Andrew Thompson wrote:
> On 9/4/2012 11:59 AM, Tommi Virtanen wrote:
> > On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson 
> > wrote:
> > > Looking at old archives, I found this thread which shows that to mount a
> > > pool as cephfs, it needs to be added to mds:
> > > 
> > > http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685
> > > 
> > > I started a `rados cppool data tempstore` a couple hours ago. When it
> > > finishes, will I need to remove the current pool from mds somehow(other
> > > than
> > > just deleting the pool)?
> > > 
> > > Is `ceph mds add_data_pool ` still required? (It's not listed in
> > > `ceph --help`.)
> > If the pool you are trying to grow pg_num for really is a CephFS data
> > pool, I fear a "rados cppool" is nowhere near enough to perform a
> > migration. My understanding is that each of the inodes stored in
> > cephfs/on ceph-mds'es knows what pool the file data resides in; you
> > shoveling the objects into another pool with "rados cppool" doesn't
> > change these pointers, removing the old pool will just break the
> > filesystem.
> > 
> > Before we go too far down this road: is your problem pool *really*
> > being use as a cephfs data pool? Based on how it's not named "data"
> > and you're just now asking about "ceph mds add_data_pool", it seems
> > that's not likely..
> 
> Well, I guess it's time to wipe this cluster and start over.
> 
> Yes, it was my `data` pool I was trying to grow. After renaming and removing
> the original data pool, I can `ls` my folders/files, but not access them.

Yeah.  Sorry I didn't catch this earlier, but TV is right: the ceph fs 
inodes refer to the data pool pool by id #, not by name, so the cppool 
trick won't work in the fs case.

> I attempted a tar backup beforehand, so unless it flaked out, I should be able
> to recover data.
> 
> I was concerned the small number of PGs created by default by mkcephfs would
> be an issue, so I was trying to up it a bit. I'm not going to have 100+ OSDs
> or petabytes of data. I just want a relatively safe place to store my files
> that I can easily extend as needed.

mkcephfs picks the pg_num by taking the initial osd count and shiftin 'osd 
pg bits' bits to the left.  Adjusting that (by default it is 6) in 
ceph.conf should give you larger initial pools.

> So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another.

Sorry to hear that!  The pg splitting (i.e., online pg_num adjustment) is 
still the next major osd project on the roadmap, but we've been a bit 
sidetracked with performance these past few weeks.

sage


> 
> -- 
> Andrew Thompson
> http://aktzero.com/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very unbalanced storage

2012-09-04 Thread Tommi Virtanen
On Tue, Sep 4, 2012 at 9:19 AM, Andrew Thompson  wrote:
> Yes, it was my `data` pool I was trying to grow. After renaming and removing
> the original data pool, I can `ls` my folders/files, but not access them.

Yup, you're seeing ceph-mds being able to access the "metadata" pool,
but all the directory entries pointing at file data that resides in a
pool_id that no longer exists.

While this would be recoverable by rewriting all the directory entries
etc, the simple answer is that your file data is not easily accessible
anymore. If this is just a test filesystem, and you have a recent
backup anyway, you might just go forward with restoring that. If there
is any doubt about that, you can leave the existing content around
until you're sure you can restore the backup successfully, and you
don't really need to re-create the cluster either. If this sounds
necessary, let me know and I'll walk you through the process; but the
simple answer really is recreating the cluster from scratch, so if
this is just test data, go with that.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html