Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-19 Thread Jiri Kanicky

Hi.

I am just curious. This is just lab environment and we are short on 
hardware :). We will have more hardware later, but right now this is all 
I have. Monitors are VMs.


Anyway, we will have to survive with this somehow :).

Thanks
Jiri

On 20/01/2015 15:33, Lindsay Mathieson wrote:



On 20 January 2015 at 14:10, Jiri Kanicky > wrote:


Hi,

BTW, is there a way how to achieve redundancy over multiple OSDs
in one box by changing CRUSH map?



I asked that same question myself a few weeks back :)

The answer was yes - but fiddly and why would you do that?

Its kinda breaking the purpose of ceph, which is large amounts of data 
stored redundantly over multiple nodes.


Perhaps you should re-examine your requirements. If what you want is 
data redundantly stored on hard disks on one node, perhaps you would 
be better served by creating a ZFS raid setup. With just one node it 
would be easier and more flexible - better performance as well.


Alternatively, could you put some OSD's on your monitor ndoes? what 
spec are they?




Thank you
Jiri


On 20/01/2015 13:37, Jiri Kanicky wrote:

Hi,

Thanks for the reply. That clarifies it. I thought that the
redundancy can be achieved with multiple OSDs (like multiple
disks in RAID) in case you don't have more nodes. Obviously the
single point of failure would be the box.

My current setting is:
osd_pool_default_size = 2

Thank you
Jiri


On 20/01/2015 13:13, Lindsay Mathieson wrote:

You only have one osd node (ceph4). The default replication
requirements  for your pools (size = 3) require osd's spread
over three nodes, so the data can be replicate on three
different nodes. That will be why your pgs are degraded.

You need to either add mode osd nodes or reduce your size
setting down to the number of osd nodes you have.

Setting your size to 1 would be a bad idea, there would be no
redundancy in your data at all. Loosing one disk would destroy
all your data.

The command to see you pool size is:

sudo ceph osd pool get  size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky mailto:j...@ganomi.com>> wrote:

Hi,

I just would like to clarify if I should expect degraded PGs
with 11 OSD in one node. I am not sure if a setup with 3 MON
and 1 OSD (11 disks) nodes allows me to have healthy cluster.

$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck
unclean; 512 pgs undersized
 monmap e1: 3 mons at

{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0

},
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0  up  1
1   0.9 osd.1  up  1
2   0.9 osd.2  up  1
3   0.9 osd.3  up  1
4   0.9 osd.4  up  1
5   0.9 osd.5  up  1
6   0.9 osd.6  up  1
7   0.9 osd.7  up  1
8   0.9 osd.8  up  1
9   0.9 osd.9  up  1
10  0.9 osd.10 up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Lindsay







--
Lindsay


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache data consistency among multiple RGW instances

2015-01-19 Thread Gregory Farnum
You don't need to list them anywhere for this to work. They set up the
necessary communication on their own by making use of watch-notify.
On Mon, Jan 19, 2015 at 6:55 PM ZHOU Yuan  wrote:

> Thanks Greg, that's a awesome feature I missed. I find some
> explanation on the watch-notify thing:
> http://www.slideshare.net/Inktank_Ceph/sweil-librados.
>
> Just want to confirm, it looks like I need to list all the RGW
> instances in ceph.conf, and then these RGW instances will
> automatically do the cache invalidation if necessary?
>
>
> Sincerely, Yuan
>
>
> On Mon, Jan 19, 2015 at 10:58 PM, Gregory Farnum  wrote:
> > On Sun, Jan 18, 2015 at 6:40 PM, ZHOU Yuan  wrote:
> >> Hi list,
> >>
> >> I'm trying to understand the RGW cache consistency model. My Ceph
> >> cluster has multiple RGW instances with HAProxy as the load balancer.
> >> HAProxy would choose one RGW instance to serve the request(with
> >> round-robin).
> >> The question is if RGW cache was enabled, which is the default
> >> behavior, there seem to be some cache inconsistency issue. e.g.,
> >> object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later
> >> it was updated from RGW-0. In this case if the next read was issued to
> >> RGW-1, the outdated cache would be served out then since RGW-1 wasn't
> >> aware of the updates. Thus the data would be inconsistent. Is this
> >> behavior expected or is there anything I missed?
> >
> > The RGW instances make use of the watch-notify primitive to keep their
> > caches consistent. It shouldn't be a problem.
> > -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-19 Thread Lindsay Mathieson
On 20 January 2015 at 14:10, Jiri Kanicky  wrote:

>  Hi,
>
> BTW, is there a way how to achieve redundancy over multiple OSDs in one
> box by changing CRUSH map?
>


I asked that same question myself a few weeks back :)

The answer was yes - but fiddly and why would you do that?

Its kinda breaking the purpose of ceph, which is large amounts of data
stored redundantly over multiple nodes.

Perhaps you should re-examine your requirements. If what you want is data
redundantly stored on hard disks on one node, perhaps you would be better
served by creating a ZFS raid setup. With just one node it would be easier
and more flexible - better performance as well.

Alternatively, could you put some OSD's on your monitor ndoes? what spec
are they?




>
> Thank you
> Jiri
>
>
> On 20/01/2015 13:37, Jiri Kanicky wrote:
>
> Hi,
>
> Thanks for the reply. That clarifies it. I thought that the redundancy can
> be achieved with multiple OSDs (like multiple disks in RAID) in case you
> don't have more nodes. Obviously the single point of failure would be the
> box.
>
> My current setting is:
> osd_pool_default_size = 2
>
> Thank you
> Jiri
>
>
> On 20/01/2015 13:13, Lindsay Mathieson wrote:
>
>You only have one osd node (ceph4). The default replication
> requirements  for your pools (size = 3) require osd's spread over three
> nodes, so the data can be replicate on three different nodes. That will be
> why your pgs are degraded.
>
>  You need to either add mode osd nodes or reduce your size setting down to
> the number of osd nodes you have.
>
>  Setting your size to 1 would be a bad idea, there would be no redundancy
> in your data at all. Loosing one disk would destroy all your data.
>
>  The command to see you pool size is:
>
>  sudo ceph osd pool get  size
>
>  assuming default setup:
>
> ceph osd pool  get rbd size
>  returns: 3
>
> On 20 January 2015 at 10:51, Jiri Kanicky  wrote:
>
>> Hi,
>>
>> I just would like to clarify if I should expect degraded PGs with 11 OSD
>> in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes
>> allows me to have healthy cluster.
>>
>> $ sudo ceph osd pool create test 512
>> pool 'test' created
>>
>> $ sudo ceph status
>> cluster 4e77327a-118d-450d-ab69-455df6458cd4
>>  health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs
>> undersized
>>  monmap e1: 3 mons at {ceph1=
>> 172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0},
>> election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
>>  osdmap e190: 11 osds: 11 up, 11 in
>>   pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
>> 53724 kB used, 9709 GB / 9720 GB avail
>>  512 active+undersized+degraded
>>
>> $ sudo ceph osd tree
>> # idweight  type name   up/down reweight
>> -1  9.45root default
>> -2  9.45host ceph4
>> 0   0.45osd.0   up  1
>> 1   0.9 osd.1   up  1
>> 2   0.9 osd.2   up  1
>> 3   0.9 osd.3   up  1
>> 4   0.9 osd.4   up  1
>> 5   0.9 osd.5   up  1
>> 6   0.9 osd.6   up  1
>> 7   0.9 osd.7   up  1
>> 8   0.9 osd.8   up  1
>> 9   0.9 osd.9   up  1
>> 10  0.9 osd.10  up  1
>>
>>
>> Thank you,
>> Jiri
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Lindsay
>
>
>
>


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-19 Thread Jiri Kanicky

Hi,

I just would like to clarify if I should expect degraded PGs with 11 OSD 
in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) 
nodes allows me to have healthy cluster.


$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 
pgs undersized
 monmap e1: 3 mons at 
{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0}, 
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3

 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0   up  1
1   0.9 osd.1   up  1
2   0.9 osd.2   up  1
3   0.9 osd.3   up  1
4   0.9 osd.4   up  1
5   0.9 osd.5   up  1
6   0.9 osd.6   up  1
7   0.9 osd.7   up  1
8   0.9 osd.8   up  1
9   0.9 osd.9   up  1
10  0.9 osd.10  up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable slow request

2015-01-19 Thread Christian Balzer

Hello,

Sorry for the thread necromancy, but this just happened again.
Still the exact same cluster as in the original thread (0.80.7).

Same OSD, same behavior. 
Slow requests that never returned and any new requests to that OSD also
went into that state until the OSD was restarted.
Causing of course all clients (VMs) that dealt with that OSD to hang.

Since I was nowhere near a computer at the time I asked a coworker to do
the restart (as this is now a production cluster), thus no historic ops.

Alas I _did_ install collectd and graphite, though nothing really stands
out there in the throttle sections.
The cluster wasn't particular busy at the time, definitely no CPU or
memory pressure, nothing wrong with the HW.

What however did catch my eye is the gauge-osd_numpg, after the restart of
OSD.8 it dropped from 122 to 121 and most alarmingly (to me at least) none
of the other OSDs increased their numpg count...
Where did that PG go? The cluster is healthy and shows all 1024 PGs and a
scrub later that night showed no problems either.

If anybody from the Ceph team would like access to my graphite server
while the data is not expired, give me a holler.

Given that it was the same OSD as last time and I'm not seeing anything in
the 0.80.8 changelog that looks like this bug, would a "windows solution"
maybe the way forward, as in deleting and re-creating that OSD?

Christian

On Tue, 9 Dec 2014 13:51:39 +0900 Christian Balzer wrote:

> On Mon, 8 Dec 2014 20:36:17 -0800 Gregory Farnum wrote:
> 
> > They never fixed themselves? 
> As I wrote, it took a restart of OSD 8 to resolve this on the next day.
> 
> > Did the reported times ever increase?
> Indeed, the last before the reboot was:
> ---
> 2014-12-07 13:12:42.933396 7fceac82f700  0 log [WRN] : 14 slow requests,
> 5 included below; oldest blocked for > 64336.578995 secs ---
> 
> All IOPS hitting that osd.8 (eventually the other VM did as well during a
> log write I suppose) were blocked.
> 
> > If not I think that's just a reporting bug which is fixed in an
> > unreleased branch, but I'd have to check the tracker to be sure.
> > 
> > On Mon, Dec 8, 2014 at 8:23 PM, Christian Balzer  wrote:
> > >
> > > Hello,
> > >
> > > On Mon, 8 Dec 2014 19:51:00 -0800 Gregory Farnum wrote:
> > >
> > >> On Mon, Dec 8, 2014 at 6:39 PM, Christian Balzer 
> > >> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > Debian Jessie cluster, thus kernel 3.16, ceph 0.80.7.
> > >> > 3 storage nodes with 8 OSDs (journals on 4 SSDs) each, 3 mons.
> > >> > 2 compute nodes, everything connected via Infiniband.
> > >> >
> > >> > This is pre-production, currently there are only 3 VMs and 2 of
> > >> > them were idle at the time. The non-idle one was having 600GB of
> > >> > maildirs copied onto it, which stresses things but not Ceph as
> > >> > those millions of small files coalesce nicely and result in
> > >> > rather few Ceph ops.
> > >> >
> > >> > A couple of hours into that copy marathon (the source FS and
> > >> > machine are slow and rsync isn't particular speedy with this kind
> > >> > of operation either) this happened:
> > >> > ---
> > >> > 2014-12-06 19:20:57.023974 osd.23 10.0.8.23:6815/3552 77 : [WRN]
> > >> > slow request 30 .673939 seconds old, received at 2014-12-06
> > >> > 19:20:26.346746: osd_op(client.33776 .0:743596
> > >> > rb.0.819b.238e1f29.0003f52f [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1748992~4096] 3.efa97e35
> > >> > ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8
> > >> > 2014-12-06 19:20:57.023991 osd.23 10.0.8.23:6815/3552 78 : [WRN]
> > >> > slow request 30 .673886 seconds old, received at 2014-12-06
> > >> > 19:20:26.346799: osd_op(client.33776 .0:743597
> > >> > rb.0.819b.238e1f29.0003f52f [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1945600~4096] 3.efa97e35
> > >> > ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8
> > >> > 2014-12-06 19:20:57.323976 osd.1 10.0.8.21:6815/4868 123 : [WRN]
> > >> > slow request 30 .910821 seconds old, received at 2014-12-06
> > >> > 19:20:26.413051: osd_op(client.33776 .0:743604
> > >> > rb.0.819b.238e1f29.0003e628 [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1794048~1835008] 3.5e76b8ba
> > >> > ack+ondisk+write e380) v4 cur rently waiting for subops from 8,17
> > >> > ---
> > >> >
> > >> > There were a few more later, but they all involved OSD 8 as common
> > >> > factor.
> > >> >
> > >> > Alas there's nothing in the osd-8.log indicating why:
> > >> > ---
> > >> > 2014-12-06 19:13:13.933636 7fce85552700  0 -- 10.0.8.22:6835/5389
> > >> > >> 10.0.8.6:0/ 716350435 pipe(0x7fcec3c25900 sd=23 :6835 s=0
> > >> > >> pgs=0 cs=0
> > >> > l=0 c=0x7fcebfad03c0).a ccept peer addr is really
> > >> > 10.0.8.6:0/716350435 (socket is 10.0.8.6:50592/0) 2014-12-06
> > >> > 19:20:56.595773 7fceac82f700 0 log [WRN] : 3 slow requests, 3
> > >> > included below; oldest blocked for > 30.241397 secs 2014-12-06
> > >> > 19:2

Re: [ceph-users] Is it possible to compile and use ceph with Raspberry Pi single-board computers?

2015-01-19 Thread Joao Eduardo Luis

On 01/19/2015 02:54 PM, Gregory Farnum wrote:

Joao has done it in the past so it's definitely possible, but I
confess I don't know what if anything he had to hack up to make it
work or what's changed since then. ARMv6 is definitely not something
we worry about when adding dependencies. :/
-Greg


I did in fact compile Ceph on a pi, but it was early last year or 
something.  I did encounter a few issues, one of which was tcmalloc not 
compiling on the pi [1], but after being reported it ended up being 
fixed.  On ceph, the most problematic of all things would be linking 
ceph-dencoder: it's statically linked and it will eventually be 
OOM-killed.  I solved this by removing its target from the Makefile.


However, when I went though all of this there was no common/Cycles, as 
this was added early December 2014.  AFAICT though is that it is used 
solely on an objectstore benchmark.  Should be safe enough to remove it 
from the Makefile and blow away the benchmark's target.


  -Joao

[1] http://code.google.com/p/gperftools/issues/detail?id=596



On Thu, Jan 15, 2015 at 12:17 AM, Prof. Dr. Christian Baun
 wrote:

Hi all,

I try to compile and use Ceph on a cluster of Raspberry Pi
single-board computers with Raspbian as operating system. I tried it
this way:

wget http://ceph.com/download/ceph-0.91.tar.bz2
tar -xvjf ceph-0.91.tar.bz2
cd ceph-0.91
./autogen.sh
./configure  --without-tcmalloc
make -j2

But result, I got this error message:

...
  CC common/module.lo
  CXXcommon/Readahead.lo
  CXXcommon/Cycles.lo
In file included from common/Cycles.cc:38:0:
common/Cycles.h:76:2: error: #error No high-precision counter
available for your OS/arch
common/Cycles.h: In static member function 'static uint64_t Cycles::rdtsc()':
common/Cycles.h:78:3: warning: no return statement in function
returning non-void [-Wreturn-type]
Makefile:13166: recipe for target 'common/Cycles.lo' failed
make[3]: *** [common/Cycles.lo] Error 1
make[3]: *** Waiting for unfinished jobs
make[3]: Leaving directory '/usr/src/ceph-0.91/src'
Makefile:17129: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/usr/src/ceph-0.91/src'
Makefile:6645: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/usr/src/ceph-0.91/src'
Makefile:405: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

Is it possible at all to build and use Ceph on the ARMv6 architecture?

Thanks for any help.

Best Regards
Christian Baun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Create file bigger than osd

2015-01-19 Thread Fabian Zimmermann
Hi,

Am 19.01.15 um 13:08 schrieb Luis Periquito:
>> What is the current issue? Cluster near-full? cluster too-full? Can you
> send the output of ceph -s?

cluster 0d75b6f9-83fb-4287-aa01-59962bbff4ad
 health HEALTH_ERR 1 full osd(s); 1 near full osd(s)
 monmap e1: 3 mons at 
{ceph0=10.0.29.0:6789/0,ceph1=10.0.29.1:6789/0,ceph2=10.0.29.2:6789/0}, 
election epoch 92, quorum 0,1,2 ceph0,ceph1,ceph2
 mdsmap e16: 1/1/1 up {0=2=up:active}, 1 up:standby
 osdmap e415: 24 osds: 24 up, 24 in
flags full
  pgmap v396664: 704 pgs, 4 pools, 3372 GB data, 866 kobjects
6750 GB used, 3270 GB / 10020 GB avail
 704 active+clean

2015-01-19 08:19:23.429198 mon.0 [INF] pgmap v396664: 704 pgs: 704 
active+clean; 3372 GB data, 6750 GB used, 3270 GB / 10020 GB avail; 39 B/s rd, 
0 op/s

> If this is the case you can look at the output of ceph df detail to figure
> out which pool is using the disk space. How many PGs these pools have? can
> you send the output of ceph df detail and ceph osd dump | grep pool?
> Is there anything else on these nodes taking up disk space? Like the
> journals...
No, I placed the journals on ssd, so they shouldn't use space in the
datadir.

I already got the cluster "back to normal". I just

* shutdown one osd (id=23)
* removed a pg-dir
* started osd (id=23)

after I did shutdown the osd. The following logs apeared in ceph -w
-- 

2015-01-19 10:13:00.391222 mon.0 [INF] osd.23 out (down for 301.648942)
2015-01-19 10:13:00.406649 mon.0 [INF] osdmap e418: 24 osds: 23 up, 23 in full
2015-01-19 10:13:00.414374 mon.0 [INF] pgmap v396684: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:01.422289 mon.0 [INF] osdmap e419: 24 osds: 23 up, 23 in full
2015-01-19 10:13:01.428216 mon.0 [INF] pgmap v396685: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:02.413598 mon.0 [INF] osdmap e420: 24 osds: 23 up, 23 in full
2015-01-19 10:13:02.443216 mon.0 [INF] pgmap v396686: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:03.455175 mon.0 [INF] pgmap v396687: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:04.483793 mon.0 [INF] pgmap v396688: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:05.431367 mon.0 [INF] osdmap e421: 24 osds: 23 up, 23 in
2015-01-19 10:13:05.451241 mon.0 [INF] pgmap v396689: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:06.505841 mon.0 [INF] pgmap v396690: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 0 B/s rd, 101 kB/s wr, 63 op/s; 104813/1774446 objects 
degraded (5.907%)
--

and here the logs after I started the osd again
--

2015-01-19 10:13:00.391222 mon.0 [INF] osd.23 out (down for 301.648942)
2015-01-19 10:13:00.406649 mon.0 [INF] osdmap e418: 24 osds: 23 up, 23 in full
2015-01-19 10:13:00.414374 mon.0 [INF] pgmap v396684: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:01.422289 mon.0 [INF] osdmap e419: 24 osds: 23 up, 23 in full
2015-01-19 10:13:01.428216 mon.0 [INF] pgmap v396685: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:02.413598 mon.0 [INF] osdmap e420: 24 osds: 23 up, 23 in full
2015-01-19 10:13:02.443216 mon.0 [INF] pgmap v396686: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:03.455175 mon.0 [INF] pgmap v396687: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:04.483793 mon.0 [INF] pgmap v396688: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:05.431367 mon.0 [INF] osdmap e421: 24 osds: 23 up, 23 in
2015-01-19 10:13:05.451241 mon.0 [INF] pgmap v396689: 704 pgs: 74 
active+undersized+degraded, 630 active+clean; 3372 GB data, 6353 GB used, 3249 
GB / 9603 GB avail; 104813/1774446 objects degraded (5.907%)
2015-01-19 10:13:0

Re: [ceph-users] Create file bigger than osd

2015-01-19 Thread Luis Periquito
AFAIK there is no such limitation.

When you create a file, that file is split into several objects (4MB IIRC
each by default), and those objects will get mapped to a PG -
http://ceph.com/docs/master/rados/operations/placement-groups/

On Mon, Jan 19, 2015 at 11:15 AM, Fabian Zimmermann 
wrote:

> Hi,
>
> if I understand the pg-system correctly it's impossible to create a
> file/volume which is bigger than the smallest osd of a pg, isn't it?
>
> What could I do to get rid of this limitation?
>
>
> Thanks,
>
> Fabian
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Create file bigger than osd

2015-01-19 Thread Fabian Zimmermann
Hi,

Am 19.01.15 um 12:47 schrieb Luis Periquito:
> Each object will get mapped to a different PG. The size of an OSD will
> affect its weight and the number of PGs assigned to it, so a smaller OSD
> will get less PGs.
Great! Good to know, thanks a lot!
> And BTW, with a replica of 3, a 2TB will need 6TB of storage - each object
> is replicated 3 times, so taking up triple the space.
>
of course, small typo.

I'm just trying to debug a situation which filled my cluster/osds tonight.

We are currently running a small testcluster:

3 mon's
2 mds (active + standby)
2 nodes = 2x12x410G HDD/OSDs

A user created a 500G rbd-volume. First I thought the 500G rbd may have
caused the osd to fill, but after reading your explainnations this seems
impossible.
I just found another 500G file created by this user in cephfs, may this
have caused the trouble?


Thanks a lot for your fast support!

Fabian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache data consistency among multiple RGW instances

2015-01-19 Thread Gregory Farnum
On Sun, Jan 18, 2015 at 6:40 PM, ZHOU Yuan  wrote:
> Hi list,
>
> I'm trying to understand the RGW cache consistency model. My Ceph
> cluster has multiple RGW instances with HAProxy as the load balancer.
> HAProxy would choose one RGW instance to serve the request(with
> round-robin).
> The question is if RGW cache was enabled, which is the default
> behavior, there seem to be some cache inconsistency issue. e.g.,
> object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later
> it was updated from RGW-0. In this case if the next read was issued to
> RGW-1, the outdated cache would be served out then since RGW-1 wasn't
> aware of the updates. Thus the data would be inconsistent. Is this
> behavior expected or is there anything I missed?

The RGW instances make use of the watch-notify primitive to keep their
caches consistent. It shouldn't be a problem.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] subscribe

2015-01-19 Thread Brian Rak


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Create file bigger than osd

2015-01-19 Thread Luis Periquito
>
> I'm just trying to debug a situation which filled my cluster/osds tonight.
>
> We are currently running a small testcluster:
>
> 3 mon's
> 2 mds (active + standby)
> 2 nodes = 2x12x410G HDD/OSDs
>
> A user created a 500G rbd-volume. First I thought the 500G rbd may have
> caused the osd to fill, but after reading your explainnations this seems
> impossible.
> I just found another 500G file created by this user in cephfs, may this
> have caused the trouble?
>
> What is the current issue? Cluster near-full? cluster too-full? Can you
send the output of ceph -s?

If this is the case you can look at the output of ceph df detail to figure
out which pool is using the disk space. How many PGs these pools have? can
you send the output of ceph df detail and ceph osd dump | grep pool?
Is there anything else on these nodes taking up disk space? Like the
journals...

With that setup (and 3x replication) you should be able to store around
1-1.2T without any warnings, but that will depend on PG distribution which
is hard to predict...


> Thanks a lot for your fast support!
>
> Fabian
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD backup and snapshot

2015-01-19 Thread Luis Periquito
Hi,

I'm currently creating a business case around ceph RBD, and one of the
issues revolves around backup.

After having a look at
http://ceph.com/dev-notes/incremental-snapshots-with-rbd/ I was thinking on
creating hourly snapshots (corporate policy) on the original cluster
(replicated pool), and then copying these snapshots to a replica cluster
(EC pool) located offsite.

After some time we would delete the original snapshot (weeks to months),
but we would need to maintain the replica for a lot more time (years).

Will this solution work properly? Can we keep a huge amount of snapshots of
a RBD? Potentially for each RBD in the tens of thousands, with hundreds of
thousands possible? Has anyone been through these scenarios?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-agent failed to parse

2015-01-19 Thread ghislain.chevalier
HI all,

Context : Ubuntu 14.04 TLS firefly 0.80.7

I recently encountered the same issue as described below.
Maybe I missed something between July and January…

I found that the http request was malformed by 
/usr/lib/python2.7/dist-packages/radosgw_agent/client.py

I did the changes below
#    url = '{protocol}://{host}{path}'.format(protocol=request.protocol,
# host=request.host,
# path=request.path)
     url = '{path}'.format(protocol="", host="", path=request.path)

The request is then correctly formed and sent.

Best regards


De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Peter
Envoyé : mercredi 23 juillet 2014 13:38
À : Craig Lewis; Ceph Users
Objet : Re: [ceph-users] radosgw-agent failed to parse

hello again,

i have reviewed my deployment, and --name argument is used everywhere with 
radosgw-admin command. 

as i create the pools during deploy, there is no rgw.root pool, so i cannot 
make changes to it.

I still think this is an issue with radosgw-agent, as if i change the 
destination on the command line, it also changes in the botched up URL it tries 
to hit:  "radosgw-agent http://example2.net"; this appears in the error:

DEBUG:boto:url = 'http://example2.nethttp://example2.net/admin/config'

so it is definitely coming from input from command line.  
On 22/07/14 20:44, Craig Lewis wrote:
You should use the --name argument with every radosgw-admin command.  If you 
don't, you'll end up making changes to .rgw.root, not .us.rgw.root. 

I'd run through the federation setup again, making sure to include the 
appropriate --name.  As Kurt said, it's safe to reload and reapply the configs. 
 Make sure you restart radosgw when it says.  

One of my problems during setup was that I had a bad config loaded in 
.rgw.root, but the correct one in .us.rgw.root.  It caused all sorts of 
problems when I forgot the --name arg.

Setting up federation is somewhat sensitive to order of operations.  When I was 
testing it, I frequently messed something up.  Several times it was faster to 
delete all the pools and start over, rather than figuring out what I broke. 


On Tue, Jul 22, 2014 at 7:46 AM, Peter  wrote:
adding --name to regionmap update command has allowed me to update the 
regionmap:


radosgw-admin regionmap update --name client.radosgw.us-master-1

so now i have reloaded zone and region and updated region map on the gateway in 
each zone, then restarted whole clusters, then restarted apahce and radosgw, 
same problem. 

I cannot see how this can be anything other than an issue inside radosgw-agent 
as it is not hitting the gateway due to the botched 


DEBUG:boto:url = 'https://example.comhttps://example.com/admin/config'

Im out of ideas. Should i submit this as a bug? 


On 22/07/14 15:25, Bachelder, Kurt wrote:
It certainly doesn’t hurt to reload your zone and region configurations on your 
RGWs and re-run the regionmap update for the instances tied to each zone, just 
to ensure consistency. 
 
From: Peter [mailto:ptier...@tchpc.tcd.ie] 
Sent: Tuesday, July 22, 2014 10:20 AM
To: Bachelder, Kurt; Craig Lewis
Cc: Ceph Users
Subject: Re: [ceph-users] radosgw-agent failed to parse
 
thanks for the suggestion. ive attempted a regionmap update but im hitting this 
error:

failed to list regions: (2) No such file or directory
2014-07-22 14:13:04.096601 7ff825ac77c0 -1 failed to list objects 
pool_iterate_begin() returned r=-2

so perhaps i do have some issue with my configuration. Although i would have 
thought that if the gateway is outputting the correct regionmap at 
/admin/config path, then all should be well with regionmap.


On 22/07/14 14:13, Bachelder, Kurt wrote:
I’m sure you’ve already tried this, but we’ve gotten burned a few times by not 
running radosgw-admin regionmap update after making region/zone changes.  
Bouncing the RGW’s probably wouldn’t hurt either.  
 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Peter
Sent: Tuesday, July 22, 2014 4:51 AM
To: Craig Lewis
Cc: Ceph Users
Subject: Re: [ceph-users] radosgw-agent failed to parse
 
yes, im scratching my head over this too. It doesn't seem to be an 
authentication issue as the radosgw-agent never reaches the us-secondary 
gateway (i've kept an eye on us-secondary logs as i execute radosgw-agent on 
us-master). 

On 22/07/14 03:51, Craig Lewis wrote:
I was hoping for some easy fixes :-P 
 
I created two system users, in both zones.  Each user has different access and 
secret, but I copied the access and secret from the primary to the secondary.  
I can't imaging that this would cause the problem you're seeing, but it is 
something different from the examples.
 
Sorry, I'm out of ideas.
 
 
On Mon, Jul 21, 2014 at 7:13 AM, Peter  wrote:
hello again,

i couldn't find  
'http://us-secondary.example.comhttp://us-secondary.example.com/' in any zone 
or regions config files. How could it be getting the UR

Re: [ceph-users] Create file bigger than osd

2015-01-19 Thread Luis Periquito
On Mon, Jan 19, 2015 at 11:38 AM, Fabian Zimmermann 
wrote:

> Hi,
>
> Am 19.01.15 um 12:24 schrieb Luis Periquito:
> > AFAIK there is no such limitation.
> >
> > When you create a file, that file is split into several objects (4MB IIRC
> > each by default), and those objects will get mapped to a PG -
> > http://ceph.com/docs/master/rados/operations/placement-groups/
> right, and this pg is mapped to X osd. (X = level of replication).
>
> Let's assume X=3.
>
> So, even if I split 2T into 4MB junks, I have to store these 3T at all
> osds in this pg, isn't it?
>

Each object will get mapped to a different PG. The size of an OSD will
affect its weight and the number of PGs assigned to it, so a smaller OSD
will get less PGs.

And BTW, with a replica of 3, a 2TB will need 6TB of storage - each object
is replicated 3 times, so taking up triple the space.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Create file bigger than osd

2015-01-19 Thread Fabian Zimmermann
Hi,

if I understand the pg-system correctly it's impossible to create a
file/volume which is bigger than the smallest osd of a pg, isn't it?

What could I do to get rid of this limitation?


Thanks,

Fabian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com