Re: [ceph-users] ceph / ovirt / multipath

2014-02-03 Thread Loic Dachary
[cc'ing ceph-users, fishing for experienced users ;-]

On 03/02/2014 11:55, Federico Simoncelli wrote:
> Hi, do you have any news about the /dev/mapper device for ceph?
> Is it there? What's the output of:
> 
> # multipath -ll

root@bm0014:~# rbd --pool ovh create --size 1 foobar
root@bm0014:~# rbd --pool ovh map foobar
root@bm0014:~# ls -l /dev/rbd1
brw-rw 1 root disk 251, 0 Feb  3 12:03 /dev/rbd1
root@bm0014:~# ls /dev/mapper
control
root@bm0014:~# multipath -ll
root@bm0014:~#
root@bm0014:~# uname -a
Linux bm0014.the.re 3.11.0-13-generic #20~precise2-Ubuntu SMP Thu Oct 24 
21:04:34 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
root@bm0014:~# modinfo rbd
filename:   /lib/modules/3.11.0-13-generic/kernel/drivers/block/rbd.ko
license:GPL
author: Jeff Garzik 
description:rados block device
author: Yehuda Sadeh 
author: Sage Weil 
author: Alex Elder 
srcversion: 53327A48E48E2776F1B7EA1
depends:libceph
intree: Y
vermagic:   3.11.0-13-generic SMP mod_unload modversions

That does not look good, does it ? :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] During copy new rbd image is totally thick

2014-02-03 Thread Sebastien Han
I have the same behaviour here.
I believe this is somehow expected since you’re calling “copy”, clone will do 
the cow.

 
Sébastien Han 
Cloud Engineer 

"Always give 100%. Unless you're giving blood.” 

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien@enovance.com 
Address : 10, rue de la Victoire - 75009 Paris 
Web : www.enovance.com - Twitter : @enovance 

On 03 Feb 2014, at 08:43, Igor Laskovy  wrote:

> Anybody? ;)
> 
> 
> On Thu, Jan 30, 2014 at 9:10 PM, Igor Laskovy  wrote:
> Hello list,
> 
> Is it correct behavior during copy to thicking rbd image?
> 
> igor@hv03:~$ rbd create rbd/test -s 1024
> igor@hv03:~$ rbd diff rbd/test | awk '{ SUM += $2 } END { print SUM/1024/1024 
> " MB" }'
> 0 MB
> igor@hv03:~$ rbd copy rbd/test rbd/cloneoftest
> Image copy: 100% complete...done.
> igor@hv03:~$ rbd diff rbd/cloneoftest | awk '{ SUM += $2 } END { print 
> SUM/1024/1024 " MB" }'
> 1024 MB
> 
> -- 
> Igor Laskovy
> facebook.com/igor.laskovy
> studiogrizzly.com
> 
> 
> 
> -- 
> Igor Laskovy
> facebook.com/igor.laskovy
> studiogrizzly.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] keyring generation

2014-02-03 Thread Alfredo Deza
On Sun, Feb 2, 2014 at 12:18 AM, Kei.masumoto  wrote:
> Hi,
>
> I am newbie of ceph, now I am trying to deploy following
> "http://ceph.com/docs/master/start/quick-ceph-deploy/";
> ceph1, ceph2 and ceph3 exists according to the above tutorial. I got a
> WARNING message when I exec ceph-deploy "mon create-initial".
>
> 2014-02-01 14:06:37,385 [ceph_deploy.gatherkeys][WARNING] Unable to find
> /etc/ceph/ceph.client.admin.keyring on ['ceph1']
> 2014-02-01 14:06:37,516 [ceph_deploy.gatherkeys][WARNING] Unable to find
> /var/lib/ceph/bootstrap-osd/ceph.keyring on ['ceph1']
> 2014-02-01 14:06:37,639 [ceph_deploy.gatherkeys][WARNING] Unable to find
> /var/lib/ceph/bootstrap-mds/ceph.keyring on ['ceph1']
>
> Thinking about when those 3 keyrings should be created, I thins
> "ceph-deploy mon create " is a right timing for keyring creation. I
> checked my environment, and found
> /etc/ceph/ceph.client.admin.keyring.14081.tmp. It looks like this file
> is created by ceph-create-keys on executing stop ceph-all && start
> ceph-all. but ceph-create-keys never finishes.

ceph-deploy tries to help here a lot with create-initial, and although
the warnings are useful,
they are only good depending on the context of the rest of the output.

When the whole process completes, does ceph-deploy say all mons are up
and running?

It would be better to paste the complete output of the call so we can
see the details.
>
> When I execute ceph-create-keys manually, it continues to generate below
> log, looks like waiting reply...
>
> 2014-02-01 20:13:02.847737 7f55e81a4700  0 -- :/1001774 >>
> 192.168.11.8:6789/0 pipe(0x7f55e4024400 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f55e4024660).fault
>
> Since I found that mon listens 6789, so I strace mon, then mon also
> waiting something...
>
> root@ceph1:~/src/ceph-0.56.7# strace -p 1047
> Process 1047 attached - interrupt to quit
> futex(0x7f37c14839d0, FUTEX_WAIT, 1102, NULL
>
> I have no idea what situation should be, any hints?
>
> P.S. somebody give me an adivce to check below, but I dont see any from
> here.
> root@ceph1:~/my-cluster# ceph daemon mon.`hostname` mon_status
> { "name": "ceph1",
>   "rank": 0,
>   "state": "leader",
>   "election_epoch": 1,
>   "quorum": [
> 0],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 1,
>   "fsid": "26835656-6b29-455d-9d1f-545cad8f1e23",
>   "modified": "0.00",
>   "created": "0.00",
>   "mons": [
> { "rank": 0,
>   "name": "ceph1",
>   "addr": "192.168.111.11:6789\/0"}]}}
>
>
> Kei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS + deep scrubbing performance issues in production environment

2014-02-03 Thread Guang
+ceph-users.

Does anybody have the similar experience of scrubbing / deep-scrubbing?

Thanks,
Guang

On Jan 29, 2014, at 10:35 AM, Guang  wrote:

> Glad to see there are some discussion around scrubbing / deep-scrubbing.
> 
> We are experiencing the same that scrubbing could affect latency quite a bit 
> and so far I found two slow patterns (dump_historic_ops): 1) waiting from 
> being dispatched 2) waiting in the op working queue to be fetched by an 
> available op thread. For the first slow pattern, it looks like there is lock 
> (as dispatcher stop working for 2 seconds and then resume, same for scrubber 
> thread), that needs further investigation. For the second slow pattern, as 
> scrubbing brings more ops (for scrubbing check), that make the op thread's 
> work load increase (client op has a lower priority), I think that could be 
> improved by increasing the op thread number, I will confirm this analysis by 
> adding more op threads and turn on scrubbing on OSD basis.
> 
> Does the above observation and analysis make sense?
> 
> Thanks,
> Guang
> 
> On Jan 29, 2014, at 2:13 AM, Filippos Giannakos  wrote:
> 
>> On Mon, Jan 27, 2014 at 10:45:48AM -0800, Sage Weil wrote:
>>> There is also 
>>> 
>>> ceph osd set noscrub
>>> 
>>> and then later
>>> 
>>> ceph osd unset noscrub
>>> 
>>> I forget whether this pauses an in-progress PG scrub or just makes it stop 
>>> when it gets to the next PG boundary.
>>> 
>>> sage
>> 
>> I bumped into those settings but I couldn't find any documentation about 
>> them.
>> When I first tried them, they didn't do anything immediately, so I thought 
>> they
>> weren't the answer. After your mention, I tried them again, and after a while
>> the deep-scrubbing stopped. So I'm guessing they stop scrubbing on the next 
>> PG
>> boundary.
>> 
>> I see from this thread and others before, that some people think it is a 
>> spindle
>> issue. I'm not sure that it is just that. Replicating it to an idle cluster 
>> that
>> can do more than 250MiB/seconds and pausing for 4-5 seconds on a single 
>> request,
>> sounds like an issue by itself. Maybe there is too much locking or not enough
>> priority to the actual I/O ? Plus, that idea of throttling deep scrubbing 
>> based
>> on the iops sounds appealing.
>> 
>> Kind Regards,
>> -- 
>> Filippos
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] active+remapped after reweight-by-utilization

2014-02-03 Thread Dominik Mostowiec
Hi,
After command:
"ceph osd reweight-by-utilization 105"
cluster stopped on " 249 active+remapped;" state.

I have 'crush tunables optimal'.
head -n 6 /tmp/crush.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

-- 
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] keyring generation

2014-02-03 Thread Kei.masumoto

Hi Alfredo,

Thanks for reply!  I pasted the logs below.


2014-02-01 14:06:33,350 [ceph_deploy.cli][INFO  ] Invoked (1.3.4): 
/usr/bin/ceph-deploy mon create-initial
2014-02-01 14:06:33,353 [ceph_deploy.mon][DEBUG ] Deploying mon, cluster 
ceph hosts ceph1
2014-02-01 14:06:33,354 [ceph_deploy.mon][DEBUG ] detecting platform for 
host ceph1 ...

2014-02-01 14:06:33,770 [ceph1][DEBUG ] connected to host: ceph1
2014-02-01 14:06:33,775 [ceph1][DEBUG ] detect platform information from 
remote host

2014-02-01 14:06:33,874 [ceph1][DEBUG ] detect machine type
2014-02-01 14:06:33,909 [ceph_deploy.mon][INFO  ] distro info: Ubuntu 
13.04 raring
2014-02-01 14:06:33,910 [ceph1][DEBUG ] determining if provided host has 
same hostname in remote

2014-02-01 14:06:33,911 [ceph1][DEBUG ] get remote short hostname
2014-02-01 14:06:33,914 [ceph1][DEBUG ] deploying mon to ceph1
2014-02-01 14:06:33,915 [ceph1][DEBUG ] get remote short hostname
2014-02-01 14:06:33,917 [ceph1][DEBUG ] remote hostname: ceph1
2014-02-01 14:06:33,919 [ceph1][DEBUG ] write cluster configuration to 
/etc/ceph/{cluster}.conf
2014-02-01 14:06:33,933 [ceph1][DEBUG ] create the mon path if it does 
not exist
2014-02-01 14:06:33,939 [ceph1][DEBUG ] checking for done path: 
/var/lib/ceph/mon/ceph-ceph1/done
2014-02-01 14:06:33,941 [ceph1][DEBUG ] create a done file to avoid 
re-doing the mon deployment
2014-02-01 14:06:33,944 [ceph1][DEBUG ] create the init path if it does 
not exist

2014-02-01 14:06:33,946 [ceph1][DEBUG ] locating the `service` executable...
2014-02-01 14:06:33,949 [ceph1][INFO  ] Running command: sudo initctl 
emit ceph-mon cluster=ceph id=ceph1
2014-02-01 14:06:36,119 [ceph1][INFO  ] Running command: sudo ceph 
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
2014-02-01 14:06:36,805 [ceph1][DEBUG ] 


2014-02-01 14:06:36,807 [ceph1][DEBUG ] status for monitor: mon.ceph1
2014-02-01 14:06:36,809 [ceph1][DEBUG ] {
2014-02-01 14:06:36,810 [ceph1][DEBUG ]   "election_epoch": 1,
2014-02-01 14:06:36,812 [ceph1][DEBUG ]   "extra_probe_peers": [],
2014-02-01 14:06:36,813 [ceph1][DEBUG ]   "monmap": {
2014-02-01 14:06:36,814 [ceph1][DEBUG ] "created": "0.00",
2014-02-01 14:06:36,815 [ceph1][DEBUG ] "epoch": 1,
2014-02-01 14:06:36,815 [ceph1][DEBUG ] "fsid": 
"26835656-6b29-455d-9d1f-545cad8f1e23",

2014-02-01 14:06:36,816 [ceph1][DEBUG ] "modified": "0.00",
2014-02-01 14:06:36,816 [ceph1][DEBUG ] "mons": [
2014-02-01 14:06:36,817 [ceph1][DEBUG ]   {
2014-02-01 14:06:36,818 [ceph1][DEBUG ] "addr": 
"192.168.111.11:6789/0",

2014-02-01 14:06:36,818 [ceph1][DEBUG ] "name": "ceph1",
2014-02-01 14:06:36,819 [ceph1][DEBUG ] "rank": 0
2014-02-01 14:06:36,820 [ceph1][DEBUG ]   }
2014-02-01 14:06:36,821 [ceph1][DEBUG ] ]
2014-02-01 14:06:36,822 [ceph1][DEBUG ]   },
2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "name": "ceph1",
2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "outside_quorum": [],
2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "quorum": [
2014-02-01 14:06:36,827 [ceph1][DEBUG ] 0
2014-02-01 14:06:36,827 [ceph1][DEBUG ]   ],
2014-02-01 14:06:36,827 [ceph1][DEBUG ]   "rank": 0,
2014-02-01 14:06:36,827 [ceph1][DEBUG ]   "state": "leader",
2014-02-01 14:06:36,828 [ceph1][DEBUG ]   "sync_provider": []
2014-02-01 14:06:36,828 [ceph1][DEBUG ] }
2014-02-01 14:06:36,828 [ceph1][DEBUG ] 


2014-02-01 14:06:36,829 [ceph1][INFO  ] monitor: mon.ceph1 is running
2014-02-01 14:06:36,830 [ceph1][INFO  ] Running command: sudo ceph 
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
2014-02-01 14:06:37,005 [ceph_deploy.mon][INFO  ] processing monitor 
mon.ceph1

2014-02-01 14:06:37,079 [ceph1][DEBUG ] connected to host: ceph1
2014-02-01 14:06:37,081 [ceph1][INFO  ] Running command: sudo ceph 
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
2014-02-01 14:06:37,258 [ceph_deploy.mon][INFO  ] mon.ceph1 monitor has 
reached quorum!
2014-02-01 14:06:37,259 [ceph_deploy.mon][INFO  ] all initial monitors 
are running and have formed quorum

2014-02-01 14:06:37,266 [ceph_deploy.mon][INFO  ] Running gatherkeys...
2014-02-01 14:06:37,268 [ceph_deploy.gatherkeys][DEBUG ] Checking ceph1 
for /etc/ceph/ceph.client.admin.keyring

2014-02-01 14:06:37,336 [ceph1][DEBUG ] connected to host: ceph1
2014-02-01 14:06:37,340 [ceph1][DEBUG ] detect platform information from 
remote host

2014-02-01 14:06:37,373 [ceph1][DEBUG ] detect machine type
2014-02-01 14:06:37,383 [ceph1][DEBUG ] fetch remote file
2014-02-01 14:06:37,385 [ceph_deploy.gatherkeys][WARNING] Unable to find 
/etc/ceph/ceph.client.admin.keyring on ['ceph1']
2014-02-01 14:06:37,391 [ceph_deploy.gatherkeys][DEBUG ] Have 
ceph.mon.keyring
2014-02-0

Re: [ceph-users] poor data distribution

2014-02-03 Thread Dominik Mostowiec
In other words,
1. we've got 3 racks ( 1 replica per rack )
2. in every rack we have 3 hosts
3. every host has 22 OSD's
4. all pg_num's are 2^n for every pool
5. we enabled "crush tunables optimal".
6. on every machine we disabled 4 unused disk's (osd out, osd reweight
0 and osd rm)

Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
machine) has 144 PGs (37% more!).
Other pools also have got this problem. It's not efficient placement.

--
Regards
Dominik


2014-02-02 Dominik Mostowiec :
> Hi,
> For more info:
>   crush: http://dysk.onet.pl/link/r4wGK
>   osd_dump: http://dysk.onet.pl/link/I3YMZ
>   pg_dump: http://dysk.onet.pl/link/4jkqM
>
> --
> Regards
> Dominik
>
> 2014-02-02 Dominik Mostowiec :
>> Hi,
>> Hmm,
>> You think about sumarize PGs from different pools on one OSD's i think.
>> But for one pool (.rgw.buckets) where i have almost of all my data, PG
>> count on OSDs is aslo different.
>> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
>> 52% disk usage, second 74%.
>>
>> --
>> Regards
>> Dominik
>>
>>
>> 2014-02-02 Sage Weil :
>>> It occurs to me that this (and other unexplain variance reports) could
>>> easily be the 'hashpspool' flag not being set.  The old behavior had the
>>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
>>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
>>> tends to 'amplify' any variance in the placement.  The default is still to
>>> use the old behavior for compatibility (this will finally change in
>>> firefly).
>>>
>>> You can do
>>>
>>>  ceph osd pool set  hashpspool true
>>>
>>> to enable the new placement logic on an existing pool, but be warned that
>>> this will rebalance *all* of the data in the pool, which can be a very
>>> heavyweight operation...
>>>
>>> sage
>>>
>>>
>>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
>>>
 Hi,
 After scrubbing almost all PGs has equal(~) num of objects.
 I found something else.
 On one host PG coun on OSDs:
 OSD with small(52%) disk usage:
 count, pool
 105 3
  18 4
   3 5

 Osd with larger(74%) disk usage:
 144 3
  31 4
   2 5

 Pool 3 is .rgw.buckets (where is almost of all data).
 Pool 4 is .log, where is no data.

 Count of PGs shouldn't be the same per OSD ?
 Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
 '4'. There is 1440 PGs ( this is not power of 2 ).

 ceph osd dump:
 pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
 crash_replay_interval 45
 pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
 pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
 pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
 object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
 0
 pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
 pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
 pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
 pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
 pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
 18446744073709551615
 pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
 18446744073709551615
 pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
 object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
 18446744073709551615
 pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
 18446744073709551615
 pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
 pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
 pg_num 8 pgp_num 8 last_change 46912 owner 0

 --
 Regards
 Dominik

 2014-02-01 Dominik Mostowiec :
 > Hi,
 >> Did you bump pgp_num as well?
 > Yes.
 >
 > See: http://dysk.onet.pl/link/BZ968
 >
 >> 25% pools is two times smaller from other.
 > This is changing after scrubbing.
 >
 > --
 > Regards
 > Dominik
 >
 > 2014-02-01 Kyle Bader :
 >>
 >>> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
 >>> optimal' didn't h

Re: [ceph-users] keyring generation

2014-02-03 Thread Alfredo Deza
On Mon, Feb 3, 2014 at 10:07 AM, Kei.masumoto  wrote:
> Hi Alfredo,
>
> Thanks for reply!  I pasted the logs below.
>
> 
> 2014-02-01 14:06:33,350 [ceph_deploy.cli][INFO  ] Invoked (1.3.4):
> /usr/bin/ceph-deploy mon create-initial
> 2014-02-01 14:06:33,353 [ceph_deploy.mon][DEBUG ] Deploying mon, cluster
> ceph hosts ceph1
> 2014-02-01 14:06:33,354 [ceph_deploy.mon][DEBUG ] detecting platform for
> host ceph1 ...
> 2014-02-01 14:06:33,770 [ceph1][DEBUG ] connected to host: ceph1
> 2014-02-01 14:06:33,775 [ceph1][DEBUG ] detect platform information from
> remote host
> 2014-02-01 14:06:33,874 [ceph1][DEBUG ] detect machine type
> 2014-02-01 14:06:33,909 [ceph_deploy.mon][INFO  ] distro info: Ubuntu 13.04
> raring
> 2014-02-01 14:06:33,910 [ceph1][DEBUG ] determining if provided host has
> same hostname in remote
> 2014-02-01 14:06:33,911 [ceph1][DEBUG ] get remote short hostname
> 2014-02-01 14:06:33,914 [ceph1][DEBUG ] deploying mon to ceph1
> 2014-02-01 14:06:33,915 [ceph1][DEBUG ] get remote short hostname
> 2014-02-01 14:06:33,917 [ceph1][DEBUG ] remote hostname: ceph1
> 2014-02-01 14:06:33,919 [ceph1][DEBUG ] write cluster configuration to
> /etc/ceph/{cluster}.conf
> 2014-02-01 14:06:33,933 [ceph1][DEBUG ] create the mon path if it does not
> exist
> 2014-02-01 14:06:33,939 [ceph1][DEBUG ] checking for done path:
> /var/lib/ceph/mon/ceph-ceph1/done
> 2014-02-01 14:06:33,941 [ceph1][DEBUG ] create a done file to avoid re-doing
> the mon deployment
> 2014-02-01 14:06:33,944 [ceph1][DEBUG ] create the init path if it does not
> exist
> 2014-02-01 14:06:33,946 [ceph1][DEBUG ] locating the `service` executable...
> 2014-02-01 14:06:33,949 [ceph1][INFO  ] Running command: sudo initctl emit
> ceph-mon cluster=ceph id=ceph1
> 2014-02-01 14:06:36,119 [ceph1][INFO  ] Running command: sudo ceph
> --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
> 2014-02-01 14:06:36,805 [ceph1][DEBUG ]
> 
> 2014-02-01 14:06:36,807 [ceph1][DEBUG ] status for monitor: mon.ceph1
> 2014-02-01 14:06:36,809 [ceph1][DEBUG ] {
> 2014-02-01 14:06:36,810 [ceph1][DEBUG ]   "election_epoch": 1,
> 2014-02-01 14:06:36,812 [ceph1][DEBUG ]   "extra_probe_peers": [],
> 2014-02-01 14:06:36,813 [ceph1][DEBUG ]   "monmap": {
> 2014-02-01 14:06:36,814 [ceph1][DEBUG ] "created": "0.00",
> 2014-02-01 14:06:36,815 [ceph1][DEBUG ] "epoch": 1,
> 2014-02-01 14:06:36,815 [ceph1][DEBUG ] "fsid":
> "26835656-6b29-455d-9d1f-545cad8f1e23",
> 2014-02-01 14:06:36,816 [ceph1][DEBUG ] "modified": "0.00",
> 2014-02-01 14:06:36,816 [ceph1][DEBUG ] "mons": [
> 2014-02-01 14:06:36,817 [ceph1][DEBUG ]   {
> 2014-02-01 14:06:36,818 [ceph1][DEBUG ] "addr":
> "192.168.111.11:6789/0",
> 2014-02-01 14:06:36,818 [ceph1][DEBUG ] "name": "ceph1",
> 2014-02-01 14:06:36,819 [ceph1][DEBUG ] "rank": 0
> 2014-02-01 14:06:36,820 [ceph1][DEBUG ]   }
> 2014-02-01 14:06:36,821 [ceph1][DEBUG ] ]
> 2014-02-01 14:06:36,822 [ceph1][DEBUG ]   },
> 2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "name": "ceph1",
> 2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "outside_quorum": [],
> 2014-02-01 14:06:36,826 [ceph1][DEBUG ]   "quorum": [
> 2014-02-01 14:06:36,827 [ceph1][DEBUG ] 0
> 2014-02-01 14:06:36,827 [ceph1][DEBUG ]   ],
> 2014-02-01 14:06:36,827 [ceph1][DEBUG ]   "rank": 0,
> 2014-02-01 14:06:36,827 [ceph1][DEBUG ]   "state": "leader",
> 2014-02-01 14:06:36,828 [ceph1][DEBUG ]   "sync_provider": []
> 2014-02-01 14:06:36,828 [ceph1][DEBUG ] }
> 2014-02-01 14:06:36,828 [ceph1][DEBUG ]
> 
> 2014-02-01 14:06:36,829 [ceph1][INFO  ] monitor: mon.ceph1 is running
> 2014-02-01 14:06:36,830 [ceph1][INFO  ] Running command: sudo ceph
> --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
> 2014-02-01 14:06:37,005 [ceph_deploy.mon][INFO  ] processing monitor
> mon.ceph1
> 2014-02-01 14:06:37,079 [ceph1][DEBUG ] connected to host: ceph1
> 2014-02-01 14:06:37,081 [ceph1][INFO  ] Running command: sudo ceph
> --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph1.asok mon_status
> 2014-02-01 14:06:37,258 [ceph_deploy.mon][INFO  ] mon.ceph1 monitor has
> reached quorum!
> 2014-02-01 14:06:37,259 [ceph_deploy.mon][INFO  ] all initial monitors are
> running and have formed quorum
> 2014-02-01 14:06:37,266 [ceph_deploy.mon][INFO  ] Running gatherkeys...
> 2014-02-01 14:06:37,268 [ceph_deploy.gatherkeys][DEBUG ] Checking ceph1 for
> /etc/ceph/ceph.client.admin.keyring
> 2014-02-01 14:06:37,336 [ceph1][DEBUG ] connected to host: ceph1
> 2014-02-01 14:06:37,340 [ceph1][DEBUG ] detect platform information from
> remote host
> 2014-02-01 14:06:37,373 [ceph1][DEBUG ] detect machine type
> 2014-02-01 14:06:37,383 [ceph1][DEBUG ] fetch remote file
>
> 2014-02-01 14

[ceph-users] get virtual size and used

2014-02-03 Thread zorg

hi,
We use rbd pool for
and I wonder how can i have
the real size use by my drb image

I can have the virtual size rbd info
but  how can i have the real size use by my drbd image


--
probeSys - spécialiste GNU/Linux
site web : http://www.probesys.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] keyring generation

2014-02-03 Thread Kei.masumoto

Hi Alfredo,

Thanks for your reply!

I think I pasted all logs from ceph.log,  but anyway, I re-executed 
"ceph-deploy mon create-initial again"

Does that make sense? It seems like stack strace are added...

--
[ceph_deploy.cli][INFO  ] Invoked (1.3.4): /usr/bin/ceph-deploy mon 
create-initial

[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts ceph1
[ceph_deploy.mon][DEBUG ] detecting platform for host ceph1 ...
[ceph1][DEBUG ] connected to host: ceph1
[ceph1][DEBUG ] detect platform information from remote host
[ceph1][DEBUG ] detect machine type
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 13.04 raring
[ceph1][DEBUG ] determining if provided host has same hostname in remote
[ceph1][DEBUG ] get remote short hostname
[ceph1][DEBUG ] deploying mon to ceph1
[ceph1][DEBUG ] get remote short hostname
[ceph1][DEBUG ] remote hostname: ceph1
[ceph1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph1][DEBUG ] create the mon path if it does not exist
[ceph1][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-ceph1/done
[ceph1][DEBUG ] create a done file to avoid re-doing the mon deployment
[ceph1][DEBUG ] create the init path if it does not exist
[ceph1][DEBUG ] locating the `service` executable...
[ceph1][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph 
id=ceph1
[ceph1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon 
/var/run/ceph/ceph-mon.ceph1.asok mon_status
[ceph1][DEBUG ] 


[ceph1][DEBUG ] status for monitor: mon.ceph1
[ceph1][DEBUG ] {
[ceph1][DEBUG ]   "election_epoch": 1,
[ceph1][DEBUG ]   "extra_probe_peers": [],
[ceph1][DEBUG ]   "monmap": {
[ceph1][DEBUG ] "created": "0.00",
[ceph1][DEBUG ] "epoch": 1,
[ceph1][DEBUG ] "fsid": "26835656-6b29-455d-9d1f-545cad8f1e23",
[ceph1][DEBUG ] "modified": "0.00",
[ceph1][DEBUG ] "mons": [
[ceph1][DEBUG ]   {
[ceph1][DEBUG ] "addr": "192.168.111.11:6789/0",
[ceph1][DEBUG ] "name": "ceph1",
[ceph1][DEBUG ] "rank": 0
[ceph1][DEBUG ]   }
[ceph1][DEBUG ] ]
[ceph1][DEBUG ]   },
[ceph1][DEBUG ]   "name": "ceph1",
[ceph1][DEBUG ]   "outside_quorum": [],
[ceph1][DEBUG ]   "quorum": [
[ceph1][DEBUG ] 0
[ceph1][DEBUG ]   ],
[ceph1][DEBUG ]   "rank": 0,
[ceph1][DEBUG ]   "state": "leader",
[ceph1][DEBUG ]   "sync_provider": []
[ceph1][DEBUG ] }
[ceph1][DEBUG ] 


[ceph1][INFO  ] monitor: mon.ceph1 is running
[ceph1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon 
/var/run/ceph/ceph-mon.ceph1.asok mon_status

[ceph_deploy.mon][INFO  ] processing monitor mon.ceph1
[ceph1][DEBUG ] connected to host: ceph1
[ceph1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon 
/var/run/ceph/ceph-mon.ceph1.asok mon_status

[ceph_deploy.mon][INFO  ] mon.ceph1 monitor has reached quorum!
[ceph_deploy.mon][INFO  ] all initial monitors are running and have 
formed quorum

[ceph_deploy.mon][INFO  ] Running gatherkeys...
gatherkeys.fetch_fileNamespace(cluster='ceph', dry_run=False, 
func=, mon=['ceph1'], overwrite_conf=False, 
prog='ceph-deploy', quiet=False, subcommand='create-initial', 
username=None, verbose=False) :: /etc/ceph/ceph.client.admin.keyring :: 
ceph.client.admin.keyring :: ['ceph1']
[ceph_deploy.gatherkeys][DEBUG ] Checking ceph1 for 
/etc/ceph/ceph.client.admin.keyring

[ceph1][DEBUG ] connected to host: ceph1
[ceph1][DEBUG ] detect platform information from remote host
[ceph1][DEBUG ] detect machine type
[ceph1][DEBUG ] fetch remote file
[ceph_deploy.gatherkeys][WARNIN] Unable to find 
/etc/ceph/ceph.client.admin.keyring on ['ceph1']

Traceback (most recent call last):
  File "", line 1, in 
  File "", line 6, in 
  File 
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_base.py", 
line 1220, in serve
gatherkeys.fetch_fileNamespace(cluster='ceph', dry_run=False, 
func=, mon=['ceph1'], overwrite_conf=False, 
prog='ceph-deploy', quiet=False, subcommand='create-initial', 
username=None, verbose=False) :: 
/var/lib/ceph/mon/ceph-{hostname}/keyring :: ceph.mon.keyring :: ['ceph1']

[ceph_deploy.gatherkeys][DEBUG ] Have ceph.mon.keyring
gatherkeys.fetch_fileNamespace(cluster='ceph', dry_run=False, 
func=, mon=['ceph1'], overwrite_conf=False, 
prog='ceph-deploy', quiet=False, subcommand='create-initial', 
username=None, verbose=False) :: 
/var/lib/ceph/bootstrap-osd/ceph.keyring :: ceph.bootstrap-osd.keyring 
:: ['ceph1']

SlaveGateway(io=io, id=id, _startcount=2).serve()
  File 
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_base.py", 
line 764, in serve
[ceph_deploy.gatherkeys][DEBUG ] Checking ceph1 for 
/var/lib/ceph/bootstrap-osd/ceph.keyring

self._io.close_write(

Re: [ceph-users] poor data distribution

2014-02-03 Thread Sage Weil
Hi Dominik,

Can you send a copy of your osdmap?

 ceph osd getmap -o /tmp/osdmap

(Can send it off list if the IP addresses are sensitive.)  I'm tweaking 
osdmaptool to have a --test-map-pgs option to look at this offline.

Thanks!
sage


On Mon, 3 Feb 2014, Dominik Mostowiec wrote:

> In other words,
> 1. we've got 3 racks ( 1 replica per rack )
> 2. in every rack we have 3 hosts
> 3. every host has 22 OSD's
> 4. all pg_num's are 2^n for every pool
> 5. we enabled "crush tunables optimal".
> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
> 0 and osd rm)
> 
> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
> machine) has 144 PGs (37% more!).
> Other pools also have got this problem. It's not efficient placement.
> 
> --
> Regards
> Dominik
> 
> 
> 2014-02-02 Dominik Mostowiec :
> > Hi,
> > For more info:
> >   crush: http://dysk.onet.pl/link/r4wGK
> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
> >   pg_dump: http://dysk.onet.pl/link/4jkqM
> >
> > --
> > Regards
> > Dominik
> >
> > 2014-02-02 Dominik Mostowiec :
> >> Hi,
> >> Hmm,
> >> You think about sumarize PGs from different pools on one OSD's i think.
> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
> >> count on OSDs is aslo different.
> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
> >> 52% disk usage, second 74%.
> >>
> >> --
> >> Regards
> >> Dominik
> >>
> >>
> >> 2014-02-02 Sage Weil :
> >>> It occurs to me that this (and other unexplain variance reports) could
> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
> >>> tends to 'amplify' any variance in the placement.  The default is still to
> >>> use the old behavior for compatibility (this will finally change in
> >>> firefly).
> >>>
> >>> You can do
> >>>
> >>>  ceph osd pool set  hashpspool true
> >>>
> >>> to enable the new placement logic on an existing pool, but be warned that
> >>> this will rebalance *all* of the data in the pool, which can be a very
> >>> heavyweight operation...
> >>>
> >>> sage
> >>>
> >>>
> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
> >>>
>  Hi,
>  After scrubbing almost all PGs has equal(~) num of objects.
>  I found something else.
>  On one host PG coun on OSDs:
>  OSD with small(52%) disk usage:
>  count, pool
>  105 3
>   18 4
>    3 5
> 
>  Osd with larger(74%) disk usage:
>  144 3
>   31 4
>    2 5
> 
>  Pool 3 is .rgw.buckets (where is almost of all data).
>  Pool 4 is .log, where is no data.
> 
>  Count of PGs shouldn't be the same per OSD ?
>  Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>  '4'. There is 1440 PGs ( this is not power of 2 ).
> 
>  ceph osd dump:
>  pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>  crash_replay_interval 45
>  pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
>  pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
>  pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
>  object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
>  0
>  pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
>  pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
>  pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
>  pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
>  pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
>  18446744073709551615
>  pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
>  18446744073709551615
>  pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
>  object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
>  18446744073709551615
>  pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
>  18446744073709551615
>  pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
>  pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
> >>>

Re: [ceph-users] poor data distribution

2014-02-03 Thread Dominik Mostowiec
Sory, i forgot to tell You.
It can be important.
We done:
ceph osd reweight-by-utilization 105 ( as i wrote in second mail ).
and after cluster stack on 'active+remapped' PGs we had to reweight it
back to 1.0. (all reweighted osd's)
This osdmap is not from active+clean cluster, rebalancing is in progress.
If you need i'll send you osdmap from clean cluster. Let me know.

--
Regards
Dominik




2014-02-03 Dominik Mostowiec :
> Hi,
> Thanks,
> In attachement.
>
>
> --
> Regards
> Dominik
>
>
> 2014-02-03 Sage Weil :
>> Hi Dominik,
>>
>> Can you send a copy of your osdmap?
>>
>>  ceph osd getmap -o /tmp/osdmap
>>
>> (Can send it off list if the IP addresses are sensitive.)  I'm tweaking
>> osdmaptool to have a --test-map-pgs option to look at this offline.
>>
>> Thanks!
>> sage
>>
>>
>> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
>>
>>> In other words,
>>> 1. we've got 3 racks ( 1 replica per rack )
>>> 2. in every rack we have 3 hosts
>>> 3. every host has 22 OSD's
>>> 4. all pg_num's are 2^n for every pool
>>> 5. we enabled "crush tunables optimal".
>>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
>>> 0 and osd rm)
>>>
>>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
>>> machine) has 144 PGs (37% more!).
>>> Other pools also have got this problem. It's not efficient placement.
>>>
>>> --
>>> Regards
>>> Dominik
>>>
>>>
>>> 2014-02-02 Dominik Mostowiec :
>>> > Hi,
>>> > For more info:
>>> >   crush: http://dysk.onet.pl/link/r4wGK
>>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
>>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
>>> >
>>> > --
>>> > Regards
>>> > Dominik
>>> >
>>> > 2014-02-02 Dominik Mostowiec :
>>> >> Hi,
>>> >> Hmm,
>>> >> You think about sumarize PGs from different pools on one OSD's i think.
>>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
>>> >> count on OSDs is aslo different.
>>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
>>> >> 52% disk usage, second 74%.
>>> >>
>>> >> --
>>> >> Regards
>>> >> Dominik
>>> >>
>>> >>
>>> >> 2014-02-02 Sage Weil :
>>> >>> It occurs to me that this (and other unexplain variance reports) could
>>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
>>> >>> misfeature where consecutive pool's pg's would 'line up' on the same 
>>> >>> osds,
>>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
>>> >>> tends to 'amplify' any variance in the placement.  The default is still 
>>> >>> to
>>> >>> use the old behavior for compatibility (this will finally change in
>>> >>> firefly).
>>> >>>
>>> >>> You can do
>>> >>>
>>> >>>  ceph osd pool set  hashpspool true
>>> >>>
>>> >>> to enable the new placement logic on an existing pool, but be warned 
>>> >>> that
>>> >>> this will rebalance *all* of the data in the pool, which can be a very
>>> >>> heavyweight operation...
>>> >>>
>>> >>> sage
>>> >>>
>>> >>>
>>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
>>> >>>
>>>  Hi,
>>>  After scrubbing almost all PGs has equal(~) num of objects.
>>>  I found something else.
>>>  On one host PG coun on OSDs:
>>>  OSD with small(52%) disk usage:
>>>  count, pool
>>>  105 3
>>>   18 4
>>>    3 5
>>> 
>>>  Osd with larger(74%) disk usage:
>>>  144 3
>>>   31 4
>>>    2 5
>>> 
>>>  Pool 3 is .rgw.buckets (where is almost of all data).
>>>  Pool 4 is .log, where is no data.
>>> 
>>>  Count of PGs shouldn't be the same per OSD ?
>>>  Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>>>  '4'. There is 1440 PGs ( this is not power of 2 ).
>>> 
>>>  ceph osd dump:
>>>  pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>>  rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>>>  crash_replay_interval 45
>>>  pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>>>  rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
>>>  pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
>>>  rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
>>>  pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
>>>  object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
>>>  0
>>>  pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>>  rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
>>>  pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>>  rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
>>>  pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>>  rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
>>>  pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>>  rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
>>>  pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
>>> >>>

Re: [ceph-users] poor data distribution

2014-02-03 Thread Sage Weil
On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
> Sory, i forgot to tell You.
> It can be important.
> We done:
> ceph osd reweight-by-utilization 105 ( as i wrote in second mail ).
> and after cluster stack on 'active+remapped' PGs we had to reweight it
> back to 1.0. (all reweighted osd's)
> This osdmap is not from active+clean cluster, rebalancing is in progress.
> If you need i'll send you osdmap from clean cluster. Let me know.

A clean osdmap would be helpful.

Thanks!
sage

> 
> --
> Regards
> Dominik
> 
> 
> 
> 
> 2014-02-03 Dominik Mostowiec :
> > Hi,
> > Thanks,
> > In attachement.
> >
> >
> > --
> > Regards
> > Dominik
> >
> >
> > 2014-02-03 Sage Weil :
> >> Hi Dominik,
> >>
> >> Can you send a copy of your osdmap?
> >>
> >>  ceph osd getmap -o /tmp/osdmap
> >>
> >> (Can send it off list if the IP addresses are sensitive.)  I'm tweaking
> >> osdmaptool to have a --test-map-pgs option to look at this offline.
> >>
> >> Thanks!
> >> sage
> >>
> >>
> >> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
> >>
> >>> In other words,
> >>> 1. we've got 3 racks ( 1 replica per rack )
> >>> 2. in every rack we have 3 hosts
> >>> 3. every host has 22 OSD's
> >>> 4. all pg_num's are 2^n for every pool
> >>> 5. we enabled "crush tunables optimal".
> >>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
> >>> 0 and osd rm)
> >>>
> >>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
> >>> machine) has 144 PGs (37% more!).
> >>> Other pools also have got this problem. It's not efficient placement.
> >>>
> >>> --
> >>> Regards
> >>> Dominik
> >>>
> >>>
> >>> 2014-02-02 Dominik Mostowiec :
> >>> > Hi,
> >>> > For more info:
> >>> >   crush: http://dysk.onet.pl/link/r4wGK
> >>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
> >>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
> >>> >
> >>> > --
> >>> > Regards
> >>> > Dominik
> >>> >
> >>> > 2014-02-02 Dominik Mostowiec :
> >>> >> Hi,
> >>> >> Hmm,
> >>> >> You think about sumarize PGs from different pools on one OSD's i think.
> >>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
> >>> >> count on OSDs is aslo different.
> >>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
> >>> >> 52% disk usage, second 74%.
> >>> >>
> >>> >> --
> >>> >> Regards
> >>> >> Dominik
> >>> >>
> >>> >>
> >>> >> 2014-02-02 Sage Weil :
> >>> >>> It occurs to me that this (and other unexplain variance reports) could
> >>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had 
> >>> >>> the
> >>> >>> misfeature where consecutive pool's pg's would 'line up' on the same 
> >>> >>> osds,
> >>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  
> >>> >>> This
> >>> >>> tends to 'amplify' any variance in the placement.  The default is 
> >>> >>> still to
> >>> >>> use the old behavior for compatibility (this will finally change in
> >>> >>> firefly).
> >>> >>>
> >>> >>> You can do
> >>> >>>
> >>> >>>  ceph osd pool set  hashpspool true
> >>> >>>
> >>> >>> to enable the new placement logic on an existing pool, but be warned 
> >>> >>> that
> >>> >>> this will rebalance *all* of the data in the pool, which can be a very
> >>> >>> heavyweight operation...
> >>> >>>
> >>> >>> sage
> >>> >>>
> >>> >>>
> >>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
> >>> >>>
> >>>  Hi,
> >>>  After scrubbing almost all PGs has equal(~) num of objects.
> >>>  I found something else.
> >>>  On one host PG coun on OSDs:
> >>>  OSD with small(52%) disk usage:
> >>>  count, pool
> >>>  105 3
> >>>   18 4
> >>>    3 5
> >>> 
> >>>  Osd with larger(74%) disk usage:
> >>>  144 3
> >>>   31 4
> >>>    2 5
> >>> 
> >>>  Pool 3 is .rgw.buckets (where is almost of all data).
> >>>  Pool 4 is .log, where is no data.
> >>> 
> >>>  Count of PGs shouldn't be the same per OSD ?
> >>>  Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
> >>>  '4'. There is 1440 PGs ( this is not power of 2 ).
> >>> 
> >>>  ceph osd dump:
> >>>  pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>>  rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
> >>>  crash_replay_interval 45
> >>>  pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
> >>>  rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
> >>>  pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
> >>>  rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
> >>>  pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
> >>>  object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
> >>>  0
> >>>  pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>>  rjenkins pg_num 1440 pgp_num 1440 last_change 28463 owner 0
> >>>  pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
> >>>  rjenkins

[ceph-users] RGW Replication

2014-02-03 Thread Craig Lewis
I've been noticing somethings strange with my RGW federation.  I added 
some statistics to radosgw-agent to try and get some insight 
(https://github.com/ceph/radosgw-agent/pull/7), but that just showed me 
that I don't understand how replication works.


When PUT traffic was relatively slow to the master zone, replication had 
no issues keeping up.  Now I'm trying to cause replication to fall 
behind, by deliberately exceeding the amount of bandwidth between the 
two zones (they're in different datacenters).  Instead of falling 
behind, both the radosgw-agent logs and the stats I added say that slave 
zone is keeping up.


But some of the numbers don't add up.  I'm not using enough bandwidth 
between the two facilities, and I'm not using enough disk space in the 
slave zone.  The disk usage in the slave zone continues to fall further 
and further behind the master.  Despite this, I'm always able to 
download objects from both zones.



How does radosgw-agent actually replicate metadata and data?  Does 
radosgw-agent actually copy all the bytes, or does it create 
placeholders in the slave zone?  If radosgw-agent is creating 
placeholders in the slave zone, and radosgw populates the placeholder in 
the background, then that would explain the behavior I'm seeing.  If 
this is true, how can I tell if replication is keeping up?



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-03 Thread Gregory Farnum
On Mon, Feb 3, 2014 at 10:43 AM, Craig Lewis  wrote:
> I've been noticing somethings strange with my RGW federation.  I added some
> statistics to radosgw-agent to try and get some insight
> (https://github.com/ceph/radosgw-agent/pull/7), but that just showed me that
> I don't understand how replication works.
>
> When PUT traffic was relatively slow to the master zone, replication had no
> issues keeping up.  Now I'm trying to cause replication to fall behind, by
> deliberately exceeding the amount of bandwidth between the two zones
> (they're in different datacenters).  Instead of falling behind, both the
> radosgw-agent logs and the stats I added say that slave zone is keeping up.
>
> But some of the numbers don't add up.  I'm not using enough bandwidth
> between the two facilities, and I'm not using enough disk space in the slave
> zone.  The disk usage in the slave zone continues to fall further and
> further behind the master.  Despite this, I'm always able to download
> objects from both zones.
>
>
> How does radosgw-agent actually replicate metadata and data?  Does
> radosgw-agent actually copy all the bytes, or does it create placeholders in
> the slave zone?  If radosgw-agent is creating placeholders in the slave
> zone, and radosgw populates the placeholder in the background, then that
> would explain the behavior I'm seeing.  If this is true, how can I tell if
> replication is keeping up?

Are you overwriting the same objects? Replication copies over the
"present" version of an object, not all the versions which have ever
existed. Similarly, the slave zone doesn't keep all the
(garbage-collected) logs that the master zone has to, so those factors
would be one way to get differing disk counts.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Chef cookbooks

2014-02-03 Thread Craig Lewis
The Chef recipes support the ceph daemons, but not things that live 
inside ceph.  You can't manage pools or users (yet).  Github has a few 
open tickets for managing things that live inside Ceph.


You'll want to browse through the open pull requests.  There are a bunch 
of minor fixes waiting to be merged.  I've merged several of them into 
my repo, since merges into master don't seem to be happening.



The ceph::conf recipe will set a cluster addr and public addr, if you 
tell it it.




I have a Federated RadosGW setup, and I disabled CephX.  (I need to 
revisit that decision).


Here's the config for my environment:

"ceph": {
  "radosgw-agent": {
"config": 
"/etc/ceph/radosgw.replicate.us-west-1-to-us-central-1.conf"

  },
  "monitor-secret": "*",
  "version": "emperor",
  "config": {
"fsid": "*",
"rgw": {
  "admin socket": "/var/run/ceph/radosgw.asok",
  "rgw region": "us",
  "rgw zone": "us-east-1",
  "rgw dns name": "us-east-1.ceph.cdlocal",
  "rgw zone root pool": ".us-east-1.rgw.root",
  "rgw region root pool": ".us.rgw.root"
},
"mon_initial_members": "ceph0",
"osd": {
  "osd journal size": 6144,
  "osd mount options xfs": 
"rw,noatime,nodiratime,nosuid,noexec,inode64",

  "osd mkfs type": "xfs",
  "osd mkfs options xfs": "-l size=1024m -n size=64k -i 
size=2048 -s size=4096"

},
"global": {
  "osd pool default size": 2,
  "osd pool default min size": 1,
  "cluster network": "192.168.0.0/24",
  "auth cluster required": "none",
  "auth service required": "none",
  "public network": "192.168.1.0/24",
  "auth client required": "none",
  "osd pool default flag hashpspool": "true"
}
  },
  "radosgw": {
"rgw_addr": "*:80",
"webserver_companion": "apache2",
"api_fqdn": "us-east-1.ceph.cdlocal",
"admin_email": "*",
"api_aliases": [
  "*.us-east-1.ceph.cdlocal"
]
  },
  "config-sections": {
"client.radosgw": {
  "rgw print continue": "false"
}
  }
},



And the node config for ceph0:

"ceph": {
  "config-sections": {
  },
  "osd_devices": [
{
  "device": "/dev/sdc",
  "encrypted": true,
  "dmcrypt": true,
  "status": "deployed",
  "journal": "JOURNAL"
},
...
}


*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



On 1/29/14 01:19 , Gandalf Corvotempesta wrote:

I'm looking at this:
https://github.com/ceph/ceph-cookbooks

seems to support the whole ceph stack (rgw, mons, osd, msd)

Here:
http://wiki.ceph.com/Guides/General_Guides/Deploying_Ceph_with_Chef#Configure_your_Ceph_Environment
I can see that I need to configure the environment as for example and
I can see a "cluster network" setting.

But the OSD recipe doesn't set any "cluster addr" or "network addr",
is that good?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] get virtual size and used

2014-02-03 Thread Sebastien Han
Hi,

$ rbd diff rbd/toto | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }’

 
Sébastien Han 
Cloud Engineer 

"Always give 100%. Unless you're giving blood.” 

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien@enovance.com 
Address : 10, rue de la Victoire - 75009 Paris 
Web : www.enovance.com - Twitter : @enovance 

On 03 Feb 2014, at 17:10, zorg  wrote:

> hi,
> We use rbd pool for
> and I wonder how can i have
> the real size use by my drb image
> 
> I can have the virtual size rbd info
> but  how can i have the real size use by my drbd image
> 
> 
> -- 
> probeSys - spécialiste GNU/Linux
> site web : http://www.probesys.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CDS and GSoC

2014-02-03 Thread Patrick McGarry
Hey folks,

It's time for our favorite gameshow again, the Ceph Developer
Summit...where everyone is a winner! This quarter the grand prize is a
google hangout date with Sage and the gang to talk about Giant.
Hooray!

http://ceph.com/community/ceph-developer-summit-giant/

As you can see from the first paragraph we're working through some
account creation issues with our auth plugin.  So, if you have issue
creating an account on the wiki please let me know and I can make sure
you get your blueprint posted ok. (but everyone already has wiki
accounts, right? right? for the copious amount of doc you've been
doing? I thoughts so! :)



We're also putting Ceph in as a Google summer of code project, so if
anyone from the community would like to step forward as a mentor on a
project you have in mind for Ceph development please let me know as
soon as possible.  If you would like to see an example of what we
submitted before it's still available:

http://ceph.com/gsoc2013/

The deadline to tell me about your project idea is Feb 12th.  Thanks,
and see you at CDS!


Best Regards,

Patrick McGarry
Director, Community || Inktank
http://ceph.com  ||  http://inktank.com
@scuttlemonkey || @ceph || @inktank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph interactive mode tab completion

2014-02-03 Thread Ben Sherman
Hello all,

I noticed ceph has an interactive mode.

I did quick search and I don't see that tab completion is in there,
but there are some mentions of readline in the source, so I'm
wondering if it is on the horizon.



--ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-03 Thread Craig Lewis




On 2/3/14 10:51 , Gregory Farnum wrote:

On Mon, Feb 3, 2014 at 10:43 AM, Craig Lewis  wrote:

I've been noticing somethings strange with my RGW federation.  I added some
statistics to radosgw-agent to try and get some insight
(https://github.com/ceph/radosgw-agent/pull/7), but that just showed me that
I don't understand how replication works.

When PUT traffic was relatively slow to the master zone, replication had no
issues keeping up.  Now I'm trying to cause replication to fall behind, by
deliberately exceeding the amount of bandwidth between the two zones
(they're in different datacenters).  Instead of falling behind, both the
radosgw-agent logs and the stats I added say that slave zone is keeping up.

But some of the numbers don't add up.  I'm not using enough bandwidth
between the two facilities, and I'm not using enough disk space in the slave
zone.  The disk usage in the slave zone continues to fall further and
further behind the master.  Despite this, I'm always able to download
objects from both zones.


How does radosgw-agent actually replicate metadata and data?  Does
radosgw-agent actually copy all the bytes, or does it create placeholders in
the slave zone?  If radosgw-agent is creating placeholders in the slave
zone, and radosgw populates the placeholder in the background, then that
would explain the behavior I'm seeing.  If this is true, how can I tell if
replication is keeping up?

Are you overwriting the same objects? Replication copies over the
"present" version of an object, not all the versions which have ever
existed. Similarly, the slave zone doesn't keep all the
(garbage-collected) logs that the master zone has to, so those factors
would be one way to get differing disk counts.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



Before I started this import, the master zone was using 3.54TB (raw), 
and the slave zone was using 3.42 TB (raw).  I did overwrite some 
objects, and the 120GB is plausible for overwrites.


I haven't deleted anything yet, so the only garbage collection would be 
overwritten objects.  Right?



I imported 1.93TB of data.  Replication is currently 2x, so that's 
3.86TB (raw).  Now the master is using 7.48TB (raw), and the slave is 
using 4.89TB (raw).  The master zone looks correct, but the slave zone 
is missing 2.59TB (raw).  That's 66% of my imported data.


The 33% of data the slave does have is in line with the amount of 
bandwidth I see between the two facilities.  I see an increase of ~150 
Mbps when the import is running on the master, and ~50 Mbps on the slave.




Just to verify that I'm not over writing objects, I checked the apache 
logs.  Since I started the import, there have been 1328542 PUTs 
(including normal site traffic).  1301511 of those are unique. I'll 
investigate the 27031 duplicates, but the dups are only 34GB. Not nearly 
enough to account for the discrepancy.



From your answer, I'll assume there are no placeholders involved. If 
radosgw-agent says we're up to date, the data should exist in the slave 
zone.


Now I'm really confused.


*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-03 Thread Sage Weil
Hi,

I spent a couple hours looking at your map because it did look like there 
was something wrong.  After some experimentation and adding a bucnh of 
improvements to osdmaptool to test the distribution, though, I think 
everything is working as expected.  For pool 3, your map has a standard 
deviation in utilizations of ~8%, and we should expect ~9% for this number 
of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).  
This is either just in the noise, or slightly confounded by the lack of 
the hashpspool flag on the pools (which slightly amplifies placement 
nonuniformity with multiple pools... not enough that it is worth changing 
anything though).

The bad news is that that order of standard deviation results in pretty 
wide min/max range of 118 to 202 pgs.  That seems a *bit* higher than we a 
perfectly random placement generates (I'm seeing a spread in that is 
usually 50-70 pgs), but I think *that* is where the pool overlap (no 
hashpspool) is rearing its head; for just pool three the spread of 50 is 
about what is expected.

Long story short: you have two options.  One is increasing the number of 
PGs.  Note that this helps but has diminishing returns (doubling PGs 
only takes you from ~8% to ~6% standard deviation, quadrupling to ~4%).

The other is to use reweight-by-utilization.  That is the best approach, 
IMO.  I'm not sure why you were seeing PGs stuck in the remapped state 
after you did that, though, but I'm happy to dig into that too.

BTW, the osdmaptool addition I was using to play with is here:
https://github.com/ceph/ceph/pull/1178

sage


On Mon, 3 Feb 2014, Dominik Mostowiec wrote:

> In other words,
> 1. we've got 3 racks ( 1 replica per rack )
> 2. in every rack we have 3 hosts
> 3. every host has 22 OSD's
> 4. all pg_num's are 2^n for every pool
> 5. we enabled "crush tunables optimal".
> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
> 0 and osd rm)
> 
> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
> machine) has 144 PGs (37% more!).
> Other pools also have got this problem. It's not efficient placement.
> 
> --
> Regards
> Dominik
> 
> 
> 2014-02-02 Dominik Mostowiec :
> > Hi,
> > For more info:
> >   crush: http://dysk.onet.pl/link/r4wGK
> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
> >   pg_dump: http://dysk.onet.pl/link/4jkqM
> >
> > --
> > Regards
> > Dominik
> >
> > 2014-02-02 Dominik Mostowiec :
> >> Hi,
> >> Hmm,
> >> You think about sumarize PGs from different pools on one OSD's i think.
> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
> >> count on OSDs is aslo different.
> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
> >> 52% disk usage, second 74%.
> >>
> >> --
> >> Regards
> >> Dominik
> >>
> >>
> >> 2014-02-02 Sage Weil :
> >>> It occurs to me that this (and other unexplain variance reports) could
> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
> >>> misfeature where consecutive pool's pg's would 'line up' on the same osds,
> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
> >>> tends to 'amplify' any variance in the placement.  The default is still to
> >>> use the old behavior for compatibility (this will finally change in
> >>> firefly).
> >>>
> >>> You can do
> >>>
> >>>  ceph osd pool set  hashpspool true
> >>>
> >>> to enable the new placement logic on an existing pool, but be warned that
> >>> this will rebalance *all* of the data in the pool, which can be a very
> >>> heavyweight operation...
> >>>
> >>> sage
> >>>
> >>>
> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
> >>>
>  Hi,
>  After scrubbing almost all PGs has equal(~) num of objects.
>  I found something else.
>  On one host PG coun on OSDs:
>  OSD with small(52%) disk usage:
>  count, pool
>  105 3
>   18 4
>    3 5
> 
>  Osd with larger(74%) disk usage:
>  144 3
>   31 4
>    2 5
> 
>  Pool 3 is .rgw.buckets (where is almost of all data).
>  Pool 4 is .log, where is no data.
> 
>  Count of PGs shouldn't be the same per OSD ?
>  Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>  '4'. There is 1440 PGs ( this is not power of 2 ).
> 
>  ceph osd dump:
>  pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>  crash_replay_interval 45
>  pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
>  pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
>  rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
>  pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
>  object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
>  0
>  pool 4 '.log' rep size 3 min_size 1 crus

[ceph-users] Low RBD Performance

2014-02-03 Thread Gruher, Joseph R
Hi folks-

I'm having trouble demonstrating reasonable performance of RBDs.  I'm running 
Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I have four dual-Xeon 
servers, each with 24GB RAM, and an Intel 320 SSD for journals and four WD 10K 
RPM SAS drives for OSDs, all connected with an LSI 1078.  This is just a lab 
experiment using scrounged hardware so everything isn't sized to be a Ceph 
cluster, it's just what I have lying around, but I should have more than enough 
CPU and memory resources.  Everything is connected with a single 10GbE.

When testing with RBDs from four clients (also running Ubuntu 13.04 with 3.12 
kernel) I am having trouble breaking 300 IOPS on a 4KB random read or write 
workload (cephx set to none, replication set to one).  IO is generated using 
FIO from four clients, each hosting a single 1TB RBD, and I've experimented 
with queue depths and increasing the number of RBDs without any benefit.  300 
IOPS for a pool of 16 10K RPM HDDs seems quite low, not to mention the journal 
should provide a good boost on write workloads.  When I run a 4KB object write 
workload in Cosbench I can approach 3500 Obj/Sec which seems more reasonable.

Sample FIO configuration:

[global]
ioengine=libaio
direct=1
ramp_time=300
runtime=300
[4k-rw]
description=4k-rw
filename=/dev/rbd1
rw=randwrite
bs=4k
stonewall

I use --iodepth=X on the FIO command line to set the queue depth when testing.

I notice in the FIO output despite the iodepth setting it seems to be reporting 
an IO depth of only 1, which would certainly help explain poor performance, but 
I'm at a loss as to why, I wonder if it could be something specific to RBD 
behavior, like I need to use a different IO engine to establish queue depth.

IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

Any thoughts appreciated!

Thanks,
Joe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low RBD Performance

2014-02-03 Thread Christian Balzer

Hello,

On Tue, 4 Feb 2014 01:29:18 + Gruher, Joseph R wrote:

[snip, nice enough test setup]

> I notice in the FIO output despite the iodepth setting it seems to be
> reporting an IO depth of only 1, which would certainly help explain poor
> performance, but I'm at a loss as to why, I wonder if it could be
> something specific to RBD behavior, like I need to use a different IO
> engine to establish queue depth.
> 
> IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
> 

This is definitely something with how you invoke fio, because when using
the iometer simulation I get:
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=0.8%, >=64=98.4%

/usr/share/doc/fio/examples/iometer-file-access-server.fio on Debian.
And that uses libaio as well.

Your Cosbench results sound about right, I get about 300 IOPS with the
above fio parameters, to a 2 node cluster with just 1 weak sauce, SSD less
OSD each and on 100Mb/s (yes, fast ether!) to boot. 
Clearly I'm less fortunate when it comes to hardware lying around for test
setups. ^o^

Also from your own company: ^o^
http://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low RBD Performance

2014-02-03 Thread Mark Nelson

On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:

Hi folks-

I’m having trouble demonstrating reasonable performance of RBDs.  I’m
running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I have four
dual-Xeon servers, each with 24GB RAM, and an Intel 320 SSD for journals
and four WD 10K RPM SAS drives for OSDs, all connected with an LSI
1078.  This is just a lab experiment using scrounged hardware so
everything isn’t sized to be a Ceph cluster, it’s just what I have lying
around, but I should have more than enough CPU and memory resources.
Everything is connected with a single 10GbE.

When testing with RBDs from four clients (also running Ubuntu 13.04 with
3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random read
or write workload (cephx set to none, replication set to one).  IO is
generated using FIO from four clients, each hosting a single 1TB RBD,
and I’ve experimented with queue depths and increasing the number of
RBDs without any benefit.  300 IOPS for a pool of 16 10K RPM HDDs seems
quite low, not to mention the journal should provide a good boost on
write workloads.  When I run a 4KB object write workload in Cosbench I
can approach 3500 Obj/Sec which seems more reasonable.

Sample FIO configuration:

[global]

ioengine=libaio

direct=1

ramp_time=300

runtime=300

[4k-rw]

description=4k-rw

filename=/dev/rbd1

rw=randwrite

bs=4k

stonewall

I use --iodepth=X on the FIO command line to set the queue depth when
testing.

I notice in the FIO output despite the iodepth setting it seems to be
reporting an IO depth of only 1, which would certainly help explain poor
performance, but I’m at a loss as to why, I wonder if it could be
something specific to RBD behavior, like I need to use a different IO
engine to establish queue depth.

IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

Any thoughts appreciated!


Interesting results with the io depth at 1.  I Haven't seen that 
behaviour when using libaio, direct=1, and higher io depths.  Is this 
kernel RBD or QEMU/KVM?  If it's QEMU/KVM, is it the libvirt driver?


Certainly 300 IOPS is low for that kind of setup compared to what we've 
seen for RBD on other systems (especially with 1x replication).  Given 
that you are seeing more reasonable performance with RGW, I guess I'd 
look at a couple things:


- Figure out why fio is reporting queue depth = 1
- Does increasing the num jobs help (ie get concurrency another way)?
- Do you have enough PGs in the RBD pool?
- Are you using the virtio driver if QEMU/KVM?



Thanks,

Joe



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v0.76 released

2014-02-03 Thread Sage Weil
This release includes another batch of updates for firefly
functionality.  Most notably, the cache pool infrastructure now
support snapshots, the OSD backfill functionality has been generalized
to include multiple targets (necessary for the coming erasure pools),
and there were performance improvements to the erasure code plugin on
capable processors.  The MDS now properly utilizes (and seamlessly
migrates to) the OSD key/value interface (aka omap) for storing directory
objects.  There continue to be many other fixes and improvements for
usability and code portability across the tree.

Upgrading
~

* 'rbd ls' on a pool which never held rbd images now exits with code
  0. It outputs nothing in plain format, or an empty list in
  non-plain format. This is consistent with the behavior for a pool
  which used to hold images, but contains none. Scripts relying on
  this behavior should be updated.

* The MDS requires a new OSD operation TMAP2OMAP, added in this release.  When
  upgrading, be sure to upgrade and restart the ceph-osd daemons before the
  ceph-mds daemon.  The MDS will refuse to start if any up OSDs do not support
  the new feature.

* The 'ceph mds set_max_mds N' command is now deprecated in favor of
  'ceph mds set max_mds N'.

Notable Changes
~~~

* build: misc improvements (Ken Dreyer)
* ceph-disk: generalize path names, add tests (Loic Dachary)
* ceph-disk: misc improvements for puppet (Loic Dachary)
* ceph-disk: several bug fixes (Loic Dachary)
* ceph-fuse: fix race for sync reads (Sage Weil)
* config: recursive metavariable expansion (Loic Dachary)
* crush: usability and test improvements (Loic Dachary)
* doc: misc fixes (David Moreau Simard, Kun Huang)
* erasure-code: improve buffer alignment (Loic Dachary)
* erasure-code: rewrite region-xor using vector operations (Andreas 
  Peters)
* librados, osd: new TMAP2OMAP operation (Yan, Zheng)
* mailmap updates (Loic Dachary)
* many portability improvements (Noah Watkins)
* many unit test improvements (Loic Dachary)
* mds: always store backtrace in default pool (Yan, Zheng)
* mds: store directories in omap instead of tmap (Yan, Zheng)
* mon: allow adjustment of cephfs max file size via 'ceph mds set 
  max_file_size' (Sage Weil)
* mon: do not create erasure rules by default (Sage Weil)
* mon: do not generate spurious MDSMaps in certain cases (Sage Weil)
* mon: do not use keyring if auth = none (Loic Dachary)
* mon: fix pg_temp leaks (Joao Eduardo Luis)
* osd: backfill to multiple targets (David Zafman)
* osd: cache pool support for snapshots (Sage Weil)
* osd: fix and cleanup misc backfill issues (David Zafman)
* osd: fix omap_clear operation to not zap xattrs (Sam Just, Yan, Zheng)
* osd: ignore num_objects_dirty on scrub for old pools (Sage Weil)
* osd: include more info in pg query result (Sage Weil)
* osd: track erasure compatibility (David Zafman)
* rbd: make 'rbd list' return empty list and success on empty pool (Josh 
  Durgin)
* rgw: fix object placement read op (Yehuda Sadeh)
* rgw: fix several CORS bugs (Robin H. Johnson)
* specfile: fix RPM build on RHEL6 (Ken Dreyer, Derek Yarnell)
* specfile: ship libdir/ceph (Key Dreyer)

You can get v0.75 from the usual locations:

* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.76.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com